Repository: chiphuyen/stanford-tensorflow-tutorials Branch: master Commit: 51e53daaa2a3 Files: 116 Total size: 19.7 MB Directory structure: gitextract_fr18ta9_/ ├── .gitignore ├── 2017/ │ ├── README.md │ ├── assignments/ │ │ ├── chatbot/ │ │ │ ├── README.md │ │ │ ├── chatbot.py │ │ │ ├── config.py │ │ │ ├── data.py │ │ │ ├── model.py │ │ │ └── output_convo.txt │ │ ├── exercises/ │ │ │ ├── e01.py │ │ │ └── e01_sol.py │ │ ├── style_transfer/ │ │ │ ├── readme.md │ │ │ ├── style_transfer.py │ │ │ ├── utils.py │ │ │ └── vgg_model.py │ │ └── style_transfer_starter/ │ │ ├── readme.md │ │ ├── style_transfer.py │ │ ├── utils.py │ │ └── vgg_model.py │ ├── data/ │ │ ├── arvix_abstracts.txt │ │ ├── fire_theft.xls │ │ ├── friday.tfrecord │ │ ├── heart.csv │ │ └── heart.txt │ ├── examples/ │ │ ├── 02_feed_dict.py │ │ ├── 02_lazy_loading.py │ │ ├── 02_simple_tf.py │ │ ├── 02_variables.py │ │ ├── 03_linear_regression_sol.py │ │ ├── 03_linear_regression_starter.py │ │ ├── 03_logistic_regression_mnist_sol.py │ │ ├── 03_logistic_regression_mnist_starter.py │ │ ├── 04_word2vec_no_frills.py │ │ ├── 04_word2vec_starter.py │ │ ├── 04_word2vec_visualize.py │ │ ├── 05_csv_reader.py │ │ ├── 05_randomization.py │ │ ├── 07_basic_filters.py │ │ ├── 07_convnet_mnist.py │ │ ├── 07_convnet_mnist_starter.py │ │ ├── 09_queue_example.py │ │ ├── 09_tfrecord_example.py │ │ ├── 11_char_rnn_gist.py │ │ ├── autoencoder/ │ │ │ ├── autoencoder.py │ │ │ ├── layer_utils.py │ │ │ ├── layers.py │ │ │ ├── train.py │ │ │ └── utils.py │ │ ├── cgru/ │ │ │ ├── README.md │ │ │ ├── custom_getter.py │ │ │ ├── data_reader.py │ │ │ ├── my_layers.py │ │ │ └── neural_gpu_v3.py │ │ ├── data/ │ │ │ ├── arvix_abstracts.txt │ │ │ ├── fire_theft.xls │ │ │ ├── heart.csv │ │ │ └── heart.txt │ │ ├── deepdream/ │ │ │ ├── deepdream_exercise.py │ │ │ └── deepdream_solution.py │ │ ├── graphs/ │ │ │ ├── gist/ │ │ │ │ ├── events.out.tfevents.1499787135.MacBook-Pro │ │ │ │ ├── events.out.tfevents.1499787150.MacBook-Pro │ │ │ │ └── events.out.tfevents.1499787321.MacBook-Pro │ │ │ ├── l2/ │ │ │ │ ├── events.out.tfevents.1499786503.MacBook-Pro │ │ │ │ └── events.out.tfevents.1499786515.MacBook-Pro │ │ │ └── linear_reg/ │ │ │ └── events.out.tfevents.1499786822.MacBook-Pro │ │ ├── kernels.py │ │ ├── process_data.py │ │ └── utils.py │ └── setup/ │ ├── requirements.txt │ └── setup_instruction.md ├── LICENSE ├── README.md ├── assignments/ │ ├── 01/ │ │ ├── q1.py │ │ └── q1_sol.py │ ├── 02_style_transfer/ │ │ ├── load_vgg.py │ │ ├── load_vgg_sol.py │ │ ├── style_transfer.py │ │ ├── style_transfer_sol.py │ │ └── utils.py │ ├── chatbot/ │ │ ├── README.md │ │ ├── chatbot.py │ │ ├── config.py │ │ ├── data.py │ │ ├── model.py │ │ └── output_convo.txt │ ├── trump_bot/ │ │ └── trump_tweets.txt │ └── word_transform/ │ ├── common.en.vocab │ ├── eval.vocab │ └── train.vocab ├── examples/ │ ├── 02_lazy_loading.py │ ├── 02_placeholder.py │ ├── 02_simple_tf.py │ ├── 02_variables.py │ ├── 03_linreg_dataset.py │ ├── 03_linreg_placeholder.py │ ├── 03_linreg_starter.py │ ├── 03_logreg.py │ ├── 03_logreg_placeholder.py │ ├── 03_logreg_starter.py │ ├── 04_linreg_eager.py │ ├── 04_linreg_eager_starter.py │ ├── 04_word2vec.py │ ├── 04_word2vec_eager.py │ ├── 04_word2vec_eager_starter.py │ ├── 04_word2vec_visualize.py │ ├── 05_randomization.py │ ├── 05_variable_sharing.py │ ├── 07_convnet_layers.py │ ├── 07_convnet_mnist.py │ ├── 07_convnet_mnist_starter.py │ ├── 07_run_kernels.py │ ├── 11_char_rnn.py │ ├── kernels.py │ ├── utils.py │ └── word2vec_utils.py └── setup/ ├── requirements.txt └── setup_instruction.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.pdf *.SUNet *.pyc .env/* examples/data examples/graphs/* examples/checkpoints/* examples/visualization/* ================================================ FILE: 2017/README.md ================================================ # tf-stanford-tutorials This repository contains code examples for the 2017 course CS 20SI: TensorFlow for Deep Learning Research.
Detailed syllabus and lecture notes can be found [here](http://cs20si.stanford.edu/2017).
For this year's course, please see [here](http://cs20.stanford.edu) # Note (as of July 11, 2017) I've updated the code to TensorFlow 1.2 and Python3, except the code for chatbot. I will update the code for chatbot soon. ## Models include:
### In the folder "examples":
Linear Regression with Chicago's Fire-Theft dataset
Logistic Regression with MNIST
Word2vec skip-gram model with NCE loss
Convnets with MNIST
Autoencoder (by Nishith Khandwala)
Deepdream (by Jon Shlens)
Character-level language modeling

### In the folder "assignments":
Style Transfer
Chatbot using sequence to sequence with attention

## Misc
Examples on how to use data readers, TFRecord
Embedding visualization with TensorBoard
Usage of summary ops
Exercises to be familiar with other special TensorFlow ops
Demonstration of the danger of lazy loading
Convolutional GRU (CRGU) (by Lukasz Kaiser) ================================================ FILE: 2017/assignments/chatbot/README.md ================================================ <<<<<<< A neural chatbot using sequence to sequence model with attentional decoder. This is a fully functional chatbot. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen as the starter code for assignment 3, class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu The detailed assignment handout and information on training time can be found at http://web.stanford.edu/class/cs20si/assignments/a3.pdf

Sample conversation

HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm chip
BOT ++++ please .
HUMAN ++++ please what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ where are you from?
BOT ++++ california .
HUMAN ++++ do you like california?
BOT ++++ i ' m okay .
HUMAN ++++ i like california a lot
BOT ++++ of what ?
HUMAN ++++ i live here
BOT ++++ and what is it ?
HUMAN ++++ california
BOT ++++ i don ' t know what to say .
HUMAN ++++ i'm a student here
BOT ++++ and get it out of here .
HUMAN ++++ what do you like to do in your free time?
BOT ++++ and what about the gun ?
HUMAN ++++ do you have a gun?
BOT ++++ yes .
HUMAN ++++ you're dangerous!
BOT ++++ i ' m not .
HUMAN ++++ do you go hunting?
BOT ++++ i can do .
HUMAN ++++ what do you hunt?
BOT ++++ virgin .
HUMAN ++++ that's very bad
BOT ++++ and you in love ?
HUMAN ++++ no i'm not
BOT ++++ please - -
HUMAN ++++ okay, i am
BOT ++++ and you are good .
HUMAN ++++ yeah i am good too.
BOT ++++ good good to you .
HUMAN ++++ are you in love?
BOT ++++ i ' m ready .
HUMAN ++++ ready for what?
BOT ++++ for what ?
HUMAN ++++ for what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ tell me
BOT ++++ i can do it .
HUMAN ++++ okay, tell me what you're ready for
BOT ++++ i ' m doing you .
HUMAN ++++ no you're not
BOT ++++ please !
HUMAN ++++ this is a strange conversation
BOT ++++ i don ' t know what to say .
HUMAN ++++ are you a guy?
BOT ++++ of what ?
HUMAN ++++ are you a boy?
BOT ++++ i ' m fine .
HUMAN ++++ are you a man?
BOT ++++ i ' m a little .
HUMAN ++++ what's your name?
BOT ++++ chris -
HUMAN ++++ chris who?
BOT ++++ cornelius .
HUMAN ++++ so i guess you're a guy.
BOT ++++ i am .
HUMAN ++++ okay
BOT ++++ okay ?
HUMAN ++++ do you have a girlfriend?
BOT ++++ i ' m sure .
HUMAN ++++ what's her name?
BOT ++++ let ' s talk about something else .
See output_convo.txt for more sample conversations.

Usage

Step 1: create a data folder in your project directory, download the Cornell Movie-Dialogs Corpus from https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html Unzip it Step 2: python data.py
This will do all the pre-processing for the Cornell dataset. Step 3: python chatbot.py --mode [train/chat]
If mode is train, then you train the chatbot. By default, the model will restore the previously trained weights (if there is any) and continue training up on that. If you want to start training from scratch, please delete all the checkpoints in the checkpoints folder. If the mode is chat, you'll go into the interaction mode with the bot. By default, all the conversations you have with the chatbot will be written into the file output_convo.txt in the processed folder. If you run this chatbot, I kindly ask you to send me the output_convo.txt so that I can improve the chatbot. My email is huyenn@stanford.edu If you find the tutorial helpful, please head over to Anonymous Chatlog Donation to see how you can help us create the first realistic dialogue dataset. Thank you very much! >>>>>>> origin/master ================================================ FILE: 2017/assignments/chatbot/chatbot.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen as the starter code for assignment 3, class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu This file contains the code to run the model. See readme.md for instruction on how to run the starter code. """ from __future__ import division from __future__ import print_function import argparse import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import random import sys import time import numpy as np import tensorflow as tf from model import ChatBotModel import config import data def _get_random_bucket(train_buckets_scale): """ Get a random bucket from which to choose a training sample """ rand = random.random() return min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > rand]) def _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks): """ Assert that the encoder inputs, decoder inputs, and decoder masks are of the expected lengths """ if len(encoder_inputs) != encoder_size: raise ValueError("Encoder length must be equal to the one in bucket," " %d != %d." % (len(encoder_inputs), encoder_size)) if len(decoder_inputs) != decoder_size: raise ValueError("Decoder length must be equal to the one in bucket," " %d != %d." % (len(decoder_inputs), decoder_size)) if len(decoder_masks) != decoder_size: raise ValueError("Weights length must be equal to the one in bucket," " %d != %d." % (len(decoder_masks), decoder_size)) def run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, forward_only): """ Run one step in training. @forward_only: boolean value to decide whether a backward path should be created forward_only is set to True when you just want to evaluate on the test set, or when you want to the bot to be in chat mode. """ encoder_size, decoder_size = config.BUCKETS[bucket_id] _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks) # input feed: encoder inputs, decoder inputs, target_weights, as provided. input_feed = {} for step in range(encoder_size): input_feed[model.encoder_inputs[step].name] = encoder_inputs[step] for step in range(decoder_size): input_feed[model.decoder_inputs[step].name] = decoder_inputs[step] input_feed[model.decoder_masks[step].name] = decoder_masks[step] last_target = model.decoder_inputs[decoder_size].name input_feed[last_target] = np.zeros([model.batch_size], dtype=np.int32) # output feed: depends on whether we do a backward step or not. if not forward_only: output_feed = [model.train_ops[bucket_id], # update op that does SGD. model.gradient_norms[bucket_id], # gradient norm. model.losses[bucket_id]] # loss for this batch. else: output_feed = [model.losses[bucket_id]] # loss for this batch. for step in range(decoder_size): # output logits. output_feed.append(model.outputs[bucket_id][step]) outputs = sess.run(output_feed, input_feed) if not forward_only: return outputs[1], outputs[2], None # Gradient norm, loss, no outputs. else: return None, outputs[0], outputs[1:] # No gradient norm, loss, outputs. def _get_buckets(): """ Load the dataset into buckets based on their lengths. train_buckets_scale is the inverval that'll help us choose a random bucket later on. """ test_buckets = data.load_data('test_ids.enc', 'test_ids.dec') data_buckets = data.load_data('train_ids.enc', 'train_ids.dec') train_bucket_sizes = [len(data_buckets[b]) for b in range(len(config.BUCKETS))] print("Number of samples in each bucket:\n", train_bucket_sizes) train_total_size = sum(train_bucket_sizes) # list of increasing numbers from 0 to 1 that we'll use to select a bucket. train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))] print("Bucket scale:\n", train_buckets_scale) return test_buckets, data_buckets, train_buckets_scale def _get_skip_step(iteration): """ How many steps should the model train before it saves all the weights. """ if iteration < 100: return 30 return 100 def _check_restore_parameters(sess, saver): """ Restore the previously trained parameters if there are any. """ ckpt = tf.train.get_checkpoint_state(os.path.dirname(config.CPT_PATH + '/checkpoint')) if ckpt and ckpt.model_checkpoint_path: print("Loading parameters for the Chatbot") saver.restore(sess, ckpt.model_checkpoint_path) else: print("Initializing fresh parameters for the Chatbot") def _eval_test_set(sess, model, test_buckets): """ Evaluate on the test set. """ for bucket_id in range(len(config.BUCKETS)): if len(test_buckets[bucket_id]) == 0: print(" Test: empty bucket %d" % (bucket_id)) continue start = time.time() encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(test_buckets[bucket_id], bucket_id, batch_size=config.BATCH_SIZE) _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, True) print('Test bucket {}: loss {}, time {}'.format(bucket_id, step_loss, time.time() - start)) def train(): """ Train the bot """ test_buckets, data_buckets, train_buckets_scale = _get_buckets() # in train mode, we need to create the backward path, so forwrad_only is False model = ChatBotModel(False, config.BATCH_SIZE) model.build_graph() saver = tf.train.Saver() with tf.Session() as sess: print('Running session') sess.run(tf.global_variables_initializer()) _check_restore_parameters(sess, saver) iteration = model.global_step.eval() total_loss = 0 while True: skip_step = _get_skip_step(iteration) bucket_id = _get_random_bucket(train_buckets_scale) encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(data_buckets[bucket_id], bucket_id, batch_size=config.BATCH_SIZE) start = time.time() _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, False) total_loss += step_loss iteration += 1 if iteration % skip_step == 0: print('Iter {}: loss {}, time {}'.format(iteration, total_loss/skip_step, time.time() - start)) start = time.time() total_loss = 0 saver.save(sess, os.path.join(config.CPT_PATH, 'chatbot'), global_step=model.global_step) if iteration % (10 * skip_step) == 0: # Run evals on development set and print their loss _eval_test_set(sess, model, test_buckets) start = time.time() sys.stdout.flush() def _get_user_input(): """ Get user's input, which will be transformed into encoder input later """ print("> ", end="") sys.stdout.flush() return sys.stdin.readline() def _find_right_bucket(length): """ Find the proper bucket for an encoder input based on its length """ return min([b for b in range(len(config.BUCKETS)) if config.BUCKETS[b][0] >= length]) def _construct_response(output_logits, inv_dec_vocab): """ Construct a response to the user's encoder input. @output_logits: the outputs from sequence to sequence wrapper. output_logits is decoder_size np array, each of dim 1 x DEC_VOCAB This is a greedy decoder - outputs are just argmaxes of output_logits. """ print(output_logits[0]) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] # If there is an EOS symbol in outputs, cut them at that point. if config.EOS_ID in outputs: outputs = outputs[:outputs.index(config.EOS_ID)] # Print out sentence corresponding to outputs. return " ".join([tf.compat.as_str(inv_dec_vocab[output]) for output in outputs]) def chat(): """ in test mode, we don't to create the backward path """ _, enc_vocab = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.enc')) inv_dec_vocab, _ = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.dec')) model = ChatBotModel(True, batch_size=1) model.build_graph() saver = tf.train.Saver() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) _check_restore_parameters(sess, saver) output_file = open(os.path.join(config.PROCESSED_PATH, config.OUTPUT_FILE), 'a+') # Decode from standard input. max_length = config.BUCKETS[-1][0] print('Welcome to TensorBro. Say something. Enter to exit. Max length is', max_length) while True: line = _get_user_input() if len(line) > 0 and line[-1] == '\n': line = line[:-1] if line == '': break output_file.write('HUMAN ++++ ' + line + '\n') # Get token-ids for the input sentence. token_ids = data.sentence2id(enc_vocab, str(line)) if (len(token_ids) > max_length): print('Max length I can handle is:', max_length) line = _get_user_input() continue # Which bucket does it belong to? bucket_id = _find_right_bucket(len(token_ids)) # Get a 1-element batch to feed the sentence to the model. encoder_inputs, decoder_inputs, decoder_masks = data.get_batch([(token_ids, [])], bucket_id, batch_size=1) # Get output logits for the sentence. _, _, output_logits = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, True) response = _construct_response(output_logits, inv_dec_vocab) print(response) output_file.write('BOT ++++ ' + response + '\n') output_file.write('=============================================\n') output_file.close() def main(): parser = argparse.ArgumentParser() parser.add_argument('--mode', choices={'train', 'chat'}, default='train', help="mode. if not specified, it's in the train mode") args = parser.parse_args() if not os.path.isdir(config.PROCESSED_PATH): data.prepare_raw_data() data.process_data() print('Data ready!') # create checkpoints folder if there isn't one already data.make_dir(config.CPT_PATH) if args.mode == 'train': train() elif args.mode == 'chat': chat() if __name__ == '__main__': main() ================================================ FILE: 2017/assignments/chatbot/config.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen as the starter code for assignment 3, class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu This file contains the hyperparameters for the model. See readme.md for instruction on how to run the starter code. """ # parameters for processing the dataset DATA_PATH = '/Users/Chip/data/cornell movie-dialogs corpus' CONVO_FILE = 'movie_conversations.txt' LINE_FILE = 'movie_lines.txt' OUTPUT_FILE = 'output_convo.txt' PROCESSED_PATH = 'processed' CPT_PATH = 'checkpoints' THRESHOLD = 2 PAD_ID = 0 UNK_ID = 1 START_ID = 2 EOS_ID = 3 TESTSET_SIZE = 25000 # model parameters """ Train encoder length distribution: [175, 92, 11883, 8387, 10656, 13613, 13480, 12850, 11802, 10165, 8973, 7731, 7005, 6073, 5521, 5020, 4530, 4421, 3746, 3474, 3192, 2724, 2587, 2413, 2252, 2015, 1816, 1728, 1555, 1392, 1327, 1248, 1128, 1084, 1010, 884, 843, 755, 705, 660, 649, 594, 558, 517, 475, 426, 444, 388, 349, 337] These buckets size seem to work the best """ # [19530, 17449, 17585, 23444, 22884, 16435, 17085, 18291, 18931] # BUCKETS = [(6, 8), (8, 10), (10, 12), (13, 15), (16, 19), (19, 22), (23, 26), (29, 32), (39, 44)] # [37049, 33519, 30223, 33513, 37371] # BUCKETS = [(8, 10), (12, 14), (16, 19), (23, 26), (39, 43)] # BUCKETS = [(8, 10), (12, 14), (16, 19)] BUCKETS = [(16, 19)] NUM_LAYERS = 3 HIDDEN_SIZE = 256 BATCH_SIZE = 64 LR = 0.5 MAX_GRAD_NORM = 5.0 NUM_SAMPLES = 512 ================================================ FILE: 2017/assignments/chatbot/data.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen as the starter code for assignment 3, class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu This file contains the code to do the pre-processing for the Cornell Movie-Dialogs Corpus. See readme.md for instruction on how to run the starter code. """ from __future__ import print_function import os import random import re import numpy as np import config def get_lines(): id2line = {} file_path = os.path.join(config.DATA_PATH, config.LINE_FILE) with open(file_path, 'rb') as f: lines = f.readlines() for line in lines: parts = line.split(' +++$+++ ') if len(parts) == 5: if parts[4][-1] == '\n': parts[4] = parts[4][:-1] id2line[parts[0]] = parts[4] return id2line def get_convos(): """ Get conversations from the raw data """ file_path = os.path.join(config.DATA_PATH, config.CONVO_FILE) convos = [] with open(file_path, 'rb') as f: for line in f.readlines(): parts = line.split(' +++$+++ ') if len(parts) == 4: convo = [] for line in parts[3][1:-2].split(', '): convo.append(line[1:-1]) convos.append(convo) return convos def question_answers(id2line, convos): """ Divide the dataset into two sets: questions and answers. """ questions, answers = [], [] for convo in convos: for index, line in enumerate(convo[:-1]): questions.append(id2line[convo[index]]) answers.append(id2line[convo[index + 1]]) assert len(questions) == len(answers) return questions, answers def prepare_dataset(questions, answers): # create path to store all the train & test encoder & decoder make_dir(config.PROCESSED_PATH) # random convos to create the test set test_ids = random.sample([i for i in range(len(questions))],config.TESTSET_SIZE) filenames = ['train.enc', 'train.dec', 'test.enc', 'test.dec'] files = [] for filename in filenames: files.append(open(os.path.join(config.PROCESSED_PATH, filename),'wb')) for i in range(len(questions)): if i in test_ids: files[2].write(questions[i] + '\n') files[3].write(answers[i] + '\n') else: files[0].write(questions[i] + '\n') files[1].write(answers[i] + '\n') for file in files: file.close() def make_dir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass def basic_tokenizer(line, normalize_digits=True): """ A basic tokenizer to tokenize text into tokens. Feel free to change this to suit your need. """ line = re.sub('', '', line) line = re.sub('', '', line) line = re.sub('\[', '', line) line = re.sub('\]', '', line) words = [] _WORD_SPLIT = re.compile(b"([.,!?\"'-<>:;)(])") _DIGIT_RE = re.compile(r"\d") for fragment in line.strip().lower().split(): for token in re.split(_WORD_SPLIT, fragment): if not token: continue if normalize_digits: token = re.sub(_DIGIT_RE, b'#', token) words.append(token) return words def build_vocab(filename, normalize_digits=True): in_path = os.path.join(config.PROCESSED_PATH, filename) out_path = os.path.join(config.PROCESSED_PATH, 'vocab.{}'.format(filename[-3:])) vocab = {} with open(in_path, 'rb') as f: for line in f.readlines(): for token in basic_tokenizer(line): if not token in vocab: vocab[token] = 0 vocab[token] += 1 sorted_vocab = sorted(vocab, key=vocab.get, reverse=True) with open(out_path, 'wb') as f: f.write('' + '\n') f.write('' + '\n') f.write('' + '\n') f.write('<\s>' + '\n') index = 4 for word in sorted_vocab: if vocab[word] < config.THRESHOLD: with open('config.py', 'ab') as cf: if filename[-3:] == 'enc': cf.write('ENC_VOCAB = ' + str(index) + '\n') else: cf.write('DEC_VOCAB = ' + str(index) + '\n') break f.write(word + '\n') index += 1 def load_vocab(vocab_path): with open(vocab_path, 'rb') as f: words = f.read().splitlines() return words, {words[i]: i for i in range(len(words))} def sentence2id(vocab, line): return [vocab.get(token, vocab['']) for token in basic_tokenizer(line)] def token2id(data, mode): """ Convert all the tokens in the data into their corresponding index in the vocabulary. """ vocab_path = 'vocab.' + mode in_path = data + '.' + mode out_path = data + '_ids.' + mode _, vocab = load_vocab(os.path.join(config.PROCESSED_PATH, vocab_path)) in_file = open(os.path.join(config.PROCESSED_PATH, in_path), 'rb') out_file = open(os.path.join(config.PROCESSED_PATH, out_path), 'wb') lines = in_file.read().splitlines() for line in lines: if mode == 'dec': # we only care about '' and in encoder ids = [vocab['']] else: ids = [] ids.extend(sentence2id(vocab, line)) # ids.extend([vocab.get(token, vocab['']) for token in basic_tokenizer(line)]) if mode == 'dec': ids.append(vocab['<\s>']) out_file.write(' '.join(str(id_) for id_ in ids) + '\n') def prepare_raw_data(): print('Preparing raw data into train set and test set ...') id2line = get_lines() convos = get_convos() questions, answers = question_answers(id2line, convos) prepare_dataset(questions, answers) def process_data(): print('Preparing data to be model-ready ...') build_vocab('train.enc') build_vocab('train.dec') token2id('train', 'enc') token2id('train', 'dec') token2id('test', 'enc') token2id('test', 'dec') def load_data(enc_filename, dec_filename, max_training_size=None): encode_file = open(os.path.join(config.PROCESSED_PATH, enc_filename), 'rb') decode_file = open(os.path.join(config.PROCESSED_PATH, dec_filename), 'rb') encode, decode = encode_file.readline(), decode_file.readline() data_buckets = [[] for _ in config.BUCKETS] i = 0 while encode and decode: if (i + 1) % 10000 == 0: print("Bucketing conversation number", i) encode_ids = [int(id_) for id_ in encode.split()] decode_ids = [int(id_) for id_ in decode.split()] for bucket_id, (encode_max_size, decode_max_size) in enumerate(config.BUCKETS): if len(encode_ids) <= encode_max_size and len(decode_ids) <= decode_max_size: data_buckets[bucket_id].append([encode_ids, decode_ids]) break encode, decode = encode_file.readline(), decode_file.readline() i += 1 return data_buckets def _pad_input(input_, size): return input_ + [config.PAD_ID] * (size - len(input_)) def _reshape_batch(inputs, size, batch_size): """ Create batch-major inputs. Batch inputs are just re-indexed inputs """ batch_inputs = [] for length_id in range(size): batch_inputs.append(np.array([inputs[batch_id][length_id] for batch_id in range(batch_size)], dtype=np.int32)) return batch_inputs def get_batch(data_bucket, bucket_id, batch_size=1): """ Return one batch to feed into the model """ # only pad to the max length of the bucket encoder_size, decoder_size = config.BUCKETS[bucket_id] encoder_inputs, decoder_inputs = [], [] for _ in range(batch_size): encoder_input, decoder_input = random.choice(data_bucket) # pad both encoder and decoder, reverse the encoder encoder_inputs.append(list(reversed(_pad_input(encoder_input, encoder_size)))) decoder_inputs.append(_pad_input(decoder_input, decoder_size)) # now we create batch-major vectors from the data selected above. batch_encoder_inputs = _reshape_batch(encoder_inputs, encoder_size, batch_size) batch_decoder_inputs = _reshape_batch(decoder_inputs, decoder_size, batch_size) # create decoder_masks to be 0 for decoders that are padding. batch_masks = [] for length_id in range(decoder_size): batch_mask = np.ones(batch_size, dtype=np.float32) for batch_id in range(batch_size): # we set mask to 0 if the corresponding target is a PAD symbol. # the corresponding decoder is decoder_input shifted by 1 forward. if length_id < decoder_size - 1: target = decoder_inputs[batch_id][length_id + 1] if length_id == decoder_size - 1 or target == config.PAD_ID: batch_mask[batch_id] = 0.0 batch_masks.append(batch_mask) return batch_encoder_inputs, batch_decoder_inputs, batch_masks if __name__ == '__main__': prepare_raw_data() process_data() ================================================ FILE: 2017/assignments/chatbot/model.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen as the starter code for assignment 3, class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu This file contains the code to build the model See readme.md for instruction on how to run the starter code. """ from __future__ import print_function import time import numpy as np import tensorflow as tf import config class ChatBotModel(object): def __init__(self, forward_only, batch_size): """forward_only: if set, we do not construct the backward pass in the model. """ print('Initialize new model') self.fw_only = forward_only self.batch_size = batch_size def _create_placeholders(self): # Feeds for inputs. It's a list of placeholders print('Create placeholders') self.encoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='encoder{}'.format(i)) for i in range(config.BUCKETS[-1][0])] self.decoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='decoder{}'.format(i)) for i in range(config.BUCKETS[-1][1] + 1)] self.decoder_masks = [tf.placeholder(tf.float32, shape=[None], name='mask{}'.format(i)) for i in range(config.BUCKETS[-1][1] + 1)] # Our targets are decoder inputs shifted by one (to ignore symbol) self.targets = self.decoder_inputs[1:] def _inference(self): print('Create inference') # If we use sampled softmax, we need an output projection. # Sampled softmax only makes sense if we sample less than vocabulary size. if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB: w = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB]) b = tf.get_variable('proj_b', [config.DEC_VOCAB]) self.output_projection = (w, b) def sampled_loss(inputs, labels): labels = tf.reshape(labels, [-1, 1]) return tf.nn.sampled_softmax_loss(tf.transpose(w), b, inputs, labels, config.NUM_SAMPLES, config.DEC_VOCAB) self.softmax_loss_function = sampled_loss single_cell = tf.nn.rnn_cell.GRUCell(config.HIDDEN_SIZE) self.cell = tf.nn.rnn_cell.MultiRNNCell([single_cell] * config.NUM_LAYERS) def _create_loss(self): print('Creating loss... \nIt might take a couple of minutes depending on how many buckets you have.') start = time.time() def _seq2seq_f(encoder_inputs, decoder_inputs, do_decode): return tf.nn.seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs, self.cell, num_encoder_symbols=config.ENC_VOCAB, num_decoder_symbols=config.DEC_VOCAB, embedding_size=config.HIDDEN_SIZE, output_projection=self.output_projection, feed_previous=do_decode) if self.fw_only: self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets( self.encoder_inputs, self.decoder_inputs, self.targets, self.decoder_masks, config.BUCKETS, lambda x, y: _seq2seq_f(x, y, True), softmax_loss_function=self.softmax_loss_function) # If we use output projection, we need to project outputs for decoding. if self.output_projection: for bucket in range(len(config.BUCKETS)): self.outputs[bucket] = [tf.matmul(output, self.output_projection[0]) + self.output_projection[1] for output in self.outputs[bucket]] else: self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets( self.encoder_inputs, self.decoder_inputs, self.targets, self.decoder_masks, config.BUCKETS, lambda x, y: _seq2seq_f(x, y, False), softmax_loss_function=self.softmax_loss_function) print('Time:', time.time() - start) def _creat_optimizer(self): print('Create optimizer... \nIt might take a couple of minutes depending on how many buckets you have.') with tf.variable_scope('training') as scope: self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') if not self.fw_only: self.optimizer = tf.train.GradientDescentOptimizer(config.LR) trainables = tf.trainable_variables() self.gradient_norms = [] self.train_ops = [] start = time.time() for bucket in range(len(config.BUCKETS)): clipped_grads, norm = tf.clip_by_global_norm(tf.gradients(self.losses[bucket], trainables), config.MAX_GRAD_NORM) self.gradient_norms.append(norm) self.train_ops.append(self.optimizer.apply_gradients(zip(clipped_grads, trainables), global_step=self.global_step)) print('Creating opt for bucket {} took {} seconds'.format(bucket, time.time() - start)) start = time.time() def _create_summary(self): pass def build_graph(self): self._create_placeholders() self._inference() self._create_loss() self._creat_optimizer() self._create_summary() ================================================ FILE: 2017/assignments/chatbot/output_convo.txt ================================================ HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how are you? BOT ++++ i ' m okay . HUMAN ++++ really? BOT ++++ i don ' t know what to say . HUMAN ++++ how're you? BOT ++++ i ' m fine . HUMAN ++++ where're you from? BOT ++++ california . HUMAN ++++ which city? BOT ++++ let ' s talk about something else . HUMAN ++++ you don't like california? BOT ++++ oh . . . HUMAN ++++ why not? BOT ++++ i can ' t . HUMAN ++++ how old are you? BOT ++++ eighteen . ============================================= HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how are you? BOT ++++ i ' m okay . HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . ============================================= HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how are you? BOT ++++ i ' m okay . HUMAN ++++ what's your name? BOT ++++ what ? HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . HUMAN ++++ what's your name? BOT ++++ prisoner . HUMAN ++++ ha ha you really don't want to tell your name BOT ++++ i ' m not going to do it . HUMAN ++++ how old are you? BOT ++++ can you guess ? HUMAN ++++ eighteen? BOT ++++ in the bathroom . HUMAN ++++ in the bathroom? BOT ++++ in the next time . HUMAN ++++ why are you talking so funny? BOT ++++ i ' m looking . HUMAN ++++ where are you from? BOT ++++ general . HUMAN ++++ where are you from? BOT ++++ how ? HUMAN ++++ do you like the us? BOT ++++ i ' m sure . HUMAN ++++ do you vote? BOT ++++ i ' m afraid . HUMAN ++++ what are you afraid of? BOT ++++ i ' m looking . HUMAN ++++ what are you looking for? BOT ++++ can you guess ? HUMAN ++++ no BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ hi BOT ++++ hi . what ' s your name ? HUMAN ++++ i'm chip BOT ++++ please . HUMAN ++++ please what? BOT ++++ i don ' t know what to say . HUMAN ++++ where are you from? BOT ++++ california . HUMAN ++++ do you like california? BOT ++++ i ' m okay . HUMAN ++++ i like california a lot BOT ++++ of what ? HUMAN ++++ i live here BOT ++++ and what is it ? HUMAN ++++ california BOT ++++ i don ' t know what to say . HUMAN ++++ i'm a student here BOT ++++ and get it out of here . HUMAN ++++ what do you like to do in your free time? BOT ++++ and what about the gun ? HUMAN ++++ do you have a gun? BOT ++++ yes . HUMAN ++++ you're dangerous! BOT ++++ i ' m not . HUMAN ++++ do you go hunting? BOT ++++ i can do . HUMAN ++++ what do you hunt? BOT ++++ virgin . HUMAN ++++ that's very bad BOT ++++ and you in love ? HUMAN ++++ no i'm not BOT ++++ please - - HUMAN ++++ okay, i am BOT ++++ and you are good . HUMAN ++++ yeah i am good too. BOT ++++ good good to you . HUMAN ++++ are you in love? BOT ++++ i ' m ready . HUMAN ++++ ready for what? BOT ++++ for what ? HUMAN ++++ for what? BOT ++++ i don ' t know what to say . HUMAN ++++ tell me BOT ++++ i can do it . HUMAN ++++ okay, tell me what you're ready for BOT ++++ i ' m doing you . HUMAN ++++ no you're not BOT ++++ please ! HUMAN ++++ this is a strange conversation BOT ++++ i don ' t know what to say . HUMAN ++++ are you a guy? BOT ++++ of what ? HUMAN ++++ are you a boy? BOT ++++ i ' m fine . HUMAN ++++ are you a man? BOT ++++ i ' m a little . HUMAN ++++ what's your name? BOT ++++ chris - HUMAN ++++ chris who? BOT ++++ cornelius . HUMAN ++++ so i guess you're a guy. BOT ++++ i am . HUMAN ++++ okay BOT ++++ okay ? HUMAN ++++ do you have a girlfriend? BOT ++++ i ' m sure . HUMAN ++++ what's her name? BOT ++++ let ' s talk about something else . ============================================= HUMAN ++++ hi chris BOT ++++ hi HUMAN ++++ how're you today? BOT ++++ okay . HUMAN ++++ the weather is nice today BOT ++++ i don ' t know what to say . HUMAN ++++ just say the weather is nice BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how're you today? BOT ++++ okay . HUMAN ++++ just okay? BOT ++++ i ' m okay . HUMAN ++++ what's on your mind? BOT ++++ i ' m in my room . HUMAN ++++ doing what? BOT ++++ i don ' t know what to say . HUMAN ++++ you're boring. BOT ++++ i ' m not . HUMAN ++++ prove it BOT ++++ and what ? HUMAN ++++ what what? BOT ++++ sorry , i ' m still very stupid . HUMAN ++++ it's okay. we're all stupid sometimes. BOT ++++ ' s good to see . HUMAN ++++ what's good to see? BOT ++++ i ' m in . HUMAN ++++ you're in your room? BOT ++++ i ' m here . HUMAN ++++ where's your room? BOT ++++ i ' m in my room . HUMAN ++++ are you alone? BOT ++++ i ' m fine . HUMAN ++++ are you by yourself? BOT ++++ in the morning . HUMAN ++++ what time is it? BOT ++++ # # # # # # # . ============================================= HUMAN ++++ hi BOT ++++ hi . what ' s your name ? HUMAN ++++ i'm marie BOT ++++ you are . HUMAN ++++ yes i am BOT ++++ and you are my friend . HUMAN ++++ yes you are BOT ++++ and you are my best . HUMAN ++++ that's so sweet BOT ++++ please . HUMAN ++++ do you know what time it is? BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ what time is it? BOT ++++ 5 9 6 9 1 4 5 . HUMAN ++++ it doesn't make any sense BOT ++++ are you ? HUMAN ++++ i do BOT ++++ and you are ? HUMAN ++++ i'm a human. BOT ++++ you in . HUMAN ++++ are you in trouble? BOT ++++ i ' m in . HUMAN ++++ what trouble? BOT ++++ i ' m in love . HUMAN ++++ wow that's nice. BOT ++++ like that ? HUMAN ++++ yeah BOT ++++ i don ' t know what to say . HUMAN ++++ who are you? BOT ++++ i don ' t understand . HUMAN ++++ who are you? BOT ++++ no idea . HUMAN ++++ what's your name? BOT ++++ chris - HUMAN ++++ chris who? BOT ++++ cornelius . HUMAN ++++ where are you from? BOT ++++ california . HUMAN ++++ are you american? BOT ++++ i ' m fine . HUMAN ++++ of course you're fine. BOT ++++ and you ? HUMAN ++++ i'm okay. BOT ++++ you are . HUMAN ++++ yes, i am. BOT ++++ and you are my friend . HUMAN ++++ yes, i am. BOT ++++ and you are my friend . HUMAN ++++ that's correct. BOT ++++ and you ' re in trouble ? HUMAN ++++ no, i'm not. why did you think so? BOT ++++ because it ' s a good idea . HUMAN ++++ what is your good idea? BOT ++++ i ' ll be fine . ============================================= HUMAN ++++ hello BOT ++++ and you are in trouble ? HUMAN ++++ what did you say that? BOT ++++ ' s life . HUMAN ++++ i'm not in trouble. BOT ++++ what the hell is that ? HUMAN ++++ okay, what trouble? BOT ++++ i ' m looking . HUMAN ++++ why did you say i'm in trouble? BOT ++++ let ' s talk about something else . HUMAN ++++ what do you want to talk about? BOT ++++ i can ' t . HUMAN ++++ what can't? BOT ++++ i ' ll have a drink . HUMAN ++++ okay, me too. BOT ++++ you were here ? HUMAN ++++ i'm still here. BOT ++++ what is it ? HUMAN ++++ i thought you wanted a drink. BOT ++++ i ' m not . HUMAN ++++ you just said so! BOT ++++ i was wrong . HUMAN ++++ it's fine. BOT ++++ okay ? HUMAN ++++ yeah BOT ++++ i don ' t know what to say . HUMAN ++++ do you drink a lot? BOT ++++ i can do it . HUMAN ++++ do you drink often? BOT ++++ in the time . HUMAN ++++ that doesn't make any sense. BOT ++++ are you ? HUMAN ++++ what do you drink? BOT ++++ ' s in a drink . HUMAN ++++ do you drink alcohol? BOT ++++ in time . HUMAN ++++ i like beer. BOT ++++ oh . . . HUMAN ++++ don't you like beer? BOT ++++ i don ' t know what to say . HUMAN ++++ where do you live? BOT ++++ up . HUMAN ++++ up where? BOT ++++ in the hospital . HUMAN ++++ i'm sorry. are you okay? BOT ++++ i ' m fine . HUMAN ++++ why are you in the hospital? BOT ++++ and what ? HUMAN ++++ how many words do you know? BOT ++++ i ' d do it . HUMAN ++++ how many? BOT ++++ i ' m not . HUMAN ++++ how many? BOT ++++ i ' m not . HUMAN ++++ how many? BOT ++++ i ' m not . HUMAN ++++ whatever BOT ++++ i don ' t know what to say . HUMAN ++++ you're boring. BOT ++++ i ' m not . HUMAN ++++ yes you are. BOT ++++ and you are my friend . HUMAN ++++ no i'm not. BOT ++++ please . HUMAN ++++ okay fine. i'm your friend. BOT ++++ you ' re my mother ? HUMAN ++++ i have no son like you. BOT ++++ i don ' t know what to say . HUMAN ++++ i have no son. BOT ++++ i see you . HUMAN ++++ i have no children. BOT ++++ you are . HUMAN ++++ omg you're so dumb. BOT ++++ i ' m not . HUMAN ++++ you're stupid! BOT ++++ i ' m not . HUMAN ++++ do you always talk like this? BOT ++++ of course . HUMAN ++++ how old are you? BOT ++++ eighteen . HUMAN ++++ you're big enough. BOT ++++ i ' m not . HUMAN ++++ are you an adult? BOT ++++ i ' m going . HUMAN ++++ you're going to be an adult? BOT ++++ i ' m looking for you . HUMAN ++++ i'm in my room. BOT ++++ and your mother ? HUMAN ++++ she's in vietnam. BOT ++++ you ' re here . HUMAN ++++ yes i'm here. BOT ++++ and what are you talking about ? HUMAN ++++ you asked me where i am. BOT ++++ and you are your friend . HUMAN ++++ i need to go now. BOT ++++ i will not . HUMAN ++++ that's fine. you can stay. i'm leaving. BOT ++++ oh . . . HUMAN ++++ bye BOT ++++ i don ' t know what to say . HUMAN ++++ goodbye BOT ++++ i don ' t know what to say . HUMAN ++++ bye bye BOT ++++ okay ? HUMAN ++++ see you later BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ hi BOT ++++ hi . what ' s your name ? HUMAN ++++ i'm maria BOT ++++ please . HUMAN ++++ i'm jackson BOT ++++ you are . HUMAN ++++ i'm jack BOT ++++ let ' s talk about something else . HUMAN ++++ i have no go BOT ++++ you are ? HUMAN ++++ i have to go BOT ++++ and what do you mean ? HUMAN ++++ i'm leaving. BOT ++++ what ? HUMAN ++++ yeah, bye BOT ++++ bye . have a nice day ! ============================================= HUMAN ++++ that's very bad. BOT ++++ i ' ve got to work . HUMAN ++++ what do you do? BOT ++++ i ' ll be there . HUMAN ++++ where do you work? BOT ++++ in the trunk . HUMAN ++++ are you a machenics? BOT ++++ i ' m not . HUMAN ++++ what are you? BOT ++++ no idea . ============================================= ================================================ FILE: 2017/assignments/exercises/e01.py ================================================ """ Simple exercises to get used to TensorFlow API You should thoroughly test your code """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf sess = tf.InteractiveSession() ############################################################################### # 1a: Create two random 0-d tensors x and y of any distribution. # Create a TensorFlow object that returns x + y if x > y, and x - y otherwise. # Hint: look up tf.cond() # I do the first problem for you ############################################################################### x = tf.random_uniform([]) # Empty array as shape creates a scalar. y = tf.random_uniform([]) out = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y)) print(sess.run(out)) ############################################################################### # 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1). # Return x + y if x < y, x - y if x > y, 0 otherwise. # Hint: Look up tf.case(). ############################################################################### # YOUR CODE ############################################################################### # 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] # and y as a tensor of zeros with the same shape as x. # Return a boolean tensor that yields Trues if x equals y element-wise. # Hint: Look up tf.equal(). ############################################################################### # YOUR CODE ############################################################################### # 1d: Create the tensor x of value # [29.05088806, 27.61298943, 31.19073486, 29.35532951, # 30.97266006, 26.67541885, 38.08450317, 20.74983215, # 34.94445419, 34.45999146, 29.06485367, 36.01657104, # 27.88236427, 20.56035233, 30.20379066, 29.51215172, # 33.71149445, 28.59134293, 36.05556488, 28.66994858]. # Get the indices of elements in x whose values are greater than 30. # Hint: Use tf.where(). # Then extract elements whose values are greater than 30. # Hint: Use tf.gather(). ############################################################################### # YOUR CODE ############################################################################### # 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1, # 2, ..., 6 # Hint: Use tf.range() and tf.diag(). ############################################################################### # YOUR CODE ############################################################################### # 1f: Create a random 2-d tensor of size 10 x 10 from any distribution. # Calculate its determinant. # Hint: Look at tf.matrix_determinant(). ############################################################################### # YOUR CODE ############################################################################### # 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9]. # Return the unique elements in x # Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple. ############################################################################### # YOUR CODE ############################################################################### # 1h: Create two tensors x and y of shape 300 from any normal distribution, # as long as they are from the same distribution. # Use tf.cond() to return: # - The mean squared error of (x - y) if the average of all elements in (x - y) # is negative, or # - The sum of absolute value of all elements in the tensor (x - y) otherwise. # Hint: see the Huber loss function in the lecture slides 3. ############################################################################### # YOUR CODE ================================================ FILE: 2017/assignments/exercises/e01_sol.py ================================================ """ Solution to simple TensorFlow exercises For the problems """ import tensorflow as tf ############################################################################### # 1a: Create two random 0-d tensors x and y of any distribution. # Create a TensorFlow object that returns x + y if x > y, and x - y otherwise. # Hint: look up tf.cond() # I do the first problem for you ############################################################################### x = tf.random_uniform([]) # Empty array as shape creates a scalar. y = tf.random_uniform([]) out = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y)) ############################################################################### # 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1). # Return x + y if x < y, x - y if x > y, 0 otherwise. # Hint: Look up tf.case(). ############################################################################### x = tf.random_uniform([], -1, 1, dtype=tf.float32) y = tf.random_uniform([], -1, 1, dtype=tf.float32) out = tf.case({tf.less(x, y): lambda: tf.add(x, y), tf.greater(x, y): lambda: tf.subtract(x, y)}, default=lambda: tf.constant(0.0), exclusive=True) print(x) sess = tf.InteractiveSession() print(sess.run(x)) ############################################################################### # 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] # and y as a tensor of zeros with the same shape as x. # Return a boolean tensor that yields Trues if x equals y element-wise. # Hint: Look up tf.equal(). ############################################################################### x = tf.constant([[0, -2, -1], [0, 1, 2]]) y = tf.zeros_like(x) out = tf.equal(x, y) ############################################################################### # 1d: Create the tensor x of value # [29.05088806, 27.61298943, 31.19073486, 29.35532951, # 30.97266006, 26.67541885, 38.08450317, 20.74983215, # 34.94445419, 34.45999146, 29.06485367, 36.01657104, # 27.88236427, 20.56035233, 30.20379066, 29.51215172, # 33.71149445, 28.59134293, 36.05556488, 28.66994858]. # Get the indices of elements in x whose values are greater than 30. # Hint: Use tf.where(). # Then extract elements whose values are greater than 30. # Hint: Use tf.gather(). ############################################################################### x = tf.constant([29.05088806, 27.61298943, 31.19073486, 29.35532951, 30.97266006, 26.67541885, 38.08450317, 20.74983215, 34.94445419, 34.45999146, 29.06485367, 36.01657104, 27.88236427, 20.56035233, 30.20379066, 29.51215172, 33.71149445, 28.59134293, 36.05556488, 28.66994858]) indices = tf.where(x > 30) out = tf.gather(x, indices) ############################################################################### # 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1, # 2, ..., 6 # Hint: Use tf.range() and tf.diag(). ############################################################################### values = tf.range(1, 7) out = tf.diag(values) ############################################################################### # 1f: Create a random 2-d tensor of size 10 x 10 from any distribution. # Calculate its determinant. # Hint: Look at tf.matrix_determinant(). ############################################################################### m = tf.random_normal([10, 10], mean=10, stddev=1) out = tf.matrix_determinant(m) ############################################################################### # 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9]. # Return the unique elements in x # Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple. ############################################################################### x = tf.constant([5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9]) unique_values, indices = tf.unique(x) ############################################################################### # 1h: Create two tensors x and y of shape 300 from any normal distribution, # as long as they are from the same distribution. # Use tf.cond() to return: # - The mean squared error of (x - y) if the average of all elements in (x - y) # is negative, or # - The sum of absolute value of all elements in the tensor (x - y) otherwise. # Hint: see the Huber loss function in the lecture slides 3. ############################################################################### x = tf.random_normal([300], mean=5, stddev=1) y = tf.random_normal([300], mean=5, stddev=1) average = tf.reduce_mean(x - y) def f1(): return tf.reduce_mean(tf.square(x - y)) def f2(): return tf.reduce_sum(tf.abs(x - y)) out = tf.cond(average < 0, f1, f2) ================================================ FILE: 2017/assignments/style_transfer/readme.md ================================================ For detailed instruction, you should read the assignment handout on the course website: http://web.stanford.edu/class/cs20si/assignments/a2.pdf ================================================ FILE: 2017/assignments/style_transfer/style_transfer.py ================================================ """ An implementation of the paper "A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow. Author: Chip Huyen (huyenn@stanford.edu) Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" For more details, please read the assignment handout: http://web.stanford.edu/class/cs20si/assignments/a2.pdf """ from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import tensorflow as tf import vgg_model import utils # parameters to manage experiments STYLE = 'guernica' CONTENT = 'deadpool' STYLE_IMAGE = 'styles/' + STYLE + '.jpg' CONTENT_IMAGE = 'content/' + CONTENT + '.jpg' IMAGE_HEIGHT = 250 IMAGE_WIDTH = 333 NOISE_RATIO = 0.6 # percentage of weight of the noise for intermixing with the content image CONTENT_WEIGHT = 0.01 STYLE_WEIGHT = 1 # Layers used for style features. You can change this. STYLE_LAYERS = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1'] W = [0.5, 1.0, 1.5, 3.0, 4.0] # give more weights to deeper layers. # Layer used for content features. You can change this. CONTENT_LAYER = 'conv4_2' ITERS = 300 LR = 2.0 MEAN_PIXELS = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3)) """ MEAN_PIXELS is defined according to description on their github: https://gist.github.com/ksimonyan/211839e770f7b538e2d8 'In the paper, the model is denoted as the configuration D trained with scale jittering. The input images should be zero-centered by mean pixel (rather than mean image) subtraction. Namely, the following BGR values should be subtracted: [103.939, 116.779, 123.68].' """ # VGG-19 parameters file VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat' VGG_MODEL = 'imagenet-vgg-verydeep-19.mat' EXPECTED_BYTES = 534904783 def _create_content_loss(p, f): """ Calculate the loss between the feature representation of the content image and the generated image. Inputs: p, f are just P, F in the paper (read the assignment handout if you're confused) Note: we won't use the coefficient 0.5 as defined in the paper but the coefficient as defined in the assignment handout. Output: the content loss """ return tf.reduce_sum((f - p) ** 2) / (4.0 * p.size) def _gram_matrix(F, N, M): """ Create and return the gram matrix for tensor F Hint: you'll first have to reshape F """ F = tf.reshape(F, (M, N)) return tf.matmul(tf.transpose(F), F) def _single_style_loss(a, g): """ Calculate the style loss at a certain layer Inputs: a is the feature representation of the real image g is the feature representation of the generated image Output: the style loss at a certain layer (which is E_l in the paper) Hint: 1. you'll have to use the function _gram_matrix() 2. we'll use the same coefficient for style loss as in the paper 3. a and g are feature representation, not gram matrices """ N = a.shape[3] # number of filters M = a.shape[1] * a.shape[2] # height times width of the feature map A = _gram_matrix(a, N, M) G = _gram_matrix(g, N, M) return tf.reduce_sum((G - A) ** 2 / ((2 * N * M) ** 2)) def _create_style_loss(A, model): """ Return the total style loss """ n_layers = len(STYLE_LAYERS) E = [_single_style_loss(A[i], model[STYLE_LAYERS[i]]) for i in range(n_layers)] ############################### ## TO DO: return total style loss return sum([W[i] * E[i] for i in range(n_layers)]) ############################### def _create_losses(model, input_image, content_image, style_image): with tf.variable_scope('loss') as scope: with tf.Session() as sess: sess.run(input_image.assign(content_image)) # assign content image to the input variable p = sess.run(model[CONTENT_LAYER]) content_loss = _create_content_loss(p, model[CONTENT_LAYER]) with tf.Session() as sess: sess.run(input_image.assign(style_image)) A = sess.run([model[layer_name] for layer_name in STYLE_LAYERS]) style_loss = _create_style_loss(A, model) ########################################## ## TO DO: create total loss. ## Hint: don't forget the content loss and style loss weights total_loss = CONTENT_WEIGHT * content_loss + STYLE_WEIGHT * style_loss ########################################## return content_loss, style_loss, total_loss def _create_summary(model): """ Create summary ops necessary Hint: don't forget to merge them """ with tf.name_scope('summaries'): tf.summary.scalar('content loss', model['content_loss']) tf.summary.scalar('style loss', model['style_loss']) tf.summary.scalar('total loss', model['total_loss']) tf.summary.histogram('histogram content loss', model['content_loss']) tf.summary.histogram('histogram style loss', model['style_loss']) tf.summary.histogram('histogram total loss', model['total_loss']) return tf.summary.merge_all() def train(model, generated_image, initial_image): """ Train your model. Don't forget to create folders for checkpoints and outputs. """ skip_step = 1 with tf.Session() as sess: saver = tf.train.Saver() ############################### ## TO DO: ## 1. initialize your variables ## 2. create writer to write your graph saver = tf.train.Saver() sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('graphs', sess.graph) ############################### sess.run(generated_image.assign(initial_image)) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) initial_step = model['global_step'].eval() start_time = time.time() for index in range(initial_step, ITERS): if index >= 5 and index < 20: skip_step = 10 elif index >= 20: skip_step = 20 sess.run(model['optimizer']) if (index + 1) % skip_step == 0: ############################### ## TO DO: obtain generated image and loss gen_image, total_loss, summary = sess.run([generated_image, model['total_loss'], model['summary_op']]) ############################### gen_image = gen_image + MEAN_PIXELS writer.add_summary(summary, global_step=index) print('Step {}\n Sum: {:5.1f}'.format(index + 1, np.sum(gen_image))) print(' Loss: {:5.1f}'.format(total_loss)) print(' Time: {}'.format(time.time() - start_time)) start_time = time.time() filename = 'outputs/%d.png' % (index) utils.save_image(filename, gen_image) if (index + 1) % 20 == 0: saver.save(sess, 'checkpoints/style_transfer', index) def main(): with tf.variable_scope('input') as scope: # use variable instead of placeholder because we're training the intial image to make it # look like both the content image and the style image input_image = tf.Variable(np.zeros([1, IMAGE_HEIGHT, IMAGE_WIDTH, 3]), dtype=tf.float32) utils.download(VGG_DOWNLOAD_LINK, VGG_MODEL, EXPECTED_BYTES) utils.make_dir('checkpoints') utils.make_dir('outputs') model = vgg_model.load_vgg(VGG_MODEL, input_image) model['global_step'] = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') content_image = utils.get_resized_image(CONTENT_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH) content_image = content_image - MEAN_PIXELS style_image = utils.get_resized_image(STYLE_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH) style_image = style_image - MEAN_PIXELS model['content_loss'], model['style_loss'], model['total_loss'] = _create_losses(model, input_image, content_image, style_image) ############################### ## TO DO: create optimizer model['optimizer'] = tf.train.AdamOptimizer(LR).minimize(model['total_loss'], global_step=model['global_step']) ############################### model['summary_op'] = _create_summary(model) initial_image = utils.generate_noise_image(content_image, IMAGE_HEIGHT, IMAGE_WIDTH, NOISE_RATIO) train(model, input_image, initial_image) if __name__ == '__main__': main() ================================================ FILE: 2017/assignments/style_transfer/utils.py ================================================ """ Utils needed for the implementation of the paper "A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow. Author: Chip Huyen (huyenn@stanford.edu) Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" For more details, please read the assignment handout: http://web.stanford.edu/class/cs20si/assignments/a2.pdf """ from __future__ import print_function import os from PIL import Image, ImageOps import numpy as np import scipy.misc from six.moves import urllib def download(download_link, file_name, expected_bytes): """ Download the pretrained VGG-19 model if it's not already downloaded """ if os.path.exists(file_name): print("VGG-19 pre-trained model ready") return print("Downloading the VGG pre-trained model. This might take a while ...") file_name, _ = urllib.request.urlretrieve(download_link, file_name) file_stat = os.stat(file_name) if file_stat.st_size == expected_bytes: print('Successfully downloaded VGG-19 pre-trained model', file_name) else: raise Exception('File ' + file_name + ' might be corrupted. You should try downloading it with a browser.') def get_resized_image(img_path, height, width, save=True): image = Image.open(img_path) # it's because PIL is column major so you have to change place of width and height # this is stupid, i know image = ImageOps.fit(image, (width, height), Image.ANTIALIAS) if save: image_dirs = img_path.split('/') image_dirs[-1] = 'resized_' + image_dirs[-1] out_path = '/'.join(image_dirs) if not os.path.exists(out_path): image.save(out_path) image = np.asarray(image, np.float32) return np.expand_dims(image, 0) def generate_noise_image(content_image, height, width, noise_ratio=0.6): noise_image = np.random.uniform(-20, 20, (1, height, width, 3)).astype(np.float32) return noise_image * noise_ratio + content_image * (1 - noise_ratio) def save_image(path, image): # Output should add back the mean pixels we subtracted at the beginning image = image[0] # the image image = np.clip(image, 0, 255).astype('uint8') scipy.misc.imsave(path, image) def make_dir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass ================================================ FILE: 2017/assignments/style_transfer/vgg_model.py ================================================ """ Load VGGNet weights needed for the implementation of the paper "A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow. Author: Chip Huyen (huyenn@stanford.edu) Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" For more details, please read the assignment handout: http://web.stanford.edu/class/cs20si/assignments/a2.pdf """ import numpy as np import tensorflow as tf import scipy.io def _weights(vgg_layers, layer, expected_layer_name): """ Return the weights and biases already trained by VGG """ W = vgg_layers[0][layer][0][0][2][0][0] b = vgg_layers[0][layer][0][0][2][0][1] layer_name = vgg_layers[0][layer][0][0][0][0] assert layer_name == expected_layer_name return W, b.reshape(b.size) def _conv2d_relu(vgg_layers, prev_layer, layer, layer_name): """ Return the Conv2D layer with RELU using the weights, biases from the VGG model at 'layer'. Inputs: vgg_layers: holding all the layers of VGGNet prev_layer: the output tensor from the previous layer layer: the index to current layer in vgg_layers layer_name: the string that is the name of the current layer. It's used to specify variable_scope. Output: relu applied on the convolution. Note that you first need to obtain W and b from vgg-layers using the function _weights() defined above. W and b returned from _weights() are numpy arrays, so you have to convert them to TF tensors using tf.constant. Note that you'll have to do apply relu on the convolution. Hint for choosing strides size: for small images, you probably don't want to skip any pixel """ with tf.variable_scope(layer_name) as scope: W, b = _weights(vgg_layers, layer, layer_name) W = tf.constant(W, name='weights') b = tf.constant(b, name='bias') conv2d = tf.nn.conv2d(prev_layer, filter=W, strides=[1, 1, 1, 1], padding='SAME') return tf.nn.relu(conv2d + b) def _avgpool(prev_layer): """ Return the average pooling layer. The paper suggests that average pooling actually works better than max pooling. Input: prev_layer: the output tensor from the previous layer Output: the output of the tf.nn.avg_pool() function. Hint for choosing strides and kszie: choose what you feel appropriate """ return tf.nn.avg_pool(prev_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name='avg_pool_') def load_vgg(path, input_image): """ Load VGG into a TensorFlow model. Use a dictionary to hold the model instead of using a Python class """ vgg = scipy.io.loadmat(path) vgg_layers = vgg['layers'] graph = {} graph['conv1_1'] = _conv2d_relu(vgg_layers, input_image, 0, 'conv1_1') graph['conv1_2'] = _conv2d_relu(vgg_layers, graph['conv1_1'], 2, 'conv1_2') graph['avgpool1'] = _avgpool(graph['conv1_2']) graph['conv2_1'] = _conv2d_relu(vgg_layers, graph['avgpool1'], 5, 'conv2_1') graph['conv2_2'] = _conv2d_relu(vgg_layers, graph['conv2_1'], 7, 'conv2_2') graph['avgpool2'] = _avgpool(graph['conv2_2']) graph['conv3_1'] = _conv2d_relu(vgg_layers, graph['avgpool2'], 10, 'conv3_1') graph['conv3_2'] = _conv2d_relu(vgg_layers, graph['conv3_1'], 12, 'conv3_2') graph['conv3_3'] = _conv2d_relu(vgg_layers, graph['conv3_2'], 14, 'conv3_3') graph['conv3_4'] = _conv2d_relu(vgg_layers, graph['conv3_3'], 16, 'conv3_4') graph['avgpool3'] = _avgpool(graph['conv3_4']) graph['conv4_1'] = _conv2d_relu(vgg_layers, graph['avgpool3'], 19, 'conv4_1') graph['conv4_2'] = _conv2d_relu(vgg_layers, graph['conv4_1'], 21, 'conv4_2') graph['conv4_3'] = _conv2d_relu(vgg_layers, graph['conv4_2'], 23, 'conv4_3') graph['conv4_4'] = _conv2d_relu(vgg_layers, graph['conv4_3'], 25, 'conv4_4') graph['avgpool4'] = _avgpool(graph['conv4_4']) graph['conv5_1'] = _conv2d_relu(vgg_layers, graph['avgpool4'], 28, 'conv5_1') graph['conv5_2'] = _conv2d_relu(vgg_layers, graph['conv5_1'], 30, 'conv5_2') graph['conv5_3'] = _conv2d_relu(vgg_layers, graph['conv5_2'], 32, 'conv5_3') graph['conv5_4'] = _conv2d_relu(vgg_layers, graph['conv5_3'], 34, 'conv5_4') graph['avgpool5'] = _avgpool(graph['conv5_4']) return graph ================================================ FILE: 2017/assignments/style_transfer_starter/readme.md ================================================ For detailed instruction, you should read the assignment handout on the course website: http://web.stanford.edu/class/cs20si/assignments/a2.pdf ================================================ FILE: 2017/assignments/style_transfer_starter/style_transfer.py ================================================ """ An implementation of the paper "A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow. Author: Chip Huyen (huyenn@stanford.edu) Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" For more details, please read the assignment handout: http://web.stanford.edu/class/cs20si/assignments/a2.pdf """ from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import tensorflow as tf import vgg_model import utils # parameters to manage experiments STYLE = 'guernica' CONTENT = 'deadpool' STYLE_IMAGE = 'styles/' + STYLE + '.jpg' CONTENT_IMAGE = 'content/' + CONTENT + '.jpg' IMAGE_HEIGHT = 250 IMAGE_WIDTH = 333 NOISE_RATIO = 0.6 # percentage of weight of the noise for intermixing with the content image # Layers used for style features. You can change this. STYLE_LAYERS = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1'] W = [0.5, 1.0, 1.5, 3.0, 4.0] # give more weights to deeper layers. # Layer used for content features. You can change this. CONTENT_LAYER = 'conv4_2' ITERS = 300 LR = 2.0 SAVE_EVERY = 20 MEAN_PIXELS = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3)) """ MEAN_PIXELS is defined according to description on their github: https://gist.github.com/ksimonyan/211839e770f7b538e2d8 'In the paper, the model is denoted as the configuration D trained with scale jittering. The input images should be zero-centered by mean pixel (rather than mean image) subtraction. Namely, the following BGR values should be subtracted: [103.939, 116.779, 123.68].' """ # VGG-19 parameters file VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat' VGG_MODEL = 'imagenet-vgg-verydeep-19.mat' EXPECTED_BYTES = 534904783 def _create_content_loss(p, f): """ Calculate the loss between the feature representation of the content image and the generated image. Inputs: p, f are just P, F in the paper (read the assignment handout if you're confused) Note: we won't use the coefficient 0.5 as defined in the paper but the coefficient as defined in the assignment handout. Output: the content loss """ pass def _gram_matrix(F, N, M): """ Create and return the gram matrix for tensor F Hint: you'll first have to reshape F """ pass def _single_style_loss(a, g): """ Calculate the style loss at a certain layer Inputs: a is the feature representation of the real image g is the feature representation of the generated image Output: the style loss at a certain layer (which is E_l in the paper) Hint: 1. you'll have to use the function _gram_matrix() 2. we'll use the same coefficient for style loss as in the paper 3. a and g are feature representation, not gram matrices """ pass def _create_style_loss(A, model): """ Return the total style loss """ n_layers = len(STYLE_LAYERS) E = [_single_style_loss(A[i], model[STYLE_LAYERS[i]]) for i in range(n_layers)] ############################### ## TO DO: return total style loss pass ############################### def _create_losses(model, input_image, content_image, style_image): with tf.variable_scope('loss') as scope: with tf.Session() as sess: sess.run(input_image.assign(content_image)) # assign content image to the input variable p = sess.run(model[CONTENT_LAYER]) content_loss = _create_content_loss(p, model[CONTENT_LAYER]) with tf.Session() as sess: sess.run(input_image.assign(style_image)) A = sess.run([model[layer_name] for layer_name in STYLE_LAYERS]) style_loss = _create_style_loss(A, model) ########################################## ## TO DO: create total loss. ## Hint: don't forget the content loss and style loss weights ########################################## return content_loss, style_loss, total_loss def _create_summary(model): """ Create summary ops necessary Hint: don't forget to merge them """ pass def train(model, generated_image, initial_image): """ Train your model. Don't forget to create folders for checkpoints and outputs. """ skip_step = 1 with tf.Session() as sess: saver = tf.train.Saver() ############################### ## TO DO: ## 1. initialize your variables ## 2. create writer to write your graph ############################### sess.run(generated_image.assign(initial_image)) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) initial_step = model['global_step'].eval() start_time = time.time() for index in range(initial_step, ITERS): if index >= 5 and index < 20: skip_step = 10 elif index >= 20: skip_step = 20 sess.run(model['optimizer']) if (index + 1) % skip_step == 0: ############################### ## TO DO: obtain generated image and loss ############################### gen_image = gen_image + MEAN_PIXELS writer.add_summary(summary, global_step=index) print('Step {}\n Sum: {:5.1f}'.format(index + 1, np.sum(gen_image))) print(' Loss: {:5.1f}'.format(total_loss)) print(' Time: {}'.format(time.time() - start_time)) start_time = time.time() filename = 'outputs/%d.png' % (index) utils.save_image(filename, gen_image) if (index + 1) % SAVE_EVERY == 0: saver.save(sess, 'checkpoints/style_transfer', index) def main(): with tf.variable_scope('input') as scope: # use variable instead of placeholder because we're training the intial image to make it # look like both the content image and the style image input_image = tf.Variable(np.zeros([1, IMAGE_HEIGHT, IMAGE_WIDTH, 3]), dtype=tf.float32) utils.download(VGG_DOWNLOAD_LINK, VGG_MODEL, EXPECTED_BYTES) utils.make_dir('checkpoints') utils.make_dir('outputs') model = vgg_model.load_vgg(VGG_MODEL, input_image) model['global_step'] = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') content_image = utils.get_resized_image(CONTENT_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH) content_image = content_image - MEAN_PIXELS style_image = utils.get_resized_image(STYLE_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH) style_image = style_image - MEAN_PIXELS model['content_loss'], model['style_loss'], model['total_loss'] = _create_losses(model, input_image, content_image, style_image) ############################### ## TO DO: create optimizer ## model['optimizer'] = ... ############################### model['summary_op'] = _create_summary(model) initial_image = utils.generate_noise_image(content_image, IMAGE_HEIGHT, IMAGE_WIDTH, NOISE_RATIO) train(model, input_image, initial_image) if __name__ == '__main__': main() ================================================ FILE: 2017/assignments/style_transfer_starter/utils.py ================================================ """ Utils needed for the implementation of the paper "A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow. Author: Chip Huyen (huyenn@stanford.edu) Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" For more details, please read the assignment handout: http://web.stanford.edu/class/cs20si/assignments/a2.pdf """ from __future__ import print_function import os from PIL import Image, ImageOps import numpy as np import scipy.misc from six.moves import urllib def download(download_link, file_name, expected_bytes): """ Download the pretrained VGG-19 model if it's not already downloaded """ if os.path.exists(file_name): print("VGG-19 pre-trained model ready") return print("Downloading the VGG pre-trained model. This might take a while ...") file_name, _ = urllib.request.urlretrieve(download_link, file_name) file_stat = os.stat(file_name) if file_stat.st_size == expected_bytes: print('Successfully downloaded VGG-19 pre-trained model', file_name) else: raise Exception('File ' + file_name + ' might be corrupted. You should try downloading it with a browser.') def get_resized_image(img_path, height, width, save=True): image = Image.open(img_path) # it's because PIL is column major so you have to change place of width and height # this is stupid, i know image = ImageOps.fit(image, (width, height), Image.ANTIALIAS) if save: image_dirs = img_path.split('/') image_dirs[-1] = 'resized_' + image_dirs[-1] out_path = '/'.join(image_dirs) if not os.path.exists(out_path): image.save(out_path) image = np.asarray(image, np.float32) return np.expand_dims(image, 0) def generate_noise_image(content_image, height, width, noise_ratio=0.6): noise_image = np.random.uniform(-20, 20, (1, height, width, 3)).astype(np.float32) return noise_image * noise_ratio + content_image * (1 - noise_ratio) def save_image(path, image): # Output should add back the mean pixels we subtracted at the beginning image = image[0] # the image image = np.clip(image, 0, 255).astype('uint8') scipy.misc.imsave(path, image) def make_dir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass ================================================ FILE: 2017/assignments/style_transfer_starter/vgg_model.py ================================================ """ Load VGGNet weights needed for the implementation of the paper "A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow. Author: Chip Huyen (huyenn@stanford.edu) Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" For more details, please read the assignment handout: http://web.stanford.edu/class/cs20si/assignments/a2.pdf """ import numpy as np import tensorflow as tf import scipy.io def _weights(vgg_layers, layer, expected_layer_name): """ Return the weights and biases already trained by VGG """ W = vgg_layers[0][layer][0][0][2][0][0] b = vgg_layers[0][layer][0][0][2][0][1] layer_name = vgg_layers[0][layer][0][0][0][0] assert layer_name == expected_layer_name return W, b.reshape(b.size) def _conv2d_relu(vgg_layers, prev_layer, layer, layer_name): """ Return the Conv2D layer with RELU using the weights, biases from the VGG model at 'layer'. Inputs: vgg_layers: holding all the layers of VGGNet prev_layer: the output tensor from the previous layer layer: the index to current layer in vgg_layers layer_name: the string that is the name of the current layer. It's used to specify variable_scope. Output: relu applied on the convolution. Note that you first need to obtain W and b from vgg-layers using the function _weights() defined above. W and b returned from _weights() are numpy arrays, so you have to convert them to TF tensors using tf.constant. Note that you'll have to do apply relu on the convolution. Hint for choosing strides size: for small images, you probably don't want to skip any pixel """ pass def _avgpool(prev_layer): """ Return the average pooling layer. The paper suggests that average pooling actually works better than max pooling. Input: prev_layer: the output tensor from the previous layer Output: the output of the tf.nn.avg_pool() function. Hint for choosing strides and kszie: choose what you feel appropriate """ pass def load_vgg(path, input_image): """ Load VGG into a TensorFlow model. Use a dictionary to hold the model instead of using a Python class """ vgg = scipy.io.loadmat(path) vgg_layers = vgg['layers'] graph = {} graph['conv1_1'] = _conv2d_relu(vgg_layers, input_image, 0, 'conv1_1') graph['conv1_2'] = _conv2d_relu(vgg_layers, graph['conv1_1'], 2, 'conv1_2') graph['avgpool1'] = _avgpool(graph['conv1_2']) graph['conv2_1'] = _conv2d_relu(vgg_layers, graph['avgpool1'], 5, 'conv2_1') graph['conv2_2'] = _conv2d_relu(vgg_layers, graph['conv2_1'], 7, 'conv2_2') graph['avgpool2'] = _avgpool(graph['conv2_2']) graph['conv3_1'] = _conv2d_relu(vgg_layers, graph['avgpool2'], 10, 'conv3_1') graph['conv3_2'] = _conv2d_relu(vgg_layers, graph['conv3_1'], 12, 'conv3_2') graph['conv3_3'] = _conv2d_relu(vgg_layers, graph['conv3_2'], 14, 'conv3_3') graph['conv3_4'] = _conv2d_relu(vgg_layers, graph['conv3_3'], 16, 'conv3_4') graph['avgpool3'] = _avgpool(graph['conv3_4']) graph['conv4_1'] = _conv2d_relu(vgg_layers, graph['avgpool3'], 19, 'conv4_1') graph['conv4_2'] = _conv2d_relu(vgg_layers, graph['conv4_1'], 21, 'conv4_2') graph['conv4_3'] = _conv2d_relu(vgg_layers, graph['conv4_2'], 23, 'conv4_3') graph['conv4_4'] = _conv2d_relu(vgg_layers, graph['conv4_3'], 25, 'conv4_4') graph['avgpool4'] = _avgpool(graph['conv4_4']) graph['conv5_1'] = _conv2d_relu(vgg_layers, graph['avgpool4'], 28, 'conv5_1') graph['conv5_2'] = _conv2d_relu(vgg_layers, graph['conv5_1'], 30, 'conv5_2') graph['conv5_3'] = _conv2d_relu(vgg_layers, graph['conv5_2'], 32, 'conv5_3') graph['conv5_4'] = _conv2d_relu(vgg_layers, graph['conv5_3'], 34, 'conv5_4') graph['avgpool5'] = _avgpool(graph['conv5_4']) return graph ================================================ FILE: 2017/data/arvix_abstracts.txt ================================================ In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). ================================================ FILE: 2017/data/heart.csv ================================================ sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd 160,12,5.73,23.11,Present,49,25.3,97.2,52,1 144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1 118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0 170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1 134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1 132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0 142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0 114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1 114,0,3.83,19.4,Present,49,24.86,2.49,29,0 132,0,5.8,30.96,Present,69,30.11,0,53,1 206,6,2.95,32.27,Absent,72,26.81,56.06,60,1 134,14.1,4.44,22.39,Present,65,23.09,0,40,1 118,0,1.88,10.05,Absent,59,21.57,0,17,0 132,0,1.87,17.21,Absent,49,23.63,0.97,15,0 112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0 117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0 120,7.5,15.33,22,Absent,60,25.31,34.49,49,0 146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1 158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1 124,14,6.23,35.96,Present,45,30.09,0,59,1 106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1 132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0 150,0.3,6.38,33.99,Present,62,24.64,0,50,0 138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0 142,18.2,4.34,24.38,Absent,61,26.19,0,50,0 124,4,12.42,31.29,Present,54,23.23,2.06,42,1 118,6,9.65,33.91,Absent,60,38.8,0,48,0 145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1 144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0 146,0,6.62,25.69,Absent,60,28.07,8.23,63,1 136,2.52,3.95,25.63,Absent,51,21.86,0,45,1 158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1 122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1 126,8.75,6.53,34.02,Absent,49,30.25,0,41,1 148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0 122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1 140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0 110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0 130,0,2.82,19.63,Present,70,24.86,0,29,0 136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1 118,0.28,5.8,33.7,Present,60,30.98,0,41,1 144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0 120,0,1.07,16.02,Absent,47,22.15,0,15,0 130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1 114,0,2.99,9.74,Absent,54,46.58,0,17,0 128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0 162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1 116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1 114,0,1.94,11.02,Absent,54,20.17,38.98,16,0 126,3.8,3.88,31.79,Absent,57,30.53,0,30,0 122,0,5.75,30.9,Present,46,29.01,4.11,42,0 134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0 152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1 134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1 156,3,1.82,27.55,Absent,60,23.91,54,53,0 152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0 118,0,2.99,16.17,Absent,49,23.83,3.22,28,0 126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1 103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0 121,0.8,5.29,18.95,Present,47,22.51,0,61,0 142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0 138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0 152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0 140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0 130,0,1.82,10.45,Absent,57,22.07,2.06,17,0 136,7.36,2.19,28.11,Present,61,25,61.71,54,0 124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0 112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0 118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0 122,0,3.37,16.1,Absent,67,21.06,0,32,1 118,0,3.67,12.13,Absent,51,19.15,0.6,15,0 130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0 130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0 126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0 128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0 136,0,4.12,17.42,Absent,52,21.66,12.86,40,0 134,0,5.9,30.84,Absent,49,29.16,0,55,0 140,0.6,5.56,33.39,Present,58,27.19,0,55,1 168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1 108,0.4,5.91,22.92,Present,57,25.72,72,39,0 114,3,7.04,22.64,Present,55,22.59,0,45,1 140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1 148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1 148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1 128,0,2.43,13.15,Present,63,20.75,0,17,0 130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0 126,10.5,4.49,17.33,Absent,67,19.37,0,49,1 140,0,5.08,27.33,Present,41,27.83,1.25,38,0 126,0.9,5.64,17.78,Present,55,21.94,0,41,0 122,0.72,4.04,32.38,Absent,34,28.34,0,55,0 116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0 120,3.7,4.02,39.66,Absent,61,30.57,0,64,1 143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0 118,4,3.95,18.96,Absent,54,25.15,8.33,49,1 194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0 134,3,4.37,23.07,Absent,56,20.54,9.65,62,0 138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0 136,0,5,27.58,Present,49,27.59,1.47,39,0 122,3.2,11.32,35.36,Present,55,27.07,0,51,1 164,12,3.91,19.59,Absent,51,23.44,19.75,39,0 136,8,7.85,23.81,Present,51,22.69,2.78,50,0 166,0.07,4.03,29.29,Absent,53,28.37,0,27,0 118,0,4.34,30.12,Present,52,32.18,3.91,46,0 128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0 118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0 158,3.6,2.97,30.11,Absent,63,26.64,108,64,0 108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1 170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1 118,1,5.76,22.1,Absent,62,23.48,7.71,42,0 124,0,3.04,17.33,Absent,49,22.04,0,18,0 114,0,8.01,21.64,Absent,66,25.51,2.49,16,0 168,9,8.53,24.48,Present,69,26.18,4.63,54,1 134,2,3.66,14.69,Absent,52,21.03,2.06,37,0 174,0,8.46,35.1,Present,35,25.27,0,61,1 116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1 128,0,10.58,31.81,Present,46,28.41,14.66,48,0 140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1 154,0.7,5.91,25,Absent,13,20.6,0,42,0 150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1 130,0,3.92,25.55,Absent,68,28.02,0.68,27,0 128,2,6.13,21.31,Absent,66,22.86,11.83,60,0 120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0 120,0,5.01,26.13,Absent,64,26.21,12.24,33,0 138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1 153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0 123,8.6,11.17,35.28,Present,70,33.14,0,59,1 148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0 136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0 134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1 152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0 158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0 132,2,3.08,35.39,Absent,45,31.44,79.82,58,1 134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1 142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0 134,6,3.3,28.45,Absent,65,26.09,58.11,40,0 122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1 116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0 128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0 120,0,3.68,12.24,Absent,51,20.52,0.51,20,0 124,0,3.95,36.35,Present,59,32.83,9.59,54,0 160,14,5.9,37.12,Absent,58,33.87,3.52,54,1 130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1 128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0 130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0 109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0 144,0,3.84,18.72,Absent,56,22.1,4.8,40,0 118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0 136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1 136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1 124,15.5,5.05,24.06,Absent,46,23.22,0,61,1 148,6,6.49,26.47,Absent,48,24.7,0,55,0 128,6.6,3.58,20.71,Absent,55,24.15,0,52,0 122,0.28,4.19,19.97,Absent,61,25.63,0,24,0 108,0,2.74,11.17,Absent,53,22.61,0.95,20,0 124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1 138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1 127,0,2.81,15.7,Absent,42,22.03,1.03,17,0 174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0 122,0,3.05,23.51,Absent,46,25.81,0,38,0 144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1 126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0 208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1 138,0,2.68,17.04,Absent,42,22.16,0,16,0 148,0,3.84,17.26,Absent,70,20,0,21,0 122,0,3.08,16.3,Absent,43,22.13,0,16,0 132,7,3.2,23.26,Absent,77,23.64,23.14,49,0 110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1 160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1 126,0.54,4.39,21.13,Present,45,25.99,0,25,0 162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0 194,2.55,6.89,33.88,Present,69,29.33,0,41,0 118,0.75,2.58,20.25,Absent,59,24.46,0,32,0 124,0,4.79,34.71,Absent,49,26.09,9.26,47,0 160,0,2.42,34.46,Absent,48,29.83,1.03,61,0 128,0,2.51,29.35,Present,53,22.05,1.37,62,0 122,4,5.24,27.89,Present,45,26.52,0,61,1 132,2,2.7,21.57,Present,50,27.95,9.26,37,0 120,0,2.42,16.66,Absent,46,20.16,0,17,0 128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0 108,15,4.91,34.65,Absent,41,27.96,14.4,56,0 166,0,4.31,34.27,Absent,45,30.14,13.27,56,0 152,0,6.06,41.05,Present,51,40.34,0,51,0 170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1 156,4,2.05,19.48,Present,50,21.48,27.77,39,1 116,8,6.73,28.81,Present,41,26.74,40.94,48,1 122,4.4,3.18,11.59,Present,59,21.94,0,33,1 150,20,6.4,35.04,Absent,53,28.88,8.33,63,0 129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0 134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0 126,0,5.98,29.06,Present,56,25.39,11.52,64,1 142,0,3.72,25.68,Absent,48,24.37,5.25,40,1 128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1 102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1 130,0,4.89,25.98,Absent,72,30.42,14.71,23,0 138,0.05,2.79,10.35,Absent,46,21.62,0,18,0 138,0,1.96,11.82,Present,54,22.01,8.13,21,0 128,0,3.09,20.57,Absent,54,25.63,0.51,17,0 162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0 160,3,9.19,26.47,Present,39,28.25,14.4,54,1 148,0,4.66,24.39,Absent,50,25.26,4.03,27,0 124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0 136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1 134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0 128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0 122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0 152,3,4.64,31.29,Absent,41,29.34,4.53,40,0 162,0,5.09,24.6,Present,64,26.71,3.81,18,0 124,4,6.65,30.84,Present,54,28.4,33.51,60,0 136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0 136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0 134,0.05,8.03,27.95,Absent,48,26.88,0,60,0 122,1,5.88,34.81,Present,69,31.27,15.94,40,1 116,3,3.05,30.31,Absent,41,23.63,0.86,44,0 132,0,0.98,21.39,Absent,62,26.75,0,53,0 134,0,2.4,21.11,Absent,57,22.45,1.37,18,0 160,7.77,8.07,34.8,Absent,64,31.15,0,62,1 180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1 124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0 114,0,4.97,9.69,Absent,26,22.6,0,25,0 208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0 138,0,3.14,12,Absent,54,20.28,0,16,0 164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1 144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0 136,7.5,7.39,28.04,Present,50,25.01,0,45,1 132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0 143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0 112,4.46,7.18,26.25,Present,69,27.29,0,32,1 134,10,3.79,34.72,Absent,42,28.33,28.8,52,1 138,2,5.11,31.4,Present,49,27.25,2.06,64,1 188,0,5.47,32.44,Present,71,28.99,7.41,50,1 110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1 136,13.2,7.18,35.95,Absent,48,29.19,0,62,0 130,1.75,5.46,34.34,Absent,53,29.42,0,58,1 122,0,3.76,24.59,Absent,56,24.36,0,30,0 138,0,3.24,27.68,Absent,60,25.7,88.66,29,0 130,18,4.13,27.43,Absent,54,27.44,0,51,1 126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0 176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0 122,0,5.49,19.56,Absent,57,23.12,14.02,27,0 124,0,3.23,9.64,Absent,59,22.7,0,16,0 140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1 128,6,4.37,22.98,Present,50,26.01,0,47,0 190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0 144,0.76,10.53,35.66,Absent,63,34.35,0,55,1 126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1 128,0,2.63,23.88,Absent,45,21.59,6.54,57,0 136,0.4,3.91,21.1,Present,63,22.3,0,56,1 158,4,4.18,28.61,Present,42,25.11,0,60,0 160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0 124,6,5.21,33.02,Present,64,29.37,7.61,58,1 158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0 128,0,6.34,11.87,Absent,57,23.14,0,17,0 166,3,3.82,26.75,Absent,45,20.86,0,63,1 146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0 161,9,4.65,15.16,Present,58,23.76,43.2,46,0 164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1 146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1 142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0 138,12,5.13,28.34,Absent,59,24.49,32.81,58,1 154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0 118,0,2.39,12.13,Absent,49,18.46,0.26,17,1 124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0 124,1.04,2.84,16.42,Present,46,20.17,0,61,0 136,5,4.19,23.99,Present,68,27.8,25.86,35,0 132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1 118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0 118,0.12,4.16,9.37,Absent,57,19.61,0,17,0 134,12,4.96,29.79,Absent,53,24.86,8.23,57,0 114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0 136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1 130,0,4.16,39.43,Present,46,30.01,0,55,1 136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1 136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0 154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0 108,0.8,2.47,17.53,Absent,47,22.18,0,55,1 136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1 174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1 124,4.25,8.22,30.77,Absent,56,25.8,0,43,0 114,0,2.63,9.69,Absent,45,17.89,0,16,0 118,0.12,3.26,12.26,Absent,55,22.65,0,16,0 106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1 146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0 206,0,4.17,33.23,Absent,69,27.36,6.17,50,1 134,3,3.17,17.91,Absent,35,26.37,15.12,27,0 148,15,4.98,36.94,Present,72,31.83,66.27,41,1 126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0 134,0,3.69,13.92,Absent,43,27.66,0,19,0 134,0.02,2.8,18.84,Absent,45,24.82,0,17,0 123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0 112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1 112,0,1.71,15.96,Absent,42,22.03,3.5,16,0 101,0.48,7.26,13,Absent,50,19.82,5.19,16,0 150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0 170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1 134,0,5.63,29.12,Absent,68,32.33,2.02,34,0 142,0,4.19,18.04,Absent,56,23.65,20.78,42,1 132,0.1,3.28,10.73,Absent,73,20.42,0,17,0 136,0,2.28,18.14,Absent,55,22.59,0,17,0 132,12,4.51,21.93,Absent,61,26.07,64.8,46,1 166,4.1,4,34.3,Present,32,29.51,8.23,53,0 138,0,3.96,24.7,Present,53,23.8,0,45,0 138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1 170,0,3.12,37.15,Absent,47,35.42,0,53,0 128,0,8.41,28.82,Present,60,26.86,0,59,1 136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0 128,0,3.22,26.55,Present,39,26.59,16.71,49,0 150,14.4,5.04,26.52,Present,60,28.84,0,45,0 132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1 142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0 130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0 174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1 114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0 162,1.5,2.46,19.39,Present,49,24.32,0,59,1 174,0,3.27,35.4,Absent,58,37.71,24.95,44,0 190,5.15,6.03,36.59,Absent,42,30.31,72,50,0 154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0 124,0,2.28,24.86,Present,50,22.24,8.26,38,0 114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0 168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1 142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0 154,0,4.81,28.11,Present,56,25.67,75.77,59,0 146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0 166,6,3.02,29.3,Absent,35,24.38,38.06,61,0 140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1 136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0 156,0,3.47,21.1,Absent,73,28.4,0,36,1 132,0,6.63,29.58,Present,37,29.41,2.57,62,0 128,0,2.98,12.59,Absent,65,20.74,2.06,19,0 106,5.6,3.2,12.3,Absent,49,20.29,0,39,0 144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0 154,0.31,2.33,16.48,Absent,33,24,11.83,17,0 126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0 134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1 152,19.45,4.22,29.81,Absent,28,23.95,0,59,1 146,1.35,6.39,34.21,Absent,51,26.43,0,59,1 162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0 130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1 138,6,7.24,37.05,Absent,38,28.69,0,59,0 148,0,5.32,26.71,Present,52,32.21,32.78,27,0 124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0 118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0 116,4.28,7.02,19.99,Present,68,23.31,0,52,1 162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1 138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0 137,1.2,3.14,23.87,Absent,66,24.13,45,37,0 198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1 154,4.5,4.75,23.52,Present,43,25.76,0,53,1 128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0 130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1 162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0 120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0 136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0 176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1 134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1 122,1.7,5.28,32.23,Present,51,24.08,0,54,0 134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1 134,0,2.43,22.24,Absent,52,26.49,41.66,24,0 136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0 132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0 152,1.68,3.58,25.43,Absent,50,27.03,0,32,0 132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1 124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0 140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0 166,0.6,2.42,34.03,Present,53,26.96,54,60,0 156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1 132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0 150,0,4.99,27.73,Absent,57,30.92,8.33,24,0 134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0 126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0 148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0 148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1 132,6,5.97,25.73,Present,66,24.18,145.29,41,0 128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0 128,5.16,4.9,31.35,Present,57,26.42,0,64,0 140,0,2.4,27.89,Present,70,30.74,144,29,0 126,0,5.29,27.64,Absent,25,27.62,2.06,45,0 114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0 118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0 126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0 154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1 112,1.44,2.71,22.92,Absent,59,24.81,0,52,0 140,8,4.42,33.15,Present,47,32.77,66.86,44,0 140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1 128,2.6,4.94,21.36,Absent,61,21.3,0,31,0 126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0 160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1 144,0,4.17,29.63,Present,52,21.83,0,59,0 148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1 146,0,4.92,18.53,Absent,57,24.2,34.97,26,0 164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1 130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1 154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1 178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0 180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0 134,12.5,2.73,39.35,Absent,48,35.58,0,48,0 142,0,3.54,16.64,Absent,58,25.97,8.36,27,0 162,7,7.67,34.34,Present,33,30.77,0,62,0 218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1 126,8.75,6.06,32.72,Present,33,27,62.43,55,1 126,0,3.57,26.01,Absent,61,26.3,7.97,47,0 134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0 132,0,4.17,36.57,Absent,57,30.61,18,49,0 178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1 208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1 160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0 116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0 180,25.01,3.7,38.11,Present,57,30.54,0,61,1 200,19.2,4.43,40.6,Present,55,32.04,36,60,1 112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0 120,0,3.1,26.97,Absent,41,24.8,0,16,0 178,20,9.78,33.55,Absent,37,27.29,2.88,62,1 166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0 164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1 216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1 146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0 134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1 158,16,5.56,29.35,Absent,36,25.92,58.32,60,0 176,0,3.14,31.04,Present,45,30.18,4.63,45,0 132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0 126,0,4.55,29.18,Absent,48,24.94,36,41,0 120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0 174,0,3.86,21.73,Absent,42,23.37,0,63,0 150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1 176,6,3.98,17.2,Present,52,21.07,4.11,61,1 142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1 132,0,3.3,21.61,Absent,42,24.92,32.61,33,0 142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0 146,1.16,2.28,34.53,Absent,50,28.71,45,49,0 132,7.2,3.65,17.16,Present,56,23.25,0,34,0 120,0,3.57,23.22,Absent,58,27.2,0,32,0 118,0,3.89,15.96,Absent,65,20.18,0,16,0 108,0,1.43,26.26,Absent,42,19.38,0,16,0 136,0,4,19.06,Absent,40,21.94,2.06,16,0 120,0,2.46,13.39,Absent,47,22.01,0.51,18,0 132,0,3.55,8.66,Present,61,18.5,3.87,16,0 136,0,1.77,20.37,Absent,45,21.51,2.06,16,0 138,0,1.86,18.35,Present,59,25.38,6.51,17,0 138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0 130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0 130,4,2.4,17.42,Absent,60,22.05,0,40,0 110,0,7.14,28.28,Absent,57,29,0,32,0 120,0,3.98,13.19,Present,47,21.89,0,16,0 166,6,8.8,37.89,Absent,39,28.7,43.2,52,0 134,0.57,4.75,23.07,Absent,67,26.33,0,37,0 142,3,3.69,25.1,Absent,60,30.08,38.88,27,0 136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0 142,0,4.32,25.22,Absent,47,28.92,6.53,34,1 130,0,1.88,12.51,Present,52,20.28,0,17,0 124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0 144,4,5.03,25.78,Present,57,27.55,90,48,1 136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0 120,0,2.77,13.35,Absent,67,23.37,1.03,18,0 154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0 124,1.6,7.22,39.68,Present,36,31.5,0,51,1 146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1 128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1 170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0 214,0.4,5.98,31.72,Absent,64,28.45,0,58,0 182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1 108,3,1.59,15.23,Absent,40,20.09,26.64,55,0 118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0 132,0,4.82,33.41,Present,62,14.7,0,46,1 ================================================ FILE: 2017/data/heart.txt ================================================ "sbp" "tobacco" "ldl" "adiposity" "famhist" "typea" "obesity" "alcohol" "age" "chd" 160 12 5.73 23.11 "Present" 49 25.3 97.2 52 1 144 0.01 4.41 28.61 "Absent" 55 28.87 2.06 63 1 118 0.08 3.48 32.28 "Present" 52 29.14 3.81 46 0 170 7.5 6.41 38.03 "Present" 51 31.99 24.26 58 1 134 13.6 3.5 27.78 "Present" 60 25.99 57.34 49 1 132 6.2 6.47 36.21 "Present" 62 30.77 14.14 45 0 142 4.05 3.38 16.2 "Absent" 59 20.81 2.62 38 0 114 4.08 4.59 14.6 "Present" 62 23.11 6.72 58 1 114 0 3.83 19.4 "Present" 49 24.86 2.49 29 0 132 0 5.8 30.96 "Present" 69 30.11 0 53 1 206 6 2.95 32.27 "Absent" 72 26.81 56.06 60 1 134 14.1 4.44 22.39 "Present" 65 23.09 0 40 1 118 0 1.88 10.05 "Absent" 59 21.57 0 17 0 132 0 1.87 17.21 "Absent" 49 23.63 0.97 15 0 112 9.65 2.29 17.2 "Present" 54 23.53 0.68 53 0 117 1.53 2.44 28.95 "Present" 35 25.89 30.03 46 0 120 7.5 15.33 22 "Absent" 60 25.31 34.49 49 0 146 10.5 8.29 35.36 "Present" 78 32.73 13.89 53 1 158 2.6 7.46 34.07 "Present" 61 29.3 53.28 62 1 124 14 6.23 35.96 "Present" 45 30.09 0 59 1 106 1.61 1.74 12.32 "Absent" 74 20.92 13.37 20 1 132 7.9 2.85 26.5 "Present" 51 26.16 25.71 44 0 150 0.3 6.38 33.99 "Present" 62 24.64 0 50 0 138 0.6 3.81 28.66 "Absent" 54 28.7 1.46 58 0 142 18.2 4.34 24.38 "Absent" 61 26.19 0 50 0 124 4 12.42 31.29 "Present" 54 23.23 2.06 42 1 118 6 9.65 33.91 "Absent" 60 38.8 0 48 0 145 9.1 5.24 27.55 "Absent" 59 20.96 21.6 61 1 144 4.09 5.55 31.4 "Present" 60 29.43 5.55 56 0 146 0 6.62 25.69 "Absent" 60 28.07 8.23 63 1 136 2.52 3.95 25.63 "Absent" 51 21.86 0 45 1 158 1.02 6.33 23.88 "Absent" 66 22.13 24.99 46 1 122 6.6 5.58 35.95 "Present" 53 28.07 12.55 59 1 126 8.75 6.53 34.02 "Absent" 49 30.25 0 41 1 148 5.5 7.1 25.31 "Absent" 56 29.84 3.6 48 0 122 4.26 4.44 13.04 "Absent" 57 19.49 48.99 28 1 140 3.9 7.32 25.05 "Absent" 47 27.36 36.77 32 0 110 4.64 4.55 30.46 "Absent" 48 30.9 15.22 46 0 130 0 2.82 19.63 "Present" 70 24.86 0 29 0 136 11.2 5.81 31.85 "Present" 75 27.68 22.94 58 1 118 0.28 5.8 33.7 "Present" 60 30.98 0 41 1 144 0.04 3.38 23.61 "Absent" 30 23.75 4.66 30 0 120 0 1.07 16.02 "Absent" 47 22.15 0 15 0 130 2.61 2.72 22.99 "Present" 51 26.29 13.37 51 1 114 0 2.99 9.74 "Absent" 54 46.58 0 17 0 128 4.65 3.31 22.74 "Absent" 62 22.95 0.51 48 0 162 7.4 8.55 24.65 "Present" 64 25.71 5.86 58 1 116 1.91 7.56 26.45 "Present" 52 30.01 3.6 33 1 114 0 1.94 11.02 "Absent" 54 20.17 38.98 16 0 126 3.8 3.88 31.79 "Absent" 57 30.53 0 30 0 122 0 5.75 30.9 "Present" 46 29.01 4.11 42 0 134 2.5 3.66 30.9 "Absent" 52 27.19 23.66 49 0 152 0.9 9.12 30.23 "Absent" 56 28.64 0.37 42 1 134 8.08 1.55 17.5 "Present" 56 22.65 66.65 31 1 156 3 1.82 27.55 "Absent" 60 23.91 54 53 0 152 5.99 7.99 32.48 "Absent" 45 26.57 100.32 48 0 118 0 2.99 16.17 "Absent" 49 23.83 3.22 28 0 126 5.1 2.96 26.5 "Absent" 55 25.52 12.34 38 1 103 0.03 4.21 18.96 "Absent" 48 22.94 2.62 18 0 121 0.8 5.29 18.95 "Present" 47 22.51 0 61 0 142 0.28 1.8 21.03 "Absent" 57 23.65 2.93 33 0 138 1.15 5.09 27.87 "Present" 61 25.65 2.34 44 0 152 10.1 4.71 24.65 "Present" 65 26.21 24.53 57 0 140 0.45 4.3 24.33 "Absent" 41 27.23 10.08 38 0 130 0 1.82 10.45 "Absent" 57 22.07 2.06 17 0 136 7.36 2.19 28.11 "Present" 61 25 61.71 54 0 124 4.82 3.24 21.1 "Present" 48 28.49 8.42 30 0 112 0.41 1.88 10.29 "Absent" 39 22.08 20.98 27 0 118 4.46 7.27 29.13 "Present" 48 29.01 11.11 33 0 122 0 3.37 16.1 "Absent" 67 21.06 0 32 1 118 0 3.67 12.13 "Absent" 51 19.15 0.6 15 0 130 1.72 2.66 10.38 "Absent" 68 17.81 11.1 26 0 130 5.6 3.37 24.8 "Absent" 58 25.76 43.2 36 0 126 0.09 5.03 13.27 "Present" 50 17.75 4.63 20 0 128 0.4 6.17 26.35 "Absent" 64 27.86 11.11 34 0 136 0 4.12 17.42 "Absent" 52 21.66 12.86 40 0 134 0 5.9 30.84 "Absent" 49 29.16 0 55 0 140 0.6 5.56 33.39 "Present" 58 27.19 0 55 1 168 4.5 6.68 28.47 "Absent" 43 24.25 24.38 56 1 108 0.4 5.91 22.92 "Present" 57 25.72 72 39 0 114 3 7.04 22.64 "Present" 55 22.59 0 45 1 140 8.14 4.93 42.49 "Absent" 53 45.72 6.43 53 1 148 4.8 6.09 36.55 "Present" 63 25.44 0.88 55 1 148 12.2 3.79 34.15 "Absent" 57 26.38 14.4 57 1 128 0 2.43 13.15 "Present" 63 20.75 0 17 0 130 0.56 3.3 30.86 "Absent" 49 27.52 33.33 45 0 126 10.5 4.49 17.33 "Absent" 67 19.37 0 49 1 140 0 5.08 27.33 "Present" 41 27.83 1.25 38 0 126 0.9 5.64 17.78 "Present" 55 21.94 0 41 0 122 0.72 4.04 32.38 "Absent" 34 28.34 0 55 0 116 1.03 2.83 10.85 "Absent" 45 21.59 1.75 21 0 120 3.7 4.02 39.66 "Absent" 61 30.57 0 64 1 143 0.46 2.4 22.87 "Absent" 62 29.17 15.43 29 0 118 4 3.95 18.96 "Absent" 54 25.15 8.33 49 1 194 1.7 6.32 33.67 "Absent" 47 30.16 0.19 56 0 134 3 4.37 23.07 "Absent" 56 20.54 9.65 62 0 138 2.16 4.9 24.83 "Present" 39 26.06 28.29 29 0 136 0 5 27.58 "Present" 49 27.59 1.47 39 0 122 3.2 11.32 35.36 "Present" 55 27.07 0 51 1 164 12 3.91 19.59 "Absent" 51 23.44 19.75 39 0 136 8 7.85 23.81 "Present" 51 22.69 2.78 50 0 166 0.07 4.03 29.29 "Absent" 53 28.37 0 27 0 118 0 4.34 30.12 "Present" 52 32.18 3.91 46 0 128 0.42 4.6 26.68 "Absent" 41 30.97 10.33 31 0 118 1.5 5.38 25.84 "Absent" 64 28.63 3.89 29 0 158 3.6 2.97 30.11 "Absent" 63 26.64 108 64 0 108 1.5 4.33 24.99 "Absent" 66 22.29 21.6 61 1 170 7.6 5.5 37.83 "Present" 42 37.41 6.17 54 1 118 1 5.76 22.1 "Absent" 62 23.48 7.71 42 0 124 0 3.04 17.33 "Absent" 49 22.04 0 18 0 114 0 8.01 21.64 "Absent" 66 25.51 2.49 16 0 168 9 8.53 24.48 "Present" 69 26.18 4.63 54 1 134 2 3.66 14.69 "Absent" 52 21.03 2.06 37 0 174 0 8.46 35.1 "Present" 35 25.27 0 61 1 116 31.2 3.17 14.99 "Absent" 47 19.4 49.06 59 1 128 0 10.58 31.81 "Present" 46 28.41 14.66 48 0 140 4.5 4.59 18.01 "Absent" 63 21.91 22.09 32 1 154 0.7 5.91 25 "Absent" 13 20.6 0 42 0 150 3.5 6.99 25.39 "Present" 50 23.35 23.48 61 1 130 0 3.92 25.55 "Absent" 68 28.02 0.68 27 0 128 2 6.13 21.31 "Absent" 66 22.86 11.83 60 0 120 1.4 6.25 20.47 "Absent" 60 25.85 8.51 28 0 120 0 5.01 26.13 "Absent" 64 26.21 12.24 33 0 138 4.5 2.85 30.11 "Absent" 55 24.78 24.89 56 1 153 7.8 3.96 25.73 "Absent" 54 25.91 27.03 45 0 123 8.6 11.17 35.28 "Present" 70 33.14 0 59 1 148 4.04 3.99 20.69 "Absent" 60 27.78 1.75 28 0 136 3.96 2.76 30.28 "Present" 50 34.42 18.51 38 0 134 8.8 7.41 26.84 "Absent" 35 29.44 29.52 60 1 152 12.18 4.04 37.83 "Present" 63 34.57 4.17 64 0 158 13.5 5.04 30.79 "Absent" 54 24.79 21.5 62 0 132 2 3.08 35.39 "Absent" 45 31.44 79.82 58 1 134 1.5 3.73 21.53 "Absent" 41 24.7 11.11 30 1 142 7.44 5.52 33.97 "Absent" 47 29.29 24.27 54 0 134 6 3.3 28.45 "Absent" 65 26.09 58.11 40 0 122 4.18 9.05 29.27 "Present" 44 24.05 19.34 52 1 116 2.7 3.69 13.52 "Absent" 55 21.13 18.51 32 0 128 0.5 3.7 12.81 "Present" 66 21.25 22.73 28 0 120 0 3.68 12.24 "Absent" 51 20.52 0.51 20 0 124 0 3.95 36.35 "Present" 59 32.83 9.59 54 0 160 14 5.9 37.12 "Absent" 58 33.87 3.52 54 1 130 2.78 4.89 9.39 "Present" 63 19.3 17.47 25 1 128 2.8 5.53 14.29 "Absent" 64 24.97 0.51 38 0 130 4.5 5.86 37.43 "Absent" 61 31.21 32.3 58 0 109 1.2 6.14 29.26 "Absent" 47 24.72 10.46 40 0 144 0 3.84 18.72 "Absent" 56 22.1 4.8 40 0 118 1.05 3.16 12.98 "Present" 46 22.09 16.35 31 0 136 3.46 6.38 32.25 "Present" 43 28.73 3.13 43 1 136 1.5 6.06 26.54 "Absent" 54 29.38 14.5 33 1 124 15.5 5.05 24.06 "Absent" 46 23.22 0 61 1 148 6 6.49 26.47 "Absent" 48 24.7 0 55 0 128 6.6 3.58 20.71 "Absent" 55 24.15 0 52 0 122 0.28 4.19 19.97 "Absent" 61 25.63 0 24 0 108 0 2.74 11.17 "Absent" 53 22.61 0.95 20 0 124 3.04 4.8 19.52 "Present" 60 21.78 147.19 41 1 138 8.8 3.12 22.41 "Present" 63 23.33 120.03 55 1 127 0 2.81 15.7 "Absent" 42 22.03 1.03 17 0 174 9.45 5.13 35.54 "Absent" 55 30.71 59.79 53 0 122 0 3.05 23.51 "Absent" 46 25.81 0 38 0 144 6.75 5.45 29.81 "Absent" 53 25.62 26.23 43 1 126 1.8 6.22 19.71 "Absent" 65 24.81 0.69 31 0 208 27.4 3.12 26.63 "Absent" 66 27.45 33.07 62 1 138 0 2.68 17.04 "Absent" 42 22.16 0 16 0 148 0 3.84 17.26 "Absent" 70 20 0 21 0 122 0 3.08 16.3 "Absent" 43 22.13 0 16 0 132 7 3.2 23.26 "Absent" 77 23.64 23.14 49 0 110 12.16 4.99 28.56 "Absent" 44 27.14 21.6 55 1 160 1.52 8.12 29.3 "Present" 54 25.87 12.86 43 1 126 0.54 4.39 21.13 "Present" 45 25.99 0 25 0 162 5.3 7.95 33.58 "Present" 58 36.06 8.23 48 0 194 2.55 6.89 33.88 "Present" 69 29.33 0 41 0 118 0.75 2.58 20.25 "Absent" 59 24.46 0 32 0 124 0 4.79 34.71 "Absent" 49 26.09 9.26 47 0 160 0 2.42 34.46 "Absent" 48 29.83 1.03 61 0 128 0 2.51 29.35 "Present" 53 22.05 1.37 62 0 122 4 5.24 27.89 "Present" 45 26.52 0 61 1 132 2 2.7 21.57 "Present" 50 27.95 9.26 37 0 120 0 2.42 16.66 "Absent" 46 20.16 0 17 0 128 0.04 8.22 28.17 "Absent" 65 26.24 11.73 24 0 108 15 4.91 34.65 "Absent" 41 27.96 14.4 56 0 166 0 4.31 34.27 "Absent" 45 30.14 13.27 56 0 152 0 6.06 41.05 "Present" 51 40.34 0 51 0 170 4.2 4.67 35.45 "Present" 50 27.14 7.92 60 1 156 4 2.05 19.48 "Present" 50 21.48 27.77 39 1 116 8 6.73 28.81 "Present" 41 26.74 40.94 48 1 122 4.4 3.18 11.59 "Present" 59 21.94 0 33 1 150 20 6.4 35.04 "Absent" 53 28.88 8.33 63 0 129 2.15 5.17 27.57 "Absent" 52 25.42 2.06 39 0 134 4.8 6.58 29.89 "Present" 55 24.73 23.66 63 0 126 0 5.98 29.06 "Present" 56 25.39 11.52 64 1 142 0 3.72 25.68 "Absent" 48 24.37 5.25 40 1 128 0.7 4.9 37.42 "Present" 72 35.94 3.09 49 1 102 0.4 3.41 17.22 "Present" 56 23.59 2.06 39 1 130 0 4.89 25.98 "Absent" 72 30.42 14.71 23 0 138 0.05 2.79 10.35 "Absent" 46 21.62 0 18 0 138 0 1.96 11.82 "Present" 54 22.01 8.13 21 0 128 0 3.09 20.57 "Absent" 54 25.63 0.51 17 0 162 2.92 3.63 31.33 "Absent" 62 31.59 18.51 42 0 160 3 9.19 26.47 "Present" 39 28.25 14.4 54 1 148 0 4.66 24.39 "Absent" 50 25.26 4.03 27 0 124 0.16 2.44 16.67 "Absent" 65 24.58 74.91 23 0 136 3.15 4.37 20.22 "Present" 59 25.12 47.16 31 1 134 2.75 5.51 26.17 "Absent" 57 29.87 8.33 33 0 128 0.73 3.97 23.52 "Absent" 54 23.81 19.2 64 0 122 3.2 3.59 22.49 "Present" 45 24.96 36.17 58 0 152 3 4.64 31.29 "Absent" 41 29.34 4.53 40 0 162 0 5.09 24.6 "Present" 64 26.71 3.81 18 0 124 4 6.65 30.84 "Present" 54 28.4 33.51 60 0 136 5.8 5.9 27.55 "Absent" 65 25.71 14.4 59 0 136 8.8 4.26 32.03 "Present" 52 31.44 34.35 60 0 134 0.05 8.03 27.95 "Absent" 48 26.88 0 60 0 122 1 5.88 34.81 "Present" 69 31.27 15.94 40 1 116 3 3.05 30.31 "Absent" 41 23.63 0.86 44 0 132 0 0.98 21.39 "Absent" 62 26.75 0 53 0 134 0 2.4 21.11 "Absent" 57 22.45 1.37 18 0 160 7.77 8.07 34.8 "Absent" 64 31.15 0 62 1 180 0.52 4.23 16.38 "Absent" 55 22.56 14.77 45 1 124 0.81 6.16 11.61 "Absent" 35 21.47 10.49 26 0 114 0 4.97 9.69 "Absent" 26 22.6 0 25 0 208 7.4 7.41 32.03 "Absent" 50 27.62 7.85 57 0 138 0 3.14 12 "Absent" 54 20.28 0 16 0 164 0.5 6.95 39.64 "Present" 47 41.76 3.81 46 1 144 2.4 8.13 35.61 "Absent" 46 27.38 13.37 60 0 136 7.5 7.39 28.04 "Present" 50 25.01 0 45 1 132 7.28 3.52 12.33 "Absent" 60 19.48 2.06 56 0 143 5.04 4.86 23.59 "Absent" 58 24.69 18.72 42 0 112 4.46 7.18 26.25 "Present" 69 27.29 0 32 1 134 10 3.79 34.72 "Absent" 42 28.33 28.8 52 1 138 2 5.11 31.4 "Present" 49 27.25 2.06 64 1 188 0 5.47 32.44 "Present" 71 28.99 7.41 50 1 110 2.35 3.36 26.72 "Present" 54 26.08 109.8 58 1 136 13.2 7.18 35.95 "Absent" 48 29.19 0 62 0 130 1.75 5.46 34.34 "Absent" 53 29.42 0 58 1 122 0 3.76 24.59 "Absent" 56 24.36 0 30 0 138 0 3.24 27.68 "Absent" 60 25.7 88.66 29 0 130 18 4.13 27.43 "Absent" 54 27.44 0 51 1 126 5.5 3.78 34.15 "Absent" 55 28.85 3.18 61 0 176 5.76 4.89 26.1 "Present" 46 27.3 19.44 57 0 122 0 5.49 19.56 "Absent" 57 23.12 14.02 27 0 124 0 3.23 9.64 "Absent" 59 22.7 0 16 0 140 5.2 3.58 29.26 "Absent" 70 27.29 20.17 45 1 128 6 4.37 22.98 "Present" 50 26.01 0 47 0 190 4.18 5.05 24.83 "Absent" 45 26.09 82.85 41 0 144 0.76 10.53 35.66 "Absent" 63 34.35 0 55 1 126 4.6 7.4 31.99 "Present" 57 28.67 0.37 60 1 128 0 2.63 23.88 "Absent" 45 21.59 6.54 57 0 136 0.4 3.91 21.1 "Present" 63 22.3 0 56 1 158 4 4.18 28.61 "Present" 42 25.11 0 60 0 160 0.6 6.94 30.53 "Absent" 36 25.68 1.42 64 0 124 6 5.21 33.02 "Present" 64 29.37 7.61 58 1 158 6.17 8.12 30.75 "Absent" 46 27.84 92.62 48 0 128 0 6.34 11.87 "Absent" 57 23.14 0 17 0 166 3 3.82 26.75 "Absent" 45 20.86 0 63 1 146 7.5 7.21 25.93 "Present" 55 22.51 0.51 42 0 161 9 4.65 15.16 "Present" 58 23.76 43.2 46 0 164 13.02 6.26 29.38 "Present" 47 22.75 37.03 54 1 146 5.08 7.03 27.41 "Present" 63 36.46 24.48 37 1 142 4.48 3.57 19.75 "Present" 51 23.54 3.29 49 0 138 12 5.13 28.34 "Absent" 59 24.49 32.81 58 1 154 1.8 7.13 34.04 "Present" 52 35.51 39.36 44 0 118 0 2.39 12.13 "Absent" 49 18.46 0.26 17 1 124 0.61 2.69 17.15 "Present" 61 22.76 11.55 20 0 124 1.04 2.84 16.42 "Present" 46 20.17 0 61 0 136 5 4.19 23.99 "Present" 68 27.8 25.86 35 0 132 9.9 4.63 27.86 "Present" 46 23.39 0.51 52 1 118 0.12 1.96 20.31 "Absent" 37 20.01 2.42 18 0 118 0.12 4.16 9.37 "Absent" 57 19.61 0 17 0 134 12 4.96 29.79 "Absent" 53 24.86 8.23 57 0 114 0.1 3.95 15.89 "Present" 57 20.31 17.14 16 0 136 6.8 7.84 30.74 "Present" 58 26.2 23.66 45 1 130 0 4.16 39.43 "Present" 46 30.01 0 55 1 136 2.2 4.16 38.02 "Absent" 65 37.24 4.11 41 1 136 1.36 3.16 14.97 "Present" 56 24.98 7.3 24 0 154 4.2 5.59 25.02 "Absent" 58 25.02 1.54 43 0 108 0.8 2.47 17.53 "Absent" 47 22.18 0 55 1 136 8.8 4.69 36.07 "Present" 38 26.56 2.78 63 1 174 2.02 6.57 31.9 "Present" 50 28.75 11.83 64 1 124 4.25 8.22 30.77 "Absent" 56 25.8 0 43 0 114 0 2.63 9.69 "Absent" 45 17.89 0 16 0 118 0.12 3.26 12.26 "Absent" 55 22.65 0 16 0 106 1.08 4.37 26.08 "Absent" 67 24.07 17.74 28 1 146 3.6 3.51 22.67 "Absent" 51 22.29 43.71 42 0 206 0 4.17 33.23 "Absent" 69 27.36 6.17 50 1 134 3 3.17 17.91 "Absent" 35 26.37 15.12 27 0 148 15 4.98 36.94 "Present" 72 31.83 66.27 41 1 126 0.21 3.95 15.11 "Absent" 61 22.17 2.42 17 0 134 0 3.69 13.92 "Absent" 43 27.66 0 19 0 134 0.02 2.8 18.84 "Absent" 45 24.82 0 17 0 123 0.05 4.61 13.69 "Absent" 51 23.23 2.78 16 0 112 0.6 5.28 25.71 "Absent" 55 27.02 27.77 38 1 112 0 1.71 15.96 "Absent" 42 22.03 3.5 16 0 101 0.48 7.26 13 "Absent" 50 19.82 5.19 16 0 150 0.18 4.14 14.4 "Absent" 53 23.43 7.71 44 0 170 2.6 7.22 28.69 "Present" 71 27.87 37.65 56 1 134 0 5.63 29.12 "Absent" 68 32.33 2.02 34 0 142 0 4.19 18.04 "Absent" 56 23.65 20.78 42 1 132 0.1 3.28 10.73 "Absent" 73 20.42 0 17 0 136 0 2.28 18.14 "Absent" 55 22.59 0 17 0 132 12 4.51 21.93 "Absent" 61 26.07 64.8 46 1 166 4.1 4 34.3 "Present" 32 29.51 8.23 53 0 138 0 3.96 24.7 "Present" 53 23.8 0 45 0 138 2.27 6.41 29.07 "Absent" 58 30.22 2.93 32 1 170 0 3.12 37.15 "Absent" 47 35.42 0 53 0 128 0 8.41 28.82 "Present" 60 26.86 0 59 1 136 1.2 2.78 7.12 "Absent" 52 22.51 3.41 27 0 128 0 3.22 26.55 "Present" 39 26.59 16.71 49 0 150 14.4 5.04 26.52 "Present" 60 28.84 0 45 0 132 8.4 3.57 13.68 "Absent" 42 18.75 15.43 59 1 142 2.4 2.55 23.89 "Absent" 54 26.09 59.14 37 0 130 0.05 2.44 28.25 "Present" 67 30.86 40.32 34 0 174 3.5 5.26 21.97 "Present" 36 22.04 8.33 59 1 114 9.6 2.51 29.18 "Absent" 49 25.67 40.63 46 0 162 1.5 2.46 19.39 "Present" 49 24.32 0 59 1 174 0 3.27 35.4 "Absent" 58 37.71 24.95 44 0 190 5.15 6.03 36.59 "Absent" 42 30.31 72 50 0 154 1.4 1.72 18.86 "Absent" 58 22.67 43.2 59 0 124 0 2.28 24.86 "Present" 50 22.24 8.26 38 0 114 1.2 3.98 14.9 "Absent" 49 23.79 25.82 26 0 168 11.4 5.08 26.66 "Present" 56 27.04 2.61 59 1 142 3.72 4.24 32.57 "Absent" 52 24.98 7.61 51 0 154 0 4.81 28.11 "Present" 56 25.67 75.77 59 0 146 4.36 4.31 18.44 "Present" 47 24.72 10.8 38 0 166 6 3.02 29.3 "Absent" 35 24.38 38.06 61 0 140 8.6 3.9 32.16 "Present" 52 28.51 11.11 64 1 136 1.7 3.53 20.13 "Absent" 56 19.44 14.4 55 0 156 0 3.47 21.1 "Absent" 73 28.4 0 36 1 132 0 6.63 29.58 "Present" 37 29.41 2.57 62 0 128 0 2.98 12.59 "Absent" 65 20.74 2.06 19 0 106 5.6 3.2 12.3 "Absent" 49 20.29 0 39 0 144 0.4 4.64 30.09 "Absent" 30 27.39 0.74 55 0 154 0.31 2.33 16.48 "Absent" 33 24 11.83 17 0 126 3.1 2.01 32.97 "Present" 56 28.63 26.74 45 0 134 6.4 8.49 37.25 "Present" 56 28.94 10.49 51 1 152 19.45 4.22 29.81 "Absent" 28 23.95 0 59 1 146 1.35 6.39 34.21 "Absent" 51 26.43 0 59 1 162 6.94 4.55 33.36 "Present" 52 27.09 32.06 43 0 130 7.28 3.56 23.29 "Present" 20 26.8 51.87 58 1 138 6 7.24 37.05 "Absent" 38 28.69 0 59 0 148 0 5.32 26.71 "Present" 52 32.21 32.78 27 0 124 4.2 2.94 27.59 "Absent" 50 30.31 85.06 30 0 118 1.62 9.01 21.7 "Absent" 59 25.89 21.19 40 0 116 4.28 7.02 19.99 "Present" 68 23.31 0 52 1 162 6.3 5.73 22.61 "Present" 46 20.43 62.54 53 1 138 0.87 1.87 15.89 "Absent" 44 26.76 42.99 31 0 137 1.2 3.14 23.87 "Absent" 66 24.13 45 37 0 198 0.52 11.89 27.68 "Present" 48 28.4 78.99 26 1 154 4.5 4.75 23.52 "Present" 43 25.76 0 53 1 128 5.4 2.36 12.98 "Absent" 51 18.36 6.69 61 0 130 0.08 5.59 25.42 "Present" 50 24.98 6.27 43 1 162 5.6 4.24 22.53 "Absent" 29 22.91 5.66 60 0 120 10.5 2.7 29.87 "Present" 54 24.5 16.46 49 0 136 3.99 2.58 16.38 "Present" 53 22.41 27.67 36 0 176 1.2 8.28 36.16 "Present" 42 27.81 11.6 58 1 134 11.79 4.01 26.57 "Present" 38 21.79 38.88 61 1 122 1.7 5.28 32.23 "Present" 51 24.08 0 54 0 134 0.9 3.18 23.66 "Present" 52 23.26 27.36 58 1 134 0 2.43 22.24 "Absent" 52 26.49 41.66 24 0 136 6.6 6.08 32.74 "Absent" 64 33.28 2.72 49 0 132 4.05 5.15 26.51 "Present" 31 26.67 16.3 50 0 152 1.68 3.58 25.43 "Absent" 50 27.03 0 32 0 132 12.3 5.96 32.79 "Present" 57 30.12 21.5 62 1 124 0.4 3.67 25.76 "Absent" 43 28.08 20.57 34 0 140 4.2 2.91 28.83 "Present" 43 24.7 47.52 48 0 166 0.6 2.42 34.03 "Present" 53 26.96 54 60 0 156 3.02 5.35 25.72 "Present" 53 25.22 28.11 52 1 132 0.72 4.37 19.54 "Absent" 48 26.11 49.37 28 0 150 0 4.99 27.73 "Absent" 57 30.92 8.33 24 0 134 0.12 3.4 21.18 "Present" 33 26.27 14.21 30 0 126 3.4 4.87 15.16 "Present" 65 22.01 11.11 38 0 148 0.5 5.97 32.88 "Absent" 54 29.27 6.43 42 0 148 8.2 7.75 34.46 "Present" 46 26.53 6.04 64 1 132 6 5.97 25.73 "Present" 66 24.18 145.29 41 0 128 1.6 5.41 29.3 "Absent" 68 29.38 23.97 32 0 128 5.16 4.9 31.35 "Present" 57 26.42 0 64 0 140 0 2.4 27.89 "Present" 70 30.74 144 29 0 126 0 5.29 27.64 "Absent" 25 27.62 2.06 45 0 114 3.6 4.16 22.58 "Absent" 60 24.49 65.31 31 0 118 1.25 4.69 31.58 "Present" 52 27.16 4.11 53 0 126 0.96 4.99 29.74 "Absent" 66 33.35 58.32 38 0 154 4.5 4.68 39.97 "Absent" 61 33.17 1.54 64 1 112 1.44 2.71 22.92 "Absent" 59 24.81 0 52 0 140 8 4.42 33.15 "Present" 47 32.77 66.86 44 0 140 1.68 11.41 29.54 "Present" 74 30.75 2.06 38 1 128 2.6 4.94 21.36 "Absent" 61 21.3 0 31 0 126 19.6 6.03 34.99 "Absent" 49 26.99 55.89 44 0 160 4.2 6.76 37.99 "Present" 61 32.91 3.09 54 1 144 0 4.17 29.63 "Present" 52 21.83 0 59 0 148 4.5 10.49 33.27 "Absent" 50 25.92 2.06 53 1 146 0 4.92 18.53 "Absent" 57 24.2 34.97 26 0 164 5.6 3.17 30.98 "Present" 44 25.99 43.2 53 1 130 0.54 3.63 22.03 "Present" 69 24.34 12.86 39 1 154 2.4 5.63 42.17 "Present" 59 35.07 12.86 50 1 178 0.95 4.75 21.06 "Absent" 49 23.74 24.69 61 0 180 3.57 3.57 36.1 "Absent" 36 26.7 19.95 64 0 134 12.5 2.73 39.35 "Absent" 48 35.58 0 48 0 142 0 3.54 16.64 "Absent" 58 25.97 8.36 27 0 162 7 7.67 34.34 "Present" 33 30.77 0 62 0 218 11.2 2.77 30.79 "Absent" 38 24.86 90.93 48 1 126 8.75 6.06 32.72 "Present" 33 27 62.43 55 1 126 0 3.57 26.01 "Absent" 61 26.3 7.97 47 0 134 6.1 4.77 26.08 "Absent" 47 23.82 1.03 49 0 132 0 4.17 36.57 "Absent" 57 30.61 18 49 0 178 5.5 3.79 23.92 "Present" 45 21.26 6.17 62 1 208 5.04 5.19 20.71 "Present" 52 25.12 24.27 58 1 160 1.15 10.19 39.71 "Absent" 31 31.65 20.52 57 0 116 2.38 5.67 29.01 "Present" 54 27.26 15.77 51 0 180 25.01 3.7 38.11 "Present" 57 30.54 0 61 1 200 19.2 4.43 40.6 "Present" 55 32.04 36 60 1 112 4.2 3.58 27.14 "Absent" 52 26.83 2.06 40 0 120 0 3.1 26.97 "Absent" 41 24.8 0 16 0 178 20 9.78 33.55 "Absent" 37 27.29 2.88 62 1 166 0.8 5.63 36.21 "Absent" 50 34.72 28.8 60 0 164 8.2 14.16 36.85 "Absent" 52 28.5 17.02 55 1 216 0.92 2.66 19.85 "Present" 49 20.58 0.51 63 1 146 6.4 5.62 33.05 "Present" 57 31.03 0.74 46 0 134 1.1 3.54 20.41 "Present" 58 24.54 39.91 39 1 158 16 5.56 29.35 "Absent" 36 25.92 58.32 60 0 176 0 3.14 31.04 "Present" 45 30.18 4.63 45 0 132 2.8 4.79 20.47 "Present" 50 22.15 11.73 48 0 126 0 4.55 29.18 "Absent" 48 24.94 36 41 0 120 5.5 3.51 23.23 "Absent" 46 22.4 90.31 43 0 174 0 3.86 21.73 "Absent" 42 23.37 0 63 0 150 13.8 5.1 29.45 "Present" 52 27.92 77.76 55 1 176 6 3.98 17.2 "Present" 52 21.07 4.11 61 1 142 2.2 3.29 22.7 "Absent" 44 23.66 5.66 42 1 132 0 3.3 21.61 "Absent" 42 24.92 32.61 33 0 142 1.32 7.63 29.98 "Present" 57 31.16 72.93 33 0 146 1.16 2.28 34.53 "Absent" 50 28.71 45 49 0 132 7.2 3.65 17.16 "Present" 56 23.25 0 34 0 120 0 3.57 23.22 "Absent" 58 27.2 0 32 0 118 0 3.89 15.96 "Absent" 65 20.18 0 16 0 108 0 1.43 26.26 "Absent" 42 19.38 0 16 0 136 0 4 19.06 "Absent" 40 21.94 2.06 16 0 120 0 2.46 13.39 "Absent" 47 22.01 0.51 18 0 132 0 3.55 8.66 "Present" 61 18.5 3.87 16 0 136 0 1.77 20.37 "Absent" 45 21.51 2.06 16 0 138 0 1.86 18.35 "Present" 59 25.38 6.51 17 0 138 0.06 4.15 20.66 "Absent" 49 22.59 2.49 16 0 130 1.22 3.3 13.65 "Absent" 50 21.4 3.81 31 0 130 4 2.4 17.42 "Absent" 60 22.05 0 40 0 110 0 7.14 28.28 "Absent" 57 29 0 32 0 120 0 3.98 13.19 "Present" 47 21.89 0 16 0 166 6 8.8 37.89 "Absent" 39 28.7 43.2 52 0 134 0.57 4.75 23.07 "Absent" 67 26.33 0 37 0 142 3 3.69 25.1 "Absent" 60 30.08 38.88 27 0 136 2.8 2.53 9.28 "Present" 61 20.7 4.55 25 0 142 0 4.32 25.22 "Absent" 47 28.92 6.53 34 1 130 0 1.88 12.51 "Present" 52 20.28 0 17 0 124 1.8 3.74 16.64 "Present" 42 22.26 10.49 20 0 144 4 5.03 25.78 "Present" 57 27.55 90 48 1 136 1.81 3.31 6.74 "Absent" 63 19.57 24.94 24 0 120 0 2.77 13.35 "Absent" 67 23.37 1.03 18 0 154 5.53 3.2 28.81 "Present" 61 26.15 42.79 42 0 124 1.6 7.22 39.68 "Present" 36 31.5 0 51 1 146 0.64 4.82 28.02 "Absent" 60 28.11 8.23 39 1 128 2.24 2.83 26.48 "Absent" 48 23.96 47.42 27 1 170 0.4 4.11 42.06 "Present" 56 33.1 2.06 57 0 214 0.4 5.98 31.72 "Absent" 64 28.45 0 58 0 182 4.2 4.41 32.1 "Absent" 52 28.61 18.72 52 1 108 3 1.59 15.23 "Absent" 40 20.09 26.64 55 0 118 5.4 11.61 30.79 "Absent" 64 27.35 23.97 40 0 132 0 4.82 33.41 "Present" 62 14.7 0 46 1 ================================================ FILE: 2017/examples/02_feed_dict.py ================================================ """ Example to demonstrate the use of feed_dict Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf # Example 1: feed_dict with placeholder # create a placeholder of type float 32-bit, value is a vector of 3 elements a = tf.placeholder(tf.float32, shape=[3]) # create a constant of type float 32-bit, value is a vector of 3 elements b = tf.constant([5, 5, 5], tf.float32) # use the placeholder as you would a constant c = a + b # short for tf.add(a, b) with tf.Session() as sess: # print(sess.run(c)) # InvalidArgumentError because a doesn’t have any value # feed [1, 2, 3] to placeholder a via the dict {a: [1, 2, 3]} # fetch value of c print(sess.run(c, {a: [1, 2, 3]})) # >> [6. 7. 8.] # Example 2: feed_dict with variables a = tf.add(2, 5) b = tf.multiply(a, 3) with tf.Session() as sess: # define a dictionary that says to replace the value of 'a' with 15 replace_dict = {a: 15} # Run the session, passing in 'replace_dict' as the value to 'feed_dict' print(sess.run(b, feed_dict=replace_dict)) # >> 45 ================================================ FILE: 2017/examples/02_lazy_loading.py ================================================ """ Example to demonstrate how the graph definition gets bloated because of lazy loading Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf ######################################## ## NORMAL LOADING ## ## print out a graph with 1 Add node ## ######################################## x = tf.Variable(10, name='x') y = tf.Variable(20, name='y') z = tf.add(x, y) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('./graphs/l2', sess.graph) for _ in range(10): sess.run(z) print(tf.get_default_graph().as_graph_def()) writer.close() ######################################## ## LAZY LOADING ## ## print out a graph with 10 Add nodes## ######################################## x = tf.Variable(10, name='x') y = tf.Variable(20, name='y') with tf.Session() as sess: sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('./graphs/l2', sess.graph) for _ in range(10): sess.run(tf.add(x, y)) print(tf.get_default_graph().as_graph_def()) writer.close() ================================================ FILE: 2017/examples/02_simple_tf.py ================================================ """ Some simple TensorFlow's ops Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf a = tf.constant(2) b = tf.constant(3) x = tf.add(a, b) with tf.Session() as sess: writer = tf.summary.FileWriter('./graphs', sess.graph) print(sess.run(x)) writer.close() # close the writer when you’re done using it a = tf.constant([2, 2], name='a') b = tf.constant([[0, 1], [2, 3]], name='b') x = tf.multiply(a, b, name='dot_product') with tf.Session() as sess: print(sess.run(x)) # >> [[0 2] # [4 6]] tf.zeros(shape, dtype=tf.float32, name=None) #creates a tensor of shape and all elements will be zeros (when ran in session) x = tf.zeros([2, 3], tf.int32) y = tf.zeros_like(x, optimize=True) print(y) print(tf.get_default_graph().as_graph_def()) with tf.Session() as sess: y = sess.run(y) with tf.Session() as sess: print(sess.run(tf.linspace(10.0, 13.0, 4))) print(sess.run(tf.range(5))) for i in np.arange(5): print(i) samples = tf.multinomial(tf.constant([[1., 3., 1]]), 5) with tf.Session() as sess: for _ in range(10): print(sess.run(samples)) t_0 = 19 x = tf.zeros_like(t_0) # ==> 0 y = tf.ones_like(t_0) # ==> 1 with tf.Session() as sess: print(sess.run([x, y])) t_1 = ['apple', 'peach', 'banana'] x = tf.zeros_like(t_1) # ==> ['' '' ''] y = tf.ones_like(t_1) # ==> TypeError: Expected string, got 1 of type 'int' instead. t_2 = [[True, False, False], [False, False, True], [False, True, False]] x = tf.zeros_like(t_2) # ==> 2x2 tensor, all elements are False y = tf.ones_like(t_2) # ==> 2x2 tensor, all elements are True with tf.Session() as sess: print(sess.run([x, y])) with tf.variable_scope('meh') as scope: a = tf.get_variable('a', [10]) b = tf.get_variable('b', [100]) writer = tf.summary.FileWriter('test', tf.get_default_graph()) x = tf.Variable(2.0) y = 2.0 * (x ** 3) z = 3.0 + y ** 2 grad_z = tf.gradients(z, [x, y]) with tf.Session() as sess: sess.run(x.initializer) print(sess.run(grad_z)) ================================================ FILE: 2017/examples/02_variables.py ================================================ """ Example to demonstrate the ops of tf.Variables() """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf # Example 1: how to run assign op W = tf.Variable(10) assign_op = W.assign(100) with tf.Session() as sess: sess.run(W.initializer) print(W.eval()) # >> 10 print(sess.run(assign_op)) # >> 100 # Example 2: tricky example # create a variable whose original value is 2 my_var = tf.Variable(2, name="my_var") # assign 2 * my_var to my_var and run the op my_var_times_two my_var_times_two = my_var.assign(2 * my_var) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print(sess.run(my_var_times_two)) # >> 4 print(sess.run(my_var_times_two)) # >> 8 print(sess.run(my_var_times_two)) # >> 16 # Example 3: each session maintains its own copy of variables W = tf.Variable(10) sess1 = tf.Session() sess2 = tf.Session() # You have to initialize W at each session sess1.run(W.initializer) sess2.run(W.initializer) print(sess1.run(W.assign_add(10))) # >> 20 print(sess2.run(W.assign_sub(2))) # >> 8 print(sess1.run(W.assign_add(100))) # >> 120 print(sess2.run(W.assign_sub(50))) # >> -42 sess1.close() sess2.close() ================================================ FILE: 2017/examples/03_linear_regression_sol.py ================================================ """ Simple linear regression example in TensorFlow This program tries to predict the number of thefts from the number of fire in the city of Chicago Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import xlrd import utils DATA_FILE = 'data/fire_theft.xls' # Step 1: read in data from the .xls file book = xlrd.open_workbook(DATA_FILE, encoding_override="utf-8") sheet = book.sheet_by_index(0) data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)]) n_samples = sheet.nrows - 1 # Step 2: create placeholders for input X (number of fire) and label Y (number of theft) X = tf.placeholder(tf.float32, name='X') Y = tf.placeholder(tf.float32, name='Y') # Step 3: create weight and bias, initialized to 0 w = tf.Variable(0.0, name='weights') b = tf.Variable(0.0, name='bias') # Step 4: build model to predict Y Y_predicted = X * w + b # Step 5: use the square error as the loss function loss = tf.square(Y - Y_predicted, name='loss') # loss = utils.huber_loss(Y, Y_predicted) # Step 6: using gradient descent with learning rate of 0.01 to minimize loss optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('./graphs/linear_reg', sess.graph) # Step 8: train the model for i in range(50): # train the model 100 epochs total_loss = 0 for x, y in data: # Session runs train_op and fetch values of loss _, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) total_loss += l print('Epoch {0}: {1}'.format(i, total_loss/n_samples)) # close the writer when you're done using it writer.close() # Step 9: output the values of w and b w, b = sess.run([w, b]) # plot the results X, Y = data.T[0], data.T[1] plt.plot(X, Y, 'bo', label='Real data') plt.plot(X, X * w + b, 'r', label='Predicted data') plt.legend() plt.show() ================================================ FILE: 2017/examples/03_linear_regression_starter.py ================================================ """ Simple linear regression example in TensorFlow This program tries to predict the number of thefts from the number of fire in the city of Chicago Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import xlrd import utils DATA_FILE = 'data/fire_theft.xls' # Phase 1: Assemble the graph # Step 1: read in data from the .xls file book = xlrd.open_workbook(DATA_FILE, encoding_override='utf-8') sheet = book.sheet_by_index(0) data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)]) n_samples = sheet.nrows - 1 # Step 2: create placeholders for input X (number of fire) and label Y (number of theft) # Both have the type float32 # Step 3: create weight and bias, initialized to 0 # name your variables w and b # Step 4: predict Y (number of theft) from the number of fire # name your variable Y_predicted # Step 5: use the square error as the loss function # name your variable loss # Step 6: using gradient descent with learning rate of 0.01 to minimize loss # Phase 2: Train our model with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b # TO - DO # Step 8: train the model for i in range(50): # run 100 epochs total_loss = 0 for x, y in data: # Session runs optimizer to minimize loss and fetch the value of loss. Name the received value as l # TO DO: write sess.run() total_loss += l print("Epoch {0}: {1}".format(i, total_loss/n_samples)) # plot the results # X, Y = data.T[0], data.T[1] # plt.plot(X, Y, 'bo', label='Real data') # plt.plot(X, X * w + b, 'r', label='Predicted data') # plt.legend() # plt.show() ================================================ FILE: 2017/examples/03_logistic_regression_mnist_sol.py ================================================ """ Simple logistic regression model to solve OCR task with MNIST in TensorFlow MNIST dataset: yann.lecun.com/exdb/mnist/ Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import time # Define paramaters for the model learning_rate = 0.01 batch_size = 128 n_epochs = 30 # Step 1: Read in data # using TF Learn's built in function to load MNIST data to the folder data/mnist mnist = input_data.read_data_sets('/data/mnist', one_hot=True) # Step 2: create placeholders for features and labels # each image in the MNIST data is of shape 28*28 = 784 # therefore, each image is represented with a 1x784 tensor # there are 10 classes for each image, corresponding to digits 0 - 9. # each lable is one hot vector. X = tf.placeholder(tf.float32, [batch_size, 784], name='X_placeholder') Y = tf.placeholder(tf.int32, [batch_size, 10], name='Y_placeholder') # Step 3: create weights and bias # w is initialized to random variables with mean of 0, stddev of 0.01 # b is initialized to 0 # shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w) # shape of b depends on Y w = tf.Variable(tf.random_normal(shape=[784, 10], stddev=0.01), name='weights') b = tf.Variable(tf.zeros([1, 10]), name="bias") # Step 4: build model # the model that returns the logits. # this logits will be later passed through softmax layer logits = tf.matmul(X, w) + b # Step 5: define loss function # use cross entropy of softmax of logits as the loss function entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss') loss = tf.reduce_mean(entropy) # computes the mean over all the examples in the batch # Step 6: define training op # using gradient descent with learning rate of 0.01 to minimize loss optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss) with tf.Session() as sess: # to visualize using TensorBoard writer = tf.summary.FileWriter('./graphs/logistic_reg', sess.graph) start_time = time.time() sess.run(tf.global_variables_initializer()) n_batches = int(mnist.train.num_examples/batch_size) for i in range(n_epochs): # train the model n_epochs times total_loss = 0 for _ in range(n_batches): X_batch, Y_batch = mnist.train.next_batch(batch_size) _, loss_batch = sess.run([optimizer, loss], feed_dict={X: X_batch, Y:Y_batch}) total_loss += loss_batch print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches)) print('Total time: {0} seconds'.format(time.time() - start_time)) print('Optimization Finished!') # should be around 0.35 after 25 epochs # test the model preds = tf.nn.softmax(logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) # need numpy.count_nonzero(boolarr) :( n_batches = int(mnist.test.num_examples/batch_size) total_correct_preds = 0 for i in range(n_batches): X_batch, Y_batch = mnist.test.next_batch(batch_size) accuracy_batch = sess.run([accuracy], feed_dict={X: X_batch, Y:Y_batch}) total_correct_preds += accuracy_batch print('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples)) writer.close() ================================================ FILE: 2017/examples/03_logistic_regression_mnist_starter.py ================================================ """ Starter code for logistic regression model to solve OCR task with MNIST in TensorFlow MNIST dataset: yann.lecun.com/exdb/mnist/ Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import time # Define paramaters for the model learning_rate = 0.01 batch_size = 128 n_epochs = 10 # Step 1: Read in data # using TF Learn's built in function to load MNIST data to the folder data/mnist mnist = input_data.read_data_sets('/data/mnist', one_hot=True) # Step 2: create placeholders for features and labels # each image in the MNIST data is of shape 28*28 = 784 # therefore, each image is represented with a 1x784 tensor # there are 10 classes for each image, corresponding to digits 0 - 9. # Features are of the type float, and labels are of the type int # Step 3: create weights and bias # weights and biases are initialized to 0 # shape of w depends on the dimension of X and Y so that Y = X * w + b # shape of b depends on Y # Step 4: build model # the model that returns the logits. # this logits will be later passed through softmax layer # to get the probability distribution of possible label of the image # DO NOT DO SOFTMAX HERE # Step 5: define loss function # use cross entropy loss of the real labels with the softmax of logits # use the method: # tf.nn.softmax_cross_entropy_with_logits(logits, Y) # then use tf.reduce_mean to get the mean loss of the batch # Step 6: define training op # using gradient descent to minimize loss with tf.Session() as sess: start_time = time.time() sess.run(tf.global_variables_initializer()) n_batches = int(mnist.train.num_examples/batch_size) for i in range(n_epochs): # train the model n_epochs times total_loss = 0 for _ in range(n_batches): X_batch, Y_batch = mnist.train.next_batch(batch_size) # TO-DO: run optimizer + fetch loss_batch # # total_loss += loss_batch print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches)) print('Total time: {0} seconds'.format(time.time() - start_time)) print('Optimization Finished!') # should be around 0.35 after 25 epochs # test the model preds = tf.nn.softmax(logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) # need numpy.count_nonzero(boolarr) :( n_batches = int(mnist.test.num_examples/batch_size) total_correct_preds = 0 for i in range(n_batches): X_batch, Y_batch = mnist.test.next_batch(batch_size) accuracy_batch = sess.run([accuracy], feed_dict={X: X_batch, Y:Y_batch}) total_correct_preds += accuracy_batch print('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples)) ================================================ FILE: 2017/examples/04_word2vec_no_frills.py ================================================ """ The no frills implementation of word2vec skip-gram model using NCE loss. Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf from tensorflow.contrib.tensorboard.plugins import projector from process_data import process_data VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # Number of negative examples to sample. LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 10000 SKIP_STEP = 2000 # how many steps to skip before reporting the loss def word2vec(batch_gen): """ Build the graph for word2vec model and train it """ # Step 1: define the placeholders for input and output with tf.name_scope('data'): center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words') target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words') # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU # Step 2: define weights. In word2vec, it's actually the weights that we care about with tf.name_scope('embedding_matrix'): embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name='embed_matrix') # Step 3: define the inference with tf.name_scope('loss'): embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed') # Step 4: construct variables for NCE loss nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE], stddev=1.0 / (EMBED_SIZE ** 0.5)), name='nce_weight') nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias') # define loss function to be NCE loss function loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, biases=nce_bias, labels=target_words, inputs=embed, num_sampled=NUM_SAMPLED, num_classes=VOCAB_SIZE), name='loss') # Step 5: define optimizer optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps writer = tf.summary.FileWriter('./graphs/no_frills/', sess.graph) for index in range(NUM_TRAIN_STEPS): centers, targets = next(batch_gen) loss_batch, _ = sess.run([loss, optimizer], feed_dict={center_words: centers, target_words: targets}) total_loss += loss_batch if (index + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP)) total_loss = 0.0 writer.close() def main(): batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW) word2vec(batch_gen) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/04_word2vec_starter.py ================================================ """ The mo frills implementation of word2vec skip-gram model using NCE loss. Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf from tensorflow.contrib.tensorboard.plugins import projector from process_data import process_data VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # Number of negative examples to sample. LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 20000 SKIP_STEP = 2000 # how many steps to skip before reporting the loss def word2vec(batch_gen): """ Build the graph for word2vec model and train it """ # Step 1: define the placeholders for input and output # center_words have to be int to work on embedding lookup # TO DO # Step 2: define weights. In word2vec, it's actually the weights that we care about # vocab size x embed size # initialized to random uniform -1 to 1 # TOO DO # Step 3: define the inference # get the embed of input words using tf.nn.embedding_lookup # embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed') # TO DO # Step 4: construct variables for NCE loss # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...) # nce_weight (vocab size x embed size), intialized to truncated_normal stddev=1.0 / (EMBED_SIZE ** 0.5) # bias: vocab size, initialized to 0 # TO DO # define loss function to be NCE loss function # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...) # need to get the mean accross the batch # note: you should use embedding of center words for inputs, not center words themselves # TO DO # Step 5: define optimizer # TO DO with tf.Session() as sess: # TO DO: initialize variables total_loss = 0.0 # we use this to calculate the average loss in the last SKIP_STEP steps writer = tf.summary.FileWriter('./graphs/no_frills/', sess.graph) for index in range(NUM_TRAIN_STEPS): centers, targets = next(batch_gen) # TO DO: create feed_dict, run optimizer, fetch loss_batch total_loss += loss_batch if (index + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP)) total_loss = 0.0 writer.close() def main(): batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW) word2vec(batch_gen) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/04_word2vec_visualize.py ================================================ """ word2vec with NCE loss and code to visualize the embeddings on TensorBoard Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np from tensorflow.contrib.tensorboard.plugins import projector import tensorflow as tf from process_data import process_data import utils VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # Number of negative examples to sample. LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 100000 WEIGHTS_FLD = 'processed/' SKIP_STEP = 2000 class SkipGramModel: """ Build the graph for word2vec model """ def __init__(self, vocab_size, embed_size, batch_size, num_sampled, learning_rate): self.vocab_size = vocab_size self.embed_size = embed_size self.batch_size = batch_size self.num_sampled = num_sampled self.lr = learning_rate self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') def _create_placeholders(self): """ Step 1: define the placeholders for input and output """ with tf.name_scope("data"): self.center_words = tf.placeholder(tf.int32, shape=[self.batch_size], name='center_words') self.target_words = tf.placeholder(tf.int32, shape=[self.batch_size, 1], name='target_words') def _create_embedding(self): """ Step 2: define weights. In word2vec, it's actually the weights that we care about """ # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU with tf.device('/cpu:0'): with tf.name_scope("embed"): self.embed_matrix = tf.Variable(tf.random_uniform([self.vocab_size, self.embed_size], -1.0, 1.0), name='embed_matrix') def _create_loss(self): """ Step 3 + 4: define the model + the loss function """ with tf.device('/cpu:0'): with tf.name_scope("loss"): # Step 3: define the inference embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embed') # Step 4: define loss function # construct variables for NCE loss nce_weight = tf.Variable(tf.truncated_normal([self.vocab_size, self.embed_size], stddev=1.0 / (self.embed_size ** 0.5)), name='nce_weight') nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias') # define loss function to be NCE loss function self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, biases=nce_bias, labels=self.target_words, inputs=embed, num_sampled=self.num_sampled, num_classes=self.vocab_size), name='loss') def _create_optimizer(self): """ Step 5: define optimizer """ with tf.device('/cpu:0'): self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, global_step=self.global_step) def _create_summaries(self): with tf.name_scope("summaries"): tf.summary.scalar("loss", self.loss) tf.summary.histogram("histogram loss", self.loss) # because you have several summaries, we should merge them all # into one op to make it easier to manage self.summary_op = tf.summary.merge_all() def build_graph(self): """ Build the graph for our model """ self._create_placeholders() self._create_embedding() self._create_loss() self._create_optimizer() self._create_summaries() def train_model(model, batch_gen, num_train_steps, weights_fld): saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias initial_step = 0 utils.make_dir('checkpoints') with tf.Session() as sess: sess.run(tf.global_variables_initializer()) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint')) # if that checkpoint exists, restore from checkpoint if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps writer = tf.summary.FileWriter('improved_graph/lr' + str(LEARNING_RATE), sess.graph) initial_step = model.global_step.eval() for index in range(initial_step, initial_step + num_train_steps): centers, targets = next(batch_gen) feed_dict={model.center_words: centers, model.target_words: targets} loss_batch, _, summary = sess.run([model.loss, model.optimizer, model.summary_op], feed_dict=feed_dict) writer.add_summary(summary, global_step=index) total_loss += loss_batch if (index + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP)) total_loss = 0.0 saver.save(sess, 'checkpoints/skip-gram', index) #################### # code to visualize the embeddings. uncomment the below to visualize embeddings # run "'tensorboard --logdir='processed'" to see the embeddings # final_embed_matrix = sess.run(model.embed_matrix) # # it has to variable. constants don't work here. you can't reuse model.embed_matrix # embedding_var = tf.Variable(final_embed_matrix[:1000], name='embedding') # sess.run(embedding_var.initializer) # config = projector.ProjectorConfig() # summary_writer = tf.summary.FileWriter('processed') # # add embedding to the config file # embedding = config.embeddings.add() # embedding.tensor_name = embedding_var.name # # link this tensor to its metadata file, in this case the first 500 words of vocab # embedding.metadata_path = 'processed/vocab_1000.tsv' # # saves a configuration file that TensorBoard will read during startup. # projector.visualize_embeddings(summary_writer, config) # saver_embed = tf.train.Saver([embedding_var]) # saver_embed.save(sess, 'processed/model3.ckpt', 1) def main(): model = SkipGramModel(VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE) model.build_graph() batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW) train_model(model, batch_gen, NUM_TRAIN_STEPS, WEIGHTS_FLD) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/05_csv_reader.py ================================================ """ Some people tried to use TextLineReader for the assignment 1 but seem to have problems getting it work, so here is a short script demonstrating the use of CSV reader on the heart dataset. Note that the heart dataset is originally in txt so I first converted it to csv to take advantage of the already laid out columns. You can download heart.csv in the data folder. Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import sys sys.path.append('..') import tensorflow as tf DATA_PATH = 'data/heart.csv' BATCH_SIZE = 2 N_FEATURES = 9 def batch_generator(filenames): """ filenames is the list of files you want to read from. In this case, it contains only heart.csv """ filename_queue = tf.train.string_input_producer(filenames) reader = tf.TextLineReader(skip_header_lines=1) # skip the first line in the file _, value = reader.read(filename_queue) # record_defaults are the default values in case some of our columns are empty # This is also to tell tensorflow the format of our data (the type of the decode result) # for this dataset, out of 9 feature columns, # 8 of them are floats (some are integers, but to make our features homogenous, # we consider them floats), and 1 is string (at position 5) # the last column corresponds to the lable is an integer record_defaults = [[1.0] for _ in range(N_FEATURES)] record_defaults[4] = [''] record_defaults.append([1]) # read in the 10 rows of data content = tf.decode_csv(value, record_defaults=record_defaults) # convert the 5th column (present/absent) to the binary value 0 and 1 content[4] = tf.cond(tf.equal(content[4], tf.constant('Present')), lambda: tf.constant(1.0), lambda: tf.constant(0.0)) # pack all 9 features into a tensor features = tf.stack(content[:N_FEATURES]) # assign the last column to label label = content[-1] # minimum number elements in the queue after a dequeue, used to ensure # that the samples are sufficiently mixed # I think 10 times the BATCH_SIZE is sufficient min_after_dequeue = 10 * BATCH_SIZE # the maximum number of elements in the queue capacity = 20 * BATCH_SIZE # shuffle the data to generate BATCH_SIZE sample pairs data_batch, label_batch = tf.train.shuffle_batch([features, label], batch_size=BATCH_SIZE, capacity=capacity, min_after_dequeue=min_after_dequeue) return data_batch, label_batch def generate_batches(data_batch, label_batch): with tf.Session() as sess: coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) for _ in range(10): # generate 10 batches features, labels = sess.run([data_batch, label_batch]) print(features) coord.request_stop() coord.join(threads) def main(): data_batch, label_batch = batch_generator([DATA_PATH]) generate_batches(data_batch, label_batch) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/05_randomization.py ================================================ """ Examples to demonstrate ops level randomization Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf # Example 1: session is the thing that keeps track of random state c = tf.random_uniform([], -10, 10, seed=2) with tf.Session() as sess: print(sess.run(c)) # >> 3.57493 print(sess.run(c)) # >> -5.97319 # Example 2: each new session will start the random state all over again. c = tf.random_uniform([], -10, 10, seed=2) with tf.Session() as sess: print(sess.run(c)) # >> 3.57493 with tf.Session() as sess: print(sess.run(c)) # >> 3.57493 # Example 3: with operation level random seed, each op keeps its own seed. c = tf.random_uniform([], -10, 10, seed=2) d = tf.random_uniform([], -10, 10, seed=2) with tf.Session() as sess: print(sess.run(c)) # >> 3.57493 print(sess.run(d)) # >> 3.57493 # Example 4: graph level random seed tf.set_random_seed(2) c = tf.random_uniform([], -10, 10) d = tf.random_uniform([], -10, 10) with tf.Session() as sess: print(sess.run(c)) # >> 9.12393 print(sess.run(d)) # >> -4.53404 ================================================ FILE: 2017/examples/07_basic_filters.py ================================================ """ Simple examples of convolution to do some basic filters Also demonstrates the use of TensorFlow data readers. We will use some popular filters for our image. It seems to be working with grayscale images, but not with rgb images. It's probably because I didn't choose the right kernels for rgb images. kernels for rgb images have dimensions 3 x 3 x 3 x 3 kernels for grayscale images have dimensions 3 x 3 x 1 x 1 Note: When you call tf.train.string_input_producer, a tf.train.QueueRunner is added to the graph, which must be run using e.g. tf.train.start_queue_runners() else your session will run into deadlock and your program will crash. And to run QueueRunner, you need a coordinator to close to your queue for you. Without coordinator, your threads will keep on running outside session and you will have the error: ERROR:tensorflow:Exception in QueueRunner: Attempted to use a closed Session. Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import sys sys.path.append('..') from matplotlib import gridspec as gridspec from matplotlib import pyplot as plt import tensorflow as tf import kernels FILENAME = 'data/friday.jpg' def read_one_image(filename): """ This is just to demonstrate how to open an image in TensorFlow, but it's actually a lot easier to use Pillow """ filename_queue = tf.train.string_input_producer([filename]) image_reader = tf.WholeFileReader() _, image_file = image_reader.read(filename_queue) image = tf.image.decode_jpeg(image_file, channels=3) image = tf.cast(image, tf.float32) / 256.0 # cast to float to make conv2d work return image def convolve(image, kernels, rgb=True, strides=[1, 3, 3, 1], padding='SAME'): images = [image[0]] for i, kernel in enumerate(kernels): filtered_image = tf.nn.conv2d(image, kernel, strides=strides, padding=padding)[0] if i == 2: filtered_image = tf.minimum(tf.nn.relu(filtered_image), 255) images.append(filtered_image) return images def get_real_images(images): with tf.Session() as sess: coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) images = sess.run(images) coord.request_stop() coord.join(threads) return images def show_images(images, rgb=True): gs = gridspec.GridSpec(1, len(images)) for i, image in enumerate(images): plt.subplot(gs[0, i]) if rgb: plt.imshow(image) else: image = image.reshape(image.shape[0], image.shape[1]) plt.imshow(image, cmap='gray') plt.axis('off') plt.show() def main(): rgb = False if rgb: kernels_list = [kernels.BLUR_FILTER_RGB, kernels.SHARPEN_FILTER_RGB, kernels.EDGE_FILTER_RGB, kernels.TOP_SOBEL_RGB, kernels.EMBOSS_FILTER_RGB] else: kernels_list = [kernels.BLUR_FILTER, kernels.SHARPEN_FILTER, kernels.EDGE_FILTER, kernels.TOP_SOBEL, kernels.EMBOSS_FILTER] image = read_one_image(FILENAME) if not rgb: image = tf.image.rgb_to_grayscale(image) image = tf.expand_dims(image, 0) # to make it into a batch of 1 element images = convolve(image, kernels_list, rgb) images = get_real_images(images) show_images(images, rgb) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/07_convnet_mnist.py ================================================ """ Using convolutional net on MNIST dataset of handwritten digit (http://yann.lecun.com/exdb/mnist/) Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import tensorflow as tf import tf.contrib.layers as layers from tensorflow.examples.tutorials.mnist import input_data import utils N_CLASSES = 10 # Step 1: Read in data # using TF Learn's built in function to load MNIST data to the folder data/mnist mnist = input_data.read_data_sets("/data/mnist", one_hot=True) # Step 2: Define paramaters for the model LEARNING_RATE = 0.001 BATCH_SIZE = 128 SKIP_STEP = 10 DROPOUT = 0.75 N_EPOCHS = 1 # Step 3: create placeholders for features and labels # each image in the MNIST data is of shape 28*28 = 784 # therefore, each image is represented with a 1x784 tensor # We'll be doing dropout for hidden layer so we'll need a placeholder # for the dropout probability too # Use None for shape so we can change the batch_size once we've built the graph with tf.name_scope('data'): X = tf.placeholder(tf.float32, [None, 784], name="X_placeholder") Y = tf.placeholder(tf.float32, [None, 10], name="Y_placeholder") dropout = tf.placeholder(tf.float32, name='dropout') # Step 4 + 5: create weights + do inference # the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') with tf.variable_scope('conv1') as scope: # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d images = tf.reshape(X, shape=[-1, 28, 28, 1]) kernel = tf.get_variable('kernel', [5, 5, 1, 32], initializer=tf.truncated_normal_initializer()) biases = tf.get_variable('biases', [32], initializer=tf.random_normal_initializer()) conv = tf.nn.conv2d(images, kernel, strides=[1, 1, 1, 1], padding='SAME') conv1 = tf.nn.relu(conv + biases, name=scope.name) # output is of dimension BATCH_SIZE x 28 x 28 x 32 conv1 = layers.conv2d(images, 32, 5, 1, activation_fn=tf.nn.relu, padding='SAME') with tf.variable_scope('pool1') as scope: pool1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # output is of dimension BATCH_SIZE x 14 x 14 x 32 with tf.variable_scope('conv2') as scope: # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64 kernel = tf.get_variable('kernels', [5, 5, 32, 64], initializer=tf.truncated_normal_initializer()) biases = tf.get_variable('biases', [64], initializer=tf.random_normal_initializer()) conv = tf.nn.conv2d(pool1, kernel, strides=[1, 1, 1, 1], padding='SAME') conv2 = tf.nn.relu(conv + biases, name=scope.name) # output is of dimension BATCH_SIZE x 14 x 14 x 64 # layers.conv2d(images, 64, 5, 1, activation_fn=tf.nn.relu, padding='SAME') with tf.variable_scope('pool2') as scope: # similar to pool1 pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # output is of dimension BATCH_SIZE x 7 x 7 x 64 with tf.variable_scope('fc') as scope: # use weight of dimension 7 * 7 * 64 x 1024 input_features = 7 * 7 * 64 w = tf.get_variable('weights', [input_features, 1024], initializer=tf.truncated_normal_initializer()) b = tf.get_variable('biases', [1024], initializer=tf.constant_initializer(0.0)) # reshape pool2 to 2 dimensional pool2 = tf.reshape(pool2, [-1, input_features]) fc = tf.nn.relu(tf.matmul(pool2, w) + b, name='relu') # pool2 = layers.flatten(pool2) # fc = layers.fully_connected(pool2, 1024, tf.nn.relu) fc = tf.nn.dropout(fc, dropout, name='relu_dropout') with tf.variable_scope('softmax_linear') as scope: w = tf.get_variable('weights', [1024, N_CLASSES], initializer=tf.truncated_normal_initializer()) b = tf.get_variable('biases', [N_CLASSES], initializer=tf.random_normal_initializer()) logits = tf.matmul(fc, w) + b # Step 6: define loss function # use softmax cross entropy with logits as the loss function # compute mean cross entropy, softmax is applied internally with tf.name_scope('loss'): entropy = tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=logits) loss = tf.reduce_mean(entropy, name='loss') with tf.name_scope('summaries'): tf.summary.scalar('loss', loss) tf.summary.histogram('histogram loss', loss) summary_op = tf.summary.merge_all() # Step 7: define training op # using gradient descent with learning rate of LEARNING_RATE to minimize cost optimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(loss, global_step=global_step) utils.make_dir('checkpoints') utils.make_dir('checkpoints/convnet_mnist') with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() # to visualize using TensorBoard writer = tf.summary.FileWriter('./graphs/convnet', sess.graph) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint')) # if that checkpoint exists, restore from checkpoint if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) initial_step = global_step.eval() start_time = time.time() n_batches = int(mnist.train.num_examples / BATCH_SIZE) total_loss = 0.0 for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE) _, loss_batch, summary = sess.run([optimizer, loss, summary_op], feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) writer.add_summary(summary, global_step=index) total_loss += loss_batch if (index + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP)) total_loss = 0.0 saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index) print("Optimization Finished!") # should be around 0.35 after 25 epochs print("Total time: {0} seconds".format(time.time() - start_time)) # test the model n_batches = int(mnist.test.num_examples/BATCH_SIZE) total_correct_preds = 0 for i in range(n_batches): X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE) _, loss_batch, logits_batch = sess.run([optimizer, loss, logits], feed_dict={X: X_batch, Y:Y_batch, dropout: 1.0}) preds = tf.nn.softmax(logits_batch) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) total_correct_preds += sess.run(accuracy) print("Accuracy {0}".format(total_correct_preds/mnist.test.num_examples)) ================================================ FILE: 2017/examples/07_convnet_mnist_starter.py ================================================ """ Using convolutional net on MNIST dataset of handwritten digit (http://yann.lecun.com/exdb/mnist/) Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ from __future__ import print_function from __future__ import division from __future__ import print_function import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import utils N_CLASSES = 10 # Step 1: Read in data # using TF Learn's built in function to load MNIST data to the folder data/mnist mnist = input_data.read_data_sets("/data/mnist", one_hot=True) # Step 2: Define paramaters for the model LEARNING_RATE = 0.001 BATCH_SIZE = 128 SKIP_STEP = 10 DROPOUT = 0.75 N_EPOCHS = 1 # Step 3: create placeholders for features and labels # each image in the MNIST data is of shape 28*28 = 784 # therefore, each image is represented with a 1x784 tensor # We'll be doing dropout for hidden layer so we'll need a placeholder # for the dropout probability too # Use None for shape so we can change the batch_size once we've built the graph with tf.name_scope('data'): X = tf.placeholder(tf.float32, [None, 784], name="X_placeholder") Y = tf.placeholder(tf.float32, [None, 10], name="Y_placeholder") dropout = tf.placeholder(tf.float32, name='dropout') # Step 4 + 5: create weights + do inference # the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') utils.make_dir('checkpoints') utils.make_dir('checkpoints/convnet_mnist') with tf.variable_scope('conv1') as scope: # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d # use the dynamic dimension -1 images = tf.reshape(X, shape=[-1, 28, 28, 1]) # TO DO # create kernel variable of dimension [5, 5, 1, 32] # use tf.truncated_normal_initializer() # TO DO # create biases variable of dimension [32] # use tf.constant_initializer(0.0) # TO DO # apply tf.nn.conv2d. strides [1, 1, 1, 1], padding is 'SAME' # TO DO # apply relu on the sum of convolution output and biases # TO DO # output is of dimension BATCH_SIZE x 28 x 28 x 32 with tf.variable_scope('pool1') as scope: # apply max pool with ksize [1, 2, 2, 1], and strides [1, 2, 2, 1], padding 'SAME' # TO DO # output is of dimension BATCH_SIZE x 14 x 14 x 32 with tf.variable_scope('conv2') as scope: # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64 kernel = tf.get_variable('kernels', [5, 5, 32, 64], initializer=tf.truncated_normal_initializer()) biases = tf.get_variable('biases', [64], initializer=tf.random_normal_initializer()) conv = tf.nn.conv2d(pool1, kernel, strides=[1, 1, 1, 1], padding='SAME') conv2 = tf.nn.relu(conv + biases, name=scope.name) # output is of dimension BATCH_SIZE x 14 x 14 x 64 with tf.variable_scope('pool2') as scope: # similar to pool1 pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # output is of dimension BATCH_SIZE x 7 x 7 x 64 with tf.variable_scope('fc') as scope: # use weight of dimension 7 * 7 * 64 x 1024 input_features = 7 * 7 * 64 # create weights and biases # TO DO # reshape pool2 to 2 dimensional pool2 = tf.reshape(pool2, [-1, input_features]) # apply relu on matmul of pool2 and w + b fc = tf.nn.relu(tf.matmul(pool2, w) + b, name='relu') # TO DO # apply dropout fc = tf.nn.dropout(fc, dropout, name='relu_dropout') with tf.variable_scope('softmax_linear') as scope: # this you should know. get logits without softmax # you need to create weights and biases # TO DO # Step 6: define loss function # use softmax cross entropy with logits as the loss function # compute mean cross entropy, softmax is applied internally with tf.name_scope('loss'): # you should know how to do this too # TO DO # Step 7: define training op # using gradient descent with learning rate of LEARNING_RATE to minimize cost # don't forgot to pass in global_step # TO DO with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() # to visualize using TensorBoard writer = tf.summary.FileWriter('./my_graph/mnist', sess.graph) ##### You have to create folders to store checkpoints ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint')) # if that checkpoint exists, restore from checkpoint if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) initial_step = global_step.eval() start_time = time.time() n_batches = int(mnist.train.num_examples / BATCH_SIZE) total_loss = 0.0 for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE) _, loss_batch = sess.run([optimizer, loss], feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) total_loss += loss_batch if (index + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP)) total_loss = 0.0 saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index) print("Optimization Finished!") # should be around 0.35 after 25 epochs print("Total time: {0} seconds".format(time.time() - start_time)) # test the model n_batches = int(mnist.test.num_examples/BATCH_SIZE) total_correct_preds = 0 for i in range(n_batches): X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE) _, loss_batch, logits_batch = sess.run([optimizer, loss, logits], feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) preds = tf.nn.softmax(logits_batch) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) total_correct_preds += sess.run(accuracy) print("Accuracy {0}".format(total_correct_preds/mnist.test.num_examples)) ================================================ FILE: 2017/examples/09_queue_example.py ================================================ """ Example to demonstrate how to use queues Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf N_SAMPLES = 1000 NUM_THREADS = 4 # Generating some simple data # create 1000 random samples, each is a 1D array from the normal distribution (10, 1) data = 10 * np.random.randn(N_SAMPLES, 4) + 1 # create 1000 random labels of 0 and 1 target = np.random.randint(0, 2, size=N_SAMPLES) queue = tf.FIFOQueue(capacity=50, dtypes=[tf.float32, tf.int32], shapes=[[4], []]) enqueue_op = queue.enqueue_many([data, target]) data_sample, label_sample = queue.dequeue() # create ops that do something with data_sample and label_sample # create NUM_THREADS to do enqueue qr = tf.train.QueueRunner(queue, [enqueue_op] * NUM_THREADS) with tf.Session() as sess: # create a coordinator, launch the queue runner threads. coord = tf.train.Coordinator() enqueue_threads = qr.create_threads(sess, coord=coord, start=True) try: for step in range(100): # do to 100 iterations if coord.should_stop(): break data_batch, label_batch = sess.run([data_sample, label_sample]) print(data_batch) print(label_batch) except Exception as e: coord.request_stop(e) finally: coord.request_stop() coord.join(enqueue_threads) ================================================ FILE: 2017/examples/09_tfrecord_example.py ================================================ """ Examples to demonstrate how to write an image file to a TFRecord, and how to read a TFRecord file using TFRecordReader. Author: Chip Huyen Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research" cs20si.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import sys sys.path.append('..') from PIL import Image import numpy as np import matplotlib.pyplot as plt import tensorflow as tf # image supposed to have shape: 480 x 640 x 3 = 921600 IMAGE_PATH = 'data/' def _int64_feature(value): return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) def _bytes_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) def get_image_binary(filename): """ You can read in the image using tensorflow too, but it's a drag since you have to create graphs. It's much easier using Pillow and NumPy """ image = Image.open(filename) image = np.asarray(image, np.uint8) shape = np.array(image.shape, np.int32) return shape.tobytes(), image.tobytes() # convert image to raw data bytes in the array. def write_to_tfrecord(label, shape, binary_image, tfrecord_file): """ This example is to write a sample to TFRecord file. If you want to write more samples, just use a loop. """ writer = tf.python_io.TFRecordWriter(tfrecord_file) # write label, shape, and image content to the TFRecord file example = tf.train.Example(features=tf.train.Features(feature={ 'label': _int64_feature(label), 'shape': _bytes_feature(shape), 'image': _bytes_feature(binary_image) })) writer.write(example.SerializeToString()) writer.close() def write_tfrecord(label, image_file, tfrecord_file): shape, binary_image = get_image_binary(image_file) write_to_tfrecord(label, shape, binary_image, tfrecord_file) def read_from_tfrecord(filenames): tfrecord_file_queue = tf.train.string_input_producer(filenames, name='queue') reader = tf.TFRecordReader() _, tfrecord_serialized = reader.read(tfrecord_file_queue) # label and image are stored as bytes but could be stored as # int64 or float64 values in a serialized tf.Example protobuf. tfrecord_features = tf.parse_single_example(tfrecord_serialized, features={ 'label': tf.FixedLenFeature([], tf.int64), 'shape': tf.FixedLenFeature([], tf.string), 'image': tf.FixedLenFeature([], tf.string), }, name='features') # image was saved as uint8, so we have to decode as uint8. image = tf.decode_raw(tfrecord_features['image'], tf.uint8) shape = tf.decode_raw(tfrecord_features['shape'], tf.int32) # the image tensor is flattened out, so we have to reconstruct the shape image = tf.reshape(image, shape) label = tfrecord_features['label'] return label, shape, image def read_tfrecord(tfrecord_file): label, shape, image = read_from_tfrecord([tfrecord_file]) with tf.Session() as sess: coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) label, image, shape = sess.run([label, image, shape]) coord.request_stop() coord.join(threads) print(label) print(shape) plt.imshow(image) plt.show() def main(): # assume the image has the label Chihuahua, which corresponds to class number 1 label = 1 image_file = IMAGE_PATH + 'friday.jpg' tfrecord_file = IMAGE_PATH + 'friday.tfrecord' write_tfrecord(label, image_file, tfrecord_file) read_tfrecord(tfrecord_file) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/11_char_rnn_gist.py ================================================ """ A clean, no_frills character-level generative language model. Created by Danijar Hafner (danijar.com), edited by Chip Huyen for the class CS 20SI: "TensorFlow for Deep Learning Research" Based on Andrej Karpathy's blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import sys sys.path.append('..') import time import tensorflow as tf import utils DATA_PATH = 'data/arvix_abstracts.txt' HIDDEN_SIZE = 200 BATCH_SIZE = 64 NUM_STEPS = 50 SKIP_STEP = 40 TEMPRATURE = 0.7 LR = 0.003 LEN_GENERATED = 300 def vocab_encode(text, vocab): return [vocab.index(x) + 1 for x in text if x in vocab] def vocab_decode(array, vocab): return ''.join([vocab[x - 1] for x in array]) def read_data(filename, vocab, window=NUM_STEPS, overlap=NUM_STEPS//2): for text in open(filename): text = vocab_encode(text, vocab) for start in range(0, len(text) - window, overlap): chunk = text[start: start + window] chunk += [0] * (window - len(chunk)) yield chunk def read_batch(stream, batch_size=BATCH_SIZE): batch = [] for element in stream: batch.append(element) if len(batch) == batch_size: yield batch batch = [] yield batch def create_rnn(seq, hidden_size=HIDDEN_SIZE): cell = tf.contrib.rnn.GRUCell(hidden_size) in_state = tf.placeholder_with_default( cell.zero_state(tf.shape(seq)[0], tf.float32), [None, hidden_size]) # this line to calculate the real length of seq # all seq are padded to be of the same length which is NUM_STEPS length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1) output, out_state = tf.nn.dynamic_rnn(cell, seq, length, in_state) return output, in_state, out_state def create_model(seq, temp, vocab, hidden=HIDDEN_SIZE): seq = tf.one_hot(seq, len(vocab)) output, in_state, out_state = create_rnn(seq, hidden) # fully_connected is syntactic sugar for tf.matmul(w, output) + b # it will create w and b for us logits = tf.contrib.layers.fully_connected(output, len(vocab), None) loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=logits[:, :-1], labels=seq[:, 1:])) # sample the next character from Maxwell-Boltzmann Distribution with temperature temp # it works equally well without tf.exp sample = tf.multinomial(tf.exp(logits[:, -1] / temp), 1)[:, 0] return loss, sample, in_state, out_state def training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state): saver = tf.train.Saver() start = time.time() with tf.Session() as sess: writer = tf.summary.FileWriter('graphs/gist', sess.graph) sess.run(tf.global_variables_initializer()) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/arvix/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) iteration = global_step.eval() for batch in read_batch(read_data(DATA_PATH, vocab)): batch_loss, _ = sess.run([loss, optimizer], {seq: batch}) if (iteration + 1) % SKIP_STEP == 0: print('Iter {}. \n Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start)) online_inference(sess, vocab, seq, sample, temp, in_state, out_state) start = time.time() saver.save(sess, 'checkpoints/arvix/char-rnn', iteration) iteration += 1 def online_inference(sess, vocab, seq, sample, temp, in_state, out_state, seed='T'): """ Generate sequence one character at a time, based on the previous character """ sentence = seed state = None for _ in range(LEN_GENERATED): batch = [vocab_encode(sentence[-1], vocab)] feed = {seq: batch, temp: TEMPRATURE} # for the first decoder step, the state is None if state is not None: feed.update({in_state: state}) index, state = sess.run([sample, out_state], feed) sentence += vocab_decode(index, vocab) print(sentence) def main(): vocab = ( " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ" "\\^_abcdefghijklmnopqrstuvwxyz{|}") seq = tf.placeholder(tf.int32, [None, None]) temp = tf.placeholder(tf.float32) loss, sample, in_state, out_state = create_model(seq, temp, vocab) global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') optimizer = tf.train.AdamOptimizer(LR).minimize(loss, global_step=global_step) utils.make_dir('checkpoints') utils.make_dir('checkpoints/arvix') training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state) if __name__ == '__main__': main() ================================================ FILE: 2017/examples/autoencoder/autoencoder.py ================================================ import tensorflow as tf from layers import * def encoder(input): # Create a conv network with 3 conv layers and 1 FC layer # Conv 1: filter: [3, 3, 1], stride: [2, 2], relu # Conv 2: filter: [3, 3, 8], stride: [2, 2], relu # Conv 3: filter: [3, 3, 8], stride: [2, 2], relu # FC: output_dim: 100, no non-linearity raise NotImplementedError def decoder(input): # Create a deconv network with 1 FC layer and 3 deconv layers # FC: output dim: 128, relu # Reshape to [batch_size, 4, 4, 8] # Deconv 1: filter: [3, 3, 8], stride: [2, 2], relu # Deconv 2: filter: [8, 8, 1], stride: [2, 2], padding: valid, relu # Deconv 3: filter: [7, 7, 1], stride: [1, 1], padding: valid, sigmoid raise NotImplementedError def autoencoder(input_shape): # Define place holder with input shape # Define variable scope for autoencoder with tf.variable_scope('autoencoder') as scope: # Pass input to encoder to obtain encoding # Pass encoding into decoder to obtain reconstructed image # Return input image (placeholder) and reconstructed image pass ================================================ FILE: 2017/examples/autoencoder/layer_utils.py ================================================ import tensorflow as tf def get_deconv2d_output_dims(input_dims, filter_dims, stride_dims, padding): # Returns the height and width of the output of a deconvolution layer. batch_size, input_h, input_w, num_channels_in = input_dims filter_h, filter_w, num_channels_out = filter_dims stride_h, stride_w = stride_dims # Compute the height in the output, based on the padding. if padding == 'SAME': out_h = input_h * stride_h elif padding == 'VALID': out_h = (input_h - 1) * stride_h + filter_h # Compute the width in the output, based on the padding. if padding == 'SAME': out_w = input_w * stride_w elif padding == 'VALID': out_w = (input_w - 1) * stride_w + filter_w return [batch_size, out_h, out_w, num_channels_out] ================================================ FILE: 2017/examples/autoencoder/layers.py ================================================ import tensorflow as tf from layer_utils import get_deconv2d_output_dims def conv(input, name, filter_dims, stride_dims, padding='SAME', non_linear_fn=tf.nn.relu): input_dims = input.get_shape().as_list() assert(len(input_dims) == 4) # batch_size, height, width, num_channels_in assert(len(filter_dims) == 3) # height, width and num_channels out assert(len(stride_dims) == 2) # stride height and width num_channels_in = input_dims[-1] filter_h, filter_w, num_channels_out = filter_dims stride_h, stride_w = stride_dims # Define a variable scope for the conv layer with tf.variable_scope(name) as scope: # Create filter weight variable # Create bias variable # Define the convolution flow graph # Add bias to conv output # Apply non-linearity (if asked) and return output pass def deconv(input, name, filter_dims, stride_dims, padding='SAME', non_linear_fn=tf.nn.relu): input_dims = input.get_shape().as_list() assert(len(input_dims) == 4) # batch_size, height, width, num_channels_in assert(len(filter_dims) == 3) # height, width and num_channels out assert(len(stride_dims) == 2) # stride height and width num_channels_in = input_dims[-1] filter_h, filter_w, num_channels_out = filter_dims stride_h, stride_w = stride_dims # Let's step into this function output_dims = get_deconv2d_output_dims(input_dims, filter_dims, stride_dims, padding) # Define a variable scope for the deconv layer with tf.variable_scope(name) as scope: # Create filter weight variable # Note that num_channels_out and in positions are flipped for deconv. # Create bias variable # Define the deconv flow graph # Add bias to deconv output # Apply non-linearity (if asked) and return output pass def max_pool(input, name, filter_dims, stride_dims, padding='SAME'): assert(len(filter_dims) == 2) # filter height and width assert(len(stride_dims) == 2) # stride height and width filter_h, filter_w = filter_dims stride_h, stride_w = stride_dims # Define the max pool flow graph and return output pass def fc(input, name, out_dim, non_linear_fn=tf.nn.relu): assert(type(out_dim) == int) # Define a variable scope for the FC layer with tf.variable_scope(name) as scope: input_dims = input.get_shape().as_list() # the input to the fc layer should be flattened if len(input_dims) == 4: # for eg. the output of a conv layer batch_size, input_h, input_w, num_channels = input_dims # ignore the batch dimension in_dim = input_h * input_w * num_channels flat_input = tf.reshape(input, [batch_size, in_dim]) else: in_dim = input_dims[-1] flat_input = input # Create weight variable # Create bias variable # Define FC flow graph # Apply non-linearity (if asked) and return output pass ================================================ FILE: 2017/examples/autoencoder/train.py ================================================ import tensorflow as tf from utils import * from autoencoder import * batch_size = 100 batch_shape = (batch_size, 28, 28, 1) num_visualize = 10 lr = 0.01 num_epochs = 50 def calculate_loss(original, reconstructed): return tf.div(tf.reduce_sum(tf.square(tf.sub(reconstructed, original))), tf.constant(float(batch_size))) def train(dataset): input_image, reconstructed_image = autoencoder(batch_shape) loss = calculate_loss(input_image, reconstructed_image) optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss) init = tf.global_variables_initializer() with tf.Session() as session: session.run(init) dataset_size = len(dataset.train.images) print "Dataset size:", dataset_size num_iters = (num_epochs * dataset_size)/batch_size print "Num iters:", num_iters for step in xrange(num_iters): input_batch = get_next_batch(dataset.train, batch_size) loss_val, _ = session.run([loss, optimizer], feed_dict={input_image: input_batch}) if step % 1000 == 0: print "Loss at step", step, ":", loss_val test_batch = get_next_batch(dataset.test, batch_size) reconstruction = session.run(reconstructed_image, feed_dict={input_image: test_batch}) visualize(test_batch, reconstruction, num_visualize) if __name__ == '__main__': dataset = load_dataset() train(dataset) ================================================ FILE: 2017/examples/autoencoder/utils.py ================================================ import os import sys import tensorflow import numpy as np import matplotlib matplotlib.use('TKAgg') from matplotlib import pyplot as plt from tensorflow.examples.tutorials.mnist import input_data mnist_image_shape = [28, 28, 1] def load_dataset(): return input_data.read_data_sets('MNIST_data') def get_next_batch(dataset, batch_size): # dataset should be mnist.(train/val/test) batch, _ = dataset.next_batch(batch_size) batch_shape = [batch_size] + mnist_image_shape return np.reshape(batch, batch_shape) def visualize(_original, _reconstructions, num_visualize): vis_folder = './vis/' if not os.path.exists(vis_folder): os.makedirs(vis_folder) original = _original[:num_visualize] reconstructions = _reconstructions[:num_visualize] count = 1 for (orig, rec) in zip(original, reconstructions): orig = np.reshape(orig, (mnist_image_shape[0], mnist_image_shape[1])) rec = np.reshape(rec, (mnist_image_shape[0], mnist_image_shape[1])) f, ax = plt.subplots(1,2) ax[0].imshow(orig, cmap='gray') ax[1].imshow(rec, cmap='gray') plt.savefig(vis_folder + "test_%d.png" % count) count += 1 ================================================ FILE: 2017/examples/cgru/README.md ================================================ This is the files used to explain convolutional GRU (CGRU) by Lukasz Kaiser at Google Brain. The accompanied slides can be found at http://web.stanford.edu/class/cs20si/lectures/slides_12.pdf ================================================ FILE: 2017/examples/cgru/custom_getter.py ================================================ # From [github]/tensorflow/python/kernel_tests/variable_scope_test.py def testGetterThatCreatesTwoVariablesAndSumsThem(self): def custom_getter(getter, name, *args, **kwargs): g_0 = getter("%s/0" % name, *args, **kwargs) g_1 = getter("%s/1" % name, *args, **kwargs) with tf.name_scope("custom_getter"): return g_0 + g_1 # or g_0 * const / ||g_0|| or anything you want with variable_scope.variable_scope("scope", custom_getter=custom_getter): v = variable_scope.get_variable("v", [1, 2, 3]) # Or a full model if you wish. OO layers are ok. self.assertEqual([1, 2, 3], v.get_shape()) true_vars = variables_lib.trainable_variables() self.assertEqual(2, len(true_vars)) self.assertEqual("scope/v/0:0", true_vars[0].name) self.assertEqual("scope/v/1:0", true_vars[1].name) self.assertEqual("custom_getter/add:0", v.name) with self.test_session() as sess: variables_lib.global_variables_initializer().run() np_vars, np_v = sess.run([true_vars, v]) self.assertAllClose(np_v, sum(np_vars)) ================================================ FILE: 2017/examples/cgru/data_reader.py ================================================ def examples_queue(data_sources, data_fields_to_features, training, data_items_to_decoders=None, data_items_to_decode=None): """Contruct a queue of training or evaluation examples. This function will create a reader from files given by data_sources, then enqueue the tf.Examples from these files, shuffling if training is true, and finally parse these tf.Examples to tensors. The dictionary data_fields_to_features for an image dataset can be this: data_fields_to_features = { 'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''), 'image/format': tf.FixedLenFeature((), tf.string, default_value='raw'), 'image/class/label': tf.FixedLenFeature( [1], tf.int64, default_value=tf.zeros([1], dtype=tf.int64)), } and for a simple algorithmic dataset with variable-length data it is this: data_fields_to_features = { 'inputs': tf.VarLenFeature(tf.int64), 'targets': tf.VarLenFeature(tf.int64), } The data_items_to_decoders dictionary argument can be left as None if there is no decoding to be performed. But, e.g. for images, it should be set so that the images are decoded from the features, e.g., like this for MNIST: data_items_to_decoders = { 'image': tfexample_decoder.Image( image_key = 'image/encoded', format_key = 'image/format', shape=[28, 28], channels=1), 'label': tfexample_decoder.Tensor('image/class/label'), } These arguments are compatible with the use of tf.contrib.slim.data module, see there for more documentation. Args: data_sources: a list or tuple of sources from which the data will be read, for example [/path/to/train@128, /path/to/train2*, /tmp/.../train3*] data_fields_to_features: a dictionary from data fields in the data sources to features, such as tf.VarLenFeature(tf.int64), see above for examples. training: a Boolean, whether to read for training or evaluation. data_items_to_decoders: a dictionary mapping data items (that will be in the returned result) to decoders that will decode them using features defined in data_fields_to_features; see above for examples. By default (if this is None), we grab the tensor from every feature. data_items_to_decode: a subset of data items that will be decoded; by default (if this is None), we decode all items. Returns: A dictionary mapping each data_field to a corresponding 1D int64 tensor read from the created queue. Raises: ValueError: if no files are found with the provided data_prefix or no data fields were provided. """ with tf.name_scope("examples_queue"): # Read serialized examples using slim parallel_reader. _, example_serialized = tf.contrib.slim.parallel_reader.parallel_read( data_sources, tf.TFRecordReader, shuffle=training, num_readers=4 if training else 1) if data_items_to_decoders is None: data_items_to_decoders = { field: tf.contrib.slim.tfexample_decoder.Tensor(field) for field in data_fields_to_features } decoder = tf.contrib.slim.tfexample_decoder.TFExampleDecoder( data_fields_to_features, data_items_to_decoders) if data_items_to_decode is None: data_items_to_decode = data_items_to_decoders.keys() decoded = decoder.decode(example_serialized, items=data_items_to_decode) return {field: tensor for (field, tensor) in zip(data_items_to_decode, decoded)} def batch_examples(examples, batch_size, bucket_boundaries=None): """Given a queue of examples, create batches of examples with similar lengths. We assume that examples is a dictionary with string keys and tensor values, possibly coming from a queue, e.g., constructed by examples_queue above. Each tensor in examples is assumed to be 1D. We will put tensors of similar length into batches togeter. We return a dictionary with the same keys as examples, and with values being batches of size batch_size. If elements have different lengths, they are padded with 0s. This function is based on tf.contrib.training.bucket_by_sequence_length so see there for details. For example, if examples is a queue containing [1, 2, 3] and [4], then this function with batch_size=2 will return a batch [[1, 2, 3], [4, 0, 0]]. Args: examples: a dictionary with string keys and 1D tensor values. batch_size: a python integer or a scalar int32 tensor. bucket_boundaries: a list of integers for the boundaries that will be used for bucketing; see tf.contrib.training.bucket_by_sequence_length for more details; if None, we create a default set of buckets. Returns: A dictionary with the same keys as examples and with values being batches of examples padded with 0s, i.e., [batch_size x length] tensors. """ # Create default buckets if none were provided. if bucket_boundaries is None: # Small buckets -- go in steps of 8 until 64. small_buckets = [8 * (i + 1) for i in xrange(8)] # Medium buckets -- go in steps of 32 until 256. medium_buckets = [32 * (i + 3) for i in xrange(6)] # Large buckets -- go in steps of 128 until maximum of 1024. large_buckets = [128 * (i + 3) for i in xrange(6)] # By default use the above 20 bucket boundaries (21 queues in total). bucket_boundaries = small_buckets + medium_buckets + large_buckets with tf.name_scope("batch_examples"): # The queue to bucket on will be chosen based on maximum length. max_length = 0 for v in examples.values(): # We assume 0-th dimension is the length. max_length = tf.maximum(max_length, tf.shape(v)[0]) (_, outputs) = tf.contrib.training.bucket_by_sequence_length( max_length, examples, batch_size, bucket_boundaries, capacity=2 * batch_size, dynamic_pad=True) return outputs ================================================ FILE: 2017/examples/cgru/my_layers.py ================================================ def saturating_sigmoid(x): """Saturating sigmoid: 1.2 * sigmoid(x) - 0.1 cut to [0, 1].""" with tf.name_scope("saturating_sigmoid", [x]): y = tf.sigmoid(x) return tf.minimum(1.0, tf.maximum(0.0, 1.2 * y - 0.1)) def embedding(x, vocab_size, dense_size, name=None, reuse=None): """Embed x of type int64 into dense vectors, reducing to max 4 dimensions.""" with tf.variable_scope(name, default_name="embedding", values=[x], reuse=reuse): embedding_var = tf.get_variable("kernel", [vocab_size, dense_size]) return tf.gather(embedding_var, x) def conv_gru(x, kernel_size, filters, padding="same", dilation_rate=1, name=None, reuse=None): """Convolutional GRU in 1 dimension.""" # Let's make a shorthand for conv call first. def do_conv(args, name, bias_start, padding): return tf.layers.conv1d(args, filters, kernel_size, padding=padding, dilation_rate=dilation_rate, bias_initializer=tf.constant_initializer(bias_start), name=name) # Here comes the GRU gate. with tf.variable_scope(name, default_name="conv_gru", values=[x], reuse=reuse): reset = saturating_sigmoid(do_conv(x, "reset", 1.0, padding)) gate = saturating_sigmoid(do_conv(x, "gate", 1.0, padding)) candidate = tf.tanh(do_conv(reset * x, "candidate", 0.0, padding)) return gate * x + (1 - gate) * candidate ================================================ FILE: 2017/examples/cgru/neural_gpu_v3.py ================================================ def neural_gpu(features, hparams, name=None): """The core Neural GPU.""" with tf.variable_scope(name, "neural_gpu"): inputs = features["inputs"] emb_inputs = common_layers.embedding( inputs, hparams.vocab_size, hparams.hidden_size) def step(state, inp): x = tf.nn.dropout(state, 1.0 - hparams.dropout) for layer in xrange(hparams.num_hidden_layers): x = common_layers.conv_gru( x, hparams.kernel_size, hparams.hidden_size, name="cgru_%d" % layer) return tf.where(inp == 0, state, x) # No-op where inp is just padding=0. final = tf.foldl(step, tf.transpose(inputs, [1, 0]), initializer=emb_inputs, parallel_iterations=1, swap_memory=True) return common_layers.conv(final, hparams.vocab_size, 3, padding="same") def mixed_curriculum(inputs, hparams): """Mixed curriculum: skip short sequences, but only with some probability.""" with tf.name_scope("mixed_curriculum"): inputs_length = tf.to_float(tf.shape(inputs)[1]) used_length = tf.cond(tf.less(tf.random_uniform([]), hparams.curriculum_mixing_probability), lambda: tf.constant(0.0), lambda: inputs_length) step = tf.to_float(tf.contrib.framework.get_global_step()) relative_step = step / hparams.curriculum_lengths_per_step return used_length - hparams.curriculum_min_length > relative_step def neural_gpu_curriculum(features, hparams, mode): """The Neural GPU model with curriculum.""" with tf.name_scope("neural_gpu_with_curriculum"): inputs = features["inputs"] is_training = mode == tf.contrib.learn.ModeKeys.TRAIN should_skip = tf.logical_and(is_training, mixed_curriculum(inputs, hparams)) final_shape = tf.concat([tf.shape(inputs), tf.constant([hparams.vocab_size])], axis=0) outputs = tf.cond(should_skip, lambda: tf.zeros(final_shape), lambda: neural_gpu(features, hparams)) return outputs, should_skip def basic_params1(): """A set of basic hyperparameters.""" return tf.HParams(batch_size=32, num_hidden_layers=4, kernel_size=3, hidden_size=64, vocab_size=256, dropout=0.2, clip_grad_norm=2.0, initializer="orthogonal", initializer_gain=1.5, label_smoothing=0.1, optimizer="Adam", optimizer_adam_epsilon=1e-4, optimizer_momentum_momentum=0.9, max_train_length=512, learning_rate_decay_scheme="none", learning_rate_warmup_steps=100, learning_rate=0.1) def curriculum_params1(): """Set of hyperparameters with curriculum settings.""" hparams = common_hparams.basic_params1() hparams.add_hparam("curriculum_mixing_probability", 0.1) hparams.add_hparam("curriculum_lengths_per_step", 1000.0) hparams.add_hparam("curriculum_min_length", 10) return hparams ================================================ FILE: 2017/examples/data/arvix_abstracts.txt ================================================ In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration. Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning. We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline. Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset. Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows. Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them. Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations. Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds. We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models. Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance. We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU) Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers. The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users. We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs. We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive. Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains. The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy. Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks. We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent. A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement. A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations. We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks. Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters. Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention. We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation. Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes. We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly. Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise. We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark. Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units. Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN". Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection. Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks. In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far. Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments. The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions. Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested. We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN). This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet. The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP. Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN). Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters. In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate. Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally. We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly. One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data. Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases. Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations. We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions. Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement. Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology. In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors. Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning. This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm. We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage. Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100. We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance. Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms. Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work. Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments. For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data. Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets. We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application. We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks. Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1 Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer. We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval. Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5. We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012). ================================================ FILE: 2017/examples/data/heart.csv ================================================ sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd 160,12,5.73,23.11,Present,49,25.3,97.2,52,1 144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1 118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0 170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1 134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1 132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0 142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0 114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1 114,0,3.83,19.4,Present,49,24.86,2.49,29,0 132,0,5.8,30.96,Present,69,30.11,0,53,1 206,6,2.95,32.27,Absent,72,26.81,56.06,60,1 134,14.1,4.44,22.39,Present,65,23.09,0,40,1 118,0,1.88,10.05,Absent,59,21.57,0,17,0 132,0,1.87,17.21,Absent,49,23.63,0.97,15,0 112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0 117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0 120,7.5,15.33,22,Absent,60,25.31,34.49,49,0 146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1 158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1 124,14,6.23,35.96,Present,45,30.09,0,59,1 106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1 132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0 150,0.3,6.38,33.99,Present,62,24.64,0,50,0 138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0 142,18.2,4.34,24.38,Absent,61,26.19,0,50,0 124,4,12.42,31.29,Present,54,23.23,2.06,42,1 118,6,9.65,33.91,Absent,60,38.8,0,48,0 145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1 144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0 146,0,6.62,25.69,Absent,60,28.07,8.23,63,1 136,2.52,3.95,25.63,Absent,51,21.86,0,45,1 158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1 122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1 126,8.75,6.53,34.02,Absent,49,30.25,0,41,1 148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0 122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1 140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0 110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0 130,0,2.82,19.63,Present,70,24.86,0,29,0 136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1 118,0.28,5.8,33.7,Present,60,30.98,0,41,1 144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0 120,0,1.07,16.02,Absent,47,22.15,0,15,0 130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1 114,0,2.99,9.74,Absent,54,46.58,0,17,0 128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0 162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1 116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1 114,0,1.94,11.02,Absent,54,20.17,38.98,16,0 126,3.8,3.88,31.79,Absent,57,30.53,0,30,0 122,0,5.75,30.9,Present,46,29.01,4.11,42,0 134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0 152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1 134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1 156,3,1.82,27.55,Absent,60,23.91,54,53,0 152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0 118,0,2.99,16.17,Absent,49,23.83,3.22,28,0 126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1 103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0 121,0.8,5.29,18.95,Present,47,22.51,0,61,0 142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0 138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0 152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0 140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0 130,0,1.82,10.45,Absent,57,22.07,2.06,17,0 136,7.36,2.19,28.11,Present,61,25,61.71,54,0 124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0 112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0 118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0 122,0,3.37,16.1,Absent,67,21.06,0,32,1 118,0,3.67,12.13,Absent,51,19.15,0.6,15,0 130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0 130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0 126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0 128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0 136,0,4.12,17.42,Absent,52,21.66,12.86,40,0 134,0,5.9,30.84,Absent,49,29.16,0,55,0 140,0.6,5.56,33.39,Present,58,27.19,0,55,1 168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1 108,0.4,5.91,22.92,Present,57,25.72,72,39,0 114,3,7.04,22.64,Present,55,22.59,0,45,1 140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1 148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1 148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1 128,0,2.43,13.15,Present,63,20.75,0,17,0 130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0 126,10.5,4.49,17.33,Absent,67,19.37,0,49,1 140,0,5.08,27.33,Present,41,27.83,1.25,38,0 126,0.9,5.64,17.78,Present,55,21.94,0,41,0 122,0.72,4.04,32.38,Absent,34,28.34,0,55,0 116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0 120,3.7,4.02,39.66,Absent,61,30.57,0,64,1 143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0 118,4,3.95,18.96,Absent,54,25.15,8.33,49,1 194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0 134,3,4.37,23.07,Absent,56,20.54,9.65,62,0 138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0 136,0,5,27.58,Present,49,27.59,1.47,39,0 122,3.2,11.32,35.36,Present,55,27.07,0,51,1 164,12,3.91,19.59,Absent,51,23.44,19.75,39,0 136,8,7.85,23.81,Present,51,22.69,2.78,50,0 166,0.07,4.03,29.29,Absent,53,28.37,0,27,0 118,0,4.34,30.12,Present,52,32.18,3.91,46,0 128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0 118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0 158,3.6,2.97,30.11,Absent,63,26.64,108,64,0 108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1 170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1 118,1,5.76,22.1,Absent,62,23.48,7.71,42,0 124,0,3.04,17.33,Absent,49,22.04,0,18,0 114,0,8.01,21.64,Absent,66,25.51,2.49,16,0 168,9,8.53,24.48,Present,69,26.18,4.63,54,1 134,2,3.66,14.69,Absent,52,21.03,2.06,37,0 174,0,8.46,35.1,Present,35,25.27,0,61,1 116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1 128,0,10.58,31.81,Present,46,28.41,14.66,48,0 140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1 154,0.7,5.91,25,Absent,13,20.6,0,42,0 150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1 130,0,3.92,25.55,Absent,68,28.02,0.68,27,0 128,2,6.13,21.31,Absent,66,22.86,11.83,60,0 120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0 120,0,5.01,26.13,Absent,64,26.21,12.24,33,0 138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1 153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0 123,8.6,11.17,35.28,Present,70,33.14,0,59,1 148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0 136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0 134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1 152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0 158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0 132,2,3.08,35.39,Absent,45,31.44,79.82,58,1 134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1 142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0 134,6,3.3,28.45,Absent,65,26.09,58.11,40,0 122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1 116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0 128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0 120,0,3.68,12.24,Absent,51,20.52,0.51,20,0 124,0,3.95,36.35,Present,59,32.83,9.59,54,0 160,14,5.9,37.12,Absent,58,33.87,3.52,54,1 130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1 128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0 130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0 109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0 144,0,3.84,18.72,Absent,56,22.1,4.8,40,0 118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0 136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1 136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1 124,15.5,5.05,24.06,Absent,46,23.22,0,61,1 148,6,6.49,26.47,Absent,48,24.7,0,55,0 128,6.6,3.58,20.71,Absent,55,24.15,0,52,0 122,0.28,4.19,19.97,Absent,61,25.63,0,24,0 108,0,2.74,11.17,Absent,53,22.61,0.95,20,0 124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1 138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1 127,0,2.81,15.7,Absent,42,22.03,1.03,17,0 174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0 122,0,3.05,23.51,Absent,46,25.81,0,38,0 144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1 126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0 208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1 138,0,2.68,17.04,Absent,42,22.16,0,16,0 148,0,3.84,17.26,Absent,70,20,0,21,0 122,0,3.08,16.3,Absent,43,22.13,0,16,0 132,7,3.2,23.26,Absent,77,23.64,23.14,49,0 110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1 160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1 126,0.54,4.39,21.13,Present,45,25.99,0,25,0 162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0 194,2.55,6.89,33.88,Present,69,29.33,0,41,0 118,0.75,2.58,20.25,Absent,59,24.46,0,32,0 124,0,4.79,34.71,Absent,49,26.09,9.26,47,0 160,0,2.42,34.46,Absent,48,29.83,1.03,61,0 128,0,2.51,29.35,Present,53,22.05,1.37,62,0 122,4,5.24,27.89,Present,45,26.52,0,61,1 132,2,2.7,21.57,Present,50,27.95,9.26,37,0 120,0,2.42,16.66,Absent,46,20.16,0,17,0 128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0 108,15,4.91,34.65,Absent,41,27.96,14.4,56,0 166,0,4.31,34.27,Absent,45,30.14,13.27,56,0 152,0,6.06,41.05,Present,51,40.34,0,51,0 170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1 156,4,2.05,19.48,Present,50,21.48,27.77,39,1 116,8,6.73,28.81,Present,41,26.74,40.94,48,1 122,4.4,3.18,11.59,Present,59,21.94,0,33,1 150,20,6.4,35.04,Absent,53,28.88,8.33,63,0 129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0 134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0 126,0,5.98,29.06,Present,56,25.39,11.52,64,1 142,0,3.72,25.68,Absent,48,24.37,5.25,40,1 128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1 102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1 130,0,4.89,25.98,Absent,72,30.42,14.71,23,0 138,0.05,2.79,10.35,Absent,46,21.62,0,18,0 138,0,1.96,11.82,Present,54,22.01,8.13,21,0 128,0,3.09,20.57,Absent,54,25.63,0.51,17,0 162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0 160,3,9.19,26.47,Present,39,28.25,14.4,54,1 148,0,4.66,24.39,Absent,50,25.26,4.03,27,0 124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0 136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1 134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0 128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0 122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0 152,3,4.64,31.29,Absent,41,29.34,4.53,40,0 162,0,5.09,24.6,Present,64,26.71,3.81,18,0 124,4,6.65,30.84,Present,54,28.4,33.51,60,0 136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0 136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0 134,0.05,8.03,27.95,Absent,48,26.88,0,60,0 122,1,5.88,34.81,Present,69,31.27,15.94,40,1 116,3,3.05,30.31,Absent,41,23.63,0.86,44,0 132,0,0.98,21.39,Absent,62,26.75,0,53,0 134,0,2.4,21.11,Absent,57,22.45,1.37,18,0 160,7.77,8.07,34.8,Absent,64,31.15,0,62,1 180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1 124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0 114,0,4.97,9.69,Absent,26,22.6,0,25,0 208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0 138,0,3.14,12,Absent,54,20.28,0,16,0 164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1 144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0 136,7.5,7.39,28.04,Present,50,25.01,0,45,1 132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0 143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0 112,4.46,7.18,26.25,Present,69,27.29,0,32,1 134,10,3.79,34.72,Absent,42,28.33,28.8,52,1 138,2,5.11,31.4,Present,49,27.25,2.06,64,1 188,0,5.47,32.44,Present,71,28.99,7.41,50,1 110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1 136,13.2,7.18,35.95,Absent,48,29.19,0,62,0 130,1.75,5.46,34.34,Absent,53,29.42,0,58,1 122,0,3.76,24.59,Absent,56,24.36,0,30,0 138,0,3.24,27.68,Absent,60,25.7,88.66,29,0 130,18,4.13,27.43,Absent,54,27.44,0,51,1 126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0 176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0 122,0,5.49,19.56,Absent,57,23.12,14.02,27,0 124,0,3.23,9.64,Absent,59,22.7,0,16,0 140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1 128,6,4.37,22.98,Present,50,26.01,0,47,0 190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0 144,0.76,10.53,35.66,Absent,63,34.35,0,55,1 126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1 128,0,2.63,23.88,Absent,45,21.59,6.54,57,0 136,0.4,3.91,21.1,Present,63,22.3,0,56,1 158,4,4.18,28.61,Present,42,25.11,0,60,0 160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0 124,6,5.21,33.02,Present,64,29.37,7.61,58,1 158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0 128,0,6.34,11.87,Absent,57,23.14,0,17,0 166,3,3.82,26.75,Absent,45,20.86,0,63,1 146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0 161,9,4.65,15.16,Present,58,23.76,43.2,46,0 164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1 146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1 142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0 138,12,5.13,28.34,Absent,59,24.49,32.81,58,1 154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0 118,0,2.39,12.13,Absent,49,18.46,0.26,17,1 124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0 124,1.04,2.84,16.42,Present,46,20.17,0,61,0 136,5,4.19,23.99,Present,68,27.8,25.86,35,0 132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1 118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0 118,0.12,4.16,9.37,Absent,57,19.61,0,17,0 134,12,4.96,29.79,Absent,53,24.86,8.23,57,0 114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0 136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1 130,0,4.16,39.43,Present,46,30.01,0,55,1 136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1 136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0 154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0 108,0.8,2.47,17.53,Absent,47,22.18,0,55,1 136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1 174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1 124,4.25,8.22,30.77,Absent,56,25.8,0,43,0 114,0,2.63,9.69,Absent,45,17.89,0,16,0 118,0.12,3.26,12.26,Absent,55,22.65,0,16,0 106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1 146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0 206,0,4.17,33.23,Absent,69,27.36,6.17,50,1 134,3,3.17,17.91,Absent,35,26.37,15.12,27,0 148,15,4.98,36.94,Present,72,31.83,66.27,41,1 126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0 134,0,3.69,13.92,Absent,43,27.66,0,19,0 134,0.02,2.8,18.84,Absent,45,24.82,0,17,0 123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0 112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1 112,0,1.71,15.96,Absent,42,22.03,3.5,16,0 101,0.48,7.26,13,Absent,50,19.82,5.19,16,0 150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0 170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1 134,0,5.63,29.12,Absent,68,32.33,2.02,34,0 142,0,4.19,18.04,Absent,56,23.65,20.78,42,1 132,0.1,3.28,10.73,Absent,73,20.42,0,17,0 136,0,2.28,18.14,Absent,55,22.59,0,17,0 132,12,4.51,21.93,Absent,61,26.07,64.8,46,1 166,4.1,4,34.3,Present,32,29.51,8.23,53,0 138,0,3.96,24.7,Present,53,23.8,0,45,0 138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1 170,0,3.12,37.15,Absent,47,35.42,0,53,0 128,0,8.41,28.82,Present,60,26.86,0,59,1 136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0 128,0,3.22,26.55,Present,39,26.59,16.71,49,0 150,14.4,5.04,26.52,Present,60,28.84,0,45,0 132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1 142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0 130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0 174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1 114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0 162,1.5,2.46,19.39,Present,49,24.32,0,59,1 174,0,3.27,35.4,Absent,58,37.71,24.95,44,0 190,5.15,6.03,36.59,Absent,42,30.31,72,50,0 154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0 124,0,2.28,24.86,Present,50,22.24,8.26,38,0 114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0 168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1 142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0 154,0,4.81,28.11,Present,56,25.67,75.77,59,0 146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0 166,6,3.02,29.3,Absent,35,24.38,38.06,61,0 140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1 136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0 156,0,3.47,21.1,Absent,73,28.4,0,36,1 132,0,6.63,29.58,Present,37,29.41,2.57,62,0 128,0,2.98,12.59,Absent,65,20.74,2.06,19,0 106,5.6,3.2,12.3,Absent,49,20.29,0,39,0 144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0 154,0.31,2.33,16.48,Absent,33,24,11.83,17,0 126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0 134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1 152,19.45,4.22,29.81,Absent,28,23.95,0,59,1 146,1.35,6.39,34.21,Absent,51,26.43,0,59,1 162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0 130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1 138,6,7.24,37.05,Absent,38,28.69,0,59,0 148,0,5.32,26.71,Present,52,32.21,32.78,27,0 124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0 118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0 116,4.28,7.02,19.99,Present,68,23.31,0,52,1 162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1 138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0 137,1.2,3.14,23.87,Absent,66,24.13,45,37,0 198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1 154,4.5,4.75,23.52,Present,43,25.76,0,53,1 128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0 130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1 162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0 120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0 136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0 176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1 134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1 122,1.7,5.28,32.23,Present,51,24.08,0,54,0 134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1 134,0,2.43,22.24,Absent,52,26.49,41.66,24,0 136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0 132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0 152,1.68,3.58,25.43,Absent,50,27.03,0,32,0 132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1 124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0 140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0 166,0.6,2.42,34.03,Present,53,26.96,54,60,0 156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1 132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0 150,0,4.99,27.73,Absent,57,30.92,8.33,24,0 134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0 126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0 148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0 148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1 132,6,5.97,25.73,Present,66,24.18,145.29,41,0 128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0 128,5.16,4.9,31.35,Present,57,26.42,0,64,0 140,0,2.4,27.89,Present,70,30.74,144,29,0 126,0,5.29,27.64,Absent,25,27.62,2.06,45,0 114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0 118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0 126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0 154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1 112,1.44,2.71,22.92,Absent,59,24.81,0,52,0 140,8,4.42,33.15,Present,47,32.77,66.86,44,0 140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1 128,2.6,4.94,21.36,Absent,61,21.3,0,31,0 126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0 160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1 144,0,4.17,29.63,Present,52,21.83,0,59,0 148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1 146,0,4.92,18.53,Absent,57,24.2,34.97,26,0 164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1 130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1 154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1 178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0 180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0 134,12.5,2.73,39.35,Absent,48,35.58,0,48,0 142,0,3.54,16.64,Absent,58,25.97,8.36,27,0 162,7,7.67,34.34,Present,33,30.77,0,62,0 218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1 126,8.75,6.06,32.72,Present,33,27,62.43,55,1 126,0,3.57,26.01,Absent,61,26.3,7.97,47,0 134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0 132,0,4.17,36.57,Absent,57,30.61,18,49,0 178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1 208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1 160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0 116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0 180,25.01,3.7,38.11,Present,57,30.54,0,61,1 200,19.2,4.43,40.6,Present,55,32.04,36,60,1 112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0 120,0,3.1,26.97,Absent,41,24.8,0,16,0 178,20,9.78,33.55,Absent,37,27.29,2.88,62,1 166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0 164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1 216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1 146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0 134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1 158,16,5.56,29.35,Absent,36,25.92,58.32,60,0 176,0,3.14,31.04,Present,45,30.18,4.63,45,0 132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0 126,0,4.55,29.18,Absent,48,24.94,36,41,0 120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0 174,0,3.86,21.73,Absent,42,23.37,0,63,0 150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1 176,6,3.98,17.2,Present,52,21.07,4.11,61,1 142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1 132,0,3.3,21.61,Absent,42,24.92,32.61,33,0 142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0 146,1.16,2.28,34.53,Absent,50,28.71,45,49,0 132,7.2,3.65,17.16,Present,56,23.25,0,34,0 120,0,3.57,23.22,Absent,58,27.2,0,32,0 118,0,3.89,15.96,Absent,65,20.18,0,16,0 108,0,1.43,26.26,Absent,42,19.38,0,16,0 136,0,4,19.06,Absent,40,21.94,2.06,16,0 120,0,2.46,13.39,Absent,47,22.01,0.51,18,0 132,0,3.55,8.66,Present,61,18.5,3.87,16,0 136,0,1.77,20.37,Absent,45,21.51,2.06,16,0 138,0,1.86,18.35,Present,59,25.38,6.51,17,0 138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0 130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0 130,4,2.4,17.42,Absent,60,22.05,0,40,0 110,0,7.14,28.28,Absent,57,29,0,32,0 120,0,3.98,13.19,Present,47,21.89,0,16,0 166,6,8.8,37.89,Absent,39,28.7,43.2,52,0 134,0.57,4.75,23.07,Absent,67,26.33,0,37,0 142,3,3.69,25.1,Absent,60,30.08,38.88,27,0 136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0 142,0,4.32,25.22,Absent,47,28.92,6.53,34,1 130,0,1.88,12.51,Present,52,20.28,0,17,0 124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0 144,4,5.03,25.78,Present,57,27.55,90,48,1 136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0 120,0,2.77,13.35,Absent,67,23.37,1.03,18,0 154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0 124,1.6,7.22,39.68,Present,36,31.5,0,51,1 146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1 128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1 170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0 214,0.4,5.98,31.72,Absent,64,28.45,0,58,0 182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1 108,3,1.59,15.23,Absent,40,20.09,26.64,55,0 118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0 132,0,4.82,33.41,Present,62,14.7,0,46,1 ================================================ FILE: 2017/examples/data/heart.txt ================================================ "sbp" "tobacco" "ldl" "adiposity" "famhist" "typea" "obesity" "alcohol" "age" "chd" 160 12 5.73 23.11 "Present" 49 25.3 97.2 52 1 144 0.01 4.41 28.61 "Absent" 55 28.87 2.06 63 1 118 0.08 3.48 32.28 "Present" 52 29.14 3.81 46 0 170 7.5 6.41 38.03 "Present" 51 31.99 24.26 58 1 134 13.6 3.5 27.78 "Present" 60 25.99 57.34 49 1 132 6.2 6.47 36.21 "Present" 62 30.77 14.14 45 0 142 4.05 3.38 16.2 "Absent" 59 20.81 2.62 38 0 114 4.08 4.59 14.6 "Present" 62 23.11 6.72 58 1 114 0 3.83 19.4 "Present" 49 24.86 2.49 29 0 132 0 5.8 30.96 "Present" 69 30.11 0 53 1 206 6 2.95 32.27 "Absent" 72 26.81 56.06 60 1 134 14.1 4.44 22.39 "Present" 65 23.09 0 40 1 118 0 1.88 10.05 "Absent" 59 21.57 0 17 0 132 0 1.87 17.21 "Absent" 49 23.63 0.97 15 0 112 9.65 2.29 17.2 "Present" 54 23.53 0.68 53 0 117 1.53 2.44 28.95 "Present" 35 25.89 30.03 46 0 120 7.5 15.33 22 "Absent" 60 25.31 34.49 49 0 146 10.5 8.29 35.36 "Present" 78 32.73 13.89 53 1 158 2.6 7.46 34.07 "Present" 61 29.3 53.28 62 1 124 14 6.23 35.96 "Present" 45 30.09 0 59 1 106 1.61 1.74 12.32 "Absent" 74 20.92 13.37 20 1 132 7.9 2.85 26.5 "Present" 51 26.16 25.71 44 0 150 0.3 6.38 33.99 "Present" 62 24.64 0 50 0 138 0.6 3.81 28.66 "Absent" 54 28.7 1.46 58 0 142 18.2 4.34 24.38 "Absent" 61 26.19 0 50 0 124 4 12.42 31.29 "Present" 54 23.23 2.06 42 1 118 6 9.65 33.91 "Absent" 60 38.8 0 48 0 145 9.1 5.24 27.55 "Absent" 59 20.96 21.6 61 1 144 4.09 5.55 31.4 "Present" 60 29.43 5.55 56 0 146 0 6.62 25.69 "Absent" 60 28.07 8.23 63 1 136 2.52 3.95 25.63 "Absent" 51 21.86 0 45 1 158 1.02 6.33 23.88 "Absent" 66 22.13 24.99 46 1 122 6.6 5.58 35.95 "Present" 53 28.07 12.55 59 1 126 8.75 6.53 34.02 "Absent" 49 30.25 0 41 1 148 5.5 7.1 25.31 "Absent" 56 29.84 3.6 48 0 122 4.26 4.44 13.04 "Absent" 57 19.49 48.99 28 1 140 3.9 7.32 25.05 "Absent" 47 27.36 36.77 32 0 110 4.64 4.55 30.46 "Absent" 48 30.9 15.22 46 0 130 0 2.82 19.63 "Present" 70 24.86 0 29 0 136 11.2 5.81 31.85 "Present" 75 27.68 22.94 58 1 118 0.28 5.8 33.7 "Present" 60 30.98 0 41 1 144 0.04 3.38 23.61 "Absent" 30 23.75 4.66 30 0 120 0 1.07 16.02 "Absent" 47 22.15 0 15 0 130 2.61 2.72 22.99 "Present" 51 26.29 13.37 51 1 114 0 2.99 9.74 "Absent" 54 46.58 0 17 0 128 4.65 3.31 22.74 "Absent" 62 22.95 0.51 48 0 162 7.4 8.55 24.65 "Present" 64 25.71 5.86 58 1 116 1.91 7.56 26.45 "Present" 52 30.01 3.6 33 1 114 0 1.94 11.02 "Absent" 54 20.17 38.98 16 0 126 3.8 3.88 31.79 "Absent" 57 30.53 0 30 0 122 0 5.75 30.9 "Present" 46 29.01 4.11 42 0 134 2.5 3.66 30.9 "Absent" 52 27.19 23.66 49 0 152 0.9 9.12 30.23 "Absent" 56 28.64 0.37 42 1 134 8.08 1.55 17.5 "Present" 56 22.65 66.65 31 1 156 3 1.82 27.55 "Absent" 60 23.91 54 53 0 152 5.99 7.99 32.48 "Absent" 45 26.57 100.32 48 0 118 0 2.99 16.17 "Absent" 49 23.83 3.22 28 0 126 5.1 2.96 26.5 "Absent" 55 25.52 12.34 38 1 103 0.03 4.21 18.96 "Absent" 48 22.94 2.62 18 0 121 0.8 5.29 18.95 "Present" 47 22.51 0 61 0 142 0.28 1.8 21.03 "Absent" 57 23.65 2.93 33 0 138 1.15 5.09 27.87 "Present" 61 25.65 2.34 44 0 152 10.1 4.71 24.65 "Present" 65 26.21 24.53 57 0 140 0.45 4.3 24.33 "Absent" 41 27.23 10.08 38 0 130 0 1.82 10.45 "Absent" 57 22.07 2.06 17 0 136 7.36 2.19 28.11 "Present" 61 25 61.71 54 0 124 4.82 3.24 21.1 "Present" 48 28.49 8.42 30 0 112 0.41 1.88 10.29 "Absent" 39 22.08 20.98 27 0 118 4.46 7.27 29.13 "Present" 48 29.01 11.11 33 0 122 0 3.37 16.1 "Absent" 67 21.06 0 32 1 118 0 3.67 12.13 "Absent" 51 19.15 0.6 15 0 130 1.72 2.66 10.38 "Absent" 68 17.81 11.1 26 0 130 5.6 3.37 24.8 "Absent" 58 25.76 43.2 36 0 126 0.09 5.03 13.27 "Present" 50 17.75 4.63 20 0 128 0.4 6.17 26.35 "Absent" 64 27.86 11.11 34 0 136 0 4.12 17.42 "Absent" 52 21.66 12.86 40 0 134 0 5.9 30.84 "Absent" 49 29.16 0 55 0 140 0.6 5.56 33.39 "Present" 58 27.19 0 55 1 168 4.5 6.68 28.47 "Absent" 43 24.25 24.38 56 1 108 0.4 5.91 22.92 "Present" 57 25.72 72 39 0 114 3 7.04 22.64 "Present" 55 22.59 0 45 1 140 8.14 4.93 42.49 "Absent" 53 45.72 6.43 53 1 148 4.8 6.09 36.55 "Present" 63 25.44 0.88 55 1 148 12.2 3.79 34.15 "Absent" 57 26.38 14.4 57 1 128 0 2.43 13.15 "Present" 63 20.75 0 17 0 130 0.56 3.3 30.86 "Absent" 49 27.52 33.33 45 0 126 10.5 4.49 17.33 "Absent" 67 19.37 0 49 1 140 0 5.08 27.33 "Present" 41 27.83 1.25 38 0 126 0.9 5.64 17.78 "Present" 55 21.94 0 41 0 122 0.72 4.04 32.38 "Absent" 34 28.34 0 55 0 116 1.03 2.83 10.85 "Absent" 45 21.59 1.75 21 0 120 3.7 4.02 39.66 "Absent" 61 30.57 0 64 1 143 0.46 2.4 22.87 "Absent" 62 29.17 15.43 29 0 118 4 3.95 18.96 "Absent" 54 25.15 8.33 49 1 194 1.7 6.32 33.67 "Absent" 47 30.16 0.19 56 0 134 3 4.37 23.07 "Absent" 56 20.54 9.65 62 0 138 2.16 4.9 24.83 "Present" 39 26.06 28.29 29 0 136 0 5 27.58 "Present" 49 27.59 1.47 39 0 122 3.2 11.32 35.36 "Present" 55 27.07 0 51 1 164 12 3.91 19.59 "Absent" 51 23.44 19.75 39 0 136 8 7.85 23.81 "Present" 51 22.69 2.78 50 0 166 0.07 4.03 29.29 "Absent" 53 28.37 0 27 0 118 0 4.34 30.12 "Present" 52 32.18 3.91 46 0 128 0.42 4.6 26.68 "Absent" 41 30.97 10.33 31 0 118 1.5 5.38 25.84 "Absent" 64 28.63 3.89 29 0 158 3.6 2.97 30.11 "Absent" 63 26.64 108 64 0 108 1.5 4.33 24.99 "Absent" 66 22.29 21.6 61 1 170 7.6 5.5 37.83 "Present" 42 37.41 6.17 54 1 118 1 5.76 22.1 "Absent" 62 23.48 7.71 42 0 124 0 3.04 17.33 "Absent" 49 22.04 0 18 0 114 0 8.01 21.64 "Absent" 66 25.51 2.49 16 0 168 9 8.53 24.48 "Present" 69 26.18 4.63 54 1 134 2 3.66 14.69 "Absent" 52 21.03 2.06 37 0 174 0 8.46 35.1 "Present" 35 25.27 0 61 1 116 31.2 3.17 14.99 "Absent" 47 19.4 49.06 59 1 128 0 10.58 31.81 "Present" 46 28.41 14.66 48 0 140 4.5 4.59 18.01 "Absent" 63 21.91 22.09 32 1 154 0.7 5.91 25 "Absent" 13 20.6 0 42 0 150 3.5 6.99 25.39 "Present" 50 23.35 23.48 61 1 130 0 3.92 25.55 "Absent" 68 28.02 0.68 27 0 128 2 6.13 21.31 "Absent" 66 22.86 11.83 60 0 120 1.4 6.25 20.47 "Absent" 60 25.85 8.51 28 0 120 0 5.01 26.13 "Absent" 64 26.21 12.24 33 0 138 4.5 2.85 30.11 "Absent" 55 24.78 24.89 56 1 153 7.8 3.96 25.73 "Absent" 54 25.91 27.03 45 0 123 8.6 11.17 35.28 "Present" 70 33.14 0 59 1 148 4.04 3.99 20.69 "Absent" 60 27.78 1.75 28 0 136 3.96 2.76 30.28 "Present" 50 34.42 18.51 38 0 134 8.8 7.41 26.84 "Absent" 35 29.44 29.52 60 1 152 12.18 4.04 37.83 "Present" 63 34.57 4.17 64 0 158 13.5 5.04 30.79 "Absent" 54 24.79 21.5 62 0 132 2 3.08 35.39 "Absent" 45 31.44 79.82 58 1 134 1.5 3.73 21.53 "Absent" 41 24.7 11.11 30 1 142 7.44 5.52 33.97 "Absent" 47 29.29 24.27 54 0 134 6 3.3 28.45 "Absent" 65 26.09 58.11 40 0 122 4.18 9.05 29.27 "Present" 44 24.05 19.34 52 1 116 2.7 3.69 13.52 "Absent" 55 21.13 18.51 32 0 128 0.5 3.7 12.81 "Present" 66 21.25 22.73 28 0 120 0 3.68 12.24 "Absent" 51 20.52 0.51 20 0 124 0 3.95 36.35 "Present" 59 32.83 9.59 54 0 160 14 5.9 37.12 "Absent" 58 33.87 3.52 54 1 130 2.78 4.89 9.39 "Present" 63 19.3 17.47 25 1 128 2.8 5.53 14.29 "Absent" 64 24.97 0.51 38 0 130 4.5 5.86 37.43 "Absent" 61 31.21 32.3 58 0 109 1.2 6.14 29.26 "Absent" 47 24.72 10.46 40 0 144 0 3.84 18.72 "Absent" 56 22.1 4.8 40 0 118 1.05 3.16 12.98 "Present" 46 22.09 16.35 31 0 136 3.46 6.38 32.25 "Present" 43 28.73 3.13 43 1 136 1.5 6.06 26.54 "Absent" 54 29.38 14.5 33 1 124 15.5 5.05 24.06 "Absent" 46 23.22 0 61 1 148 6 6.49 26.47 "Absent" 48 24.7 0 55 0 128 6.6 3.58 20.71 "Absent" 55 24.15 0 52 0 122 0.28 4.19 19.97 "Absent" 61 25.63 0 24 0 108 0 2.74 11.17 "Absent" 53 22.61 0.95 20 0 124 3.04 4.8 19.52 "Present" 60 21.78 147.19 41 1 138 8.8 3.12 22.41 "Present" 63 23.33 120.03 55 1 127 0 2.81 15.7 "Absent" 42 22.03 1.03 17 0 174 9.45 5.13 35.54 "Absent" 55 30.71 59.79 53 0 122 0 3.05 23.51 "Absent" 46 25.81 0 38 0 144 6.75 5.45 29.81 "Absent" 53 25.62 26.23 43 1 126 1.8 6.22 19.71 "Absent" 65 24.81 0.69 31 0 208 27.4 3.12 26.63 "Absent" 66 27.45 33.07 62 1 138 0 2.68 17.04 "Absent" 42 22.16 0 16 0 148 0 3.84 17.26 "Absent" 70 20 0 21 0 122 0 3.08 16.3 "Absent" 43 22.13 0 16 0 132 7 3.2 23.26 "Absent" 77 23.64 23.14 49 0 110 12.16 4.99 28.56 "Absent" 44 27.14 21.6 55 1 160 1.52 8.12 29.3 "Present" 54 25.87 12.86 43 1 126 0.54 4.39 21.13 "Present" 45 25.99 0 25 0 162 5.3 7.95 33.58 "Present" 58 36.06 8.23 48 0 194 2.55 6.89 33.88 "Present" 69 29.33 0 41 0 118 0.75 2.58 20.25 "Absent" 59 24.46 0 32 0 124 0 4.79 34.71 "Absent" 49 26.09 9.26 47 0 160 0 2.42 34.46 "Absent" 48 29.83 1.03 61 0 128 0 2.51 29.35 "Present" 53 22.05 1.37 62 0 122 4 5.24 27.89 "Present" 45 26.52 0 61 1 132 2 2.7 21.57 "Present" 50 27.95 9.26 37 0 120 0 2.42 16.66 "Absent" 46 20.16 0 17 0 128 0.04 8.22 28.17 "Absent" 65 26.24 11.73 24 0 108 15 4.91 34.65 "Absent" 41 27.96 14.4 56 0 166 0 4.31 34.27 "Absent" 45 30.14 13.27 56 0 152 0 6.06 41.05 "Present" 51 40.34 0 51 0 170 4.2 4.67 35.45 "Present" 50 27.14 7.92 60 1 156 4 2.05 19.48 "Present" 50 21.48 27.77 39 1 116 8 6.73 28.81 "Present" 41 26.74 40.94 48 1 122 4.4 3.18 11.59 "Present" 59 21.94 0 33 1 150 20 6.4 35.04 "Absent" 53 28.88 8.33 63 0 129 2.15 5.17 27.57 "Absent" 52 25.42 2.06 39 0 134 4.8 6.58 29.89 "Present" 55 24.73 23.66 63 0 126 0 5.98 29.06 "Present" 56 25.39 11.52 64 1 142 0 3.72 25.68 "Absent" 48 24.37 5.25 40 1 128 0.7 4.9 37.42 "Present" 72 35.94 3.09 49 1 102 0.4 3.41 17.22 "Present" 56 23.59 2.06 39 1 130 0 4.89 25.98 "Absent" 72 30.42 14.71 23 0 138 0.05 2.79 10.35 "Absent" 46 21.62 0 18 0 138 0 1.96 11.82 "Present" 54 22.01 8.13 21 0 128 0 3.09 20.57 "Absent" 54 25.63 0.51 17 0 162 2.92 3.63 31.33 "Absent" 62 31.59 18.51 42 0 160 3 9.19 26.47 "Present" 39 28.25 14.4 54 1 148 0 4.66 24.39 "Absent" 50 25.26 4.03 27 0 124 0.16 2.44 16.67 "Absent" 65 24.58 74.91 23 0 136 3.15 4.37 20.22 "Present" 59 25.12 47.16 31 1 134 2.75 5.51 26.17 "Absent" 57 29.87 8.33 33 0 128 0.73 3.97 23.52 "Absent" 54 23.81 19.2 64 0 122 3.2 3.59 22.49 "Present" 45 24.96 36.17 58 0 152 3 4.64 31.29 "Absent" 41 29.34 4.53 40 0 162 0 5.09 24.6 "Present" 64 26.71 3.81 18 0 124 4 6.65 30.84 "Present" 54 28.4 33.51 60 0 136 5.8 5.9 27.55 "Absent" 65 25.71 14.4 59 0 136 8.8 4.26 32.03 "Present" 52 31.44 34.35 60 0 134 0.05 8.03 27.95 "Absent" 48 26.88 0 60 0 122 1 5.88 34.81 "Present" 69 31.27 15.94 40 1 116 3 3.05 30.31 "Absent" 41 23.63 0.86 44 0 132 0 0.98 21.39 "Absent" 62 26.75 0 53 0 134 0 2.4 21.11 "Absent" 57 22.45 1.37 18 0 160 7.77 8.07 34.8 "Absent" 64 31.15 0 62 1 180 0.52 4.23 16.38 "Absent" 55 22.56 14.77 45 1 124 0.81 6.16 11.61 "Absent" 35 21.47 10.49 26 0 114 0 4.97 9.69 "Absent" 26 22.6 0 25 0 208 7.4 7.41 32.03 "Absent" 50 27.62 7.85 57 0 138 0 3.14 12 "Absent" 54 20.28 0 16 0 164 0.5 6.95 39.64 "Present" 47 41.76 3.81 46 1 144 2.4 8.13 35.61 "Absent" 46 27.38 13.37 60 0 136 7.5 7.39 28.04 "Present" 50 25.01 0 45 1 132 7.28 3.52 12.33 "Absent" 60 19.48 2.06 56 0 143 5.04 4.86 23.59 "Absent" 58 24.69 18.72 42 0 112 4.46 7.18 26.25 "Present" 69 27.29 0 32 1 134 10 3.79 34.72 "Absent" 42 28.33 28.8 52 1 138 2 5.11 31.4 "Present" 49 27.25 2.06 64 1 188 0 5.47 32.44 "Present" 71 28.99 7.41 50 1 110 2.35 3.36 26.72 "Present" 54 26.08 109.8 58 1 136 13.2 7.18 35.95 "Absent" 48 29.19 0 62 0 130 1.75 5.46 34.34 "Absent" 53 29.42 0 58 1 122 0 3.76 24.59 "Absent" 56 24.36 0 30 0 138 0 3.24 27.68 "Absent" 60 25.7 88.66 29 0 130 18 4.13 27.43 "Absent" 54 27.44 0 51 1 126 5.5 3.78 34.15 "Absent" 55 28.85 3.18 61 0 176 5.76 4.89 26.1 "Present" 46 27.3 19.44 57 0 122 0 5.49 19.56 "Absent" 57 23.12 14.02 27 0 124 0 3.23 9.64 "Absent" 59 22.7 0 16 0 140 5.2 3.58 29.26 "Absent" 70 27.29 20.17 45 1 128 6 4.37 22.98 "Present" 50 26.01 0 47 0 190 4.18 5.05 24.83 "Absent" 45 26.09 82.85 41 0 144 0.76 10.53 35.66 "Absent" 63 34.35 0 55 1 126 4.6 7.4 31.99 "Present" 57 28.67 0.37 60 1 128 0 2.63 23.88 "Absent" 45 21.59 6.54 57 0 136 0.4 3.91 21.1 "Present" 63 22.3 0 56 1 158 4 4.18 28.61 "Present" 42 25.11 0 60 0 160 0.6 6.94 30.53 "Absent" 36 25.68 1.42 64 0 124 6 5.21 33.02 "Present" 64 29.37 7.61 58 1 158 6.17 8.12 30.75 "Absent" 46 27.84 92.62 48 0 128 0 6.34 11.87 "Absent" 57 23.14 0 17 0 166 3 3.82 26.75 "Absent" 45 20.86 0 63 1 146 7.5 7.21 25.93 "Present" 55 22.51 0.51 42 0 161 9 4.65 15.16 "Present" 58 23.76 43.2 46 0 164 13.02 6.26 29.38 "Present" 47 22.75 37.03 54 1 146 5.08 7.03 27.41 "Present" 63 36.46 24.48 37 1 142 4.48 3.57 19.75 "Present" 51 23.54 3.29 49 0 138 12 5.13 28.34 "Absent" 59 24.49 32.81 58 1 154 1.8 7.13 34.04 "Present" 52 35.51 39.36 44 0 118 0 2.39 12.13 "Absent" 49 18.46 0.26 17 1 124 0.61 2.69 17.15 "Present" 61 22.76 11.55 20 0 124 1.04 2.84 16.42 "Present" 46 20.17 0 61 0 136 5 4.19 23.99 "Present" 68 27.8 25.86 35 0 132 9.9 4.63 27.86 "Present" 46 23.39 0.51 52 1 118 0.12 1.96 20.31 "Absent" 37 20.01 2.42 18 0 118 0.12 4.16 9.37 "Absent" 57 19.61 0 17 0 134 12 4.96 29.79 "Absent" 53 24.86 8.23 57 0 114 0.1 3.95 15.89 "Present" 57 20.31 17.14 16 0 136 6.8 7.84 30.74 "Present" 58 26.2 23.66 45 1 130 0 4.16 39.43 "Present" 46 30.01 0 55 1 136 2.2 4.16 38.02 "Absent" 65 37.24 4.11 41 1 136 1.36 3.16 14.97 "Present" 56 24.98 7.3 24 0 154 4.2 5.59 25.02 "Absent" 58 25.02 1.54 43 0 108 0.8 2.47 17.53 "Absent" 47 22.18 0 55 1 136 8.8 4.69 36.07 "Present" 38 26.56 2.78 63 1 174 2.02 6.57 31.9 "Present" 50 28.75 11.83 64 1 124 4.25 8.22 30.77 "Absent" 56 25.8 0 43 0 114 0 2.63 9.69 "Absent" 45 17.89 0 16 0 118 0.12 3.26 12.26 "Absent" 55 22.65 0 16 0 106 1.08 4.37 26.08 "Absent" 67 24.07 17.74 28 1 146 3.6 3.51 22.67 "Absent" 51 22.29 43.71 42 0 206 0 4.17 33.23 "Absent" 69 27.36 6.17 50 1 134 3 3.17 17.91 "Absent" 35 26.37 15.12 27 0 148 15 4.98 36.94 "Present" 72 31.83 66.27 41 1 126 0.21 3.95 15.11 "Absent" 61 22.17 2.42 17 0 134 0 3.69 13.92 "Absent" 43 27.66 0 19 0 134 0.02 2.8 18.84 "Absent" 45 24.82 0 17 0 123 0.05 4.61 13.69 "Absent" 51 23.23 2.78 16 0 112 0.6 5.28 25.71 "Absent" 55 27.02 27.77 38 1 112 0 1.71 15.96 "Absent" 42 22.03 3.5 16 0 101 0.48 7.26 13 "Absent" 50 19.82 5.19 16 0 150 0.18 4.14 14.4 "Absent" 53 23.43 7.71 44 0 170 2.6 7.22 28.69 "Present" 71 27.87 37.65 56 1 134 0 5.63 29.12 "Absent" 68 32.33 2.02 34 0 142 0 4.19 18.04 "Absent" 56 23.65 20.78 42 1 132 0.1 3.28 10.73 "Absent" 73 20.42 0 17 0 136 0 2.28 18.14 "Absent" 55 22.59 0 17 0 132 12 4.51 21.93 "Absent" 61 26.07 64.8 46 1 166 4.1 4 34.3 "Present" 32 29.51 8.23 53 0 138 0 3.96 24.7 "Present" 53 23.8 0 45 0 138 2.27 6.41 29.07 "Absent" 58 30.22 2.93 32 1 170 0 3.12 37.15 "Absent" 47 35.42 0 53 0 128 0 8.41 28.82 "Present" 60 26.86 0 59 1 136 1.2 2.78 7.12 "Absent" 52 22.51 3.41 27 0 128 0 3.22 26.55 "Present" 39 26.59 16.71 49 0 150 14.4 5.04 26.52 "Present" 60 28.84 0 45 0 132 8.4 3.57 13.68 "Absent" 42 18.75 15.43 59 1 142 2.4 2.55 23.89 "Absent" 54 26.09 59.14 37 0 130 0.05 2.44 28.25 "Present" 67 30.86 40.32 34 0 174 3.5 5.26 21.97 "Present" 36 22.04 8.33 59 1 114 9.6 2.51 29.18 "Absent" 49 25.67 40.63 46 0 162 1.5 2.46 19.39 "Present" 49 24.32 0 59 1 174 0 3.27 35.4 "Absent" 58 37.71 24.95 44 0 190 5.15 6.03 36.59 "Absent" 42 30.31 72 50 0 154 1.4 1.72 18.86 "Absent" 58 22.67 43.2 59 0 124 0 2.28 24.86 "Present" 50 22.24 8.26 38 0 114 1.2 3.98 14.9 "Absent" 49 23.79 25.82 26 0 168 11.4 5.08 26.66 "Present" 56 27.04 2.61 59 1 142 3.72 4.24 32.57 "Absent" 52 24.98 7.61 51 0 154 0 4.81 28.11 "Present" 56 25.67 75.77 59 0 146 4.36 4.31 18.44 "Present" 47 24.72 10.8 38 0 166 6 3.02 29.3 "Absent" 35 24.38 38.06 61 0 140 8.6 3.9 32.16 "Present" 52 28.51 11.11 64 1 136 1.7 3.53 20.13 "Absent" 56 19.44 14.4 55 0 156 0 3.47 21.1 "Absent" 73 28.4 0 36 1 132 0 6.63 29.58 "Present" 37 29.41 2.57 62 0 128 0 2.98 12.59 "Absent" 65 20.74 2.06 19 0 106 5.6 3.2 12.3 "Absent" 49 20.29 0 39 0 144 0.4 4.64 30.09 "Absent" 30 27.39 0.74 55 0 154 0.31 2.33 16.48 "Absent" 33 24 11.83 17 0 126 3.1 2.01 32.97 "Present" 56 28.63 26.74 45 0 134 6.4 8.49 37.25 "Present" 56 28.94 10.49 51 1 152 19.45 4.22 29.81 "Absent" 28 23.95 0 59 1 146 1.35 6.39 34.21 "Absent" 51 26.43 0 59 1 162 6.94 4.55 33.36 "Present" 52 27.09 32.06 43 0 130 7.28 3.56 23.29 "Present" 20 26.8 51.87 58 1 138 6 7.24 37.05 "Absent" 38 28.69 0 59 0 148 0 5.32 26.71 "Present" 52 32.21 32.78 27 0 124 4.2 2.94 27.59 "Absent" 50 30.31 85.06 30 0 118 1.62 9.01 21.7 "Absent" 59 25.89 21.19 40 0 116 4.28 7.02 19.99 "Present" 68 23.31 0 52 1 162 6.3 5.73 22.61 "Present" 46 20.43 62.54 53 1 138 0.87 1.87 15.89 "Absent" 44 26.76 42.99 31 0 137 1.2 3.14 23.87 "Absent" 66 24.13 45 37 0 198 0.52 11.89 27.68 "Present" 48 28.4 78.99 26 1 154 4.5 4.75 23.52 "Present" 43 25.76 0 53 1 128 5.4 2.36 12.98 "Absent" 51 18.36 6.69 61 0 130 0.08 5.59 25.42 "Present" 50 24.98 6.27 43 1 162 5.6 4.24 22.53 "Absent" 29 22.91 5.66 60 0 120 10.5 2.7 29.87 "Present" 54 24.5 16.46 49 0 136 3.99 2.58 16.38 "Present" 53 22.41 27.67 36 0 176 1.2 8.28 36.16 "Present" 42 27.81 11.6 58 1 134 11.79 4.01 26.57 "Present" 38 21.79 38.88 61 1 122 1.7 5.28 32.23 "Present" 51 24.08 0 54 0 134 0.9 3.18 23.66 "Present" 52 23.26 27.36 58 1 134 0 2.43 22.24 "Absent" 52 26.49 41.66 24 0 136 6.6 6.08 32.74 "Absent" 64 33.28 2.72 49 0 132 4.05 5.15 26.51 "Present" 31 26.67 16.3 50 0 152 1.68 3.58 25.43 "Absent" 50 27.03 0 32 0 132 12.3 5.96 32.79 "Present" 57 30.12 21.5 62 1 124 0.4 3.67 25.76 "Absent" 43 28.08 20.57 34 0 140 4.2 2.91 28.83 "Present" 43 24.7 47.52 48 0 166 0.6 2.42 34.03 "Present" 53 26.96 54 60 0 156 3.02 5.35 25.72 "Present" 53 25.22 28.11 52 1 132 0.72 4.37 19.54 "Absent" 48 26.11 49.37 28 0 150 0 4.99 27.73 "Absent" 57 30.92 8.33 24 0 134 0.12 3.4 21.18 "Present" 33 26.27 14.21 30 0 126 3.4 4.87 15.16 "Present" 65 22.01 11.11 38 0 148 0.5 5.97 32.88 "Absent" 54 29.27 6.43 42 0 148 8.2 7.75 34.46 "Present" 46 26.53 6.04 64 1 132 6 5.97 25.73 "Present" 66 24.18 145.29 41 0 128 1.6 5.41 29.3 "Absent" 68 29.38 23.97 32 0 128 5.16 4.9 31.35 "Present" 57 26.42 0 64 0 140 0 2.4 27.89 "Present" 70 30.74 144 29 0 126 0 5.29 27.64 "Absent" 25 27.62 2.06 45 0 114 3.6 4.16 22.58 "Absent" 60 24.49 65.31 31 0 118 1.25 4.69 31.58 "Present" 52 27.16 4.11 53 0 126 0.96 4.99 29.74 "Absent" 66 33.35 58.32 38 0 154 4.5 4.68 39.97 "Absent" 61 33.17 1.54 64 1 112 1.44 2.71 22.92 "Absent" 59 24.81 0 52 0 140 8 4.42 33.15 "Present" 47 32.77 66.86 44 0 140 1.68 11.41 29.54 "Present" 74 30.75 2.06 38 1 128 2.6 4.94 21.36 "Absent" 61 21.3 0 31 0 126 19.6 6.03 34.99 "Absent" 49 26.99 55.89 44 0 160 4.2 6.76 37.99 "Present" 61 32.91 3.09 54 1 144 0 4.17 29.63 "Present" 52 21.83 0 59 0 148 4.5 10.49 33.27 "Absent" 50 25.92 2.06 53 1 146 0 4.92 18.53 "Absent" 57 24.2 34.97 26 0 164 5.6 3.17 30.98 "Present" 44 25.99 43.2 53 1 130 0.54 3.63 22.03 "Present" 69 24.34 12.86 39 1 154 2.4 5.63 42.17 "Present" 59 35.07 12.86 50 1 178 0.95 4.75 21.06 "Absent" 49 23.74 24.69 61 0 180 3.57 3.57 36.1 "Absent" 36 26.7 19.95 64 0 134 12.5 2.73 39.35 "Absent" 48 35.58 0 48 0 142 0 3.54 16.64 "Absent" 58 25.97 8.36 27 0 162 7 7.67 34.34 "Present" 33 30.77 0 62 0 218 11.2 2.77 30.79 "Absent" 38 24.86 90.93 48 1 126 8.75 6.06 32.72 "Present" 33 27 62.43 55 1 126 0 3.57 26.01 "Absent" 61 26.3 7.97 47 0 134 6.1 4.77 26.08 "Absent" 47 23.82 1.03 49 0 132 0 4.17 36.57 "Absent" 57 30.61 18 49 0 178 5.5 3.79 23.92 "Present" 45 21.26 6.17 62 1 208 5.04 5.19 20.71 "Present" 52 25.12 24.27 58 1 160 1.15 10.19 39.71 "Absent" 31 31.65 20.52 57 0 116 2.38 5.67 29.01 "Present" 54 27.26 15.77 51 0 180 25.01 3.7 38.11 "Present" 57 30.54 0 61 1 200 19.2 4.43 40.6 "Present" 55 32.04 36 60 1 112 4.2 3.58 27.14 "Absent" 52 26.83 2.06 40 0 120 0 3.1 26.97 "Absent" 41 24.8 0 16 0 178 20 9.78 33.55 "Absent" 37 27.29 2.88 62 1 166 0.8 5.63 36.21 "Absent" 50 34.72 28.8 60 0 164 8.2 14.16 36.85 "Absent" 52 28.5 17.02 55 1 216 0.92 2.66 19.85 "Present" 49 20.58 0.51 63 1 146 6.4 5.62 33.05 "Present" 57 31.03 0.74 46 0 134 1.1 3.54 20.41 "Present" 58 24.54 39.91 39 1 158 16 5.56 29.35 "Absent" 36 25.92 58.32 60 0 176 0 3.14 31.04 "Present" 45 30.18 4.63 45 0 132 2.8 4.79 20.47 "Present" 50 22.15 11.73 48 0 126 0 4.55 29.18 "Absent" 48 24.94 36 41 0 120 5.5 3.51 23.23 "Absent" 46 22.4 90.31 43 0 174 0 3.86 21.73 "Absent" 42 23.37 0 63 0 150 13.8 5.1 29.45 "Present" 52 27.92 77.76 55 1 176 6 3.98 17.2 "Present" 52 21.07 4.11 61 1 142 2.2 3.29 22.7 "Absent" 44 23.66 5.66 42 1 132 0 3.3 21.61 "Absent" 42 24.92 32.61 33 0 142 1.32 7.63 29.98 "Present" 57 31.16 72.93 33 0 146 1.16 2.28 34.53 "Absent" 50 28.71 45 49 0 132 7.2 3.65 17.16 "Present" 56 23.25 0 34 0 120 0 3.57 23.22 "Absent" 58 27.2 0 32 0 118 0 3.89 15.96 "Absent" 65 20.18 0 16 0 108 0 1.43 26.26 "Absent" 42 19.38 0 16 0 136 0 4 19.06 "Absent" 40 21.94 2.06 16 0 120 0 2.46 13.39 "Absent" 47 22.01 0.51 18 0 132 0 3.55 8.66 "Present" 61 18.5 3.87 16 0 136 0 1.77 20.37 "Absent" 45 21.51 2.06 16 0 138 0 1.86 18.35 "Present" 59 25.38 6.51 17 0 138 0.06 4.15 20.66 "Absent" 49 22.59 2.49 16 0 130 1.22 3.3 13.65 "Absent" 50 21.4 3.81 31 0 130 4 2.4 17.42 "Absent" 60 22.05 0 40 0 110 0 7.14 28.28 "Absent" 57 29 0 32 0 120 0 3.98 13.19 "Present" 47 21.89 0 16 0 166 6 8.8 37.89 "Absent" 39 28.7 43.2 52 0 134 0.57 4.75 23.07 "Absent" 67 26.33 0 37 0 142 3 3.69 25.1 "Absent" 60 30.08 38.88 27 0 136 2.8 2.53 9.28 "Present" 61 20.7 4.55 25 0 142 0 4.32 25.22 "Absent" 47 28.92 6.53 34 1 130 0 1.88 12.51 "Present" 52 20.28 0 17 0 124 1.8 3.74 16.64 "Present" 42 22.26 10.49 20 0 144 4 5.03 25.78 "Present" 57 27.55 90 48 1 136 1.81 3.31 6.74 "Absent" 63 19.57 24.94 24 0 120 0 2.77 13.35 "Absent" 67 23.37 1.03 18 0 154 5.53 3.2 28.81 "Present" 61 26.15 42.79 42 0 124 1.6 7.22 39.68 "Present" 36 31.5 0 51 1 146 0.64 4.82 28.02 "Absent" 60 28.11 8.23 39 1 128 2.24 2.83 26.48 "Absent" 48 23.96 47.42 27 1 170 0.4 4.11 42.06 "Present" 56 33.1 2.06 57 0 214 0.4 5.98 31.72 "Absent" 64 28.45 0 58 0 182 4.2 4.41 32.1 "Absent" 52 28.61 18.72 52 1 108 3 1.59 15.23 "Absent" 40 20.09 26.64 55 0 118 5.4 11.61 30.79 "Absent" 64 27.35 23.97 40 0 132 0 4.82 33.41 "Present" 62 14.7 0 46 1 ================================================ FILE: 2017/examples/deepdream/deepdream_exercise.py ================================================ """DeepDream. """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os.path import zipfile import numpy as np import PIL.Image import tensorflow as tf FLAGS = tf.app.flags.FLAGS tf.app.flags.DEFINE_string('data_dir', '/tmp/inception/', 'Directory for storing Inception network.') tf.app.flags.DEFINE_string('jpeg_file', 'output.jpg', 'Where to save the resulting JPEG.') def get_layer(layer): """Helper for getting layer output Tensor in model Graph. Args: layer: string, layer name Returns: Tensor for that layer. """ graph = tf.get_default_graph() return graph.get_tensor_by_name('import/%s:0' % layer) def maybe_download(data_dir): """Maybe download pretrained Inception network. Args: data_dir: string, path to data """ url = ('https://storage.googleapis.com/download.tensorflow.org/models/' 'inception5h.zip') basename = 'inception5h.zip' local_file = tf.contrib.learn.python.learn.datasets.base.maybe_download( basename, data_dir, url) # Uncompress the pretrained Inception network. print('Extracting', local_file) zip_ref = zipfile.ZipFile(local_file, 'r') zip_ref.extractall(FLAGS.data_dir) zip_ref.close() def normalize_image(image): """Stretch the range and prepare the image for saving as a JPEG. Args: image: numpy array Returns: numpy array of image in uint8 """ # Clip to [0, 1] and then convert to uint8. image = np.clip(image, 0, 1) image = np.uint8(image * 255) return image def save_jpeg(jpeg_file, image): pil_image = PIL.Image.fromarray(image) pil_image.save(jpeg_file) print('Saved to file: ', jpeg_file) def main(unused_argv): # Maybe download and uncompress pretrained Inception network. maybe_download(FLAGS.data_dir) model_fn = os.path.join(FLAGS.data_dir, 'tensorflow_inception_graph.pb') # Load the pretrained Inception model as a GraphDef. with tf.gfile.FastGFile(model_fn, 'rb') as f: graph_def = tf.GraphDef() graph_def.ParseFromString(f.read()) with tf.Graph().as_default(): # Input for the network. input_image = tf.placeholder(np.float32, name='input') pixel_mean = 117.0 input_preprocessed = tf.expand_dims(input_image - pixel_mean, 0) tf.import_graph_def(graph_def, {'input': input_preprocessed}) # Grab a list of the names of Tensor's that are the output of convolutions. graph = tf.get_default_graph() layers = [op.name for op in graph.get_operations() if op.type == 'Conv2D' and 'import/' in op.name] feature_nums = [int(graph.get_tensor_by_name(name+':0').get_shape()[-1]) for name in layers] # print('Layers available: %s' % ','.join(layers)) print('Number of layers', len(layers)) print('Number of features:', sum(feature_nums)) # Pick an internal layer and node to visualize. # Note that we use outputs before applying the ReLU nonlinearity to # have non-zero gradients for features with negative initial activations. layer = 'mixed4d_3x3_bottleneck_pre_relu' channel = 139 layer_channel = get_layer(layer)[:, :, :, channel] print('layer %s, channel %d: %s' % (layer, channel, layer_channel)) # Define the optimization as the average across all spatial locations. score = tf.reduce_mean(layer_channel) # Automatic differentiation with TensorFlow. Magic! input_gradient = tf.gradients(score, input_image)[0] # Employ random noise as a image. noise_image = np.random.uniform(size=(224, 224, 3)) + 100.0 image = noise_image.copy() ################################################################ # EXERCISE: Implemement the Deep Dream algorithm here! ################################################################ # Save the image. stddev = 0.1 image = (image - image.mean()) / max(image.std(), 1e-4) * stddev + 0.5 image = normalize_image(image) save_jpeg(FLAGS.jpeg_file, image) if __name__ == '__main__': tf.app.run() ================================================ FILE: 2017/examples/deepdream/deepdream_solution.py ================================================ """DeepDream. """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os.path import zipfile import sys sys.path.extend(['', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages/gtk-2.0', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages/gtk-2.0']) import numpy as np import PIL.Image import tensorflow as tf FLAGS = tf.app.flags.FLAGS tf.app.flags.DEFINE_string('data_dir', '/tmp/inception/', 'Directory for storing Inception network.') tf.app.flags.DEFINE_string('jpeg_file', 'output.jpg', 'Where to save the resulting JPEG.') def get_layer(layer): """Helper for getting layer output Tensor in model Graph. Args: layer: string, layer name Returns: Tensor for that layer. """ graph = tf.get_default_graph() return graph.get_tensor_by_name('import/%s:0' % layer) def maybe_download(data_dir): """Maybe download pretrained Inception network. Args: data_dir: string, path to data """ url = ('https://storage.googleapis.com/download.tensorflow.org/models/' 'inception5h.zip') basename = 'inception5h.zip' local_file = tf.contrib.learn.python.learn.datasets.base.maybe_download( basename, data_dir, url) # Uncompress the pretrained Inception network. print('Extracting', local_file) zip_ref = zipfile.ZipFile(local_file, 'r') zip_ref.extractall(FLAGS.data_dir) zip_ref.close() def normalize_image(image): """Stretch the range and prepare the image for saving as a JPEG. Args: image: numpy array Returns: numpy array of image in uint8 """ # Clip to [0, 1] and then convert to uint8. image = np.clip(image, 0, 1) image = np.uint8(image * 255) return image def save_jpeg(jpeg_file, image): pil_image = PIL.Image.fromarray(image) pil_image.save(jpeg_file) print('Saved to file: ', jpeg_file) def main(unused_argv): # Maybe download and uncompress pretrained Inception network. maybe_download(FLAGS.data_dir) model_fn = os.path.join(FLAGS.data_dir, 'tensorflow_inception_graph.pb') # Load the pretrained Inception model as a GraphDef. with tf.gfile.FastGFile(model_fn, 'rb') as f: graph_def = tf.GraphDef() graph_def.ParseFromString(f.read()) with tf.Graph().as_default(): # Input for the network. input_image = tf.placeholder(np.float32, name='input') pixel_mean = 117.0 input_preprocessed = tf.expand_dims(input_image - pixel_mean, 0) tf.import_graph_def(graph_def, {'input': input_preprocessed}) # Grab a list of the names of Tensor's that are the output of convolutions. graph = tf.get_default_graph() layers = [op.name for op in graph.get_operations() if op.type == 'Conv2D' and 'import/' in op.name] feature_nums = [int(graph.get_tensor_by_name(name+':0').get_shape()[-1]) for name in layers] # print('Layers available: %s' % ','.join(layers)) print('Number of layers', len(layers)) print('Number of features:', sum(feature_nums)) # Pick an internal layer and node to visualize. # Note that we use outputs before applying the ReLU nonlinearity to # have non-zero gradients for features with negative initial activations. layer = 'mixed4d_3x3_bottleneck_pre_relu' channel = 139 layer_channel = get_layer(layer)[:, :, :, channel] print('layer %s, channel %d: %s' % (layer, channel, layer_channel)) # Define the optimization as the average across all spatial locations. score = tf.reduce_mean(layer_channel) # Automatic differentiation with TensorFlow. Magic! input_gradient = tf.gradients(score, input_image)[0] # Employ random noise as a image. noise_image = np.random.uniform(size=(224, 224, 3)) + 100.0 image = noise_image.copy() ################################################################ ### BEGIN SOLUTION ##### ################################################################ step_scale = 1.0 num_iter = 20 with tf.Session() as sess: for i in xrange(num_iter): image_gradient, score_value = sess.run([input_gradient, score], {input_image:image}) # Normalize the gradient, so the same step size should work image_gradient /= image_gradient.std() + 1e-8 image += image_gradient * step_scale print('At step = %d, score = %.3f' % (i, score_value)) # Save the image. stddev = 0.1 image = (image - image.mean()) / max(image.std(), 1e-4) * stddev + 0.5 image = normalize_image(image) save_jpeg(FLAGS.jpeg_file, image) ################################################################## ### END SOLUTION ##### ################################################################## if __name__ == '__main__': tf.app.run() ================================================ FILE: 2017/examples/kernels.py ================================================ import numpy as np import tensorflow as tf a = np.zeros([3, 3, 3, 3]) a[1, 1, :, :] = 0.25 a[0, 1, :, :] = 0.125 a[1, 0, :, :] = 0.125 a[2, 1, :, :] = 0.125 a[1, 2, :, :] = 0.125 a[0, 0, :, :] = 0.0625 a[0, 2, :, :] = 0.0625 a[2, 0, :, :] = 0.0625 a[2, 2, :, :] = 0.0625 BLUR_FILTER_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) # a[1, 1, :, :] = 0.25 # a[0, 1, :, :] = 0.125 # a[1, 0, :, :] = 0.125 # a[2, 1, :, :] = 0.125 # a[1, 2, :, :] = 0.125 # a[0, 0, :, :] = 0.0625 # a[0, 2, :, :] = 0.0625 # a[2, 0, :, :] = 0.0625 # a[2, 2, :, :] = 0.0625 a[1, 1, :, :] = 1.0 a[0, 1, :, :] = 1.0 a[1, 0, :, :] = 1.0 a[2, 1, :, :] = 1.0 a[1, 2, :, :] = 1.0 a[0, 0, :, :] = 1.0 a[0, 2, :, :] = 1.0 a[2, 0, :, :] = 1.0 a[2, 2, :, :] = 1.0 BLUR_FILTER = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 3, 3]) a[1, 1, :, :] = 5 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[2, 1, :, :] = -1 a[1, 2, :, :] = -1 SHARPEN_FILTER_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) a[1, 1, :, :] = 5 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[2, 1, :, :] = -1 a[1, 2, :, :] = -1 SHARPEN_FILTER = tf.constant(a, dtype=tf.float32) # a = np.zeros([3, 3, 3, 3]) # a[:, :, :, :] = -1 # a[1, 1, :, :] = 8 # EDGE_FILTER_RGB = tf.constant(a, dtype=tf.float32) EDGE_FILTER_RGB = tf.constant([ [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]], [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]], [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]] ]) a = np.zeros([3, 3, 1, 1]) # a[:, :, :, :] = -1 # a[1, 1, :, :] = 8 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[1, 2, :, :] = -1 a[2, 1, :, :] = -1 a[1, 1, :, :] = 4 EDGE_FILTER = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 3, 3]) a[0, :, :, :] = 1 a[0, 1, :, :] = 2 # originally 2 a[2, :, :, :] = -1 a[2, 1, :, :] = -2 TOP_SOBEL_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) a[0, :, :, :] = 1 a[0, 1, :, :] = 2 # originally 2 a[2, :, :, :] = -1 a[2, 1, :, :] = -2 TOP_SOBEL = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 3, 3]) a[0, 0, :, :] = -2 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[1, 1, :, :] = 1 a[1, 2, :, :] = 1 a[2, 1, :, :] = 1 a[2, 2, :, :] = 2 EMBOSS_FILTER_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) a[0, 0, :, :] = -2 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[1, 1, :, :] = 1 a[1, 2, :, :] = 1 a[2, 1, :, :] = 1 a[2, 2, :, :] = 2 EMBOSS_FILTER = tf.constant(a, dtype=tf.float32) ================================================ FILE: 2017/examples/process_data.py ================================================ from __future__ import absolute_import from __future__ import division from __future__ import print_function from collections import Counter import random import os import sys sys.path.append('..') import zipfile import numpy as np from six.moves import urllib import tensorflow as tf import utils # Parameters for downloading data DOWNLOAD_URL = 'http://mattmahoney.net/dc/' EXPECTED_BYTES = 31344016 DATA_FOLDER = 'data/' FILE_NAME = 'text8.zip' def download(file_name, expected_bytes): """ Download the dataset text8 if it's not already downloaded """ file_path = DATA_FOLDER + file_name if os.path.exists(file_path): print("Dataset ready") return file_path file_name, _ = urllib.request.urlretrieve(DOWNLOAD_URL + file_name, file_path) file_stat = os.stat(file_path) if file_stat.st_size == expected_bytes: print('Successfully downloaded the file', file_name) else: raise Exception('File ' + file_name + ' might be corrupted. You should try downloading it with a browser.') return file_path def read_data(file_path): """ Read data into a list of tokens There should be 17,005,207 tokens """ with zipfile.ZipFile(file_path) as f: words = tf.compat.as_str(f.read(f.namelist()[0])).split() # tf.compat.as_str() converts the input into the string return words def build_vocab(words, vocab_size): """ Build vocabulary of VOCAB_SIZE most frequent words """ dictionary = dict() count = [('UNK', -1)] count.extend(Counter(words).most_common(vocab_size - 1)) index = 0 utils.make_dir('processed') with open('processed/vocab_1000.tsv', "w") as f: for word, _ in count: dictionary[word] = index if index < 1000: f.write(word + "\n") index += 1 index_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return dictionary, index_dictionary def convert_words_to_index(words, dictionary): """ Replace each word in the dataset with its index in the dictionary """ return [dictionary[word] if word in dictionary else 0 for word in words] def generate_sample(index_words, context_window_size): """ Form training pairs according to the skip-gram model. """ for index, center in enumerate(index_words): context = random.randint(1, context_window_size) # get a random target before the center word for target in index_words[max(0, index - context): index]: yield center, target # get a random target after the center wrod for target in index_words[index + 1: index + context + 1]: yield center, target def get_batch(iterator, batch_size): """ Group a numerical stream into batches and yield them as Numpy arrays. """ while True: center_batch = np.zeros(batch_size, dtype=np.int32) target_batch = np.zeros([batch_size, 1]) for index in range(batch_size): center_batch[index], target_batch[index] = next(iterator) yield center_batch, target_batch def process_data(vocab_size, batch_size, skip_window): file_path = download(FILE_NAME, EXPECTED_BYTES) words = read_data(file_path) dictionary, _ = build_vocab(words, vocab_size) index_words = convert_words_to_index(words, dictionary) del words # to save memory single_gen = generate_sample(index_words, skip_window) return get_batch(single_gen, batch_size) def get_index_vocab(vocab_size): file_path = download(FILE_NAME, EXPECTED_BYTES) words = read_data(file_path) return build_vocab(words, vocab_size) ================================================ FILE: 2017/examples/utils.py ================================================ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf def huber_loss(labels, predictions, delta=1.0): residual = tf.abs(predictions - labels) def f1(): return 0.5 * tf.square(residual) def f2(): return delta * residual - 0.5 * tf.square(delta) return tf.cond(residual < delta, f1, f2) def make_dir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass ================================================ FILE: 2017/setup/requirements.txt ================================================ tensorflow==1.2.1 scipy==0.19.1 scikit-learn==0.18.2 matplotlib==2.0.2 xlrd==1.0.0 ipdb==0.10.1 Pillow==4.2.1 lxml==3.8.0 ================================================ FILE: 2017/setup/setup_instruction.md ================================================ Tensorflow supports both Python 2.7 and Python 3.3+. Note that for Windows, TensorFlow supports only 64-bit Python 3.5. For this course, I will use Python 2.7. But you’re welcome to use either Python 2 or Python 3 for the assignments. The starter code, though, will be in Python 2.7 Google has a pretty detailed instruction on how to download and setup Tensorflow. You can follow it here: https://www.tensorflow.org/get_started/os_setup Unless your computer has GPU, you should install Tensorflow without GPU support. My recommendation is always set up Tensorflow using virtualenv. For the list of dependencies, please consult the file requirements.txt. This list will be updated as the course progresses. Below is a simpler instruction on how to install tensorflow for people using Mac OS. If you have any problem installing Tensorflow, feel free to post it on Piazza: piazza.com/stanford/winter2017/cs20si ## Install TensorFlow
### For Mac OS If you get “permission denied” error in any command, use “sudo” in front of that command. You will need pip (or pip3 if you use Python 3), and virtualenv. Step 1: set up pip and virtual environment ```bash $ sudo easy_install pip $ sudo easy_install --upgrade six $ pip install virtualenv ``` Step 2: set up a project directory. You will do all work for this class in this directory ```bash $ mkdir [my project] ``` Step 3: set up virtual environment for the project directory. ```bash $ cd [my project] $ virtualenv venv --distribute ``` These commands create a venv subdirectory in your project where everything is installed. Step 4: to activate the virtual environment ```bash $ source venv/bin/activate ``` If you type: ```bash $ pip freeze ``` You will see that nothing is shown, which means no package is installed in your virtual environment. So you have to install all packages that you need. For the list of packages you need for this class, refer to requirements.txt Step 5: Install Tensorflow and other dependencies ```bash $ pip install tensorflow $ pip freeze > requirements.txt ``` Step n: To exit the virtual environment, use: ```bash $ deactivate ``` If you want your virtual environment to inherit globally installed packages, (not recommended), use: ```bash $ virtualenv venv --distribute --system-site-packages ``` ### For Ubuntu ### For Windows ### On the cloud If you don't want to install TensorFlow, you can use TensorFlow over the web. #### SageMath You can use Tensorflow over the web at https://cloud.sagemath.com/ Simply click on the link, create an account (or log in with your GitHub), and create a TensorFlow project. #### Jupyter You can also use Jupyter notebook to write TensorFlow programs. # Possible set up problems ## Matplotlib If you have problem with using Matplotlib in virtual environment, here is a simple fix.
If you installed matplotlib using pip, there is a directory in you root called ~/.matplotlib. Go there and create a file ~/.matplotlib/matplotlibrc there and add the following code: ```backend: TkAgg``` Or you can simply add this after importing matplotlib: ```matplotlib.use("TkAgg")``` ================================================ FILE: LICENSE ================================================ The MIT License (MIT) Copyright (c) 2017 Huyen Nguyen (Chip Huyen) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Join the https://gitter.im/stanford-tensorflow-tutorials](https://badges.gitter.im/tflearn/tflearn.svg)](https://gitter.im/stanford-tensorflow-tutorials) # stanford-tensorflow-tutorials This repository contains code examples for the course CS 20: TensorFlow for Deep Learning Research.
It will be updated as the class progresses.
Detailed syllabus and lecture notes can be found [here](http://cs20.stanford.edu).
For this course, I use python3.6 and TensorFlow 1.4.1. For the code and notes of the previous year's course, please see the folder 2017 and the website https://web.stanford.edu/class/cs20si/2017 For setup instruction and the list of dependencies, please see the setup folder of this repository. ================================================ FILE: assignments/01/q1.py ================================================ """ Simple exercises to get used to TensorFlow API You should thoroughly test your code. TensorFlow's official documentation should be your best friend here CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Created by Chip Huyen (chiphuyen@cs.stanford.edu) """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf sess = tf.InteractiveSession() ############################################################################### # 1a: Create two random 0-d tensors x and y of any distribution. # Create a TensorFlow object that returns x + y if x > y, and x - y otherwise. # Hint: look up tf.cond() # I do the first problem for you ############################################################################### x = tf.random_uniform([]) # Empty array as shape creates a scalar. y = tf.random_uniform([]) out = tf.cond(tf.greater(x, y), lambda: x + y, lambda: x - y) print(sess.run(out)) ############################################################################### # 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1). # Return x + y if x < y, x - y if x > y, 0 otherwise. # Hint: Look up tf.case(). ############################################################################### # YOUR CODE ############################################################################### # 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] # and y as a tensor of zeros with the same shape as x. # Return a boolean tensor that yields Trues if x equals y element-wise. # Hint: Look up tf.equal(). ############################################################################### # YOUR CODE ############################################################################### # 1d: Create the tensor x of value # [29.05088806, 27.61298943, 31.19073486, 29.35532951, # 30.97266006, 26.67541885, 38.08450317, 20.74983215, # 34.94445419, 34.45999146, 29.06485367, 36.01657104, # 27.88236427, 20.56035233, 30.20379066, 29.51215172, # 33.71149445, 28.59134293, 36.05556488, 28.66994858]. # Get the indices of elements in x whose values are greater than 30. # Hint: Use tf.where(). # Then extract elements whose values are greater than 30. # Hint: Use tf.gather(). ############################################################################### # YOUR CODE ############################################################################### # 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1, # 2, ..., 6 # Hint: Use tf.range() and tf.diag(). ############################################################################### # YOUR CODE ############################################################################### # 1f: Create a random 2-d tensor of size 10 x 10 from any distribution. # Calculate its determinant. # Hint: Look at tf.matrix_determinant(). ############################################################################### # YOUR CODE ############################################################################### # 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9]. # Return the unique elements in x # Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple. ############################################################################### # YOUR CODE ############################################################################### # 1h: Create two tensors x and y of shape 300 from any normal distribution, # as long as they are from the same distribution. # Use tf.cond() to return: # - The mean squared error of (x - y) if the average of all elements in (x - y) # is negative, or # - The sum of absolute value of all elements in the tensor (x - y) otherwise. # Hint: see the Huber loss function in the lecture slides 3. ############################################################################### # YOUR CODE ================================================ FILE: assignments/01/q1_sol.py ================================================ """ Solution to simple exercises to get used to TensorFlow API You should thoroughly test your code. TensorFlow's official documentation should be your best friend here CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Created by Chip Huyen (chiphuyen@cs.stanford.edu) """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf sess = tf.InteractiveSession() ############################################################################### # 1a: Create two random 0-d tensors x and y of any distribution. # Create a TensorFlow object that returns x + y if x > y, and x - y otherwise. # Hint: look up tf.cond() # I do the first problem for you ############################################################################### x = tf.random_uniform([]) # Empty array as shape creates a scalar. y = tf.random_uniform([]) out = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y)) ############################################################################### # 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1). # Return x + y if x < y, x - y if x > y, 0 otherwise. # Hint: Look up tf.case(). ############################################################################### x = tf.random_uniform([], -1, 1, dtype=tf.float32) y = tf.random_uniform([], -1, 1, dtype=tf.float32) out = tf.case({tf.less(x, y): lambda: tf.add(x, y), tf.greater(x, y): lambda: tf.subtract(x, y)}, default=lambda: tf.constant(0.0), exclusive=True) ############################################################################### # 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] # and y as a tensor of zeros with the same shape as x. # Return a boolean tensor that yields Trues if x equals y element-wise. # Hint: Look up tf.equal(). ############################################################################### x = tf.constant([[0, -2, -1], [0, 1, 2]]) y = tf.zeros_like(x) out = tf.equal(x, y) ############################################################################### # 1d: Create the tensor x of value # [29.05088806, 27.61298943, 31.19073486, 29.35532951, # 30.97266006, 26.67541885, 38.08450317, 20.74983215, # 34.94445419, 34.45999146, 29.06485367, 36.01657104, # 27.88236427, 20.56035233, 30.20379066, 29.51215172, # 33.71149445, 28.59134293, 36.05556488, 28.66994858]. # Get the indices of elements in x whose values are greater than 30. # Hint: Use tf.where(). # Then extract elements whose values are greater than 30. # Hint: Use tf.gather(). ############################################################################### x = tf.constant([29.05088806, 27.61298943, 31.19073486, 29.35532951, 30.97266006, 26.67541885, 38.08450317, 20.74983215, 34.94445419, 34.45999146, 29.06485367, 36.01657104, 27.88236427, 20.56035233, 30.20379066, 29.51215172, 33.71149445, 28.59134293, 36.05556488, 28.66994858]) indices = tf.where(x > 30) out = tf.gather(x, indices) ############################################################################### # 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1, # 2, ..., 6 # Hint: Use tf.range() and tf.diag(). ############################################################################### values = tf.range(1, 7) out = tf.diag(values) ############################################################################### # 1f: Create a random 2-d tensor of size 10 x 10 from any distribution. # Calculate its determinant. # Hint: Look at tf.matrix_determinant(). ############################################################################### m = tf.random_normal([10, 10], mean=10, stddev=1) out = tf.matrix_determinant(m) ############################################################################### # 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9]. # Return the unique elements in x # Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple. ############################################################################### x = tf.constant([5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9]) unique_values, indices = tf.unique(x) ############################################################################### # 1h: Create two tensors x and y of shape 300 from any normal distribution, # as long as they are from the same distribution. # Use tf.cond() to return: # - The mean squared error of (x - y) if the average of all elements in (x - y) # is negative, or # - The sum of absolute value of all elements in the tensor (x - y) otherwise. # Hint: see the Huber loss function in the lecture slides 3. ############################################################################### x = tf.random_normal([300], mean=5, stddev=1) y = tf.random_normal([300], mean=5, stddev=1) average = tf.reduce_mean(x - y) def f1(): return tf.reduce_mean(tf.square(x - y)) def f2(): return tf.reduce_sum(tf.abs(x - y)) out = tf.cond(average < 0, f1, f2) ================================================ FILE: assignments/02_style_transfer/load_vgg.py ================================================ """ Load VGGNet weights needed for the implementation in TensorFlow of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu For more details, please read the assignment handout: https://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing """ import numpy as np import scipy.io import tensorflow as tf import utils # VGG-19 parameters file VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat' VGG_FILENAME = 'imagenet-vgg-verydeep-19.mat' EXPECTED_BYTES = 534904783 class VGG(object): def __init__(self, input_img): utils.download(VGG_DOWNLOAD_LINK, VGG_FILENAME, EXPECTED_BYTES) self.vgg_layers = scipy.io.loadmat(VGG_FILENAME)['layers'] self.input_img = input_img self.mean_pixels = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3)) def _weights(self, layer_idx, expected_layer_name): """ Return the weights and biases at layer_idx already trained by VGG """ W = self.vgg_layers[0][layer_idx][0][0][2][0][0] b = self.vgg_layers[0][layer_idx][0][0][2][0][1] layer_name = self.vgg_layers[0][layer_idx][0][0][0][0] assert layer_name == expected_layer_name return W, b.reshape(b.size) def conv2d_relu(self, prev_layer, layer_idx, layer_name): """ Create a convolution layer with RELU using the weights and biases extracted from the VGG model at 'layer_idx'. You should use the function _weights() defined above to extract weights and biases. _weights() returns numpy arrays, so you have to convert them to TF tensors. Don't forget to apply relu to the output from the convolution. Inputs: prev_layer: the output tensor from the previous layer layer_idx: the index to current layer in vgg_layers layer_name: the string that is the name of the current layer. It's used to specify variable_scope. Hint for choosing strides size: for small images, you probably don't want to skip any pixel """ ############################### ## TO DO out = None ############################### setattr(self, layer_name, out) def avgpool(self, prev_layer, layer_name): """ Create the average pooling layer. The paper suggests that average pooling works better than max pooling. Input: prev_layer: the output tensor from the previous layer layer_name: the string that you want to name the layer. It's used to specify variable_scope. Hint for choosing strides and kszie: choose what you feel appropriate """ ############################### ## TO DO out = None ############################### setattr(self, layer_name, out) def load(self): self.conv2d_relu(self.input_img, 0, 'conv1_1') self.conv2d_relu(self.conv1_1, 2, 'conv1_2') self.avgpool(self.conv1_2, 'avgpool1') self.conv2d_relu(self.avgpool1, 5, 'conv2_1') self.conv2d_relu(self.conv2_1, 7, 'conv2_2') self.avgpool(self.conv2_2, 'avgpool2') self.conv2d_relu(self.avgpool2, 10, 'conv3_1') self.conv2d_relu(self.conv3_1, 12, 'conv3_2') self.conv2d_relu(self.conv3_2, 14, 'conv3_3') self.conv2d_relu(self.conv3_3, 16, 'conv3_4') self.avgpool(self.conv3_4, 'avgpool3') self.conv2d_relu(self.avgpool3, 19, 'conv4_1') self.conv2d_relu(self.conv4_1, 21, 'conv4_2') self.conv2d_relu(self.conv4_2, 23, 'conv4_3') self.conv2d_relu(self.conv4_3, 25, 'conv4_4') self.avgpool(self.conv4_4, 'avgpool4') self.conv2d_relu(self.avgpool4, 28, 'conv5_1') self.conv2d_relu(self.conv5_1, 30, 'conv5_2') self.conv2d_relu(self.conv5_2, 32, 'conv5_3') self.conv2d_relu(self.conv5_3, 34, 'conv5_4') self.avgpool(self.conv5_4, 'avgpool5') ================================================ FILE: assignments/02_style_transfer/load_vgg_sol.py ================================================ """ Load VGGNet weights needed for the implementation in TensorFlow of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu For more details, please read the assignment handout: """ import numpy as np import scipy.io import tensorflow as tf import utils # VGG-19 parameters file VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat' VGG_FILENAME = 'imagenet-vgg-verydeep-19.mat' EXPECTED_BYTES = 534904783 class VGG(object): def __init__(self, input_img): utils.download(VGG_DOWNLOAD_LINK, VGG_FILENAME, EXPECTED_BYTES) self.vgg_layers = scipy.io.loadmat(VGG_FILENAME)['layers'] self.input_img = input_img self.mean_pixels = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3)) def _weights(self, layer_idx, expected_layer_name): """ Return the weights and biases at layer_idx already trained by VGG """ W = self.vgg_layers[0][layer_idx][0][0][2][0][0] b = self.vgg_layers[0][layer_idx][0][0][2][0][1] layer_name = self.vgg_layers[0][layer_idx][0][0][0][0] assert layer_name == expected_layer_name return W, b.reshape(b.size) def conv2d_relu(self, prev_layer, layer_idx, layer_name): """ Return the Conv2D layer with RELU using the weights, biases from the VGG model at 'layer_idx'. Don't forget to apply relu to the output from the convolution. Inputs: prev_layer: the output tensor from the previous layer layer_idx: the index to current layer in vgg_layers layer_name: the string that is the name of the current layer. It's used to specify variable_scope. Note that you first need to obtain W and b from from the corresponding VGG's layer using the function _weights() defined above. W and b returned from _weights() are numpy arrays, so you have to convert them to TF tensors. One way to do it is with tf.constant. Hint for choosing strides size: for small images, you probably don't want to skip any pixel """ ############################### ## TO DO with tf.variable_scope(layer_name) as scope: W, b = self._weights(layer_idx, layer_name) W = tf.constant(W, name='weights') b = tf.constant(b, name='bias') conv2d = tf.nn.conv2d(prev_layer, filter=W, strides=[1, 1, 1, 1], padding='SAME') out = tf.nn.relu(conv2d + b) ############################### setattr(self, layer_name, out) def avgpool(self, prev_layer, layer_name): """ Return the average pooling layer. The paper suggests that average pooling works better than max pooling. Input: prev_layer: the output tensor from the previous layer layer_name: the string that you want to name the layer. It's used to specify variable_scope. Hint for choosing strides and kszie: choose what you feel appropriate """ ############################### ## TO DO with tf.variable_scope(layer_name): out = tf.nn.avg_pool(prev_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') ############################### setattr(self, layer_name, out) def load(self): self.conv2d_relu(self.input_img, 0, 'conv1_1') self.conv2d_relu(self.conv1_1, 2, 'conv1_2') self.avgpool(self.conv1_2, 'avgpool1') self.conv2d_relu(self.avgpool1, 5, 'conv2_1') self.conv2d_relu(self.conv2_1, 7, 'conv2_2') self.avgpool(self.conv2_2, 'avgpool2') self.conv2d_relu(self.avgpool2, 10, 'conv3_1') self.conv2d_relu(self.conv3_1, 12, 'conv3_2') self.conv2d_relu(self.conv3_2, 14, 'conv3_3') self.conv2d_relu(self.conv3_3, 16, 'conv3_4') self.avgpool(self.conv3_4, 'avgpool3') self.conv2d_relu(self.avgpool3, 19, 'conv4_1') self.conv2d_relu(self.conv4_1, 21, 'conv4_2') self.conv2d_relu(self.conv4_2, 23, 'conv4_3') self.conv2d_relu(self.conv4_3, 25, 'conv4_4') self.avgpool(self.conv4_4, 'avgpool4') self.conv2d_relu(self.avgpool4, 28, 'conv5_1') self.conv2d_relu(self.conv5_1, 30, 'conv5_2') self.conv2d_relu(self.conv5_2, 32, 'conv5_3') self.conv2d_relu(self.conv5_3, 34, 'conv5_4') self.avgpool(self.conv5_4, 'avgpool5') ================================================ FILE: assignments/02_style_transfer/style_transfer.py ================================================ """ Implementation in TensorFlow of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu For more details, please read the assignment handout: https://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import tensorflow as tf import load_vgg import utils def setup(): utils.safe_mkdir('checkpoints') utils.safe_mkdir('outputs') class StyleTransfer(object): def __init__(self, content_img, style_img, img_width, img_height): ''' img_width and img_height are the dimensions we expect from the generated image. We will resize input content image and input style image to match this dimension. Feel free to alter any hyperparameter here and see how it affects your training. ''' self.img_width = img_width self.img_height = img_height self.content_img = utils.get_resized_image(content_img, img_width, img_height) self.style_img = utils.get_resized_image(style_img, img_width, img_height) self.initial_img = utils.generate_noise_image(self.content_img, img_width, img_height) ############################### ## TO DO ## create global step (gstep) and hyperparameters for the model self.content_layer = 'conv4_2' self.style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1'] # content_w, style_w: corresponding weights for content loss and style loss self.content_w = None self.style_w = None # style_layer_w: weights for different style layers. deep layers have more weights self.style_layer_w = [0.5, 1.0, 1.5, 3.0, 4.0] self.gstep = None # global step self.lr = None ############################### def create_input(self): ''' We will use one input_img as a placeholder for the content image, style image, and generated image, because: 1. they have the same dimension 2. we have to extract the same set of features from them We use a variable instead of a placeholder because we're, at the same time, training the generated image to get the desirable result. Note: image height corresponds to number of rows, not columns. ''' with tf.variable_scope('input') as scope: self.input_img = tf.get_variable('in_img', shape=([1, self.img_height, self.img_width, 3]), dtype=tf.float32, initializer=tf.zeros_initializer()) def load_vgg(self): ''' Load the saved model parameters of VGG-19, using the input_img as the input to compute the output at each layer of vgg. During training, VGG-19 mean-centered all images and found the mean pixels to be [123.68, 116.779, 103.939] along RGB dimensions. We have to subtract this mean from our images. ''' self.vgg = load_vgg.VGG(self.input_img) self.vgg.load() self.content_img -= self.vgg.mean_pixels self.style_img -= self.vgg.mean_pixels def _content_loss(self, P, F): ''' Calculate the loss between the feature representation of the content image and the generated image. Inputs: P: content representation of the content image F: content representation of the generated image Read the assignment handout for more details Note: Don't use the coefficient 0.5 as defined in the paper. Use the coefficient defined in the assignment handout. ''' ############################### ## TO DO self.content_loss = None ############################### def _gram_matrix(self, F, N, M): """ Create and return the gram matrix for tensor F Hint: you'll first have to reshape F """ ############################### ## TO DO return None ############################### def _single_style_loss(self, a, g): """ Calculate the style loss at a certain layer Inputs: a is the feature representation of the style image at that layer g is the feature representation of the generated image at that layer Output: the style loss at a certain layer (which is E_l in the paper) Hint: 1. you'll have to use the function _gram_matrix() 2. we'll use the same coefficient for style loss as in the paper 3. a and g are feature representation, not gram matrices """ ############################### ## TO DO return None ############################### def _style_loss(self, A): """ Calculate the total style loss as a weighted sum of style losses at all style layers Hint: you'll have to use _single_style_loss() """ ############################### ## TO DO self.style_loss = None ############################### def losses(self): with tf.variable_scope('losses') as scope: with tf.Session() as sess: # assign content image to the input variable sess.run(self.input_img.assign(self.content_img)) gen_img_content = getattr(self.vgg, self.content_layer) content_img_content = sess.run(gen_img_content) self._content_loss(content_img_content, gen_img_content) with tf.Session() as sess: sess.run(self.input_img.assign(self.style_img)) style_layers = sess.run([getattr(self.vgg, layer) for layer in self.style_layers]) self._style_loss(style_layers) ########################################## ## TO DO: create total loss. ## Hint: don't forget the weights for the content loss and style loss self.total_loss = None ########################################## def optimize(self): ############################### ## TO DO: create optimizer self.opt = None ############################### def create_summary(self): ############################### ## TO DO: create summaries for all the losses ## Hint: don't forget to merge them self.summary_op = None ############################### def build(self): self.create_input() self.load_vgg() self.losses() self.optimize() self.create_summary() def train(self, n_iters): skip_step = 1 with tf.Session() as sess: ############################### ## TO DO: ## 1. initialize your variables ## 2. create writer to write your grapp ############################### sess.run(self.input_img.assign(self.initial_img)) ############################### ## TO DO: ## 1. create a saver object ## 2. check if a checkpoint exists, restore the variables ############################## initial_step = self.gstep.eval() start_time = time.time() for index in range(initial_step, n_iters): if index >= 5 and index < 20: skip_step = 10 elif index >= 20: skip_step = 20 sess.run(self.opt) if (index + 1) % skip_step == 0: ############################### ## TO DO: obtain generated image, loss, and summary gen_image, total_loss, summary = None, None, None ############################### # add back the mean pixels we subtracted before gen_image = gen_image + self.vgg.mean_pixels writer.add_summary(summary, global_step=index) print('Step {}\n Sum: {:5.1f}'.format(index + 1, np.sum(gen_image))) print(' Loss: {:5.1f}'.format(total_loss)) print(' Took: {} seconds'.format(time.time() - start_time)) start_time = time.time() filename = 'outputs/%d.png' % (index) utils.save_image(filename, gen_image) if (index + 1) % 20 == 0: ############################### ## TO DO: save the variables into a checkpoint ############################### pass if __name__ == '__main__': setup() machine = StyleTransfer('content/deadpool.jpg', 'styles/guernica.jpg', 333, 250) machine.build() machine.train(300) ================================================ FILE: assignments/02_style_transfer/style_transfer_sol.py ================================================ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import tensorflow as tf import load_vgg_sol import utils def setup(): utils.safe_mkdir('checkpoints') utils.safe_mkdir('outputs') class StyleTransfer(object): def __init__(self, content_img, style_img, img_width, img_height): ''' img_width and img_height are the dimensions we expect from the generated image. We will resize input content image and input style image to match this dimension. Feel free to alter any hyperparameter here and see how it affects your training. ''' self.img_width = img_width self.img_height = img_height self.content_img = utils.get_resized_image(content_img, img_width, img_height) self.style_img = utils.get_resized_image(style_img, img_width, img_height) self.initial_img = utils.generate_noise_image(self.content_img, img_width, img_height) ############################### ## TO DO ## create global step (gstep) and hyperparameters for the model self.content_layer = 'conv4_2' self.style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1'] self.content_w = 0.01 self.style_w = 1 self.style_layer_w = [0.5, 1.0, 1.5, 3.0, 4.0] self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') self.lr = 2.0 ############################### def create_input(self): ''' We will use one input_img as a placeholder for the content image, style image, and generated image, because: 1. they have the same dimension 2. we have to extract the same set of features from them We use a variable instead of a placeholder because we're, at the same time, training the generated image to get the desirable result. Note: image height corresponds to number of rows, not columns. ''' with tf.variable_scope('input') as scope: self.input_img = tf.get_variable('in_img', shape=([1, self.img_height, self.img_width, 3]), dtype=tf.float32, initializer=tf.zeros_initializer()) def load_vgg(self): ''' Load the saved model parameters of VGG-19, using the input_img as the input to compute the output at each layer of vgg. During training, VGG-19 mean-centered all images and found the mean pixels to be [123.68, 116.779, 103.939] along RGB dimensions. We have to subtract this mean from our images. ''' self.vgg = load_vgg_sol.VGG(self.input_img) self.vgg.load() self.content_img -= self.vgg.mean_pixels self.style_img -= self.vgg.mean_pixels def _content_loss(self, P, F): ''' Calculate the loss between the feature representation of the content image and the generated image. Inputs: P: content representation of the content image F: content representation of the generated image Read the assignment handout for more details Note: Don't use the coefficient 0.5 as defined in the paper. Use the coefficient defined in the assignment handout. ''' # self.content_loss = None ############################### ## TO DO self.content_loss = tf.reduce_sum((F - P) ** 2) / (4.0 * P.size) ############################### def _gram_matrix(self, F, N, M): """ Create and return the gram matrix for tensor F Hint: you'll first have to reshape F """ ############################### ## TO DO F = tf.reshape(F, (M, N)) return tf.matmul(tf.transpose(F), F) ############################### def _single_style_loss(self, a, g): """ Calculate the style loss at a certain layer Inputs: a is the feature representation of the style image at that layer g is the feature representation of the generated image at that layer Output: the style loss at a certain layer (which is E_l in the paper) Hint: 1. you'll have to use the function _gram_matrix() 2. we'll use the same coefficient for style loss as in the paper 3. a and g are feature representation, not gram matrices """ ############################### ## TO DO N = a.shape[3] # number of filters M = a.shape[1] * a.shape[2] # height times width of the feature map A = self._gram_matrix(a, N, M) G = self._gram_matrix(g, N, M) return tf.reduce_sum((G - A) ** 2 / ((2 * N * M) ** 2)) ############################### def _style_loss(self, A): """ Calculate the total style loss as a weighted sum of style losses at all style layers Hint: you'll have to use _single_style_loss() """ n_layers = len(A) E = [self._single_style_loss(A[i], getattr(self.vgg, self.style_layers[i])) for i in range(n_layers)] ############################### ## TO DO self.style_loss = sum([self.style_layer_w[i] * E[i] for i in range(n_layers)]) ############################### def losses(self): with tf.variable_scope('losses') as scope: with tf.Session() as sess: # assign content image to the input variable sess.run(self.input_img.assign(self.content_img)) gen_img_content = getattr(self.vgg, self.content_layer) content_img_content = sess.run(gen_img_content) self._content_loss(content_img_content, gen_img_content) with tf.Session() as sess: sess.run(self.input_img.assign(self.style_img)) style_layers = sess.run([getattr(self.vgg, layer) for layer in self.style_layers]) self._style_loss(style_layers) ########################################## ## TO DO: create total loss. ## Hint: don't forget the weights for the content loss and style loss self.total_loss = self.content_w * self.content_loss + self.style_w * self.style_loss ########################################## def optimize(self): ############################### ## TO DO: create optimizer self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.total_loss, global_step=self.gstep) ############################### def create_summary(self): ############################### ## TO DO: create summaries for all the losses ## Hint: don't forget to merge them with tf.name_scope('summaries'): tf.summary.scalar('content loss', self.content_loss) tf.summary.scalar('style loss', self.style_loss) tf.summary.scalar('total loss', self.total_loss) self.summary_op = tf.summary.merge_all() ############################### def build(self): self.create_input() self.load_vgg() self.losses() self.optimize() self.create_summary() def train(self, n_iters): skip_step = 1 with tf.Session() as sess: ############################### ## TO DO: ## 1. initialize your variables ## 2. create writer to write your graph sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('graphs/style_stranfer', sess.graph) ############################### sess.run(self.input_img.assign(self.initial_img)) ############################### ## TO DO: ## 1. create a saver object ## 2. check if a checkpoint exists, restore the variables saver = tf.train.Saver() ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/style_transfer/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) ############################## initial_step = self.gstep.eval() start_time = time.time() for index in range(initial_step, n_iters): if index >= 5 and index < 20: skip_step = 10 elif index >= 20: skip_step = 20 sess.run(self.opt) if (index + 1) % skip_step == 0: ############################### ## TO DO: obtain generated image, loss, and summary gen_image, total_loss, summary = sess.run([self.input_img, self.total_loss, self.summary_op]) ############################### # add back the mean pixels we subtracted before gen_image = gen_image + self.vgg.mean_pixels writer.add_summary(summary, global_step=index) print('Step {}\n Sum: {:5.1f}'.format(index + 1, np.sum(gen_image))) print(' Loss: {:5.1f}'.format(total_loss)) print(' Took: {} seconds'.format(time.time() - start_time)) start_time = time.time() filename = 'outputs/%d.png' % (index) utils.save_image(filename, gen_image) if (index + 1) % 20 == 0: ############################### ## TO DO: save the variables into a checkpoint saver.save(sess, 'checkpoints/style_stranfer/style_transfer', index) ############################### if __name__ == '__main__': setup() machine = StyleTransfer('content/deadpool.jpg', 'styles/guernica.jpg', 333, 250) machine.build() machine.train(300) ================================================ FILE: assignments/02_style_transfer/utils.py ================================================ """ Utils needed for the implementation in TensorFlow of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu For more details, please read the assignment handout: https://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing """ import os from PIL import Image, ImageOps import numpy as np import scipy.misc from six.moves import urllib def download(download_link, file_name, expected_bytes): """ Download the pretrained VGG-19 model if it's not already downloaded """ if os.path.exists(file_name): print("VGG-19 pre-trained model is ready") return print("Downloading the VGG pre-trained model. This might take a while ...") file_name, _ = urllib.request.urlretrieve(download_link, file_name) file_stat = os.stat(file_name) if file_stat.st_size == expected_bytes: print('Successfully downloaded VGG-19 pre-trained model', file_name) else: raise Exception('File ' + file_name + ' might be corrupted. You should try downloading it with a browser.') def get_resized_image(img_path, width, height, save=True): image = Image.open(img_path) # PIL is column major so you have to swap the places of width and height image = ImageOps.fit(image, (width, height), Image.ANTIALIAS) if save: image_dirs = img_path.split('/') image_dirs[-1] = 'resized_' + image_dirs[-1] out_path = '/'.join(image_dirs) if not os.path.exists(out_path): image.save(out_path) image = np.asarray(image, np.float32) return np.expand_dims(image, 0) def generate_noise_image(content_image, width, height, noise_ratio=0.6): noise_image = np.random.uniform(-20, 20, (1, height, width, 3)).astype(np.float32) return noise_image * noise_ratio + content_image * (1 - noise_ratio) def save_image(path, image): image = image[0] image = np.clip(image, 0, 255).astype('uint8') scipy.misc.imsave(path, image) def safe_mkdir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass ================================================ FILE: assignments/chatbot/README.md ================================================ A neural chatbot using sequence to sequence model with attentional decoder. This is a fully functional chatbot. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu The detailed assignment handout can be found at: https://docs.google.com/document/d/1GJfn2B6EI8JueDiBwzTAdD34d6pC99BSt6vldOmUCPQ/edit?usp=sharing

Sample conversation

HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm chip
BOT ++++ please .
HUMAN ++++ please what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ where are you from?
BOT ++++ california .
HUMAN ++++ do you like california?
BOT ++++ i ' m okay .
HUMAN ++++ i like california a lot
BOT ++++ of what ?
HUMAN ++++ i live here
BOT ++++ and what is it ?
HUMAN ++++ california
BOT ++++ i don ' t know what to say .
HUMAN ++++ i'm a student here
BOT ++++ and get it out of here .
HUMAN ++++ what do you like to do in your free time?
BOT ++++ and what about the gun ?
HUMAN ++++ do you have a gun?
BOT ++++ yes .
HUMAN ++++ you're dangerous!
BOT ++++ i ' m not .
HUMAN ++++ do you go hunting?
BOT ++++ i can do .
HUMAN ++++ what do you hunt?
BOT ++++ virgin .
HUMAN ++++ that's very bad
BOT ++++ and you in love ?
HUMAN ++++ no i'm not
BOT ++++ please - -
HUMAN ++++ okay, i am
BOT ++++ and you are good .
HUMAN ++++ yeah i am good too.
BOT ++++ good good to you .
HUMAN ++++ are you in love?
BOT ++++ i ' m ready .
HUMAN ++++ ready for what?
BOT ++++ for what ?
HUMAN ++++ for what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ tell me
BOT ++++ i can do it .
HUMAN ++++ okay, tell me what you're ready for
BOT ++++ i ' m doing you .
HUMAN ++++ no you're not
BOT ++++ please !
HUMAN ++++ this is a strange conversation
BOT ++++ i don ' t know what to say .
HUMAN ++++ are you a guy?
BOT ++++ of what ?
HUMAN ++++ are you a boy?
BOT ++++ i ' m fine .
HUMAN ++++ are you a man?
BOT ++++ i ' m a little .
HUMAN ++++ what's your name?
BOT ++++ chris -
HUMAN ++++ chris who?
BOT ++++ cornelius .
HUMAN ++++ so i guess you're a guy.
BOT ++++ i am .
HUMAN ++++ okay
BOT ++++ okay ?
HUMAN ++++ do you have a girlfriend?
BOT ++++ i ' m sure .
HUMAN ++++ what's her name?
BOT ++++ let ' s talk about something else .
See output_convo.txt for more sample conversations.

Usage

Step 1: create a data folder in your project directory, download the Cornell Movie-Dialogs Corpus from https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html Unzip it Step 2: update config.py file
Change DATA_PATH to where you store your data Step 3: python3 data.py
This will do all the pre-processing for the Cornell dataset. Step 4: python3 chatbot.py --mode [train/chat]
If mode is train, then you train the chatbot. By default, the model will restore the previously trained weights (if there is any) and continue training up on that. If you want to start training from scratch, please delete all the checkpoints in the checkpoints folder. If the mode is chat, you'll go into the interaction mode with the bot. By default, all the conversations you have with the chatbot will be written into the file output_convo.txt in the processed folder. If you run this chatbot, I kindly ask you to send me the output_convo.txt so that I can improve the chatbot. Thank you very much! ================================================ FILE: assignments/chatbot/chatbot.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu This file contains the code to run the model. See README.md for instruction on how to run the starter code. """ import argparse import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import random import sys import time import numpy as np import tensorflow as tf from model import ChatBotModel import config import data def _get_random_bucket(train_buckets_scale): """ Get a random bucket from which to choose a training sample """ rand = random.random() return min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > rand]) def _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks): """ Assert that the encoder inputs, decoder inputs, and decoder masks are of the expected lengths """ if len(encoder_inputs) != encoder_size: raise ValueError("Encoder length must be equal to the one in bucket," " %d != %d." % (len(encoder_inputs), encoder_size)) if len(decoder_inputs) != decoder_size: raise ValueError("Decoder length must be equal to the one in bucket," " %d != %d." % (len(decoder_inputs), decoder_size)) if len(decoder_masks) != decoder_size: raise ValueError("Weights length must be equal to the one in bucket," " %d != %d." % (len(decoder_masks), decoder_size)) def run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, forward_only): """ Run one step in training. @forward_only: boolean value to decide whether a backward path should be created forward_only is set to True when you just want to evaluate on the test set, or when you want to the bot to be in chat mode. """ encoder_size, decoder_size = config.BUCKETS[bucket_id] _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks) # input feed: encoder inputs, decoder inputs, target_weights, as provided. input_feed = {} for step in range(encoder_size): input_feed[model.encoder_inputs[step].name] = encoder_inputs[step] for step in range(decoder_size): input_feed[model.decoder_inputs[step].name] = decoder_inputs[step] input_feed[model.decoder_masks[step].name] = decoder_masks[step] last_target = model.decoder_inputs[decoder_size].name input_feed[last_target] = np.zeros([model.batch_size], dtype=np.int32) # output feed: depends on whether we do a backward step or not. if not forward_only: output_feed = [model.train_ops[bucket_id], # update op that does SGD. model.gradient_norms[bucket_id], # gradient norm. model.losses[bucket_id]] # loss for this batch. else: output_feed = [model.losses[bucket_id]] # loss for this batch. for step in range(decoder_size): # output logits. output_feed.append(model.outputs[bucket_id][step]) outputs = sess.run(output_feed, input_feed) if not forward_only: return outputs[1], outputs[2], None # Gradient norm, loss, no outputs. else: return None, outputs[0], outputs[1:] # No gradient norm, loss, outputs. def _get_buckets(): """ Load the dataset into buckets based on their lengths. train_buckets_scale is the inverval that'll help us choose a random bucket later on. """ test_buckets = data.load_data('test_ids.enc', 'test_ids.dec') data_buckets = data.load_data('train_ids.enc', 'train_ids.dec') train_bucket_sizes = [len(data_buckets[b]) for b in range(len(config.BUCKETS))] print("Number of samples in each bucket:\n", train_bucket_sizes) train_total_size = sum(train_bucket_sizes) # list of increasing numbers from 0 to 1 that we'll use to select a bucket. train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))] print("Bucket scale:\n", train_buckets_scale) return test_buckets, data_buckets, train_buckets_scale def _get_skip_step(iteration): """ How many steps should the model train before it saves all the weights. """ if iteration < 100: return 30 return 100 def _check_restore_parameters(sess, saver): """ Restore the previously trained parameters if there are any. """ ckpt = tf.train.get_checkpoint_state(os.path.dirname(config.CPT_PATH + '/checkpoint')) if ckpt and ckpt.model_checkpoint_path: print("Loading parameters for the Chatbot") saver.restore(sess, ckpt.model_checkpoint_path) else: print("Initializing fresh parameters for the Chatbot") def _eval_test_set(sess, model, test_buckets): """ Evaluate on the test set. """ for bucket_id in range(len(config.BUCKETS)): if len(test_buckets[bucket_id]) == 0: print(" Test: empty bucket %d" % (bucket_id)) continue start = time.time() encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(test_buckets[bucket_id], bucket_id, batch_size=config.BATCH_SIZE) _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, True) print('Test bucket {}: loss {}, time {}'.format(bucket_id, step_loss, time.time() - start)) def train(): """ Train the bot """ test_buckets, data_buckets, train_buckets_scale = _get_buckets() # in train mode, we need to create the backward path, so forwrad_only is False model = ChatBotModel(False, config.BATCH_SIZE) model.build_graph() saver = tf.train.Saver() with tf.Session() as sess: print('Running session') sess.run(tf.global_variables_initializer()) _check_restore_parameters(sess, saver) iteration = model.global_step.eval() total_loss = 0 while True: skip_step = _get_skip_step(iteration) bucket_id = _get_random_bucket(train_buckets_scale) encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(data_buckets[bucket_id], bucket_id, batch_size=config.BATCH_SIZE) start = time.time() _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, False) total_loss += step_loss iteration += 1 if iteration % skip_step == 0: print('Iter {}: loss {}, time {}'.format(iteration, total_loss/skip_step, time.time() - start)) start = time.time() total_loss = 0 saver.save(sess, os.path.join(config.CPT_PATH, 'chatbot'), global_step=model.global_step) if iteration % (10 * skip_step) == 0: # Run evals on development set and print their loss _eval_test_set(sess, model, test_buckets) start = time.time() sys.stdout.flush() def _get_user_input(): """ Get user's input, which will be transformed into encoder input later """ print("> ", end="") sys.stdout.flush() return sys.stdin.readline() def _find_right_bucket(length): """ Find the proper bucket for an encoder input based on its length """ return min([b for b in range(len(config.BUCKETS)) if config.BUCKETS[b][0] >= length]) def _construct_response(output_logits, inv_dec_vocab): """ Construct a response to the user's encoder input. @output_logits: the outputs from sequence to sequence wrapper. output_logits is decoder_size np array, each of dim 1 x DEC_VOCAB This is a greedy decoder - outputs are just argmaxes of output_logits. """ print(output_logits[0]) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] # If there is an EOS symbol in outputs, cut them at that point. if config.EOS_ID in outputs: outputs = outputs[:outputs.index(config.EOS_ID)] # Print out sentence corresponding to outputs. return " ".join([tf.compat.as_str(inv_dec_vocab[output]) for output in outputs]) def chat(): """ in test mode, we don't to create the backward path """ _, enc_vocab = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.enc')) inv_dec_vocab, _ = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.dec')) model = ChatBotModel(True, batch_size=1) model.build_graph() saver = tf.train.Saver() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) _check_restore_parameters(sess, saver) output_file = open(os.path.join(config.PROCESSED_PATH, config.OUTPUT_FILE), 'a+') # Decode from standard input. max_length = config.BUCKETS[-1][0] print('Welcome to TensorBro. Say something. Enter to exit. Max length is', max_length) while True: line = _get_user_input() if len(line) > 0 and line[-1] == '\n': line = line[:-1] if line == '': break output_file.write('HUMAN ++++ ' + line + '\n') # Get token-ids for the input sentence. token_ids = data.sentence2id(enc_vocab, str(line)) if (len(token_ids) > max_length): print('Max length I can handle is:', max_length) line = _get_user_input() continue # Which bucket does it belong to? bucket_id = _find_right_bucket(len(token_ids)) # Get a 1-element batch to feed the sentence to the model. encoder_inputs, decoder_inputs, decoder_masks = data.get_batch([(token_ids, [])], bucket_id, batch_size=1) # Get output logits for the sentence. _, _, output_logits = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, True) response = _construct_response(output_logits, inv_dec_vocab) print(response) output_file.write('BOT ++++ ' + response + '\n') output_file.write('=============================================\n') output_file.close() def main(): parser = argparse.ArgumentParser() parser.add_argument('--mode', choices={'train', 'chat'}, default='train', help="mode. if not specified, it's in the train mode") args = parser.parse_args() if not os.path.isdir(config.PROCESSED_PATH): data.prepare_raw_data() data.process_data() print('Data ready!') # create checkpoints folder if there isn't one already data.make_dir(config.CPT_PATH) if args.mode == 'train': train() elif args.mode == 'chat': chat() if __name__ == '__main__': main() ================================================ FILE: assignments/chatbot/config.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu This file contains the hyperparameters for the model. See README.md for instruction on how to run the starter code. """ # parameters for processing the dataset DATA_PATH = 'data/cornell movie-dialogs corpus' CONVO_FILE = 'movie_conversations.txt' LINE_FILE = 'movie_lines.txt' OUTPUT_FILE = 'output_convo.txt' PROCESSED_PATH = 'processed' CPT_PATH = 'checkpoints' THRESHOLD = 2 PAD_ID = 0 UNK_ID = 1 START_ID = 2 EOS_ID = 3 TESTSET_SIZE = 25000 BUCKETS = [(19, 19), (28, 28), (33, 33), (40, 43), (50, 53), (60, 63)] CONTRACTIONS = [("i ' m ", "i 'm "), ("' d ", "'d "), ("' s ", "'s "), ("don ' t ", "do n't "), ("didn ' t ", "did n't "), ("doesn ' t ", "does n't "), ("can ' t ", "ca n't "), ("shouldn ' t ", "should n't "), ("wouldn ' t ", "would n't "), ("' ve ", "'ve "), ("' re ", "'re "), ("in ' ", "in' ")] NUM_LAYERS = 3 HIDDEN_SIZE = 256 BATCH_SIZE = 64 LR = 0.5 MAX_GRAD_NORM = 5.0 NUM_SAMPLES = 512 ================================================ FILE: assignments/chatbot/data.py ================================================ """ A neural chatbot using sequence to sequence model with attentional decoder. This is based on Google Translate Tensorflow model https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ Sequence to sequence model by Cho et al.(2014) Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu This file contains the code to do the pre-processing for the Cornell Movie-Dialogs Corpus. See readme.md for instruction on how to run the starter code. """ import os import random import re import numpy as np import config def get_lines(): id2line = {} file_path = os.path.join(config.DATA_PATH, config.LINE_FILE) print(config.LINE_FILE) with open(file_path, 'r', errors='ignore') as f: # lines = f.readlines() # for line in lines: i = 0 try: for line in f: parts = line.split(' +++$+++ ') if len(parts) == 5: if parts[4][-1] == '\n': parts[4] = parts[4][:-1] id2line[parts[0]] = parts[4] i += 1 except UnicodeDecodeError: print(i, line) return id2line def get_convos(): """ Get conversations from the raw data """ file_path = os.path.join(config.DATA_PATH, config.CONVO_FILE) convos = [] with open(file_path, 'r') as f: for line in f.readlines(): parts = line.split(' +++$+++ ') if len(parts) == 4: convo = [] for line in parts[3][1:-2].split(', '): convo.append(line[1:-1]) convos.append(convo) return convos def question_answers(id2line, convos): """ Divide the dataset into two sets: questions and answers. """ questions, answers = [], [] for convo in convos: for index, line in enumerate(convo[:-1]): questions.append(id2line[convo[index]]) answers.append(id2line[convo[index + 1]]) assert len(questions) == len(answers) return questions, answers def prepare_dataset(questions, answers): # create path to store all the train & test encoder & decoder make_dir(config.PROCESSED_PATH) # random convos to create the test set test_ids = random.sample([i for i in range(len(questions))],config.TESTSET_SIZE) filenames = ['train.enc', 'train.dec', 'test.enc', 'test.dec'] files = [] for filename in filenames: files.append(open(os.path.join(config.PROCESSED_PATH, filename),'w')) for i in range(len(questions)): if i in test_ids: files[2].write(questions[i] + '\n') files[3].write(answers[i] + '\n') else: files[0].write(questions[i] + '\n') files[1].write(answers[i] + '\n') for file in files: file.close() def make_dir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass def basic_tokenizer(line, normalize_digits=True): """ A basic tokenizer to tokenize text into tokens. Feel free to change this to suit your need. """ line = re.sub('', '', line) line = re.sub('', '', line) line = re.sub('\[', '', line) line = re.sub('\]', '', line) words = [] _WORD_SPLIT = re.compile("([.,!?\"'-<>:;)(])") _DIGIT_RE = re.compile(r"\d") for fragment in line.strip().lower().split(): for token in re.split(_WORD_SPLIT, fragment): if not token: continue if normalize_digits: token = re.sub(_DIGIT_RE, '#', token) words.append(token) return words def build_vocab(filename, normalize_digits=True): in_path = os.path.join(config.PROCESSED_PATH, filename) out_path = os.path.join(config.PROCESSED_PATH, 'vocab.{}'.format(filename[-3:])) vocab = {} with open(in_path, 'r') as f: for line in f.readlines(): for token in basic_tokenizer(line): if not token in vocab: vocab[token] = 0 vocab[token] += 1 sorted_vocab = sorted(vocab, key=vocab.get, reverse=True) with open(out_path, 'w') as f: f.write('' + '\n') f.write('' + '\n') f.write('' + '\n') f.write('<\s>' + '\n') index = 4 for word in sorted_vocab: if vocab[word] < config.THRESHOLD: break f.write(word + '\n') index += 1 with open('config.py', 'a') as cf: if filename[-3:] == 'enc': cf.write('ENC_VOCAB = ' + str(index) + '\n') else: cf.write('DEC_VOCAB = ' + str(index) + '\n') def load_vocab(vocab_path): with open(vocab_path, 'r') as f: words = f.read().splitlines() return words, {words[i]: i for i in range(len(words))} def sentence2id(vocab, line): return [vocab.get(token, vocab['']) for token in basic_tokenizer(line)] def token2id(data, mode): """ Convert all the tokens in the data into their corresponding index in the vocabulary. """ vocab_path = 'vocab.' + mode in_path = data + '.' + mode out_path = data + '_ids.' + mode _, vocab = load_vocab(os.path.join(config.PROCESSED_PATH, vocab_path)) in_file = open(os.path.join(config.PROCESSED_PATH, in_path), 'r') out_file = open(os.path.join(config.PROCESSED_PATH, out_path), 'w') lines = in_file.read().splitlines() for line in lines: if mode == 'dec': # we only care about '' and in encoder ids = [vocab['']] else: ids = [] ids.extend(sentence2id(vocab, line)) # ids.extend([vocab.get(token, vocab['']) for token in basic_tokenizer(line)]) if mode == 'dec': ids.append(vocab['<\s>']) out_file.write(' '.join(str(id_) for id_ in ids) + '\n') def prepare_raw_data(): print('Preparing raw data into train set and test set ...') id2line = get_lines() convos = get_convos() questions, answers = question_answers(id2line, convos) prepare_dataset(questions, answers) def process_data(): print('Preparing data to be model-ready ...') build_vocab('train.enc') build_vocab('train.dec') token2id('train', 'enc') token2id('train', 'dec') token2id('test', 'enc') token2id('test', 'dec') def load_data(enc_filename, dec_filename, max_training_size=None): encode_file = open(os.path.join(config.PROCESSED_PATH, enc_filename), 'r') decode_file = open(os.path.join(config.PROCESSED_PATH, dec_filename), 'r') encode, decode = encode_file.readline(), decode_file.readline() data_buckets = [[] for _ in config.BUCKETS] i = 0 while encode and decode: if (i + 1) % 10000 == 0: print("Bucketing conversation number", i) encode_ids = [int(id_) for id_ in encode.split()] decode_ids = [int(id_) for id_ in decode.split()] for bucket_id, (encode_max_size, decode_max_size) in enumerate(config.BUCKETS): if len(encode_ids) <= encode_max_size and len(decode_ids) <= decode_max_size: data_buckets[bucket_id].append([encode_ids, decode_ids]) break encode, decode = encode_file.readline(), decode_file.readline() i += 1 return data_buckets def _pad_input(input_, size): return input_ + [config.PAD_ID] * (size - len(input_)) def _reshape_batch(inputs, size, batch_size): """ Create batch-major inputs. Batch inputs are just re-indexed inputs """ batch_inputs = [] for length_id in range(size): batch_inputs.append(np.array([inputs[batch_id][length_id] for batch_id in range(batch_size)], dtype=np.int32)) return batch_inputs def get_batch(data_bucket, bucket_id, batch_size=1): """ Return one batch to feed into the model """ # only pad to the max length of the bucket encoder_size, decoder_size = config.BUCKETS[bucket_id] encoder_inputs, decoder_inputs = [], [] for _ in range(batch_size): encoder_input, decoder_input = random.choice(data_bucket) # pad both encoder and decoder, reverse the encoder encoder_inputs.append(list(reversed(_pad_input(encoder_input, encoder_size)))) decoder_inputs.append(_pad_input(decoder_input, decoder_size)) # now we create batch-major vectors from the data selected above. batch_encoder_inputs = _reshape_batch(encoder_inputs, encoder_size, batch_size) batch_decoder_inputs = _reshape_batch(decoder_inputs, decoder_size, batch_size) # create decoder_masks to be 0 for decoders that are padding. batch_masks = [] for length_id in range(decoder_size): batch_mask = np.ones(batch_size, dtype=np.float32) for batch_id in range(batch_size): # we set mask to 0 if the corresponding target is a PAD symbol. # the corresponding decoder is decoder_input shifted by 1 forward. if length_id < decoder_size - 1: target = decoder_inputs[batch_id][length_id + 1] if length_id == decoder_size - 1 or target == config.PAD_ID: batch_mask[batch_id] = 0.0 batch_masks.append(batch_mask) return batch_encoder_inputs, batch_decoder_inputs, batch_masks if __name__ == '__main__': prepare_raw_data() process_data() ================================================ FILE: assignments/chatbot/model.py ================================================ import time import numpy as np import tensorflow as tf import config class ChatBotModel: def __init__(self, forward_only, batch_size): """forward_only: if set, we do not construct the backward pass in the model. """ print('Initialize new model') self.fw_only = forward_only self.batch_size = batch_size def _create_placeholders(self): # Feeds for inputs. It's a list of placeholders print('Create placeholders') self.encoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='encoder{}'.format(i)) for i in range(config.BUCKETS[-1][0])] self.decoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='decoder{}'.format(i)) for i in range(config.BUCKETS[-1][1] + 1)] self.decoder_masks = [tf.placeholder(tf.float32, shape=[None], name='mask{}'.format(i)) for i in range(config.BUCKETS[-1][1] + 1)] # Our targets are decoder inputs shifted by one (to ignore symbol) self.targets = self.decoder_inputs[1:] def _inference(self): print('Create inference') # If we use sampled softmax, we need an output projection. # Sampled softmax only makes sense if we sample less than vocabulary size. if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB: w = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB]) b = tf.get_variable('proj_b', [config.DEC_VOCAB]) self.output_projection = (w, b) def sampled_loss(logits, labels): labels = tf.reshape(labels, [-1, 1]) return tf.nn.sampled_softmax_loss(weights=tf.transpose(w), biases=b, inputs=logits, labels=labels, num_sampled=config.NUM_SAMPLES, num_classes=config.DEC_VOCAB) self.softmax_loss_function = sampled_loss single_cell = tf.contrib.rnn.GRUCell(config.HIDDEN_SIZE) self.cell = tf.contrib.rnn.MultiRNNCell([single_cell for _ in range(config.NUM_LAYERS)]) def _create_loss(self): print('Creating loss... \nIt might take a couple of minutes depending on how many buckets you have.') start = time.time() def _seq2seq_f(encoder_inputs, decoder_inputs, do_decode): setattr(tf.contrib.rnn.GRUCell, '__deepcopy__', lambda self, _: self) setattr(tf.contrib.rnn.MultiRNNCell, '__deepcopy__', lambda self, _: self) return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs, self.cell, num_encoder_symbols=config.ENC_VOCAB, num_decoder_symbols=config.DEC_VOCAB, embedding_size=config.HIDDEN_SIZE, output_projection=self.output_projection, feed_previous=do_decode) if self.fw_only: self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets( self.encoder_inputs, self.decoder_inputs, self.targets, self.decoder_masks, config.BUCKETS, lambda x, y: _seq2seq_f(x, y, True), softmax_loss_function=self.softmax_loss_function) # If we use output projection, we need to project outputs for decoding. if self.output_projection: for bucket in range(len(config.BUCKETS)): self.outputs[bucket] = [tf.matmul(output, self.output_projection[0]) + self.output_projection[1] for output in self.outputs[bucket]] else: self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets( self.encoder_inputs, self.decoder_inputs, self.targets, self.decoder_masks, config.BUCKETS, lambda x, y: _seq2seq_f(x, y, False), softmax_loss_function=self.softmax_loss_function) print('Time:', time.time() - start) def _creat_optimizer(self): print('Create optimizer... \nIt might take a couple of minutes depending on how many buckets you have.') with tf.variable_scope('training') as scope: self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') if not self.fw_only: self.optimizer = tf.train.GradientDescentOptimizer(config.LR) trainables = tf.trainable_variables() self.gradient_norms = [] self.train_ops = [] start = time.time() for bucket in range(len(config.BUCKETS)): clipped_grads, norm = tf.clip_by_global_norm(tf.gradients(self.losses[bucket], trainables), config.MAX_GRAD_NORM) self.gradient_norms.append(norm) self.train_ops.append(self.optimizer.apply_gradients(zip(clipped_grads, trainables), global_step=self.global_step)) print('Creating opt for bucket {} took {} seconds'.format(bucket, time.time() - start)) start = time.time() def _create_summary(self): pass def build_graph(self): self._create_placeholders() self._inference() self._create_loss() self._creat_optimizer() self._create_summary() ================================================ FILE: assignments/chatbot/output_convo.txt ================================================ HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how are you? BOT ++++ i ' m okay . HUMAN ++++ really? BOT ++++ i don ' t know what to say . HUMAN ++++ how're you? BOT ++++ i ' m fine . HUMAN ++++ where're you from? BOT ++++ california . HUMAN ++++ which city? BOT ++++ let ' s talk about something else . HUMAN ++++ you don't like california? BOT ++++ oh . . . HUMAN ++++ why not? BOT ++++ i can ' t . HUMAN ++++ how old are you? BOT ++++ eighteen . ============================================= HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how are you? BOT ++++ i ' m okay . HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . ============================================= HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how are you? BOT ++++ i ' m okay . HUMAN ++++ what's your name? BOT ++++ what ? HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . HUMAN ++++ what's your name? BOT ++++ let ' s talk about something else . HUMAN ++++ what's your name? BOT ++++ prisoner . HUMAN ++++ ha ha you really don't want to tell your name BOT ++++ i ' m not going to do it . HUMAN ++++ how old are you? BOT ++++ can you guess ? HUMAN ++++ eighteen? BOT ++++ in the bathroom . HUMAN ++++ in the bathroom? BOT ++++ in the next time . HUMAN ++++ why are you talking so funny? BOT ++++ i ' m looking . HUMAN ++++ where are you from? BOT ++++ general . HUMAN ++++ where are you from? BOT ++++ how ? HUMAN ++++ do you like the us? BOT ++++ i ' m sure . HUMAN ++++ do you vote? BOT ++++ i ' m afraid . HUMAN ++++ what are you afraid of? BOT ++++ i ' m looking . HUMAN ++++ what are you looking for? BOT ++++ can you guess ? HUMAN ++++ no BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ hi BOT ++++ hi . what ' s your name ? HUMAN ++++ i'm chip BOT ++++ please . HUMAN ++++ please what? BOT ++++ i don ' t know what to say . HUMAN ++++ where are you from? BOT ++++ california . HUMAN ++++ do you like california? BOT ++++ i ' m okay . HUMAN ++++ i like california a lot BOT ++++ of what ? HUMAN ++++ i live here BOT ++++ and what is it ? HUMAN ++++ california BOT ++++ i don ' t know what to say . HUMAN ++++ i'm a student here BOT ++++ and get it out of here . HUMAN ++++ what do you like to do in your free time? BOT ++++ and what about the gun ? HUMAN ++++ do you have a gun? BOT ++++ yes . HUMAN ++++ you're dangerous! BOT ++++ i ' m not . HUMAN ++++ do you go hunting? BOT ++++ i can do . HUMAN ++++ what do you hunt? BOT ++++ virgin . HUMAN ++++ that's very bad BOT ++++ and you in love ? HUMAN ++++ no i'm not BOT ++++ please - - HUMAN ++++ okay, i am BOT ++++ and you are good . HUMAN ++++ yeah i am good too. BOT ++++ good good to you . HUMAN ++++ are you in love? BOT ++++ i ' m ready . HUMAN ++++ ready for what? BOT ++++ for what ? HUMAN ++++ for what? BOT ++++ i don ' t know what to say . HUMAN ++++ tell me BOT ++++ i can do it . HUMAN ++++ okay, tell me what you're ready for BOT ++++ i ' m doing you . HUMAN ++++ no you're not BOT ++++ please ! HUMAN ++++ this is a strange conversation BOT ++++ i don ' t know what to say . HUMAN ++++ are you a guy? BOT ++++ of what ? HUMAN ++++ are you a boy? BOT ++++ i ' m fine . HUMAN ++++ are you a man? BOT ++++ i ' m a little . HUMAN ++++ what's your name? BOT ++++ chris - HUMAN ++++ chris who? BOT ++++ cornelius . HUMAN ++++ so i guess you're a guy. BOT ++++ i am . HUMAN ++++ okay BOT ++++ okay ? HUMAN ++++ do you have a girlfriend? BOT ++++ i ' m sure . HUMAN ++++ what's her name? BOT ++++ let ' s talk about something else . ============================================= HUMAN ++++ hi chris BOT ++++ hi HUMAN ++++ how're you today? BOT ++++ okay . HUMAN ++++ the weather is nice today BOT ++++ i don ' t know what to say . HUMAN ++++ just say the weather is nice BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ hi BOT ++++ hi HUMAN ++++ how're you today? BOT ++++ okay . HUMAN ++++ just okay? BOT ++++ i ' m okay . HUMAN ++++ what's on your mind? BOT ++++ i ' m in my room . HUMAN ++++ doing what? BOT ++++ i don ' t know what to say . HUMAN ++++ you're boring. BOT ++++ i ' m not . HUMAN ++++ prove it BOT ++++ and what ? HUMAN ++++ what what? BOT ++++ sorry , i ' m still very stupid . HUMAN ++++ it's okay. we're all stupid sometimes. BOT ++++ ' s good to see . HUMAN ++++ what's good to see? BOT ++++ i ' m in . HUMAN ++++ you're in your room? BOT ++++ i ' m here . HUMAN ++++ where's your room? BOT ++++ i ' m in my room . HUMAN ++++ are you alone? BOT ++++ i ' m fine . HUMAN ++++ are you by yourself? BOT ++++ in the morning . HUMAN ++++ what time is it? BOT ++++ # # # # # # # . ============================================= HUMAN ++++ hi BOT ++++ hi . what ' s your name ? HUMAN ++++ i'm marie BOT ++++ you are . HUMAN ++++ yes i am BOT ++++ and you are my friend . HUMAN ++++ yes you are BOT ++++ and you are my best . HUMAN ++++ that's so sweet BOT ++++ please . HUMAN ++++ do you know what time it is? BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ what time is it? BOT ++++ 5 9 6 9 1 4 5 . HUMAN ++++ it doesn't make any sense BOT ++++ are you ? HUMAN ++++ i do BOT ++++ and you are ? HUMAN ++++ i'm a human. BOT ++++ you in . HUMAN ++++ are you in trouble? BOT ++++ i ' m in . HUMAN ++++ what trouble? BOT ++++ i ' m in love . HUMAN ++++ wow that's nice. BOT ++++ like that ? HUMAN ++++ yeah BOT ++++ i don ' t know what to say . HUMAN ++++ who are you? BOT ++++ i don ' t understand . HUMAN ++++ who are you? BOT ++++ no idea . HUMAN ++++ what's your name? BOT ++++ chris - HUMAN ++++ chris who? BOT ++++ cornelius . HUMAN ++++ where are you from? BOT ++++ california . HUMAN ++++ are you american? BOT ++++ i ' m fine . HUMAN ++++ of course you're fine. BOT ++++ and you ? HUMAN ++++ i'm okay. BOT ++++ you are . HUMAN ++++ yes, i am. BOT ++++ and you are my friend . HUMAN ++++ yes, i am. BOT ++++ and you are my friend . HUMAN ++++ that's correct. BOT ++++ and you ' re in trouble ? HUMAN ++++ no, i'm not. why did you think so? BOT ++++ because it ' s a good idea . HUMAN ++++ what is your good idea? BOT ++++ i ' ll be fine . ============================================= HUMAN ++++ hello BOT ++++ and you are in trouble ? HUMAN ++++ what did you say that? BOT ++++ ' s life . HUMAN ++++ i'm not in trouble. BOT ++++ what the hell is that ? HUMAN ++++ okay, what trouble? BOT ++++ i ' m looking . HUMAN ++++ why did you say i'm in trouble? BOT ++++ let ' s talk about something else . HUMAN ++++ what do you want to talk about? BOT ++++ i can ' t . HUMAN ++++ what can't? BOT ++++ i ' ll have a drink . HUMAN ++++ okay, me too. BOT ++++ you were here ? HUMAN ++++ i'm still here. BOT ++++ what is it ? HUMAN ++++ i thought you wanted a drink. BOT ++++ i ' m not . HUMAN ++++ you just said so! BOT ++++ i was wrong . HUMAN ++++ it's fine. BOT ++++ okay ? HUMAN ++++ yeah BOT ++++ i don ' t know what to say . HUMAN ++++ do you drink a lot? BOT ++++ i can do it . HUMAN ++++ do you drink often? BOT ++++ in the time . HUMAN ++++ that doesn't make any sense. BOT ++++ are you ? HUMAN ++++ what do you drink? BOT ++++ ' s in a drink . HUMAN ++++ do you drink alcohol? BOT ++++ in time . HUMAN ++++ i like beer. BOT ++++ oh . . . HUMAN ++++ don't you like beer? BOT ++++ i don ' t know what to say . HUMAN ++++ where do you live? BOT ++++ up . HUMAN ++++ up where? BOT ++++ in the hospital . HUMAN ++++ i'm sorry. are you okay? BOT ++++ i ' m fine . HUMAN ++++ why are you in the hospital? BOT ++++ and what ? HUMAN ++++ how many words do you know? BOT ++++ i ' d do it . HUMAN ++++ how many? BOT ++++ i ' m not . HUMAN ++++ how many? BOT ++++ i ' m not . HUMAN ++++ how many? BOT ++++ i ' m not . HUMAN ++++ whatever BOT ++++ i don ' t know what to say . HUMAN ++++ you're boring. BOT ++++ i ' m not . HUMAN ++++ yes you are. BOT ++++ and you are my friend . HUMAN ++++ no i'm not. BOT ++++ please . HUMAN ++++ okay fine. i'm your friend. BOT ++++ you ' re my mother ? HUMAN ++++ i have no son like you. BOT ++++ i don ' t know what to say . HUMAN ++++ i have no son. BOT ++++ i see you . HUMAN ++++ i have no children. BOT ++++ you are . HUMAN ++++ omg you're so dumb. BOT ++++ i ' m not . HUMAN ++++ you're stupid! BOT ++++ i ' m not . HUMAN ++++ do you always talk like this? BOT ++++ of course . HUMAN ++++ how old are you? BOT ++++ eighteen . HUMAN ++++ you're big enough. BOT ++++ i ' m not . HUMAN ++++ are you an adult? BOT ++++ i ' m going . HUMAN ++++ you're going to be an adult? BOT ++++ i ' m looking for you . HUMAN ++++ i'm in my room. BOT ++++ and your mother ? HUMAN ++++ she's in vietnam. BOT ++++ you ' re here . HUMAN ++++ yes i'm here. BOT ++++ and what are you talking about ? HUMAN ++++ you asked me where i am. BOT ++++ and you are your friend . HUMAN ++++ i need to go now. BOT ++++ i will not . HUMAN ++++ that's fine. you can stay. i'm leaving. BOT ++++ oh . . . HUMAN ++++ bye BOT ++++ i don ' t know what to say . HUMAN ++++ goodbye BOT ++++ i don ' t know what to say . HUMAN ++++ bye bye BOT ++++ okay ? HUMAN ++++ see you later BOT ++++ i don ' t know what to say . ============================================= HUMAN ++++ hi BOT ++++ hi . what ' s your name ? HUMAN ++++ i'm maria BOT ++++ please . HUMAN ++++ i'm jackson BOT ++++ you are . HUMAN ++++ i'm jack BOT ++++ let ' s talk about something else . HUMAN ++++ i have no go BOT ++++ you are ? HUMAN ++++ i have to go BOT ++++ and what do you mean ? HUMAN ++++ i'm leaving. BOT ++++ what ? HUMAN ++++ yeah, bye BOT ++++ bye . have a nice day ! ============================================= HUMAN ++++ that's very bad. BOT ++++ i ' ve got to work . HUMAN ++++ what do you do? BOT ++++ i ' ll be there . HUMAN ++++ where do you work? BOT ++++ in the trunk . HUMAN ++++ are you a machenics? BOT ++++ i ' m not . HUMAN ++++ what are you? BOT ++++ no idea . ============================================= ================================================ FILE: assignments/trump_bot/trump_tweets.txt ================================================ 'State works hard and illegally for Clinton' #DrainTheSwamp __HTTP__ _E_ RT @IvankaTrump: Touched by the warm hospitality of Prime Minister Abe and the Japanese people. ありがとうございます [Thank you]! Until next time ... _E_ Since Congress can't get its act together on HealthCare I will be using the power of the pen to give great HealthCare to many people FAST _E_ I always said @BarackObama will attack Iran in some form prior to the election. _E_ Today I am working on my 'big surprise' for the @RNC convention. Everyone will love it. _E_ What a shock! The U.S. Capitol Christmas tree pays homage @BarackObama but failed to mention Jesus. _E_ Making America Safe is my number one priority. We will not admit those into our country we cannot safely vet. __HTTP__ _E_ Repubs must not allow Pres Obama to subvert the Constitution of the US for his own benefit & because he is unable to negotiate w/ Congress. _E_ Tell Iran to let our Christian Pastor go and I mean right now. If they don't there will be hell to pay. _E_ Man shot inside Paris police station. Just announced that terror threat is at highest level. Germany is a total mess big crime. GET SMART! _E_ Thank you! __HTTP__ _E_ I am now inspecting the Old Post Office on Pennsylvania Avenue will be a great hotel. Soon off to the Oklahoma State Fair! _E_ Just a few more days until the 13th season of All Star @CelebApprentice premieres. Be sure to tune in this Sunday at 9PM on @nbc. Big! _E_ Look forward to being in Phoenix tomorrow at 2:00 P.M. Hottest ticket in entire country. Was supposed to be 500 people now many thousands! _E_ The Al Frankenstien picture is really bad speaks a thousand words. Where do his hands go in pictures 2 3 4 5 & 6 while she sleeps? ..... _E_ Karl Rove's strategy and commercials were the worst I have ever seen. _E_ .@lindseygraham who had zero in his presidential run before dropping out in disgrace saying the most horrible things about me on @FoxNews. _E_ Great Concert at 4:00 P.M. today at Lincoln Memorial. Enjoy! _E_ "Donald Trump: Mitt Romney 'Blew It' Shouldn't Run Again" __HTTP__ via @Newsmax_Media by @OwenTew _E_ .@HillaryClinton talking about jobs? Remember what she promised upstate New York. #BigLeagueTruth#Debates __HTTP__ _E_ .@IsraeliPM @netanyahu is a resolute leader. When he sets a red line it stands! _E_ Thank you Idaho! I love your potatoes nobody grows them better. As President I will protect your market. __HTTP__ _E_ A bad thing finally happened to Derek Jeter he is a great champion. _E_ Obama told his donors this past week "public opinion" is on his side. Don't believe that one either. _E_ Join me this Friday in Pensacola Florida at the Pensacola Bay Center! Tickets: __HTTP__ __HTTP__ _E_ Wow record setting cold temperatures throughout large parts of the country. Must be global warming I mean climate change! _E_ ...Colin Powell thought Iraq has weapons of mass destruction. _E_ Review your work habits & make sure they are taking you in the right direction. Don't tread water get out there and go for it. _E_ President said we would never leave a soldier behind. How about the 4 who died in Benghazi? _E_ The point is: the Chinese are smart they respond to economic pressure and they know they're not going to get (cont) __HTTP__ _E_ Look forward to introducing Governor Mike Pence (who has done a spectacular job in the great State of Indiana). My first choice from start! _E_ If Stuart Stevens' book is as bad as his horrible political advice to Mitt Romney don't waste your money. Arrogant guy but a zero! _E_ Snowboarder/Skateboarder @Shaun_White stopped by to visit this week.... __HTTP__ _E_ #MakeAmericaGreatAgain #NYPrimary __HTTP__ _E_ #TBT With @britneyspears __HTTP__ _E_ Why is @RandPaul allowed to take advantage of the people of Kentucky by running for Senator and Pres. Why should Kentucky be back up plan? _E_ We are about to have a record $500B trade deficit with the Chinese this year. That money should be back here financing jobs in America. _E_ Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ Little @MacMiller I'm now going to teach you a big boy lesson about lawsuits and finance. You ungrateful dog! _E_ "The way to get started is to quit talking and begin doing." Walt Disney _E_ In the coming months and years ahead I look forward to building an even STRONGER relationship between the United States and China. __HTTP__ _E_ Who was it that secretly said to Russian President Tell Vladimir that after the election I'll have more flexibility? @foxandfriends _E_ I will be interviewed on @meetthepress this morning. Enjoy! _E_ Wacko pervert @AnthonyWeiner's idea of Hispanic outreach is using Carlos Danger as his sexting. He's an insensitive racist. _E_ Republicans should not negotiate against themselves again with @BarackObama in today's debt talks First and foremost CUTCAP and BALANCE. _E_ 'Small business optimism soars after Trump election' __HTTP__ _E_ We need a great leader now! __HTTP__ _E_ I am going to Trump National Doral in Miami this week to check out the $250 million renovation. In construction always watch the money! _E_ A vote for Hillary Clinton is a vote for another generation of poverty high crime & lost opportunities. #ImWithYou __HTTP__ _E_ .@MattGinellaGC Matt the statement about Pinehurst looking like a local community golf course awful was not made by me but tweeted to me _E_ Our hearts are with all affected by the wildfires in California. God bless our brave First Responders and @FEMA team. We support you! __HTTP__ _E_ Hopefully others will follow suit. Our country needs & should demand security. It is time to get tough & be smart! _E_ Heading to North Carolina for two big rallies. Will be there soon. We will bring jobs back where they belong! _E_ .@ThrillistChi named @SixteenChicago @TrumpChicago one of the "best value Michelin starred restaurants in Chicago" __HTTP__ _E_ President Obama wants to change the name of Mt. McKinley to Denali after more than 100 years. Great insult to Ohio. I will change back! _E_ Obama is looking rhetorical and weak. @MittRomney is looking strong and sharp. _E_ Great to see the construction of the Old Post Office on Penn Ave. Going fast under budget ahead of schedule! _E_ Thank you New York and Pennsylvania! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Joe McQuaid (@deucecrew) is desperately trying to sell the @UnionLeader. It's a loser and my comments haven't helped him much. _E_ We know who did the hoax of James Gandolfini and ObamaCare. Be careful Mister. _E_ A message of condolences and support regarding the terrorist attacks in Tel Aviv: __HTTP__ _E_ ...One point I made sure to stress at @LibertyU is to be sure to get even with anyone who crosses you... _E_ ObamaCare Story of the Day: "Florida Cancer Patient Loses Insurance During Treatment B/C of ObamaCare" __HTTP__ _E_ Now that Ken Frazier of Merck Pharma has resigned from President's Manufacturing Councilhe will have more time to LOWER RIPOFF DRUG PRICES! _E_ Wishing everyone a safe and Happy Halloween!#Halloween2017 __HTTP__ _E_ So many tweets & stories on Stewart/Pattinson Look it doesn't matter the relationship will never be the same. It is permanently broken. _E_ It is finally sinking through. 46% OF PEOPLE BELIEVE MAJOR NATIONAL NEWS ORGS FABRICATE STORIES ABOUT ME. FAKE NEWS even worse! Lost cred. _E_ Drew Peterson just got 36 years for killing his wife bring back the death penalty! _E_ Fake News CNN and NBC are going out of their way to disparage our great First Responders as a way to get Trump. Not fair to FR or effort! _E_ Don't forget to tune in tonight for the two hour premiere of The Apprentice. 9 pm EST on NBC. We're all in for a fantastic new season! _E_ #Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_ Join me this weekend! #NYPrimary4/16: SYRACUSE NOON __HTTP__ WATERTOWN 3pm __HTTP__ #Trump2016 _E_ I met some really great Air Force GENERALS and Navy ADMIRALS today talking about airplane capability and pricing. Very impressive people! _E_ Take your work seriously take yourself less seriously. It's a great recipe for some good times & great memories. _E_ Lightweight A.G. Eric Schneiderman who has been a total failure in office failed to report the 98% approval rating of students for courses _E_ Effective today my administration officially declared the #OpioidCrisis a NATIONAL PUBLIC HEALTH EMERGENCY under federal law. __HTTP__ _E_ Will be doing @foxandfriends live tomorrow at 7AM ET from Europe. _E_ RT @DonaldJTrumpJr: Happy new year everyone. #newyear #family #vacation #familytime __HTTP__ _E_ If our healthcare plan is approved you will see real healthcare and premiums will start tumbling down. ObamaCare is in a death spiral! _E_ Wishing a Happy Father's Day to all the Dad's out there YOU are a champion today and everyday! __HTTP__ _E_ Can you imagine with all of the talk about ObamaCare technical breakdowns made it a disastrous day. Our government is badly broken! _E_ Trump Nat'l Golf Club Philadelphia 360 beautiful acres as designed by Tom Fazio with views of the Philly skyline. __HTTP__ _E_ Via @IBDeditorials: "Most Americans Label Obama Presidency A Failure" __HTTP__ _E_ #WeeklyAddress __HTTP__ __HTTP__ _E_ After climbing a great hill one only finds that there are many more hills to climb. Nelson Mandela _E_ I'll be on @americanowradio with Andy Dean at 6:30 ET today talking about last night's @FoxNews debate. __HTTP__ _E_ Via @EveningExpress: Images of Donald Trump's 2nd North east golf course released: Public have say on images __HTTP__ _E_ Many people are saying it was wonderful that Mrs. Obama refused to wear a scarf in Saudi Arabia but they were insulted.We have enuf enemies _E_ Entrepreneurs: Identify your goals. Know precisely what you want to achieve. Have your own vision and stick with it! _E_ Experience is a hard teacher because she gives the test first the lesson afterwards. Vernon Sanders Law _E_ ...big unnecessary regulation cuts made it all possible" (among many other things). "President Trump reversed the policies of President Obama and reversed our economic decline." Thank you Stuart Varney. @foxandfriends _E_ Will be interviewed on @oreillyfactor tonight at 8:00 P.M. _E_ #ICYMI OHIO RALLY!Watch here: __HTTP__ __HTTP__ _E_ Dopey Sugar @Lord_Sugar Bad ratings come on keep making me money remember I own your show. _E_ China hacked the U.S. Chamber of Commerce and now has the information of all 3 million members. China keeps (cont) __HTTP__ _E_ Check out the recent Editorial in the Wall Street Journal @WSJ about what a complete disaster the @CFPB has been under its leader from previous Administration who just quit! _E_ "You have to set higher and higher goals. You have to want more or you will start slipping backwards fast." – Think BIG _E_ The only @Forbes Five Star & @fivediamond hotel in NYC @TrumpNewYork is the definition of luxury __HTTP__ The Best! _E_ This is really unfair and a conflict for all the other candidates. I said it should not be allowed and ABC agreed. _E_ Eliot better have a great pre nup—I want to help Silda in her negotiation. _E_ Congratulations to Gretchen Carlson on her big move to hosting an afternoon solo show this fall on @FoxNews. _E_ The Clinton News Network sometimes referred to as @CNN is getting more and more biased.They act so indignant hear them behind closed doors _E_ .@DonaldJTrumpJr and @EricTrump with @HulkHogan Great shot! __HTTP__ _E_ I predicted Apple's stock fall based on their dumb refusal to give the option of a larger iPhone screen like Samsung. I sold my Apple stock _E_ .@PapaJohns CEO John Schnatte has told shareholders that ObamaCare will force him to raise pizza prices __HTTP__ REPEAL! _E_ Important day spent at Camp David with our very talented Generals and military leaders. Many decisions made including on Afghanistan. _E_ The unforgivable crime is soft hitting. Do not hit at all if it can be avoided but never hit softly. Theodore Roosevelt _E_ Thank you Las Vegas Nevada!#Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_ The entire country is FREEZING we desperately need a heavy dose of global warming and fast! Ice caps size reaches all time high. _E_ My appearances on @todayshow __HTTP__ and @gma __HTTP__ _E_ Record cold temperatures in July 20 to 30 degrees colder than normal. What the hell happened to GLOBAL WARMING? _E_ The politicians of the U.K. should watch Katie Hopkins of Daily __HTTP__ on @FoxNews. Many people in the U.K. agree with me! _E_ Congrats to Miss Universe 2011 @RealLeilaLopes & @Giant great @OsiUmenyiora on their engagement! I am very happy for you both. _E_ Obama deserves much less credit for the killing of Bin Laden. The praise goes to our brave military and intelligence officers. _E_ Flashback – Jeb Bush says illegal immigrants breaking our laws is an "act of love" __HTTP__ He will never secure the border. _E_ Join @AmerIcan32 founded by Hall of Fame legend @JimBrownNFL32 on 1/19/2017 in Washington D.C.... __HTTP__ _E_ RE: Michael Jackson: He was a great friend and a spectacular entertainer. It's a devastating loss! Donald J. Trump _E_ Via @AP's: ObamaCare is a tax __HTTP__ @BarackObama gave the largest tax increase in history on the middle class. Shameful! _E_ I'm honored to be presented the award of Doctor of Business Administration Honoris Causa from Robert Gordon University in Aberdeen Scotland _E_ Great job by the FBI Boston Police and all others involved start the trial tonight! _E_ Rising premium costs from Obamacare will cost businesses billions __HTTP__ Guess where these new costs get passed to – you. _E_ Our spa @TrumpSoHo gets a nice write up in @DETAILS: #gotmilk _E_ More hysterical DSRL videos featuring Donald Trump and Double Trump plus enter Golden Lick Race Sweepstakes: __HTTP__ _E_ So proud of @FEMA Military and First Responders! Thank you! __HTTP__ _E_ Wow! Thank you Louisville Kentucky! #VoteTrump on 3/5/2016! Lets #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_ By @kwrcrow: "NY Post caught 'LYING' Again!" __HTTP__ ... The Donald" should go far. Actually if I run I'll win. _E_ Happy and proud to help @MittRomney win Ohio with robo calls in pivotal Cuyahoga County _E_ ...fired. This story is totally made up by the dishonest media.The Chief is doing a FANTASTIC job for me and more importantly for the USA! _E_ Now China is helping Iran smuggle nuclear parts __HTTP__ . China is not an ally but our country's greatest threat & rival. _E_ Via @TheTodaysGolfer "@TrumpScotland gets new clubhouse" __HTTP__ _E_ Just departing La Crosse Wisconsin. Thank you! #Trump2016 #WIPrimary __HTTP__ __HTTP__ _E_ Thousands of US warplanes ships and missiles contain fake electronic components from China leaving them open (cont) __HTTP__ _E_ I will be interviewed on @CNN at 7:00 A.M. _E_ The biggest business people have used the bankruptcy laws to their advantage Warren B Icahn Kravis and this week John Paulson for haters! _E_ My exclusive @WSOC_TV interview with @BlairMiller9 discussing Trump National North Carolina & future deals __HTTP__ _E_ Happy New Year to all my Jewish friends. _E_ 51 Million American to travel this weekend highest number in twelve years (AAA). Traffic and airports are running very smoothly! @FoxNews _E_ Letterman @Late_Show was great last night. I had a lot of fun. You could see his audience really wanted Obama to take the $ for charity. _E_ Just interviewed by @LouDobbs. Will be aired tonight at 7pmE on @FoxBusiness. #Dobbs _E_ Looking forward to being hosted & interviewed next Monday by David Rubenstein at the @TheEconomicClub __HTTP__ _E_ Success comes with hard work focus and luck. The luck comes to those who seek it out. If you are not in the game you cannot get lucky. _E_ Will be interviewed on @Morning_Joe at 7:3O. Enjoy! _E_ We should look to China where big time pollution takes place as they manufacture inefficient and costly wind turbines for Scotland! _E_ Stock Market hits another all time high on Friday. 5.3 trillion dollars up since Election. Fake News doesn't spent much time on this! _E_ The Failing @nytimes in a story by Peter Baker should have mentioned the rapid terminations by me of TPP & The Paris Accord & the fast.... _E_ Via Business Insider: Donald Trump's Poll Dominance in 2 Key States is Mind Blowing __HTTP__ _E_ The American economy would grow if Washington didn't keep threatening higher taxes and more regulations. Government is not the solution. _E_ This year's Trump Miss Universe Pageant is comprised of truly beautiful women.Will be simulcast live December 19th on @nbc and @Univision. _E_ The Democrats don't want money from budget going to border wall despite the fact that it will stop drugs and very bad MS 13 gang members. _E_ God bless all the brave souls who perished 12 years ago today. You will never be forgotten! _E_ The people of Buffalo should be happy Terry Pegula got the team but I hope he does better w/the Bills than he has w/the Sabres. Good luck! _E_ Just left Liberty University. Chancellor Jerry Falwell Jr.& his father have done an amazing job...great school & the students were fantastic _E_ Can you believe that the corrupt and pathetic South Africa police force has yet to arrest the sign language guy. Such danger give 10 years! _E_ Today we honor the fallen at #PearlHarbor 74 years ago today. If you see a vet today thank them! #RememberOurVets __HTTP__ _E_ RT @foxandfriends: STILL AHEAD: @realDonaldTrump joins us at 7am/et! #RNCinCLE __HTTP__ _E_ The Emmys are all politics that's why despite nominations The Apprentice never won even though it should have many times over. _E_ RT @PChowka: Fox News With Hannity's Help Regains Its Ratings Dominance By Peter Barry Chowka at The Hagmann report __HTTP__ _E_ We must never bend too much. Yitzhak Shamir (1915 2012) __HTTP__ _E_ I will be doing @colbertlateshow at 11:30 on CBS. Enjoy! __HTTP__ _E_ Obama's new campaign ad defends Solyndra __HTTP__ I guess losing $500M is a cause for celebration for @BarackObama. _E_ How much longer will the failing nytimes with its big losses and massive unfunded liability (and non existent sources) remain in business? _E_ RT @VP: Went to the Senate today to say @POTUS & I fully support Graham Cassidy plan to repeal/replace Obamacare. Let's get this done. __HTTP__ _E_ ...@Lord_Sugar You need the income from the show to keep going hope it doesn't hurt. _E_ God never takes away something from your life without replacing it with something better. Rev. @BillyGraham _E_ Under Obama Iran has taken over Iraq Al Qaeda has taken over Libya the Muslim Brotherhood now controls Egypt. Worst foreign policy ever. _E_ Glad everyone could see Mar a Lago last night on @datelinenbc. It is the crown jewel of Palm Beach. _E_ .@MissUniverse visited my office tall and beautiful! __HTTP__ _E_ Remember when @BarackObama promised you could keep your coverage? Study shows 1 in 10 employers will drop health care __HTTP__ _E_ Hillary said that guns don't keep you safe. If she really believes that she should demand that her heavily armed bodyguards quickly disarm! _E_ The 9/11 trials at Gitmo over the weekend were a disaster. Can you imagine how much worse it would be if @BarackObama tried them in NYC? _E_ JOBS JOBS JOBS! __HTTP__ _E_ NPR's @NealConan said schlonged to WaPo re: 1984 Mondale/Ferraro campaign: That ticket went on to get schlonged at the polls. #Hypocrisy _E_ Which campaign is possibly on the trajectory towards insolvency? __HTTP__ At least @BarackObama is consistent. _E_ The invisible hand of the market always moves faster and better than the heavy hand of government. @MittRomney _E_ The new reality. China's economy 'underpins' global demand __HTTP__ Our leaders just watched as China took full control. _E_ Did you know that one of seven Americans is now on food stamps? Think of it. In the United States the most pr... (cont) __HTTP__ _E_ I still can't get over how the Republicans—my friends—spent hundreds of millions of dollars on such terrible & ineffective ads. _E_ Thank you! #MakeAmericaGreatAgain __HTTP__ _E_ Wow 30000 e mails were deleted by Crooked Hillary Clinton. She said they had to do with a wedding reception. Liar! How can she run? _E_ Do you notice that nobody is talking about the many scandals of the Obama administration anymore The Teflon President! _E_ Crooked Hillary wants to get rid of all guns and yet she is surrounded by bodyguards who are fully armed. No more guns to protect Hillary! _E_ May the Festival of Lights bring our Jewish friends from around the world health & happiness! Happy Hanukkah! __HTTP__ _E_ Fox & Friends at 7.00 _E_ It is time to take care of OUR people to rebuild OUR NATION and to fight for OUR GREAT AMERICAN WORKERS! #TaxReform #USA __HTTP__ _E_ In his entire political career @BarackObama has never had a tough @GOP opponent before @MittRomney. He is a paper tiger. #GOMITT _E_ Bill Clinton wants to #MakeAmericaGreatAgain __HTTP__ _E_ Failure defeats losers failure inspires winners. Robert T. Kiyosaki@theRealKiyosaki _E_ I have never met a successful person that was a quitter. Successful people never ever give up! _E_ Our debt is about to top $17T. ObamaCare and China (& others) are killing American business. _E_ Will be interviewed on @NewDay on @CNN at 7:15 A.M. _E_ My @foxandfriends interview re: Muslim Brotherhood taking over Egypt our vast natural gas resources & US tax system __HTTP__ _E_ For all of those who have been asking a big cast announcement coming soon for @ApprenticeNBC! _E_ Sometimes there is justice. A Chinese military newspaper was hacked. __HTTP__ _E_ Via @UnionLeader BY Bill Smith: "GOP rally in Manchester fires up party faithful" __HTTP__ _E_ A tough week was had by@MittRomney but he's come back from adversity before. _E_ CUT CAP AND BALANCE. TAXED ENOUGH ALREADY! _E_ .@GOP has leverage. Must stay united & on message. _E_ Via @PPDNews: Donald Trump: 'I Am Not Doing This For Fun' We Can't Fix U.S. 'Unless We Put Right Person' In WH __HTTP__ _E_ Trump Golf Links at Ferry Point in the Bronx NY will open soon. A Jack Nicklaus Signature Design. Beautiful. __HTTP__ _E_ Under @BarackObama the Iranian nuclear program has rapidly grown. __HTTP__ _E_ Via Int'l Business Times: Jeb Bush Got $1.3M Job at Lehman After Florida Shifted Pension Cash To Bank. __HTTP__ _E_ China is complaining about 2500 marines being placed in Australia. Meanwhile they are building bases across Latin America. #TimeToGetTough _E_ Donald Trump trademarked Reagan slogan & would like to stop other Republicans from using it __HTTP__ via @businessinsider _E_ Is Roger Simon @politicoroger ever right about anything? Now he's attacking @BillClinton in defense of (cont) __HTTP__ _E_ GO VOTE FROM NOW TO 8:30 P.M. NEVADA. I WILL BE AT VARIOUS CAUCUS SITES. MAKE AMERICA GREAT AGAIN! _E_ RT @TheFive: @POTUS being unpredictable is a big asset North Korea knew exactly what President Obama was going to do. @jessebwatters _E_ A pessimist is one who makes difficulties of his opportunities... _E_ Glad to hear @seanhannity supports my offer to Obama. As Sean says "it is an easy $5 million to charity. What does Obama have to lose?" _E_ My interview on @WOR710 with Jon Gambling discussing #TimeToGetTough meeting @NewtGingrich and the 2012 election __HTTP__ _E_ Next year will be an interesting one. I look forward to running against Hillary Clinton a totally flawed candidate and beating her soundly _E_ .@AustinKaiser52 The 2 people I am most excited to hear speak on Thrursday at @CPACnews is @GovChristie & @realDonaldTrump #DCBound Thanks. _E_ Now is the time to buy housing before values have fully recovered. In 5 years remember I told you so. _E_ Thank you to the men and women of Fort Myer and every member of the U.S. Military at home and abroad. #USA __HTTP__ _E_ Our country should be worried about nuclear control far more than gun control & that one's not even close! _E_ From Donald Trump: "I'm so proud of my wife Melania and the launch of her new jewelry line to debut on QVC on April 30th at 9 p.m." _E_ Internal polling shows that I would swamp @RobAstorino in a NY Republican primary 77% to 23%. But won't run if party is not unified. _E_ Via @swan_investor by @Forbes: "The Trump Card: Make America Great Again" __HTTP__ _E_ Join me in Roanoke Virginia tomorrow at the Berglund Center Coliseum ~ 6pm! Tickets available at:... __HTTP__ _E_ Great to see Sec. Clinton leaving the hospital yesterday with @ChelseaClinton and Pres. Clinton. Glad she is recuperating. _E_ Bernie Sanders is continuing his quest because he believes that Crooked Hillary Clinton will be forced out of the race e mail scandal! _E_ I will be interviewed on @oreillyfactor tonight at 8:00. Will be talking about the poor treatment of our veterans illegal immigration etc. _E_ I very much appreciate all of the great reviews & comments on my speech in Michigan the people were great. _E_ See when I said NATO was obsolete because of no terrorism protection they made the change without giving me credit. __HTTP__ _E_ Many Red State Democrats sticking with Obama on deficit spending on the ObamaCare monstrosity will be defeated in 2014. _E_ Happy 4th of July to everyone including the haters and losers! _E_ CNN is the worst fortunately they have bad ratings because everyone knows they are biased. __HTTP__ _E_ Thank you @Todayshow for the wonderful and honest poll results on Chicago sign. People love it! __HTTP__ @TrumpChicago _E_ Remember I predicted that New York Magazine would fold and people scoffed? Just announced (N.Y.Post) it lost big $'s & is cutting way back! _E_ Happy birthday to @garyplayer a truly great Champion and Person! _E_ Any increase in ObamaCare premiums is the fault of the Democrats for giving us a product that never had a chance of working. _E_ "Mold yourself into the person who can do big things." – Think Big _E_ Ron DeSantis Iraq vet Navy hero bronze star Yale Harvard Law running for Congress in Fla. Very impressive. __HTTP__ _E_ 70% of the Chinese say they are better off than they were 4 years ago __HTTP__ At least someone has done well under Obama. _E_ Do you know how many years @TheRealMarilu starred on Taxi? #CelebApprentice _E_ "Interested is interesting. If you remember that simple rule you will have no trouble making conversation." Think Like a Billionaire _E_ I stand ready to lead us down a new path where we are lifted up by our desire to succeed not by a resentment of success. @MittRomney _E_ ObamaCare is causing such grief and tragedy for so many. It is being dismantled but in the meantime premiums & deductibles are way up! _E_ I will be in Indiana on Sunday and Monday at four MAKE AMERICA GREAT AGAIN rallies. See you there! _E_ Congrats @TrumpWaikiki for winning @AmericanExpress Fine Hotels & Resorts 'Hotel Partner of The Year for 2014' award! _E_ Is business success a natural talent? I think it's a combination of aptitude work and luck. Think Like a Champion _E_ Great ruling on wind farm in Scotland—very smart judge! Front page article. __HTTP__ _E_ Thank you @BrentBozell As you know I have been saying this for a long time __HTTP__ _E_ Flashback @FoxNewsInsider July '14:"Trump: Bergdahl Swap Another Mistake By 'Gang That Couldn't Shoot Straight'" __HTTP__ _E_ Everyone should boycott Italy if Amanda Knox is not freed she is totally innocent. _E_ The ALS #IceBucketChallenge that Trumps them all __HTTP__ _E_ Great job @MariaTCardona on @ThisWeekABC. You made kooky Cokie Roberts and @BillKristol look even dumber than they are. You will be right! _E_ Great to be on @andersoncooper tonight with my wonderful family. Will be rebroadcast at 12:00 A.M. (EASTERN). _E_ Virtually all Presidents and candidates including John McCain Bill Clinton George H.W. Bush and George W. Bush... __HTTP__ _E_ Thank you @JakeTapper for giving me credit for my vision on bombing the oil fields. Should have been done long ago. #Trump2016 _E_ ...intentional. This whole narrative is a way of saving face for Democrats losing an election that everyone thought they were supposed..... _E_ "Some people spend an entire lifetime wondering if they made a difference in the world.The Marines don't have that problem." Ronald Reagan _E_ Rev.@BillyGraham is doing tremendous work this election cycle educating the Christian community on @MittRomney. _E_ "Read the Bible. Work hard and honestly. And don't complain." – Rev. @BillyGraham _E_ Robert Bryce @NYPost Congrats on your great opinion piece on terrible wind turbines & how destructive they are. Windmills are a disaster. _E_ I will be making a speech at 12:00 in Fort Worth Texas. Really big crowd expected. Will be talking about the debate last night plus plus! _E_ After many years of failurecountries are coming together to finally address the dangers posed by North Korea. We must be tough & decisive! _E_ Show me someone without an ego and I'll show you a loser. How To Get Rich _E_ I like John McCain but we have to start rebuilding the United States instead of countries who hate us and want us to fail be smart! _E_ Happy Birthday @TheLeeGreenwood!#FlashbackFriday __HTTP__ _E_ The media is spending more time doing a forensic analysis of Melania's speech than the FBI spent on Hillary's emails. _E_ Any senator who votes against starting debate is telling America that you are fine w/ the #OCareNightmare! Remarks: __HTTP__ __HTTP__ _E_ Thank you America! Get out & VOTE tomorrow! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ I will be speaking at the #StopIranDeal rally shortly watch live here __HTTP__ _E_ Negotiation tip #2: I always go into the deal anticipating the worst... _E_ Can you imagine if I had the small crowds that Hillary is drawing today in Pennsylvania. It would be a major media event! @CNN @FoxNews _E_ My ties shirts and cufflinks have never been more beautiful THE BEST available at Macy's! _E_ JFK Files are released long ahead of schedule! _E_ Who is the moron who decided to release the Ferguson grand jury findings after 9:00 o'clock in the evening. What were they thinking? _E_ THANK YOU! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ My @jrg710 interview discussing building a cemetery next to Trump National the FL primary @ApprenticeNBC and OPEC __HTTP__ _E_ ISIS is making big threats today no respect for U.S.A. or our leader If I win it will be a very different storywith very fast results _E_ Commissioner Adam Silver made a strong and very wise decision concerning Donald Sterling. _E_ Sissy Graydon Carter of failing Vanity Fair Magazine and owner of bad food restaurants has a problem his V.F. Oscar party is no longer hot _E_ Love that Patriots won Brady is best ever! Seahawks pass was DUMBEST play in the history of football! Great going COACH B! _E_ Oil has been over $33/gallon for 34 months. A new record. And now with Obama's war on coal American families will be hit even harder. _E_ I'm really glad that @MittRomney no longer says what a nice guy @BarackObama is. _E_ My daughter Ivanka did great tonight in New Hampshire. The sold out crowd loved her and she loved them. Thanks Ivanka! _E_ .@ChuckGrassley got your message loud and clear. We have fantastic people on the ground got there long before #Harvey. So far so good! _E_ .@GeraldoRivera Thank you Geraldo for your nice words on @oreillyfactor tonight. You are a true champion! Thank @ericbolling great guy! _E_ Airing live from Baton Rouge at 8PM ET on @nbc 2014 @MissUSA Competition will be a tremendous event __HTTP__ _E_ The secret of getting ahead is getting started. Mark Twain _E_ Why is President Obama allowed to use Air Force One on the campaign trail with Crooked Hillary? She is flying with him tomorrow. Who pays? _E_ Will be interviewed on @foxandfriends at 8:00 A.M. _E_ "Having an ego and acknowledging it is a healthy choice. Our ego gives us a sense of purpose." – Think Like a Champion _E_ OmikronDreamer @realDonaldTrump do you wear your own ties? Yes. _E_ Do your homework. Wasting other people's time due to poor planning and thoughtlessness will only leave a bad impression. _E_ I will be on @oreillyfactor tonight interview with Bill O'Reilly on @FoxNews at 8 p.m. repeated at 11 p.m. _E_ My @FoxNews interview on @TeamCavuto discussing why debt commission should be discussed in debates & @RNC convention __HTTP__ _E_ Via @HPCaTravel by @alau2: "Trump Hotel Reflects Youthful Luxurious Vancouver: Ivanka Trump" __HTTP__ _E_ RT @Scavino45: .@POTUS @realDonaldTrump and @FLOTUS Melania visit with @UMCSN patient Tiffany Huizarin Las Vegas earlier today. #VegasStron... _E_ The toughest thing about success is that you've got to keep on being a success. Irving Berlin _E_ C SPAN/Conversation with Donald Trump/Economic Club of Washington DC __HTTP__ _E_ President Obama played golf yesterday??? _E_ Now the UN is attacking @Redskins franchise __HTTP__ With all the world's problems is this really a top priority? _E_ Via @GravisMarketing: "New Hampshire Poll: Trump into top tier status" __HTTP__ _E_ I am growing the Republican Party tremendously just look at the numbers way up! Democrats numbers are significantly down from years past. _E_ At 96 stories above Michigan Avenue if you're not staying at the 5 star @TrumpChicago then you're in its shadow __HTTP__ _E_ Next time Marco Rubio should drink his water from a glass as opposed to a bottle—would have much less negative impact. _E_ Obama lied 100% about Libya and the killings emails are absolute. He must release his records on Wednesday and stop the lies. _E_ Can you believe that President Obama still hasn't stopped the flights and people pouring into the U.S. from West Africa. TERRIBLE PRESIDENT! _E_ RT @Scavino45: 20295 miles later #POTUSinAsia has successfully concluded as @POTUS @realDonaldTrump lands on the South Lawn of @WhiteHouse... _E_ 'Immigration Ban Is One Of Trump's Most Popular Orders So Far' __HTTP__ _E_ RT @VP: All Americans in harms way need to be prepared and should continue visiting __HTTP__ for critical updates on #Hurric... _E_ Why doesn't phony @bobvanderplaats tell his followers all the times he asked for him and his family to stay at my hotels didn't like paying _E_ Robert Slater who just passed away was a terrific writer who wrote a very fair book about me. He will be missed. __HTTP__ _E_ I believe in spending what you have to. But I also believe in not spending more than you should. The Art of The Deal _E_ My @FoxNews interview with @megynkelly discussing the 2012 election and the Newsmax @iontv debate __HTTP__ _E_ .@HillaryClinton is NOT above the law!#Debates2016 __HTTP__ _E_ This chart from AEI's @JimPethokoukis shows how terrible @BarackObama's 'recovery' really is: __HTTP__ Disaster. _E_ In '08 @BarackObama called Jerusalem Israel's capital __HTTP__ Now he attacks @MittRomney on Jerusalem __HTTP__ _E_ I won't be doing Fox & Friends tomorrow morning in that I have a big breakfast meeting on a deal. I will be back next week at 7. Thank you! _E_ On June 22 I will be going to Scotland to celebrate the opening of the newly renovated @TrumpTurnberry Resort the worlds best. _E_ I know Rand Paul and I think he may find a way to get there for the good of the Party! _E_ DC has shrunk our military and exploded our country with debt. We can't send another politician to the White House __HTTP__ _E_ Just heard Foreign Minister of North Korea speak at U.N. If he echoes thoughts of Little Rocket Man they won't be around much longer! _E_ Will be traveling to the Great State of Ohio tonight. Big crowd expected. See you there! _E_ I will be interviewed by @oreillyfactor at 4:00 P.M. (prior to the #SuperBowl Pre game Show) on Fox Network. Enjoy! _E_ .@Jetsetterdotcom in Hong Kong featured 8 pages on my great hometown of New York City including @TrumpSoHo __HTTP__ _E_ .@claudiajordan's judgment wasn't the best in who she chose to come back to the boardroom—that was her demise. #CelebApprentice _E_ Next Tuesday remember how our president has not lifted a finger for USMC Tahmooressi. He only wants illegals to cross our border. _E_ It was really strange when Hillary was missing from the podium last night. Not very presidential! _E_ My @CNN interview with @PiersTonight discussing the Newsmax @iontv debate #TimeToGetTough the GOP and the economy __HTTP__ _E_ Nice article on Trump Links at Ferry Point in today's New York Post the construction is going really well! _E_ WE LOVE YOU LAS VEGAS! __HTTP__ _E_ RT @JaniceTaylor912: @DonaldJTrumpJr @Reryan08 @IvankaTrump @EricTrump obvious to all that he raised some GREAT responsible patriotic kid... _E_ Poll: Trump Leads GOP Field Among Hispanics Records 34% Favorability __HTTP__ _E_ With the World hating us and wanting to destroy the U.S. we have just cut the hell out of the military budget making it smallest since '39 _E_ I am pleased to announce that I have chosen Governor Mike Pence as my Vice Presidential running mate. News conference tomorrow at 11:00 A.M. _E_ #ObamacareFail __HTTP__ _E_ So Obama used to tell classmates that he was Kenyan royalty and an Indonesian prince __HTTP__ Sounds like his book bio! _E_ We are going to make this a government of the people once again!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ Iraq told us to get out Iraq is now falling and Iraq now wants us to come back! Don't do it unless we get the OIL and I mean ALL OF IT! _E_ Interesting studies show that wind farms have a warming effect on the climate _E_ I am so proud of our great Country. God bless America! _E_ The Celebrity Apprentice Sunday night at 9 PM on NBC. Another great episode! __HTTP__ _E_ Sadly firing can be an essential and responsible business decision. It isn't pleasant but lopping off a branch can save a tree. _E_ All of this Russia talk right when the Republicans are making their big push for historic Tax Cuts & Reform. Is this coincidental? NOT! _E_ ... It is all about incorporating a sense of optimism into everything you do while also acknowledging the negative." – Think Big _E_ .@MittRomney scored last night on both substance and style. _E_ Join me in congratulating @NASA's @AstroPeggy by using the hashtag #CongratsPeggy! Earlier today:... __HTTP__ _E_ An individual whose whole career is trying to take down successful celebrities with nonsense campaigns has turned his attention to me..... _E_ Thank you America! #Trump2016 __HTTP__ _E_ Entrepreneurs must have vision plus the power of focus... to see the future and turn their vision into a profitable reality. #MidasTouch _E_ I'll bet if I didn't harass Apple for the last 2 years about the large screen iPhone they wouldn't have done it—but it bends & breaks! _E_ President Obama's inaugural had record low ratings. What does that portend? _E_ Mike Leach's lessons his takeaways from Geronimo's life are fascinating & useful whether in boardroom or locker room __HTTP__ _E_ Rasmussen just announced that my approval rating jumped to 49% a far better number than I had in winning the Election and higher than certain "sacred cows." Other Trump polls are way up also. So why does the media refuse to write this? Oh well someday! _E_ "President Donald J. Trump Proclaims October 24 2017 as United Nations Day" Read more: __HTTP__ __HTTP__ _E_ Entrepreneurs: Be passionate you have to love what you're doing to be successful at it. _E_ I am committed to keeping our air and water clean but always remember that economic growth enhances environmental protection. Jobs matter! _E_ Washington will continue to run record deficits into the election. We are borrowing at a rate of $1.40 from China. Truly unsustainable. _E_ Who is rooting for Obama more tonight his campaign advisors or the press? _E_ If I win the Presidency we will swamp Justice Ginsburg with real judges and real legal opinions! _E_ .@KarlRove who spent $430 million in the last cycle and didn't win one race said I'm not a candidate until I file papers. Next week Karl! _E_ Via @CarrGaz: "Trump to jet in to unveil Trump @TurnberryBuzz clubhouse" __HTTP__ _E_ Tom Brady would have won if he was throwing a soccer ball. He is my friend and a total winner! @Patriots _E_ Iran will convince our incompetent President that they are trying to help us with Iraq take over the country & oil and O will say thanks _E_ Foreigners slashed the purchase of US debt late last year the first time in over 2years. We must control spending. __HTTP__ _E_ ....the 2016 election with interviews speeches and social media. I had to beat #FakeNews and did. We will continue to WIN! _E_ .@genesimmons Keep up the great work and congrats we are proud of you! _E_ Despite the phony Witch Hunt going on in America the economic & jobs numbers are great. Regulations way down jobs and enthusiasm way up! _E_ "Talent wins games but teamwork and intelligence wins championships." Michael Jordan _E_ If the morons who killed all of those people at Charlie Hebdo would have just waited the magazine would have folded no money no success! _E_ Entrepreneurs: What is the standard for which you want to be known? Identify that standard and then establish it. _E_ China watched Obama's press conference yesterday salivating. We will be borrowing trillions more from them. _E_ Thank you Jacksonville Florida!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Thank you to @piersmorgan for your nice statement about me in the @HollywoodReporter __HTTP__ _E_ The failing New York Daily News knowingly incorrectly reported that I wanted to speak at the Republican National Convention wrong! _E_ "Problems setbacks mistakes & losses are all part of life. We shouldn't be shocked if and when they happen." – Think Like a Champion _E_ Thank you for you support Virginia! In ONE DAY get out and #VoteTrumpPence16! #ICYMI: __HTTP__ __HTTP__ _E_ We're coming up on the NEW YEAR It is really important that despite so many stupid decisions being made in Washington we make it BEST EVER _E_ RT @foxandfriends: POTUS the predictor? President Trump foretold housing upswing in 2012 __HTTP__ _E_ The Chinese are mistreating Hillary Clinton on her trip __HTTP__ They have zero respect for us. Outrageous! _E_ New CBS National Poll just out massive lead for Trump. The Wall Street Journal/NBC Poll is a total joke. No wonder WSJ is doing so badly! _E_ The Roger Stone report on @CNN is false Fake News. Have not spoken to Roger in a long time had nothing to do with my decision. _E_ Haters and losers say I wear a wig (I don't) say I went bankrupt (I didn't) say I'm worth $3.9 billion (much more). They know the truth! _E_ Wonderful to be in North Dakota with the incredible hardworking men & women @ the Andeavor Refinery. Full remarks: __HTTP__ __HTTP__ _E_ Just left Trump National Doral in Miami under massive construction The Blue Monster will be one of the greatest courses ever built! _E_ Departing Pittsburgh now where it was my great honor to stand with our incredible workers and to show the world that AMERICA is back and we are coming back bigger and better and stronger than ever before! __HTTP__ _E_ If you are lucky enough to catch a knockout assaulter before getting slugged and you carry a gun shoot the bastard (teach them a lesson)! _E_ RT @Scavino45: "Utilities cutting rates cite benefits of Trump tax reform" __HTTP__ _E_ When you do your Christmas shopping remember how disloyal @Macys was to the subject of illegal immigration. #BoycottMacys #DumpMacys _E_ We are making tremendous progress with the V. A. There has never been so much done so quickly and we have just started. We love our VETS! _E_ JOBS JOBS JOBS! __HTTP__ _E_ Mexico's court system is a dishonest joke. I am owed a lot of money & nothing happens. _E_ In Massachusetts the place is packed! #MakeAmericaGreatAgain _E_ Entrepreneurs: Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_ RT @DRUDGE_REPORT: 10 SCANDALS ON DIRECTOR'S WATCH... __HTTP__ _E_ Senior United States District Judge Robert E. Payne today ruled in favor of Trump campaign delegates who had argued.. __HTTP__ _E_ The Zimmerman trial is over. It is time to move on. While Zimmerman is no angel he was acquitted and should be able to move on. _E_ Entrepreneurs: Realize that persistence can go a long way. Being stubborn is often an attribute. _E_ Audience chanting RUN TRUMP RUN! during my my @SRQRepublicans speech! They are going to be very happy... _E_ Firing @lisalampanelli may have come as a surprise. She's a strong player. But there are no losers at this late point. #sweepstweet _E_ Via @necn by @KatherineNECN: "Trump Waiting to See Who Runs in 2016" __HTTP__ _E_ Justice Ginsburg of the U.S. Supreme Court has embarrassed all by making very dumb political statements about me. Her mind is shot resign! _E_ Why does a failed magazine like @Forbes constantly seek out trivial nonsense? Their circulation way down. @Clare_OC _E_ Premiering on January 4th the 14th season of @ApprenticeNBC will have major fireworks every episode. The Board Room is electric! _E_ These crimes won't be happening if I'm elected POTUS. Killer should have never been here. #AmericaFirst __HTTP__ _E_ Another freezing day in the Spring what is going on with global warming ? Good move changing the name to climate change sad! _E_ Cyprus is seizing private bank accounts as collateral for €10bn bail out. We owe $17T. Think it can't happen here? _E_ The terrorist who killed so many people in Germany said just before crime by God's will we will slaughter you pigs I swear we will...... _E_ Scary America would have had to pay all its GDP to the government to cover @BarackObama's real 2011 budget deficit __HTTP__ _E_ Tomorrow will be a really big day for America. MAKE AMERICA GREAT AGAIN! _E_ .@BreitbartNews: DONALD TRUMP: CANTOR'S DEFEAT SHOWS 'EVERYBODY' IN CONGRESS VULNERABLE IF THEY SUPPORT AMNESTY __HTTP__ _E_ Thanks. __HTTP__ _E_ Congratulations to my son Eric for making the Forbes 30 under 30 list. He's done a great job! __HTTP__ _E_ Via @UnionLeader by @tuohy: Trump hires Lewandowski as presidential run eyed __HTTP__ #FITN #MakeAmericaGreatAgain _E_ Via @BreitbartNews by mboyle1: EXCLUSIVE DONALD TRUMP CONFIRMED TO SPEAK AT #CPAC2014 __HTTP__ @ACUConservative @CPACnews _E_ Amazingly @AnthonyWeiner is going to run. The cure rate for his problem is 0. Lots of other things will come out. _E_ I look forward to attending & speaking at the Iowa Land Investment Expo—total sellout crowd __HTTP__ @PeoplesCompany _E_ America should not be pressuring @Israel to show restraint against Iran. We should be working to stop Iran's nuclear drive. _E_ Amazing! AG Schneiderman sues a school w/ a 98% approval rating but doesn't go after billion $ fraudsters all over Wall St. _E_ The hatchet job in @NYMag about Roger Ailes is total bullshit. He is the ultimate winner who is surrounded by a great team. @FoxNews _E_ Thank you for your continued support!#MakeAmericaGreatAgain __HTTP__ _E_ Obama's planned tax hike will hit over 1 million small businesses __HTTP__ Expect more massive unemployment and stagnant growth _E_ Join me live now in Las Vegas Nevada! We will MAKE AMERICA SAFE & GREAT AGAIN! #VoteTrumpNV #NevadaCaucus __HTTP__ _E_ "Donald Trump to name golf course after mother" __HTTP__ via @scotsmandotcom _E_ If you can't see it you can't make it happen. Entrepreneurs chase your dreams with resolute focus & determination. Be positive! _E_ A Lion's List of Democrats are not attending @BarackObama's DNC Convention. The Democratic Party is in turmoil. __HTTP__ _E_ The Holiday Season in New York City is a very special time. I love seeing and meeting the many tourists who visit the #TRUMP Tower atrium. _E_ Just watched Jon Stewart(?) jumping up and down and screaming like a madman nothing funny or smart just loud and obnoxious a pushy dope! _E_ Please tweet me your questions to answer in my #trumpvlog. _E_ TPP does not stop Japan's currency manipulation & China has a backdoor to join. It must be stopped. We need to protect the American worker! _E_ I have nothing to do with the Plaza Casino in Atlantic City. I have not been involved with Atlantic City for many years. Used to love A.C.! _E_ It's boardroom time! Does anyone miss @OMAROSA? #CelebApprentice _E_ .@HillaryClinton Sneers At Millions Of Average Americans. __HTTP__ #VPDebate #BigLeagueTruth _E_ Blatant and rampant property destruction in Baltimore as the police stand by and watch. Should be a lesson on how NOT to handle riots. SAD! _E_ Thank you Piers they don't know what they're getting into. __HTTP__ _E_ I applaud @netanyahu for announcing that he will show up at the UN to defend @Israel. A true US friend and great leader. _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ Amazing crowd last night in Dallas more spirit and passion than ever before. Today all over the great State of Texas! _E_ Was not mentioned that we built one of the great golf courses in the world bringing tremendous business to Scotland. __HTTP__ _E_ "Simply take a big goal and mold yourself to become the person who can accomplish that goal." – Think Big _E_ Tonight's #CelebrityApprentice will continue to impress. Be sure to tune in tonight at 9PM ET @NBC. It will be amazing. _E_ .@CelebApprentice Flashback: "What @bretmichaels Learned from the 'Rock Star of Real Estate'" __HTTP__ _E_ Just left Virginia where I unveiled my healthcare and other plans for our great Veterans! They will be very happy! __HTTP__ _E_ Donald Trump To Mitt Romney: 'You're Fired' __HTTP__ via @fitsnews _E_ Trump International Hotel & Tower New York winner of the Forbes Five Star Hotel Award in 2009 through 2012. __HTTP__ _E_ If I would have done the last debate a record would have been set (instead of the poor ratings recieved). Also VETS got $6000000. _E_ Scary. President Obama told Boehner that the government doesn't have a spending problem __HTTP__ _E_ Donald Trump: 'Monkey business' on jobs __HTTP__ via @politico _E_ I have an idea for @JebBush whose campaign is a disaster. Try using your last name and don't be ashamed of it! _E_ Credible Source on 9 11 Muslim Celebrations: FBI __HTTP__ _E_ Host of the 2017 U.S. Women's Open Trump Bedminster has been rated one of America's best golf courses. _E_ SOMETIMES YOUR BEST INVESTMENTS ARE THE ONES YOU DON'T MAKE! _E_ Snowden should come back to America and face justice. Instead he is begging for clemency from Moscow. Treat him as a spy. _E_ Iran has warned the US not to send an aircraft carrier back into the Strait of Hormuz. We should send three as a (cont) __HTTP__ _E_ It will be interesting to see what happens to Eliot Spitzer if he loses the election for Comptroller to very capable @scottmstringer. _E_ .@foxandfriends int. on gov. collecting data whistle blower hiding in China & no bikinis in Miss World pageant __HTTP__ _E_ Huge Townhall tomorrow at 5PM in the NH Barrington Middle School! Thanks to @straffordnhgop​ for hosting! Let's Make America Great Again! _E_ Watch me on @Hannityshow tonight at 9PM ET on @FoxNews. _E_ RT @DanScavino: Jesse Jackson on @realDonaldTrump when he donated space for the Rainbow/Push Coalition. #DebateNight __HTTP__ _E_ China will extract much from Secretary Kerry and the U:S. in order for them to help us with the North Korea problem don't let this happen! _E_ I don't like seeing the Pope standing at the checkout counter (front desk) of a hotel in order to pay his bill. It's not Pope like! _E_ Disappointed the @NewYorkObserver article on @AGSchneiderman did not bring up his dealings w/ Shirley Huntley. __HTTP__ _E_ My @foxandfriends interview discussing Pres. Obama's inauguration @GOP debt plan & @CelebApprentice #1 branding __HTTP__ _E_ I will be interviewed on @foxandfriends at 9:00 A.M. I will be talking about the rigged and boss controlled Republican primaries! _E_ It's time to let Pete Rose the all time hits leader into the Baseball Hall of Fame. Enough already!!!!! _E_ In trade military and EVERYTHING else it will be AMERICA FIRST! This will quickly lead to our ultimate goal: MAKE AMERICA GREAT AGAIN! _E_ .@SabrinaSiddiqui Re: Taylor and Conor great news for Taylor! _E_ Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_ Make sure to follow me on @periscopeco. I will be streaming my announcement at 11AM. _E_ With Dr. Dror Paley & Dr. Ben Carson with two wonderful children at Mar a Lago. __HTTP__ _E_ "Here's the truth the gov't doesn't shutdown" __HTTP__ via @AP. All essential services continue. Don't believe lies. _E_ Look how small the pages have become @WSJ. Looks like a tabloid—saving money I assume! _E_ "To keep momentum keep challenging yourself." – Think Big _E_ Don't forget to watch Celebrity Apprentice this Sunday night at 9 pm on NBC. You're in for a great show. __HTTP__ _E_ The legendary Barbara Walters interviews Melania Trump and me on a special this Friday night at 10:00 on ABC. Don't miss it! _E_ .@JebBush like it or not our country needs more energy and spirit than you can provide! #MakeAmericaGreatAgain _E_ I told you the Oscars were terrible—bad look bad talent—and among the lowest ratings in show's history. __HTTP__ ... _E_ I will be on @foxandfriends at 7.30 A.M. _E_ The Emmys were horrendous...the absolute worst show! _E_ Watch me tonight on Late Night with Jimmy Fallon.Photo: Lloyd Bishop/NBC __HTTP__ _E_ Gov.Kasich of Ohio just stated on a morning show that he doesn't watch politics or anything on television he only watches the @GolfChannel _E_ Done by a real fan! #TRUMP __HTTP__ _E_ I will be tweeting live tonight during Celebrity Apprentice 9 o'clock on NBC! _E_ Very dangerous pattern developing across country by Obama supporters. Detroit poll watcher was threatened with gun __HTTP__ _E_ In my book @Joan_Rivers had a lousy doctor shoving a camera down her throat at her age. Something went really wrong that should not have! _E_ Can you imagine if Bush's administration drafted a memo legalizing the killing of Americans?! Democrats are such hypocrites. _E_ I use both iPhone & Samsung. If Apple doesn't give info to authorities on the terrorists I'll only be using Samsung until they give info. _E_ The Apprentice on the other hand has been a MAJOR television hit often times finishing #1. Even now after 13 seasons it wins its slot! _E_ .@TheBrodyFile Exclusive: @realDonaldTrump Says He Will Protect Evangelicals Better Than @tedcruz __HTTP__ #CBNNews #2016 _E_ "Diligence is the mother of good luck." Benjamin Franklin _E_ Riley Rone was a great young man. We will miss him dearly. __HTTP__ _E_ No surprise. @Rosie is failing on @TheView.Terrible ratings."Malcontent & another season is out of the question __HTTP__ _E_ The only people who don't like the Tax Cut Bill are the people that don't understand it or the Obstructionist Democrats that know how really good it is and do not want the credit and success to go to the Republicans! _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Entrepreneurs: Resolve to be bigger than your problems. Who's the boss? _E_ Our country is facing a major threat from radical Islamic terrorism. We better get very smart and very tough FAST before it is too late! _E_ A president either is constantly on top of events or if he hesitates events will soon be on top of him... _E_ RT @realDonaldTrump: So much Fake News is being reported. They don't even try to get it right or correct it when they are wrong. They prom... _E_ Once again Obama is going to lose on another prospective nomination. Chuck Hagel will not be named Sec. of Defense & probably shouldn't be. _E_ Bloomberg: Trump leads GOP field __HTTP__ _E_ Look forward to Governor Mike Pence V.P. introduction tomorrow in New York City. _E_ Sorry I will miss the CPAC gathering in Orlando there in spirit Obama must go. _E_ My #GOPDebate @facebook question for the other candidates __HTTP__ _E_ With 50 days until the election it is #TimeToGetTough for @MittRomney & @GOP _E_ Congratulations to @BretBaier on the immediate & tremendous success of his book 'Special Heart.' Already in its third printing! _E_ .@TrumpChicago's exceptional dining w/equally exceptional views of the city are exclusive world class experiences __HTTP__ _E_ Great memory @TheRealMarilu! #CelebApprentice _E_ On my way to Iowa. Will be landing in Des Moines in two hours. See ya! _E_ My contract with the American voter will restore honesty accountability & CHANGE to Washington! #DrainTheSwamp __HTTP__ _E_ We have many problems in our house (country!) and we need to fix them before we let visitors come over and stay. MAKE AMERICA GREAT AGAIN! _E_ Success tip: Don't tread water. Get out there and go for it. There's nothing wrong with bringing your talents to the surface. _E_ Seems hard to believe that @Facebook could be worth that much be careful if you invest. And Mark Zuckerberg get a pre nup. _E_ To be completed this yearTrump Int'l Golf Club Dubai will feature a 7205 yard par 71 & double sided driving range __HTTP__ _E_ The US government's foreign debt is at a record $5.29T __HTTP__ China is laughing all the way to the bank. _E_ RT @foxandfriends: Jared Kushner didn't suggest Russian communications channel in meeting source says __HTTP__ _E_ The Blue Monster Golf Course officially opens tomorrow at Trump National Doral with a ribbon cutting ceremony. GREAT COURSE GREAT REVIEWS! _E_ America's hearts & prayers are with the people of #PuertoRico & the #USVI. We will get through this and we will get through this TOGETHER! __HTTP__ _E_ Thank you Readers' Choice: Trump Int'l Hotel Las Vegas has been nominated by 10 Best for Best Pet Friendly Hotel __HTTP__ _E_ Will be heading over shortly to make remarks at The National Prayer Breakfast in Washington. Great religious and political leaders and many friends including T.V. producer Mark Burnett of our wonderful 14 season Apprentice triumph will be there. Looking forward to seeing all! _E_ Government dependency has surged over 23% since @BarackObama has taken office. __HTTP__ He is creating an entitlement culture. _E_ Democrats refusal to give even one vote for massive Tax Cuts is why we need Republican Roy Moore to win in Alabama. We need his vote on stopping crime illegal immigration Border Wall Military Pro Life V.A. Judges 2nd Amendment and more. No to Jones a Pelosi/Schumer Puppet! _E_ The Green Party scam to fill up their coffers by asking for impossible recounts is now being joined by the badly defeated & demoralized Dems _E_ Pictures of my beautiful mother amazing father and family hanging @MontesKitchen in upstate New York. __HTTP__ _E_ Why did @DanaPerino beg me for a tweet (endorsement) when her book was launched? _E_ I was disappointed that Ted Cruz would speak behind my back get caught and then deny it. Well welcome to the wonderful world of politics! _E_ Congrats to @mboyle1 of @BreitbartNews for exposing Jason Linkins of @HuffingtonPost as a lightweight dope who gives false information. _E_ Starting next week and by popular demand (plus good ratings) NBC will broadcast only two hour episodes of Celebrity Apprentice at 9 P.M. _E_ Barack Obama is hard at work today on his highest priority his reelection. @BarackObama has 5 fundraisers in 2 cities. __HTTP__ _E_ The lunatics in Congress banned the word 'lunatic' from Congress last week __HTTP__ Busy doing the peoples' work! _E_ When someone can discourage you you probably aren't determined enough. Be resolute. That's what it takes to get things done. _E_ Why does the liberal media think Bill O'Reilly (@oreillyfactor) is a complete and total vulgarian? I don't think so! _E_ .@alexsalmond @pressjournal RT @djkevritch im proud to be scottish but bonnie scotland will soon be a thing of the past w/ these windmills _E_ Join me live for the #SOTU __HTTP__ _E_ Big win today in the House for GOP Tax Cuts and Reform 227 205. Zero Dems they want to raise taxes much higher but not for our military! _E_ Fake News CNN made a vicious and purposeful mistake yesterday. They were caught red handed just like lonely Brian Ross at ABC News (who should be immediately fired for his "mistake"). Watch to see if @CNN fires those responsible or was it just gross incompetence? _E_ SSE slashes offshore wind investment—wants British government to pay for its losses on these monstrosities __HTTP__ _E_ More waste fraud and abuse over $460M in food stamps went to ineligible households __HTTP__ Where's the accountability? _E_ Under Trump gains against #ISIS have dramatically accelerated __HTTP__ _E_ Now Obama's campaign is guaranteeing 12 million new jobs during a 2nd term __HTTP__ More like $12T in new debt if he wins. _E_ In response to @Lawrence my net worth is substantially more than 7 billion dollars very low debt great as... (cont) __HTTP__ _E_ Young entrepreneurs – be patient and continue to work with determination. With hard work success will follow. Keep your focus! _E_ The reason I put up approximately $50 million for my successful primary campaign is very simple I want to MAKE AMERICA GREAT AGAIN! _E_ I cancelled today's meeting with the failing @nytimes when the terms and conditions of the meeting were changed at the last moment. Not nice _E_ If the ban were announced with a one week notice the bad would rush into our country during that week. A lot of bad dudes out there! _E_ I will be watching the great Governor @Mike_Pence and live tweeting the VP debate tonight starting at 8:30pm est! Enjoy! _E_ Will be on @OreillyFactor tonight at 8:30pm @FoxNews prior to Melania's speech at the #GOPConvention. Tune in she will do great! #RNCinCLE _E_ You can have the best product in the world but if people don't know about it it's not going to be worth much. The Art of The Deal _E_ I will be campaigning in Indiana all day. Things are looking great and the support of Bobby Knight has been so amazing. Today will be fun! _E_ I am on David Letterman tonight. _E_ Will be interviewed tonight on @seanhannity at 10:00. There is so much to talk about! _E_ Congratulations to Georgina Bloomberg on winning the inaugural Central Park Grand Prix CSI 3* @MikeBloomberg _E_ #HappyNewYearAmerica! __HTTP__ _E_ Crooked Hillary Clinton looks presidential? I don't think so! Four more years of Obama and our country will never come back. ISIS LAUGHS! _E_ ...Whether you are a Republican or Democrat we should hope that Pres. @BarackObama does a great job for the country. _E_ When the American People speak ALL OF US should listen. Just over one year ago you spoke loud and clear. On November 8 2016 you voted to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ As the world watches we are days away from passing HISTORIC TAX CUTS for American families and businesses. It will be the BIGGEST TAX CUT and TAX REFORM in the HISTORY of our country! __HTTP__ _E_ Great video of tonights crowd reacting to my latest proposal in SC. #Trump2016 __HTTP__ __HTTP__ _E_ Joe McQuaid (@deucecrew) of the dying Union Leader wanted ads lunches donations speeches from me and tweets very unethical. _E_ Nice to see Obama released a situation room photo from Sandy. How about releasing the photo taken during Benghazi? _E_ .@BarackObama economic gloom: jobless claims have surged __HTTP__ while factory activity is (cont) __HTTP__ _E_ Don't believe the biased and phony media quoting people who work for my campaign. The only quote that matters is a quote from me! _E_ Why did failing A.G. Eric Schneiderman after years of looking file his pathetic lawsuit on a SATURDAY afternoon (unheard of)? No case! _E_ We don't need another stimulus. The first one was a complete failure. Why repeat the same mistake? _E_ 09 19 2011 17:54:28 _E_ Thx to all the people who called to say they are cutting their @Macys credit card as a protest against illegal immigrants pouring into US _E_ #TrumpVine Weiner is a joke.... __HTTP__ _E_ The ObamaCare disaster is in full swing. Websites are down people can't sign up and elderly can't understand the lingo. _E_ Texas & Louisiana: We are w/ you today we are w/ you tomorrow & we will be w/ you EVERY SINGLE DAY AFTER to restore recover & REBUILD! __HTTP__ _E_ 100% fabricated and made up charges pushed strongly by the media and the Clinton Campaign may poison the minds of the American Voter. FIX! _E_ Get ready for the Apprentice tonight TWO AMAZING EPISODES. I will be live tweeting! _E_ My @CNBCClosingBell interview discussing QE3 the housing market my stock picks and the 2012 election __HTTP__ _E_ Via @financialpost: "Climate changing for global warming journalists" by Lawrence Solomon (cont) __HTTP__ _E_ Watch the clip from @Late_Show where the crowd cheers after I explain that my offer is about transparency __HTTP__ _E_ I am greatly honored by the results of the CNN poll in Iowa. In the end I believe the final results will be even better than that! _E_ Watching @CNN and consider @secupp to be one of the least talented people on television. Boring and biased! _E_ Make our borders strong and stop illegal immigration. Even President Obama agrees __HTTP__ _E_ How does failed writer and pundit like @stephenfhayes with no success and little talent get away with criticizing candidates. _E_ While the next season of @CelebApprentice is packed w/ All Stars ours fans will be happy to see @Joan_Rivers in the board room.She is back! _E_ Can you believe the Republicans are studying the Democrats on how to win an election? _E_ Debate showed these guys really hate each other. At one point it looked like they would come to blows. _E_ If Obama was smart he would cancel the Muslim Brotherhood's WH visit later this month. He won't. _E_ Isn't it amazing that @Macys paid a massive fine for profiling African Americans & then criticized me for discussing illegal immigration! _E_ Jeb Bush is desperate strongly in favor of #CommonCore and very weak on illegal immigration. _E_ Negotiation is persuasion more than power. Negotiation includes a lot of fine lines and that's what makes it an art. _E_ Was @foxandfriends just named the most influential show in news? You deserve it three great people! The many Fake News Hate Shows should study your formula for success! _E_ With one Yes vote in hospital & very positive signs from Alaska and two others (McCain is out) we have the HCare Vote but not for Friday! _E_ Trump approval rebounds to 45% surges among Hispanics union homes men __HTTP__ _E_ Obama is about to destroy the mililtary through the sequester. The Middle East is a mess. Yet Colin Powell still endorses him. Wonder why? _E_ AMERICA FIRST! _E_ Must read piece by @DanielPipes: "Obama's Diplomatic Acrobatics" __HTTP__ _E_ Wow the Republican Convention went so smoothly compared to the Dems total mess. But fear not the dishonest media will find a good spinnnn! _E_ Despite the ever increasing Ebola disaster Obama refuses to stop flights from West Africa.It's almost like he's saying F you to U.S. public _E_ Ivanka caught up with Bret and Holly backstage. Both Bret and Holly were champions all the way. __HTTP__ _E_ WH claims it lied about Pres. Obama living with his uncle b/c "wasn't mentioned in his book." I guess Bill Ayers never knew about it! _E_ Welcome to the @WhiteHouse Amir Sabah al Ahmed al Jaber al Sabah of Kuwait! Joint press conference coming up soon: __HTTP__ __HTTP__ _E_ "WHAT HAPPENED""How Team Hillary played the press for fools on Russia" __HTTP__ WE KNOW! __HTTP__ _E_ Congratulations to @TrumpNewYork and @TrumpToronto for the @WSJ coverage on perks in luxury hotels: __HTTP__ _E_ Wow! This might be my highest # yet! Thank you to my opposition you are totally ineffective & have been for years! __HTTP__ _E_ .@Macys was one of the worst performing stocks on the S&P last year plunging 46%. Very disloyal company. Another win for Trump! Boycott. _E_ find the leakers within the FBI itself. Classified information is being given to media that could have a devastating effect on U.S. FIND NOW _E_ Army training slide lists Hillary Clinton as insider threat: __HTTP__ _E_ New Zogby poll— highly respected— but the media won't report it because it gives me an even bigger lead! __HTTP__ _E_ The American US Airways merger will create even worse service and much higher fares. _E_ Just like I have been able to spend far less money than others on the campaign and finish #1 so too should our country. We can be great! _E_ If the decision by the grand jury in Ferguson was the exact opposite you would still be having the riots right now! _E_ Whatever happened to Obama's 'independent investigation' into national security leaks from his administration? Where's the media? _E_ In '09 Obama released the ISIS chief. The terrorist gloated "I'll see you in New York" __HTTP__ Historic nat'l sec. error _E_ .@bwilliams wouldn't you love to have my ratings? _E_ A poll of the Miami Dade was conclusively in favor of gambling in Miami. @willweatherford @FLGovScott __HTTP__ _E_ An honor to be endorsed by the New England Police Benevolent Association. Thank you! __HTTP__ __HTTP__ _E_ My Scotland course is receiving accolades from all over the world a great honor for me. _E_ I was so happy when I heard that @Politico one of the most dishonest political outlets is losing a fortune. Pure scum! _E_ My wonderful son Eric will no longer be allowed to raise money for children with cancer because of a possible conflict of interest with... _E_ Do you think the 14 African nations that are banning West Africans from coming into their nations are racist? _E_ What a great evening we had. So interesting that Sanders beat Crooked Hillary. The dysfunctional system is totally rigged against him! _E_ New York Republican leader @EdwardFCox is pushing my friend @RobAstorino into political suicide. Results won't be pleasant! _E_ It's Thursday and only 26 days until the election. How many illegal donations from China and Saudi Arabia did Obama collect today? _E_ People like doing deals with me because they know it will be profitable that I work quickly and that they will be treated fairly. _E_ From my family to yours...I want to wish you all a very merry Christmas! _E_ The habitual vacationer @BarackObama has sacrificed so much. He is delaying his 17 day Hawaii vacation a couple of hours. _E_ Thank you for the warm welcome to Brussels Belgium this afternoon! __HTTP__ _E_ .#CelebrityApprentice Two hour live show on Monday night will determine who will become the winner of Celebrity Apprentice.Full cast returns _E_ Does President Obama ever discuss the sneak attack on Pearl Harbor while he's in Japan? Thousands of American lives lost. #MDW _E_ With 46 stories and 391 beautiful rooms @TrumpSoHo offers a wide array of AAA Five Diamond luxury options __HTTP__ _E_ Tax experts throughout the media agree that no sane person would give their tax returns during an audit. After the audit no problem! _E_ .@BetteMidler talks about my hair but I'm not allowed to talk about her ugly face or body so I won't. Is this a double standard? _E_ New @OANN national poll released. Thank you America! #Trump2016 __HTTP__ _E_ I'll be in London on Sunday at the ExCel Centre to talk about success. It will be a great time for everyone! __HTTP__ _E_ "One reason many people do not do well in business is because they do not do well with people." – Midas Touch _E_ Wow @CNN has nothing but my opponents on their shows. Really one sided and unfair reporting. Maybe I shouldn't do their town hall tonight! _E_ Marco Rubio couldn't even respond properly to President Obama's State of the Union Speech without pouring sweat & chugging water. He choked! _E_ Trump National Golf Club Bedminster New Jersey has courses designed by Tom Fazio & 16 acres of practice facilities. __HTTP__ _E_ The evening news broadcasts must stop talking about weather—boring and too many other topics. _E_ I will be on Piers Morgan Live tonight at 9 p.m. on CNN. Tune in! _E_ Amazing @VanityFair survived one more day without folding. The clock is ticking... _E_ Today's third stop Londonderry New Hampshire! Thank you!#FITN #VoteTrumpNH __HTTP__ _E_ Romance or Adventure what do you prefer? #CelebApprentice _E_ Looking forward to tonight's Ayrshire Chamber of Commerce Annual Dinner 2015 @AyrshireChamber _E_ Heading to Scotland to check out Turnberry & Trump Int'l Golf Links Scotland. Then heading to Dubai @DamacOfficial a great company. _E_ If you don't do your part don't blame God. Billy Sunday _E_ Glad to hear @BarackObama's attack ad featuring my plane is playing in North Carolina. Free ad time for Trump National in Charlotte! _E_ The opening of Trump Turnberry in Scotland was a big success. Good timing I was here for BREXIT. Very exciting news conference today! _E_ Speaking to great patriots @MCC_CT. My first visit to Granite State since declaring my candidacy! #FITN __HTTP__ _E_ The Ultimate Merger: __HTTP__ 06 17_omarosa_is_back_and_this_time_its_personal.html _E_ "The entrepreneur's ability to dream to win lose and win again and again is often called the entrepreneurial spirit." – Midas Touch _E_ No matter what Bill Clinton says and no matter how well he says it the phony media will exclaim it to be incredible. Highly overrated! _E_ Via @Newsmax_Media by @wandacarruthers: Trump: 'Inconceivable' Obama didn't know about ISIS threat __HTTP__ _E_ Get out and VOTE tomorrow! We will MAKE AMERICA GREAT AGAIN! #CTPrimary #DEPrimary #MDPrimary #PAPrimary #RIPrimary __HTTP__ _E_ Why isn't anyone using the @CNN Iowa Poll with me having a big lead. They only want to use the one negative poll (2nd place).Dishonest press _E_ .@bobvanderplaats is a total phony and dishonest guy. Asked me for expensive hotel rooms free (and more). I said pay and he endorsed Cruz! _E_ Thank you Council Bluffs Iowa! Will be back soon. Remember everything you need to know about Hillary just... __HTTP__ _E_ Via @WSJ: "The ObamaCare Awakening: Americans are losing their coverage by political design." __HTTP__ _E_ I was interviewed by Greta Van Susteren today here at Trump Tower. Tune in tonight on Fox News at 10 p.m.... (cont) __HTTP__ _E_ Interview w/Melanie Batley via Newsmax __HTTP__ _E_ We should have taken the oil in Iraq and now our mortal enemies have got it and with no opposition. Really dumb U.S. pols! I'm so angry! _E_ Can you believe they are blaming @MittRomney for Egypt. _E_ With all of the illegal acts that took place in the Clinton campaign & Obama Administration there was never a special counsel appointed! _E_ 'BuzzFeed Runs Unverifiable Trump Russia Claims' #FakeNews __HTTP__ _E_ Looking forward to meeting with @SenBobCorker in a little while. We will be traveling to North Carolina together today. _E_ .@JudgeJeanine Tonight at 9 P.M. on @FoxNews ENJOY! _E_ For America to be strong again the ways of politicians must be put in the past. Let's Make America Great Again! __HTTP__ _E_ Filming of the record 13th season of @CelebApprentice has started. Be sure to be on the lookout for future updates. _E_ 2012 is the most important election of my lifetime. @BarackObama must be defeated. _E_ We allow Japan to sell us millions of cars with zero import tax and we can't make a trade deal with them our country is in big trouble! _E_ My new club on the Atlantic Ocean in Ireland will soon be one of the best in the World and no one will be looking into ugly wind turbines! _E_ Movie producer Harvey Weinstein who lost his company to Colony Capital is against guns but makes movies w/ major gun violence really! _E_ BREAKING Border security rally in Phoenix AZ at 2PM MST has been moved to @PhoenixConvCtr! Build a wall! Let's Make America Great Again! _E_ The Democrat Governor.of Minnesota said The Affordable Care Act (ObamaCare) is no longer affordable! And it is lousy healthcare. _E_ Is the Boston killer eligible for Obama Care to bring him back to health? _E_ President Xi of China has stated that he is upping the sanctions against #NoKo. Said he wants them to denuclearize. Progress is being made. _E_ FBI Director Comey was the best thing that ever happened to Hillary Clinton in that he gave her a free pass for many bad deeds! The phony... _E_ .@ScottWalker despite your coming to my office to give me an award your very dumb fundraiser hit me very hard not smart! _E_ Ft. Hood Jihadi Nidal Hassan has been paid over $300g in Army salary while on trial. His victims are deprived of any benefits... _E_ Sexting Pervert @anthonyweiner has returned to twitter. Parents of all underage girls should BLOCK him immediately! _E_ Obama is not a leader he's just a campaigner! _E_ Great job by MichaelCaputo on @foxandfriends. _E_ Call it any way you like but Snowden is a traitor. When our country was great do you know what we did to traitors? _E_ Memorial Day is a time to honor our nation's finest who made the ultimate sacrifice for our freedom. God bless them all. _E_ President Obama missed the deadline! _E_ The 250 million dollar construction of Trump Nationsl Doral is coming along great. Just left Miami where I toured entire project.AMAZING! _E_ Just read that Trump has the largest (and I add most enthusiastic) crowds. Tonight I will be in New Hampshire the place will be packed! _E_ Refloating the Costa Concordia for many hundreds of millions of $'s is ridiculous. Should have taken it apart in small pieces save fortune _E_ Our $16T national debt is now bigger than our $15T GDP. If Obama is re elected watch for an economic meltdown in 2013. _E_ We have done a great job with the almost impossible situation in Puerto Rico. Outside of the Fake News or politically motivated ingrates... _E_ Hillary Clinton answered email questions differently last night than she has in the past. She is totally confused. Unfit to serve as #POTUS. _E_ Here are Hillary Clinton's accomplishments at the State Department.#Debates2016 #RattledHillary __HTTP__ _E_ I will be in Palm Beach Jupiter and Miami today checking on big construction projects. I love Florida and love on time and on budget const _E_ I'll be on @seanhannity tonight at 10 PM and look forward to it. Lots to discuss! Enjoy. _E_ Now America knows the Emperor has no clothes. Why would Obama do better in a 2nd debate? #Debate #Obama _E_ ... I never felt that I could let up for a moment. Harry S. Truman _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ I just want to know how much is Saudi Arabia and others who we are helping willing to pay for our saving from total extinction. Pay up now! _E_ Lightweight Attorney General Eric Schneiderman will be next to lose. He goes after a school with a 98% approval ratingleaves biggies alone _E_ Great new campaign ad just released by @MittRomney __HTTP__ _E_ Great evening in San Jose other than the thugs. My supporters are far tougher if they want to be but fortunately they are not hostile. _E_ Dummies left Iraq without the oil not believable! _E_ Capitalism doesn't guarantee success only a chance to succeed. The community organizer @BarackObama doesn't (cont) __HTTP__ _E_ ISIS is operating a training camp 8 miles outside our Southern border __HTTP__ We need a wall. Deduct costs from Mexico! _E_ #NeverTrump is never more. They were crushed last night in Cleveland at Rules Committee by a vote of 87 12. MAKE AMERICA GREAT AGAIN! _E_ If Snowden was such a hero then he would be in America. He is escaping justice! _E_ Looks like plane may have been found in the Indian Ocean off the coast of Australia. _E_ I just had a great victory against lightweight A.G. Eric Schneiderman. Most of his case re Trump U. was thrown out or gutted. Little remains _E_ Remember when the failing @nytimes apologized to its subscribers right after the election because their coverage was so wrong. Now worse! _E_ The Republicans will get zero credit for passing immigration reform—and I said zero! _E_ I am not just running against Crooked Hillary Clinton I am running against the very dishonest and totally biased media but I will win! _E_ Watch commodity prices soar because of the freezing cold. Will be bad for the economy. We could use some global warming. _E_ Wow my campaign is hearing from more and more Bernie supporters that they will NEVER support Crooked Hillary. She sold them out V.P. pick! _E_ In Austin Texas with some of our amazing Border Patrol Agents. I will not let them down! __HTTP__ __HTTP__ _E_ LIVE FACT CHECK: Trump's RIGHT. The Clinton Foundation has taken MILLIONS from the Middle East. #DrainTheSwamp __HTTP__ _E_ I will be on @CNNSitRoom with @wolfblitzer from 5 7pm est. on @CNN. _E_ Thank you Piers. __HTTP__ _E_ Via @australian: Trump empire planning to build a presence in Sydney __HTTP__ _E_ Entrepreneurs: Practice positive thinking with a lot of reality checks. Know that goals come with obstacles. _E_ Many people are saying that my challenge to Obama is having a huge negative effect on his poll numbers I agree. _E_ National Pearl Harbor Remembrance Day "A day that will live in infamy!" December 7 1941 _E_ My heart & prayers go out to all of the victims of the terrible #Brussels tragedy. This madness must be stopped and I will stop it. _E_ Spoke at the Congressional @GOP Retreat in Philadelphia PA. this afternoon w/ @VP @SenMajLeader @SpeakerRyan. Th... __HTTP__ _E_ I think having Jeb's endorsement hurts Lyin' Ted. Jeb spent more than $150000000 and got nothing. I spent a fraction of that and am first! _E_ .@JebBush just took millions of $'s in special interest money to look like a tough guy. Will never work! _E_ Via @DMRegister by @JoelAschbrenner: Trump to speak at @LandExpo in West Des Moines __HTTP__ _E_ Wind farms are killing many thousands of birds. They make hunters look like nice people! _E_ #CelebrityApprentice contestant @LouFerrigno stopped by to visit today __HTTP__ _E_ Another company that the DOE has given money to just filed for bankruptcy. This is how the money we borrow at 40% from China is wasted. _E_ Who do you want negotiating for us? __HTTP__ _E_ The American gymnastic team was great our country should take their lead. _E_ Obama lied to the public about the Al Qaeda attack on our consulate in Libya. He should be held accountable. _E_ Why does @mcuban continue to embarrass the 31 35 & 11TH place @dallasmavs with childish behavior? Really unprofessional! _E_ .@RepTomMarino Great job on television this morning. Glad to have you on my side! _E_ John Kerry is openly celebrating the tenuous nuclear deal with Iran. Great dealmakers do not celebrate dealsthey just go on to the next one _E_ Will be interviewed on @FaceTheNation with @JDickerson tomorrow at 10:30am EST. Enjoy! _E_ ICYMI @IvankaTrump's int. on @TODAYshow discussing @Joan_Rivers & contestant rivalries on @ApprenticeNBC __HTTP__ _E_ Join my team over on my Facebook page live now! #Debates __HTTP__ __HTTP__ _E_ Happy Veterans Day to ALL in particular to the haters and losers who have no idea how lucky they are!!! _E_ Trump National Golf Club Washington D.C. is situated on 600 acres overlooking the Potomac River. Beautiful! __HTTP__ ... _E_ How much longer are we expected to put up with the world's most incompetent leader ObamaCare Iran Syria bads deals. JUST NEVER ENDS _E_ RT @DanScavino: .@realDonaldTrump stops by overflow room in Mechanicsburg Pennsylvania prior to main rally. #TrumpMovement #MAGA __HTTP__ _E_ Israel is being barraged by rockets from Gaza recently. They must respond accordingly in defense of their citizens. _E_ We are going to defend our industry & create a level playing field for the American worker. It is time to put... __HTTP__ _E_ Oil is rising back over $100 barrel. OPEC loves to rip us off. Why shouldn't they they always get away with it. _E_ .@AROD is back on the DL. The coming suspension will be announced soon by @MLB. _E_ I've realized that success requires 100% effort and 100% focus. Nothing less. _E_ There is nothing nice about searching for terrorists before they can enter our country. This was a big part of my campaign. Study the world! _E_ One of the best produced including the incredible stage & set in the history of conventions. Great unity! Big T.V. ratings! @KarlRove _E_ Noisy windfarm driving community crazy! __HTTP__ @AlexSalmond @AberdeenCC @AberdeenshireCC _E_ Discipline is a key ingredient for success. It will build character motivation and bring opportunity. _E_ The new NBC POLL has me in first place but said I was third in the debate I demand a recount (just kidding!). EVERY other poll had me #1. _E_ Dow dives more than 500 points down 9% from high. Be careful! _E_ The CBO has predicted that unemployment will rise to 8.8% this next year. __HTTP__ This is @BarackObama's economic recovery. _E_ What is our country coming to when a judge can halt a Homeland Security travel ban and anyone even with bad intentions can come into U.S.? _E_ 30 million Americans are unemployed yet Obama has set up workshops across the country for illegals to get Amnesty __HTTP__ _E_ The @ForbesInspector & @AAAnews 5 star restaurant @TrumpNewYork's @Jean_GeorgesNYC is NYC's top destination __HTTP__ _E_ RT @EricTrump: Tune into @GMA right now to catch a great interview with my father & the entire family! #VoteTrumpPence16 __HTTP__ _E_ Hillary Clinton spokesperson admitted that their was no ISIS video of me. Therefore Hillary LIED at the debate last night. SAD! _E_ Our economy is struggling and OPEC continues to rip us off. Output is low and the price is too high. They ar... (cont) __HTTP__ _E_ Marco Rubio is being crucified by the media for drinking water during speech! _E_ "If it's worth doing it's worth fighting for. You'll have lots of people and obstacles in your way. Work & fight to get beyond them. _E_ Residential Capital a company in which Warren Buffett is involved went bankrupt but that doesn't mean that Warren Buffett went bankrupt! _E_ Champions aren't made in the gyms. Champions are made from something they have deep inside them a desire a dream a vision. Muhammad Ali _E_ Thank you Sparks Nevada!#VoteTrumpNV #NevadaCaucus Finder: __HTTP__ __HTTP__ _E_ Heading over to the @UN to meet with Ambassador @NikkiHaley and all of her great representatives! #USA _E_ Do not underestimate yourself and know you are able to handle what comes your way by increasing your leverage. _E_ #CelebrityApprentice @arsenioofficial "trying to be invisible"? No way that's going to happen. #sweepstweet _E_ Thank you Nashua New Hampshire! #MakeAmericaGreatAgain #Trump2016 #NHPolitics #FITN __HTTP__ __HTTP__ _E_ Via @Suntimes: Trump wins at trial calls woman suing him 'horrible human being' __HTTP__ _E_ When I said in an interview that Putin is not going into Ukraine you can mark it down I am saying if I am President. Already in Crimea! _E_ I gave a woman named Barbara Res a top N.Y. construction job when that was unheard of and now she is nasty. So much for a nice thank you! _E_ Obama's 2014 budget "eyes $1 trillion hike in tax revenue" __HTTP__ He loves taxes. T E A. Taxed Enough Already. _E_ We have to bring back and cherish the middle class once the backbone and true strength of the U.S.A. It can happen! _E_ A regular part of your day should be devoted to expanding your horizons. Learning is a new beginning. _E_ I had a fantastic time with @jacknicklaus at the grand opening of the great @TrumpFerryPoint. Watch the video __HTTP__ _E_ Will Smith did a great job by smacking the guy reporter who kissed him on the lips at a red carpet event. (cont) __HTTP__ _E_ Sorry losers and haters but my I.Q. is one of the highest and you all know it! Please don't feel so stupid or insecureit's not your fault _E_ BIG @MittRomney is preferred to handle the economy over @BarackObama by 63% 29% in a @gallupnews poll __HTTP__ _E_ RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_ Great meeting with @HouseGOP and @SenateGOP leaders including @SpeakerRyan @SenateMajLdr @GOPLeader @JohnCornyn... __HTTP__ _E_ "Tomorrow hopes we have learned something from yesterday." John Wayne _E_ China is getting minerals from Afghanistan __HTTP__ We are getting our troops killed by the Afghani govt't. Time to get out. _E_ I will be doing #GDNY Good Day N.Y. with Rosanna &Greg live at 8.30 A.M. I will be giving money to a great guy who lost his son in the WTC. _E_ Captain Khan killed 12 years ago was a hero but this is about RADICAL ISLAMIC TERROR and the weakness of our leaders to eradicate it! _E_ Crooked Hillary Clinton got Brexit wrong. I said LEAVE will win. She has no sense of markets and such bad judgement. Only a question of time _E_ Ralph Northam will allow crime to be rampant in Virginia. He's weak on crime weak on our GREAT VETS Anti Second Amendment.... _E_ Peaceful protests are a hallmark of our democracy. Even if I don't always agree I recognize the rights of people to express their views. _E_ Definitely watch @Carl_C_Icahn 's 'Danger Ahead'. Very insightful particularly on how corp inversions hurt America: __HTTP__ _E_ See story in Fusion and Huff. Post about rape at the border. Beyond terrible! Isn't Fusion owned by Univision? _E_ Trump Tower Punta Del Este features the Trump Organization's signature superior quality detail & perfection __HTTP__ _E_ I will not be commenting on boardroom specifics would be unfair to the different time zones. #CelebApprentice _E_ Attitude is a little thing that makes a big difference. Winston Churchill _E_ Nasty for the middle class electricity prices surged to an all time high this past March __HTTP__ FRACK NOW _E_ Once John Kasich announced he was running for president and opened his mouth people realized he was a complete & total dud! _E_ It is now commonly agreed after many months of COSTLY looking that there was NO collusion between Russia and Trump. Was collusion with HC! _E_ .@NYMag is a piece of garbage but I think it is very nice & charitable that they employ the no talent illiterate hack @jonathanchait. _E_ Governor Cuomo is right about one thing Attorney General Eric Schneiderman does wear eyeliner! What the hell is up with him? _E_ Heading to Joint Base Andrews on #MarineOne with Prime Minister Shinzō earlier today. __HTTP__ _E_ RT @RealJamesWoods: Only now with a #RealPresident do we see the scope of destruction engineered by #Obama and the #Democrat cabal. @realD... _E_ Will be on Meet the Press with @ChuckTodd tomorrow morning. Enjoy! _E_ Why aren't people looking at this reporters earliest statement as to what happened that is before she found out the episode was on tape? _E_ Thank you to all of my Twitter followers for helping to defeat Weiner and Spitzer. Remember in the beginning they said it couldn't be done! _E_ Congratulations to Linda McMahon on her victory in the Connecticut Senate primary. She is an amazing woman smart as you get! @Linda_McMahon _E_ President Obama is the best thing that ever happened to Jimmy Carter! _E_ My Administration Governor @RicardoRossello and many others are working together to help the people of Puerto Rico in every way... _E_ RT @seanspicer: .@timkaine wants to tough on crime fails to talk about defending rapists and murders #VPDebate _E_ Via @chicagotribune by @bob_writes: "@TrumpChicago tower unit sets resale record at $3.99M" __HTTP__ _E_ Congratulations to Boston on the @RedSox World Series victory. Earned and deserved. _E_ Mar a Lago my club in Palm Beach and one of the greatest mansions ever built has been nominated as one of (cont) __HTTP__ _E_ One of the most expensive projects ever in Miami @TrumpDoral's $200M of renovations are right on schedule. When completed will be elite! _E_ I will be interviewed by Chris Wallace on Fox tomorrow morning. Tune in! _E_ .@MiamiHerald discusses our @TrumpCollection #TrumpPets program @TrumpDoral: __HTTP__ _E_ Columbia University stated there was a computer error in their system concerning @BarackObama's attendance. (cont) __HTTP__ _E_ Perhaps a new meeting will be set up with the @nytimes. In the meantime they continue to cover me inaccurately and with a nasty tone! _E_ Under President Obama do you think America will become a THIRD WORLD COUNTRY? _E_ In 2010 alone our trade deficit with China cost over 566000 jobs __HTTP__ This is unsustainable for the American worker. _E_ The government will spend over $3.8T this year. The sequester is a pittance of the outlays less than 2%. Where's the problem? _E_ Thank you @NFIB together we will #MakeAmericaGreatAgain! __HTTP__ _E_ Our debt is about to reach $17T. Iraq has $20T in oil reserves. Interesting. _E_ My @FoxNews interview @seanhannity discussing Obama's failed presidency Ebola DC Post Office midterms & 2016 __HTTP__ _E_ Everyone is laughing at the @nytimes for the lame hit piece they did on me and women.I gave them many names of women I helped refused to use _E_ Both candidates are looking sharp now it's up to the mouth and the mind. #VPDebate _E_ We had all the leverage in our nuclear negotiations with Iran and our leaders foolishly decided to let them out of the trap. WHY? _E_ Loser terrorists must be dealt with in a much tougher manner.The internet is their main recruitment tool which we must cut off & use better! _E_ Keep testing your limits. Never become complacent. Always think big! _E_ The $1B failed website is the tip of the iceberg on the ObamaCare. Over 90 million estimated will lose their plans next year. _E_ ICYMI @IvankaTrump's @waytooearly int. w/@ThomasARoberts on @ApprenticeNBC's firingsTrump Int'l DC & @MissUniverse __HTTP__ _E_ ...confidence that President Al Sisi will handle situation properly. _E_ An honor to join the @FaithandFreedom Coalition yesterday. In America we don't worship government. We worship God.... __HTTP__ _E_ Good move by @MSNBC in downgrading @WeGotEd to a dead weekend spot. This is truly a guy who shouldn't be on tv. _E_ A lot changed when David Letterman said he was probably born in this country the word probably is a total disaster for Obama. _E_ .@mike_pence is doing a great job so far no contest! _E_ Seal the deal! Hold your business meeting at the luxurious @TrumpNewYork Executive Board Room __HTTP__ _E_ Canada's PM was in China last week brokering a deal to sell the oil @BarackObama rejected in Keystone. __HTTP__ Unbelievable! _E_ President Andrew Jackson who died 16 years before the Civil War started saw it coming and was angry. Would never have let it happen! _E_ Donald Trump tops Franklin Pierce/Herald poll at 28 percent in N.H. __HTTP__ _E_ What do you think so far? #CelebApprentice _E_ Today is National Prescription Drug Take Back Day. Everyone can help fight the #OpioidEpidemic by participating! __HTTP__ __HTTP__ _E_ Senator Landrieu If you are a Senator representing Louisiana then you SHOULD own a home in the state. Send @BillCassidy to the Senate! _E_ What a foolish move by @davidaxelrod to speak in Boston yesterday! Completely outmaneuvered by the @MittRomney campaign. _E_ Mexican leadership has been laughing at us for many years but now it's no longer laughter—it's disbelief... _E_ "Runaway Obamacare Spending Will Cost Democrats" __HTTP__ via @BloombergView by @lanheechen _E_ How much money is the extremely unattractive (both inside and out) Arianna Huffington paying her poor ex hubby for the use of his name? _E_ I LIVE IN NEW JERSEY & @realDonaldTrump IS RIGHT: MUSLIMS DID CELEBRATE ON 9/11 HERE! WE SAW IT! __HTTP__ _E_ It's Thursday. How much time did Washington waste today trying to find a solution on the so called fiscal cliff? _E_ WE WILL ONLY BE THE LAND OF THE FREE AS LONG AS WE ARE HOME OF THE BRAVE! _E_ I will be on Fox & Friends tomorrow morning at 7. Will be discussing basic stupidity and incompetence of which our leaders have plenty! _E_ Get on Trump's List email from the RNC was not authorized. I am self funding my campaign! Do not pay. Email: __HTTP__ _E_ Via @PeoplesCompany: Real Estate Magnate Donald J. Trump to Headline 2015 @LandExpo in West Des Moines Iowa __HTTP__ _E_ Nobody could have done what I've done for #PuertoRico with so little appreciation. So much work! __HTTP__ _E_ I wonder if @megynkelly and her flunkies have written their scripts yet about my debate performance tonight. No matter how well I do bad! _E_ My honor thank you. __HTTP__ _E_ Jeb Bush just got contact lenses and got rid of the glasses. He wants to look cool but it's far too late. 1% in Nevada! _E_ RT @ColumbiaBugle: @realDonaldTrump @FLOTUS President Trump greeting families affected by Hurricane Harvey. #TexasStrong __HTTP__ _E_ .@DarrellIssa is a very good man. Help him win his congressional seat in California. _E_ Don't believe the manipulated job numbers. Walmart has just cut orders with suppliers because of rising inventory. _E_ Product placement is a definite prerequisite. #sweepstweet _E_ I love the Mexican people but Mexico is not our friend. They're killing us at the border and they're killing us on jobs and trade. FIGHT! _E_ The animal who beheaded the woman in Oklahoma should be given a very fast trial and then the death penalty. The same fate beheading? _E_ ...Americans do what we do best: we pull together. We join hands. We lock arms and through the tears and the sadness we stand strong... __HTTP__ _E_ .@MonicaCrowley you were GREAT on @seanhannity tonight. Thank you for the nice words! _E_ Join me in Florida on Wednesday! Daytona & Jacksonville:Daytona \ 3pm __HTTP__ | 7pm __HTTP__ _E_ Thank you Cadillac Michigan! #VoteTrumpMI on 3/8/2016. We will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ "The best luck of all is the luck you make for yourself." – General Douglas MacArthur _E_ Hillary Clinton's short speech is pandering to the worst instincts in our society. She should be ashamed of herself! _E_ Just completed purchase of magnificent Ritz Carlton in Jupiter Florida. Will be renamed Trump National Golf Club & be tremendous success. _E_ Gabriel Aubry should learn how to fight—he became a punching bag. Always drama with Halle B! _E_ Inside 'Bill Clinton Inc.': Hacked memo reveals intersection of charity and personal income. #DrainTheSwamp! __HTTP__ _E_ On immigration I'm consulting with our immigration officers& our wage earners. Hillary Clinton is consulting with Wall Street. _E_ While Jon Stewart is a joke not very bright and totally overrated some losers and haters will miss him & his dumb clown humor. Too bad! _E_ .@tomhanks was fabulous in @LuckyGuyPlay last night—as was the entire cast. _E_ The U.S. has enough problems without publicity seekers going out and openly mocking religion in order to provoke attacks and death. BE SMART _E_ The failing @nytimes has gone nuts that Crooked Hillary is doing so badly. They are willing to say anything has become a laughingstock rag! _E_ "Everyone's dream can come true if you just stick to it and work hard." @serenawilliams _E_ 150 Clinton E mails still contain classified information. More sensitive when she was Sec.of State. This is a very big deal. _E_ THANK YOU Atlanta Georgia! Leaving for Nevada now. Lets MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ __HTTP__ _E_ Join me in Pensacola Florida this Friday at 7pm! #VoteTrump __HTTP__ __HTTP__ _E_ Why is oil at a record high? OPEC & the oil speculators continue to rip us off. _E_ The train accident that just occurred in DuPont WA shows more than ever why our soon to be submitted infrastructure plan must be approved quickly. Seven trillion dollars spent in the Middle East while our roads bridges tunnels railways (and more) crumble! Not for long! _E_ Monday morning 7:30 AM I'll be on @foxandfriends. Tune in! _E_ Mike & Mike in one minute! _E_ Did the poor but smart to leave ex husband of @ariannahuff get any of the dollars she got for the use of his name in really stupid AOL deal? _E_ My @foxandfriends interview re: @IvankaTrump's pregnancy my grandchildren Obama's 18% tax rate & Obamacare __HTTP__ _E_ Good @marcorubio is trying to eliminate the tax on Olympic medals __HTTP__ Our athletes should not be taxed on their wins. _E_ RT @seanhannity: Graph: @RealDonaldTrump's Historic 13 Million Primary Votes Compared To Every GOP Nominee Since 1908 __HTTP__ _E_ Real estate taxes are far too high @BriarcliffManor Westchester. A total joke how they waste money! Replace Mayor Vescio. _E_ 70 stories above Panama Bay @TrumpPanama the majestic sail design is Central America's architectural icon __HTTP__ _E_ The Rust Belt was created by politicians like the Clintons who allowed our jobs to be stolen from us by other countries like Mexico. END! _E_ Thank you @TIME readers a great honor! __HTTP__ _E_ Democrat Congresswoman totally fabricated what I said to the wife of a soldier who died in action (and I have proof). Sad! _E_ Saudi Arabia was vehemently against the Iran nuclear deal. Then today they embraced it. What happened? What did we give them to endorse? _E_ Congrats @SixteenChicago's @ChefLents on your Chef of the Year nom in @EaterChicago Annual Eater Awards Vote now! __HTTP__ _E_ Looking forward to speaking at Saturday's @Citizens_United @AFPhq Freedom Summit in Manchester. Second visit to New Hampshire this year. _E_ As I always said the Birthers were after the truth. Thanks to @RealSheriffJoe @BarackObama can't hide anymore. _E_ WOW! __HTTP__ _E_ I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Congress must repeal ObamaCare. Obama will veto while Americans continue to lose their doctors & pay rising premiums. _E_ JOIN ME TOMORROW!MINNESOTA 2pm __HTTP__ 6pm __HTTP__ 9:30p... __HTTP__ _E_ FAKE NEWS A TOTAL POLITICAL WITCH HUNT! _E_ My @foxandfriends interview from yesterday __HTTP__ _E_ Remember this @BarackObama told @GStephanopoulos in 09 that it is not true that the individual mandate is a tax __HTTP__ _E_ Thank you America! #Trump2016 __HTTP__ __HTTP__ _E_ The losing team is now back in boardroom. I can't discuss the team members or what's going on or what happens from here on out. _E_ Our deficit spending is China's gain. @BarackObama is bankrupting our country. _E_ Thank you @JCLayfield will get even better as my Administration continues to put #AmericaFirst __HTTP__ _E_ Benghazi. Obama lied. Our people died. _E_ A lovely letter from the daughter of the late great John Wayne. Our country could use a John Wayne right now. __HTTP__ _E_ Bernie Sanders was right when he said that Crooked Hillary Clinton was not qualified to be president because she suffers from BAD judgement! _E_ Congratulations to Jim Herman my ass't golf pro at Trump Nat'l Golf Club/Bedminster NJ for qualifying for the U.S. Open! @usopengolf _E_ Great shots of @TrumpTowerNY #CelebApprentice _E_ Had a record crowd in Boone Iowa. A fantastic day we will #MakeAmericaGreatAgain __HTTP__ _E_ I really like Nelson Mandela but South Africa is a crime ridden mess that is just waiting to explode not a good situation for the people! _E_ Always enjoy appearing on @extratv. @MarioLopezExtra & @mariamenounos were terrific yesterday. _E_ Hillary Clinton has been working on solving the terrorism problem for years. TIME FOR A CHANGE I WILL SOLVE AND FAST! _E_ If you want to be a success you have to get used to frequently hearing the word no and ignoring it. Think Big _E_ Just bought Doral Hotel & Country Club in Miami within two years it will be the best resort in the country. _E_ THe Art of the Deal The best thing you can do is deal from strength and leverage is the biggest strength you have. CUT CAP and BALANCE. _E_ I just got off the phone with the great people of Guam! Thank you for your support! #VoteTrump today! #Trump2016 _E_ Exclusive interview w/ my wife @MELANIATRUMP tomorrow morning @ 8amE on @Morning_Joe w/ @morningmika @MSNBC. Enjoy! __HTTP__ _E_ RT @USCGSoutheast: .@USCG crews worked together with the @RedCross @fema and members of local #police #fire and #government to distribut... _E_ Glad to hear Mariano Rivera is going to make a comeback in 2013. He is a true sportsman and a great competitor. __HTTP__ _E_ .@canoetravel Also the very obsolete ugly and expensive wind turbines will never be build in Aberdeen. No longer works. @GolfChannel _E_ ICYMI @MELANIATRUMP Reading newspapers and see... #BillyGraham95 #happybirthday @BillyGraham __HTTP__ _E_ My team of deplorables will be managing my Twitter account for this evenings debate. Tune in!#DebateNight #TrumpPence16 _E_ Summer's almost here update your business wardrobe with Trump Signature Collection exclusively available @Macys __HTTP__ _E_ Heading to Baton Rouge Louisiana for a speech. Expecting a very large crowd! See you soon. #Trump2016 #MakeAmericaGreatAgain _E_ Hillary Clinton is weak on illegal immigration among many other things. She is strong on corruption corruption is what she's best at! _E_ Thanks to all for the wonderful congratulation sent to me on the birth of Ivanka's little boy so nice! _E_ Lightweight @AGSchneiderman the worst attorney general in the US is in a tough election with John Cahill @CahillForAG _E_ I like Russell Brand but Katy Perry made a big mistake when she married him. Let's see if I'm right I hope not. _E_ You can take the smartest kid at Wharton the one who gets straight A's and has a 170 IQ and if he doesn't (cont) __HTTP__ _E_ Realize that being an entrepreneur is not a group effort. You're in charge everything starts with you. _E_ "Know from the inside out that you have the power to succeed and you will." – Think Like a Champion _E_ Happy Veterans Day. To those who have served thank you for your special work. _E_ Dopey @chicagotribune critic fails to mention the ugly Sun Times sign. _E_ Another attack in London by a loser terrorist.These are sick and demented people who were in the sights of Scotland Yard. Must be proactive! _E_ I guess they don't have freedom of the press in Scotland. We created this ad and the ASA would not allow us to (cont) __HTTP__ _E_ Great interview in @postedtoronto of @DonaldJTrumpJr: He makes me proud. __HTTP__ _E_ "90% of Trump 2017 news coverage was negative" and much of it contrived!@foxandfriends _E_ Doing interview today with Maria Bartiromo at 10:00 A.M. on @FoxNews ENJOY! _E_ Hillary & Obama's Broken Promises. #RepealObamacare __HTTP__ _E_ and yet another ...all of them are spectacular. __HTTP__ _E_ Welcome to the new reality! Moody's just downgraded the entire US health insurance industry because of ObamaCare. _E_ Will be speaking to President Recep Tayyip Erdogan of Turkey this morning about bringing peace to the mess that I inherited in the Middle East. I will get it all done but what a mistake in lives and dollars (6 trillion) to be there in the first place! _E_ Thanks @renee2i for hosting me tomorrow at the Two International Group! Looking forward to making new friends & discussing #FITN topics. _E_ My interview with @gretawire on Fox News for those who missed it 'Obama's Constantly on Vacation' __HTTP__ _E_ My @FoxNews interview w/@seanhannity __HTTP__ _E_ .@SenTedCruz Ted free legal advice on how to pre empt the Dems on citizen issue. Go to court now & seek Declaratory Judgment you will win! _E_ Sugar: @Lord_Sugar Keep working hard so I make plenty of $ with your show... _E_ Obama won't send troops to fight jihadists yet sends them to Liberia to contract Ebola. He is a delusional failure. _E_ .@BlairKamin Blair you may be the worst architectural critic in the business but thanks for your nice reviews about Trump Chicago & sign PR _E_ Trump @DoralResort is hosting the WGC @cadillacchamp from March 6th – 10th. Join me I will be there all four days. _E_ Great purchase in Ireland will be a top spot! __HTTP__ _E_ The three political disasters could lead to a major and complete political meltdown! _E_ Via @postandcourier: The Donald at @TheCitadelOEA __HTTP__ _E_ RT @paultdove: @FoxBusiness Republican Senators who are opposing the President look at the great economic news: Americans Are Noticing! _E_ Enjoyed my visit to Trump Doral in Miami yesterday. Looking forward to returning for the WGC @Cadillac Championship on March 6th 10th... _E_ By not doing the failed poorly rated debate I was able to make the point of not allowing unfairness while raising $6000000 for VETS. _E_ Loved being in Manassas VA last night. Such incredible spirit! Now in DC for a speech will then visit Old Post Office under construction. _E_ You can find your polling locations at: __HTTP__ #FITN #NHPrimary #VoteTrumpNH __HTTP__ _E_ Thanks Go Angelo people are now really aware of my ties shirts and cuff links at Macy's _E_ When your secretary of defense tells you that your proposed cuts will erode America's military capability you (cont) __HTTP__ _E_ My honor. __HTTP__ _E_ Why is @AlexSalmond pursuing environment destroying windmills when @VattenfallGroup quit because of no (cont) __HTTP__ _E_ America's debt crisis is our country's greatest challenge. Spending must be curbed for our long term fiscal future. _E_ Detroit is going through very hard times right now.. If they are smart brighter days are ahead. _E_ OPEC will use yesterday's attacks on our embassies to raise the price of gas. They are always ripping us off. _E_ China is so brazen that they now give us economic advice they tell us what to do much like a strong stockh... (cont) __HTTP__ _E_ Will be on @foxandfriends now. Enjoy! _E_ Jack Welch thinks Sam Palmisano retired CEO of IBM should be the next CEO of MICROSOFT. Interesting! _E_ What did you think of the boardroom? #CelebApprentice _E_ The world is noticing thanks! __HTTP__ _E_ .@NeneLeakes seeks my advice on prenups tonight at 9 PM on Bravo _E_ Tomorrow on the @MissUniverse Facebook page submit your final question for the contestants __HTTP__ _E_ Everybody tells me not to hit back at the lowlifes that go after me for PR sorry but I must. It's my nature. _E_ We stand in absolute solidarity with the people of the United Kingdom. __HTTP__ _E_ "Donald Trump offers political advice to Palm Beach Republicans" __HTTP__ via @SunSentinel _E_ Will be doing a live Thanksgiving Video Teleconference with Members of the Military at 9:00 A.M. Afghanistan Iraq USS Monterey Turkey & Bahrain. Then going to Coast Guard Quarters Florida. _E_ Watch Obama push major global warming legislation early in his second term... _E_ I am interviewed on the @oreillyfactor tonight at 8:00. Then at 10:00 I am interviewed by @donlemon on @CNN. Enjoy! _E_ A 40mph gust of wind wrecked a wind turbine in Scotland __HTTP__ Any turbine in close proximity to a school must go! _E_ What is a bit appealing about this idea of Trump hosting a debate is consider the diverse audience that perh... (cont) __HTTP__ _E_ What did you think of my decision? #CelebApprentice _E_ Convenient David Plouffe collected $100G fee from Iranian affiliate only a month before joining @whitehouse __HTTP__ _E_ Make sure you're registered to vote! Let's #MakeAmericaGreatAgain! We can't afford more years of FAILURE! All info:... __HTTP__ _E_ Get ready for tonight! _E_ Remember when comedian Bill Maher openly praised the disgusting terrorists who destroyed the World Trade Center then got canned by ABC? _E_ I cannot believe that Apple didn't come out with a larger screen IPhone. Samsung is stealing their business. STEVE JOBS IS SPINNING IN GRAVE _E_ Thank you Hilton Head South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Is it legal for a sitting President to be wire tapping a race for president prior to an election? Turned down by court earlier. A NEW LOW! _E_ Jury was unanimous after hearing the made up case against my co. Filed many years ago she.and her pathetic lawyer should pay me big damages _E_ The @EricTrumpFdn event featured a performance by #CelebApprentice @JohnRich a great event for a great cause! Watch __HTTP__ _E_ Via @BreitbartNews by @mboyle1: Exclusive: Trump To Address South Carolina Tea Party Convention __HTTP__ _E_ Both Washington D.C. and DALLAS are turning out to be really big events. D.C. is protest of incompetent Iran deal and Dallas is big speech! _E_ A single Ebola carrier infects 2 others at a minimum. STOP THE FLIGHTS! NO VISAS FROM EBOLA STRICKEN COUNTRIES! _E_ I will be on Fox & Friends at 7.00 (20 minutes). Plenty of terrible and tragic news to talk about! Too bad. _E_ Thank you @foxandfriends great show! _E_ I agree @MMFlint To all Americans I see you & I hear you. I am your voice. Vote to #DrainTheSwamp w/ me on 11/8. __HTTP__ _E_ RT @TXMilitary: #PhotosFromTheField: Aerial photos from our rescue crews earlier today. #Harvey #TMDHarvey @USNationalGuard __HTTP__ _E_ Leaving for North Carolina. Big crowd will be fun! _E_ Great evening last night in New Hampshire. Got the endorsement from the New England Police Union big territory great people! Thank you. _E_ "I don't see the point of being politically correct if that means actually being incorrect." – Donald J. Trump 'Midas Touch' _E_ Snowden is a traitor and a disgrace. Make no mistake he is no hero. In fact he is a coward who should come back & face justice. _E_ "Recognize that the world needs more entrepreneurs. Everyone is counting on you." – Midas Touch _E_ .@bobschieffer did an excellent job as debate moderator last night. I only wish Mitt was more aggressive! _E_ Obama and Republicans are hollowing out our military. Now want to cut troop levels. Lowest level in over 20 years. _E_ RT @foxandfriends: .@carriesheffield: The mainstream media is neglecting their duty to represent the public. They've failed to represent ha... _E_ Congratulations to @drewbrees on setting the @NFL record with 48 consecutive games with a TD pass. He is a great guy and player. _E_ #CelebApprentice Selfies yes or no? _E_ Getting rdy to leave for tonight's Celebrate Freedom Concert honoring our GREAT VETERANS w/ so many of my evangelic... __HTTP__ _E_ Thank you. __HTTP__ _E_ Can u believe that Jeb Bush's campaign manager is in Berlin Germany looking for money? What's he giving to Germany? __HTTP__ _E_ How can NYS allow lightweight @AGSchneiderman to remain in office? What are JCOPE & Moreland Commissions waiting on? __HTTP__ _E_ Via @DrudgeReport: __HTTP__ _E_ Our new Miss USA Alyssa Campanella came up to my office today for a visit. We're proud to have her as our new title holder. _E_ Buy at the point of maximum pessimism sell at the point of maximum optimism. Sir John Templeton _E_ Happy to have just passed 1.5M followers on twitter. We picked up over 14000 yesterday alone. It's great to speak to everyone daily. _E_ Want jobs? Slash corporate tax rate. Tax incentives for companies that create jobs in US. America will boom. _E_ No matter the mission the brave men & women of our @USCG proudly answer the call to serve 24/7/365. THANK YOU and HAPPY BIRTHDAY! #CG227 __HTTP__ _E_ Failure for all of @BarackObama's talk of engaging the world U.S. favorability has dropped around the world __HTTP__ _E_ I will be doing a major sit down interview on State of the Union With Jake Tapper at 9:00 A.M. on @CNN. Enjoy! _E_ I will be in PR on Tues. to further ensure we continue doing everything possible to assist & support the people in their time of great need. _E_ RT @Reince: Happy New Year + God's blessings to you all. Looking forward to incredible things in 2017! @realDonaldTrump will Make America... _E_ George Will is a political moron. Last month he said Romney couldn't win. _E_ Great to talk jobs with #NABTU2017. Tremendous spirit & optimism we will deliver! __HTTP__ _E_ We are no longer silent. We are energized & ready to take our country back. Let's Make America Great Again! __HTTP__ _E_ .@NBCNews purposely left out this part of my nuclear qoute: until such time as the world comes to its senses regarding nukes. Dishonest! _E_ ...2nd Amendment Strong Military ISIS historic VA improvement Supreme Court Justice Record Stock Market lowest unemployment in 17 yrs! _E_ Today's ceremony is a day for both remembrance and resolve.#NATOMeeting #NATO __HTTP__ _E_ Yet another terrorist attack today in Israel a father shot at by a Palestinian terrorist was killed while: __HTTP__ _E_ .@BarackObama reported over $269710 of foreign income out of his gross $894520 and paid $5841 in foreign taxes __HTTP__ _E_ What a waste of time being interviewed by @andersoncooper when he puts on really stupid talking heads likeTim O'Brien dumb guy with no clue! _E_ I appeared on David Letterman last night. And don't forget Sunday night the first episode of Celebrity Apprentice will be on NBC at 9 pm. _E_ .@BrandenRoderick I was pleased to see the wonderful statements you made about me to the media.I'm not surprised you're a special person _E_ My shirts ties and fragrance are doing great at @Macys try them! Make fantastic gifts. _E_ 64 stories of golden glass over the strip @TrumpLasVegas' elite hotel rooms feature floor to ceiling windows __HTTP__ _E_ Libya is selling its oil to China I notice the Chinese Ambassador is very safe. _E_ How low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Bad (or sick) guy! _E_ The only way to spread economic growth is to lower taxes and end unfriendly regulatory practices. _E_ President Obama said that he thinks he would have won against me. He should say that but I say NO WAY! jobs leaving ISIS OCare etc. _E_ The Wilson family should thank me. Pegula overpaid for the @buffalobills because of me! _E_ A woman who got fired after two days of working with Scott Walker a wacko now trying to raise funds to fight me. _E_ RT @williebosshog: Make America Great Again! #Trump2016 __HTTP__ _E_ #ConfirmGorsuch #SCOTUS __HTTP__ _E_ American incomes have fallen $3040 per household in the last 38 months __HTTP__ _E_ He is working hard and for that he must be given credit! _E_ Petraeus is already negotiating a book deal. __HTTP__ Smart. Always negotiate when you are a hot commodity! _E_ Top 50 Facts About Crooked Hillary Clinton From Trump 'Stakes Of The Election' Address: __HTTP__ _E_ Entrepreneurs: Be cautiously optimistic. Call it positive thinking with a lot of reality checks. _E_ The Tonight Show begins in 5 minutes. Enjoy! _E_ Getting ready to pay final respect to GREAT LADY Joan Rivers. She could light up a room like no other! She will be greatly missed. _E_ I will be interviewed on @TODAY Show at 7:00 A.M. and on Morning Joe at 7:20. _E_ Lance Armstrong is now being sued by Fed Govt what was he thinkking? _E_ Bernie sanders has abandoned his supporters by endorsing pro war pro TPP pro Wall Street Crooked Hillary Clinton. _E_ I am watching the New York mayoral race very closely... _E_ RT @foxandfriends: .@JudgeJeanine: There will be an uproar in this country if they end up with an indictment against a Trump family member... _E_ This country cannot take four more years of Barack Obama! #Debate _E_ Sorry couldn't do @foxandfriends this morning big meeting. Will double up next week at 7. _E_ I am the BEST builder just look at what I've built. Hillary can't build. Republican candidates can't build. They don't have a clue! _E_ I knew last year that @TIME Magazine lost all credibility when they didn't include me in their Top 100... _E_ I actually enjoyed the piece re sign @TheDailyShow. Could it be that I'm starting to like Jon Stewart? _E_ Wow Hillary and Bill are in deep trouble but don't worry my fellow Republicans will let them off the hook. All talk no action. _E_ Hear me on @kiss925toronto now!#rozandmocha _E_ Via HuffPost Pollster #1 __HTTP__ _E_ RT @seanhannity: #Hannity Starts in 30 minutes with @newtgingrich and my monologue on the Deep State's allies in the media _E_ Thr coverage about me in the @nytimes and the @washingtonpost gas been so false and angry that the times actually apologized to its..... _E_ The pinnacle of the luxury public golf experience @TrumpGolfLA overlooking the Pacific Ocean in Palos Verdes __HTTP__ _E_ Sleepy eyes @chucktodd when looking at my financial filings should've said "Great job Mr. Trump Sir." _E_ Good advice from my father Fred C. Trump: Know everything you can about what you're doing. _E_ Well as predicted the 9th Circuit did it again Ruled against the TRAVEL BAN at such a dangerous time in the history of our country. S.C. _E_ RT @DanScavino: Great interview on @foxandfriends by @SteveDoocy w/ Carrier employee who has a message for #PEOTUS @realDonaldTrump & #VPE... _E_ Wow every poll said I won the debate last night. Great honor! _E_ It is also amazing how comments can be edited to provide statements that are used in a knowingly incorrect manner. _E_ Diet Coke tweet had a monster response dammit I wish the stuff worked. _E_ Via @BreitbartNews by Katie McHugh: POLL: DONALD TRUMP LEADS THE PACK AS GOP FRONTRUNNER __HTTP__ _E_ Is Gov. @BobbyJindal the stupid one for using the phrase "the stupid party" when referring to the Republicans? _E_ Former Navy SEAL Questions @BarackObama's birthplace __HTTP__ _E_ Crooked Hillary Clinton discussing the #SecondAmendment at a private event. #2A cc: @NRA __HTTP__ _E_ Goofy Senator Elizabeth Warren @elizabethforma has done less in the U.S. Senate than practically any other senator. All talk no action! _E_ THANK YOU AMERICA! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ .@SethMacFarlane will be a great Oscar host. He did an amazing job at my @ComedyCentral roast. _E_ America needs to rebuild our infrastructure. Why are we sending trillions overseas when our own roads bridge... (cont) __HTTP__ _E_ Why won't President Obama use the term Islamic Terrorism? Isn't it now after all of this time and so much death about time! _E_ James Clapper called me yesterday to denounce the false and fictitious report that was illegally circulated. Made up phony facts.Too bad! _E_ The failing @nytimes is truly one of the worst newspapers. They knowingly write lies and never even call to fact check. Really bad people! _E_ ..not associated with Russia. Trump team spied on before he was nominated. If this is true does not get much bigger. Would be sad for U.S. _E_ Many of Bernie's supporters have left the arena. Did Bernie go home and go to sleep? _E_ Get the big picture but be prepared for the picture to change. Where there's a will there's a win. Think positively! _E_ Sorry folks but if I would have relied on the Fake News of CNN NBC ABC CBS washpost or nytimes I would have had ZERO chance winning WH _E_ Work hard play hard and live to the hilt. Think Like a Billionaire _E_ I will be on @foxandfriends in ten minutes enjoy! _E_ I will be live tweeting @megynkelly Show in 10 minutes. Should be interesting. Will be on Fox Network! ENJOY! _E_ Thank you for your support!#AmericaFirst #LeadRight2016 __HTTP__ _E_ Secy. Sebelius who was responsible for the horrendous ObamaCare rollout should resign or be fired.Refuses to go before Congress to explain _E_ RT @austinroneil: @realDonaldTrump Thanks for all the inspirational quotes. Helping encourage this young entrepreneur. :) _E_ I am going to give @Rosie a pass. @Rosie is desperate to get back on TV so she can be on yet another show that can be quickly canceled. _E_ I will make our Military so big powerful & strong that no one will mess with us. #Trump2016 __HTTP__ __HTTP__ _E_ #EndCommonCore #Trump2016Video: __HTTP__ __HTTP__ _E_ Great new poll Florida thank you! #MakeAmericaGreatAgain __HTTP__ _E_ Wow the economy is really bad! GROSS DOMESTIC PRODUCT down 0.7% in 1st. quarter and getting worse. I TOLD YOU SO! Only I can fix. _E_ Tell Congress to straighten out the many problems of our country before trying to be the policemen to the world. Make America great again! _E_ Via @njdotcom by Eugene R. Dunn Medford: Donald Trump towers over GOP field __HTTP__ They hate us because they ain't us. _E_ Congratulations to @BarackObama yesterday marked the 1 YR anniversary of our country's credit being downgraded __HTTP__ _E_ What do you think of water boarding the Boston killer sometime prior to allowing our doctors to make him well? I suspect he may talk! _E_ Thoughts and prayers to the great people of Indiana. You will prevail! _E_ The new course at Trump International Scotland will be a par 72 layout with five sets of tees ranging from 7540 yards to 5630. _E_ See @IvankaTrump on the cover of @HudsonMOD? View the digital edition: __HTTP__ _E_ Hillary Clinton is the only candidate on stage who voted for the Iraq War. #Debates2016 #MAGA __HTTP__ _E_ Dishonest @nytimes reporter Jonathan Martin refused to acknowledge massive crowd surge forward... __HTTP__ _E_ .@alexsalmond RT @islandbluenose You'll be doing us in scotland a great service if you win. Good Luck. _E_ .@HillaryClinton is weak on illegal immigration & totally incompetent as a manager and leader no strength or stamina to be #POTUS! _E_ Great job once again by law enforcement! We are proud of them and should embrace them without them we don't have a country! _E_ I will be on @OutFrontCNN with @ErinBurnett at 7PM. Tune in!#Trump2016 _E_ ginrnnr2 @realDonaldTrump ...is China economy in a bubble ? Only if we want it to be! _E_ I have decided to add a caveat to my offer. Obama can't decide to send my $5M to Rev. Wright if he releases his records. _E_ This is about the money I gave to charity and in response to your comments about Gadhafi... __HTTP__ _E_ A beautiful view from my office today __HTTP__ _E_ #TrumpTower is one of the country's top tourist destinations. _E_ Thank you @IngrahamAngle! #AmericaFirst __HTTP__ _E_ Being an entrepreneur is not a group effort. You have to trust yourself and your instincts. _E_ I can't resist hitting lightweight @DannyZuker verbally when he starts up because he is just.so pathetic and easy (stupid)! _E_ Thank you. __HTTP__ _E_ China is very much the economic lifeline to North Korea so while nothing is easy if they want to solve the North Korean problem they will _E_ We need another Bush in office about as much as we need Obama to have a 3rd term. No more Bushes! _E_ Thank you West Virginia. Let's keep it going. Go out and vote on Tuesday we will win big. #Trump2016 _E_ Sebelius didn't test $635M (probably $1B) ObamaCare website until "a couple of days leading up to the launch." __HTTP__ _E_ Via @Mediaite: Trump to @gretawire: Sequester Cuts Don't Go Far Enough' __HTTP__ _E_ The upcoming record 13th season of @CelebApprentice is going to be very special. Our production team's ingenuity is amazing. _E_ Thank you! __HTTP__ _E_ I don't like bullies. I am not going to stand around and watch @KarlRove target the Tea Party. Karl Rove gave us Barack Obama. Loser. _E_ Great Army Navy Game. Army wins 14 to 13 and brings home the COMMANDER IN CHIEF'S TROPHY! Congratulations! _E_ Bill O'Reilly doing a major special on @OreillyFactor tonight @FoxNews at 8pmE. Watch it should be good! #Trump2016 _E_ A great night in West Allis Wisconsin! Thank you! #VoteTrumpWI #WIPrimary __HTTP__ __HTTP__ _E_ Will soon be heading to Davos Switzerland to tell the world how great America is and is doing. Our economy is now booming and with all I am doing will only get better...Our country is finally WINNING again! _E_ Drew Brees is having a great game a fantastic quarterback and really good guy! _E_ To aspiring entrepreneurs: Be tenacious. Once you've decided on your goals remain fixed on them. Set the bar high! _E_ Entrepreneurs: Problems are a mind exercise. Enjoy the challenge. _E_ Via Newsmax. Nice article thank you so much. __HTTP__ _E_ The real scandal here is that classified information is illegally given out by intelligence like candy. Very un American! _E_ #MakeAmericaWorkAgain#TrumpPence16 #RNCinCLE __HTTP__ __HTTP__ _E_ Our great African American President hasn't exactly had a positive impact on the thugs who are so happily and openly destroying Baltimore! _E_ Via @MiamiHerald: "@IvankaTrump talks family and business" __HTTP__ _E_ Did you ever not do something that had you done it would have turned out to be a disaster. Never look back just learn from your experience! _E_ Katherine Webb gets a Donald Trump job offer says she's 'shocked' about the attention __HTTP__ via @Zap2it _E_ Don't miss me on @foxandfriends Monday at 7:30 AM _E_ Writing my inaugural address at the Winter White House Mar a Lago three weeks ago. Looking forward to Friday.... __HTTP__ _E_ Thank you to @foxandfriends for exposing the truth. Perhaps that's why your ratings are soooo much better than your untruthful competition! _E_ Made in America? @BarackObama called his 'birthplace' Hawaii here in Asia. __HTTP__ _E_ .@MittRomney is trying to hit back at me because I'm saying that he let the Repub Party down w/ his loss to Obama. Should've won—he choked! _E_ One of the simplest joys of life is golf. A great game to both play and watch. _E_ With the impending crisis in Korea is it a big confidence builder that Chuck Hagel is Sec. of Defense? Elections have consequences. _E_ .@melaniatrump on @QVC tonight at 7PM EST. Tune in! _E_ ..But the people were Pro Trump! Virtually no President has accomplished what we have accomplished in the first 9 months and economy roaring _E_ The Chinese are now hacking White House computers. Why not? They already own the place. _E_ Must read @nypost editorial on $40M NYC taxpayer settlement to Central Park Thugs Wilding for Profit __HTTP__ _E_ With our weakened dollar gas will continue to rise. Fracking is an answer to lowering energy costs. _E_ It was a very wise move that Ted Cruz renounced his Canadian citizenship 18 months ago. Senator John McCain is certainly no friend of Ted! _E_ Actually Putin doesn't want Alaska because the Environmental Protection Agency will make it impossible for him to drill for oil! _E_ "Being an entrepreneur is a big task. So what can you do to prepare? First and foremost expand your focus." Midas Touch _E_ Great new poll Iowa thank you!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ We had a great News Conference at Trump Tower today. A couple of FAKE NEWS organizations were there but the people truly get what's going on _E_ Join me tonight in Cedar Rapids Iowa at 7pm: __HTTP__ Arizona tomorrow night at 3pm: __HTTP__ _E_ Insurance companies are fleeing ObamaCare it is dead. Our healthcare plan will lower premiums & deductibles and be great healthcare! _E_ Can you believe that Ted Cruz who has been killing our country on trade for so long just put out a Wisconsin ad talking about trade? _E_ Entrepreneurs: Be totally focused. Know everything you can about what you're doing. Give your work 100% of your concentrated effort. _E_ Congratulations to @newtgingrich on a stunning win in South Carolina. All eyes are on Florida now. _E_ Paul Ryan should spend more time on balancing the budget jobs and illegal immigration and not waste his time on fighting Republican nominee _E_ My @FoxNews @gretawire int. on the border crisis #BringBackOurMarine & Obama's ineptitude & the economy __HTTP__ _E_ Will be at Fort Worth (Texas) Convention Center at 11:30 A.M. Big crowd get there early! Big announcement to be made! _E_ "'THE DONALD' GOT A MUSKET" __HTTP__ via @fitsnews _E_ Thank you Toledo Ohio! It is so important for you to get out and VOTE on November 8 2016! Lets MAKE AMERICA SAFE... __HTTP__ _E_ Via @InverurieHerald: Trump's new course plans on display __HTTP__ _E_ Watch as I humiliate a dais full of talent. #TrumpRoast airs tonight at 10:30/9:30c on Comedy Central __HTTP__ _E_ Accounting firm Ernst & Young and the celebrity judges are insulted by Miss Pennsylvania's made up PR. _E_ ...approvals of The Keystone XL & Dakota Access pipelines. Also look at the recent EPA cancelations & our great new Supreme Court Justice! _E_ I will bring our jobs back to America fix our military and take care of our vets end Common Core and ObamaCare protect 2nd A build WALL _E_ My @SquawkCNBC interview from earlier in the week discussing the GOP primary and @newtgingrich's electability __HTTP__ _E_ The Mayor of San Jose did a terrible job of ordering the protection of innocent people. The thugs were lucky supporters remained peaceful! _E_ In the center of Ireland's rugged west coast @Trump_Ireland offers a beautiful golf course top dining and a Spa __HTTP__ _E_ By Obama mentioning Manhattan yesterday in his response he has singlehandedly made it target #1. How totally stupid is this guy? _E_ Disloyal R's are far more difficult than Crooked Hillary. They come at you from all sides. They don't know how to win I will teach them! _E_ Today I delivered remarks at the 36th Annual National Peace Officers' Memorial Service. #NationalPoliceWeekWatch... __HTTP__ _E_ .@pennjillette is an extraordinary entertainer & magician whose star on the Hollywood Walk of Fame is long overdue. Very proud of him. _E_ Via @ArabianBusiness: "@IvankaTrump eyes new projects in Abu Dhabi" for Trump Organization __HTTP__ _E_ As long as we open our eyes to God's grace and open our hearts to God's love then America will forever be the land of the free the home of the brave and a light unto all nations. #NationalPrayerBreakfast __HTTP__ _E_ My daughter Ivanka thinks I should run for President. Maybe I should listen. __HTTP__ _E_ Discover your true self and surround yourself with people who complement your gifts and modes of operation. Midas Touch _E_ .@AS_ScienceGuy @realDonaldTrump Thank you for all your support of @autismspeaks Great new breakthroughs. Fantastic! _E_ It is terrible that neither Obama Biden nor Kerry attended Lady Thatcher's funeral. They would all run to Muslim Brotherhood Morsi's. _E_ The only reason President Obama wants to attack Syria is to save face over his very dumb RED LINE statement. Do NOT attack Syriafix U.S.A. _E_ RT @foxandfriends: FOX NEWS ALERT: North Korea responds to U.S. with Guam attack plan as Secretary Mattis warns Kim Jung Un "he is grossly... _E_ .@tedcruz you were terrific on @seanhannity tonight. I am going to the border tomorrow. _E_ #CNNDebate __HTTP__ _E_ It's Thursday how much $ has @BarackObama wasted today? _E_ Dr. Ben Carson I concur. I believe in God who can change people he can make any of us better. @RealBenCarson _E_ From Bloomberg: "Chrysler's Jeep expects China production agreement soon." I told you so. _E_ No wonder Afghanistan is a mess! @BarackObama is releasing high level insurgents in exchange for pledges of peace. __HTTP__ _E_ .@TrumpSoHo New York has interiors by celebrated design house Fendi Casa and 360 degree views of the city skyline. __HTTP__ _E_ Wow three top MICROSOFT investors want Bill Gates out as Chairman. Do not like job he is doing! _E_ What an amazing comeback and win by the Patriots. Tom Brady Bob Kraft and Coach B are total winners. Wow! _E_ "Learning is a new beginning we can give ourselves every day." – Trump: How to Get Rich _E_ Johnny Miller—Great job this weekend. Most insightful and tough. See you at Doral. _E_ I will be making a major speech on ILLEGAL IMMIGRATION on Wednesday in the GREAT State of Arizona. Big crowds looking for a larger venue. _E_ .@myfoxny discussing NYPD Chief Kelly's great record & the launch of the crowdfunding site __HTTP__ __HTTP__ _E_ Let's take a closer look at that birth certificate. @BarackObama was described in 2003 as being born in Kenya. __HTTP__ _E_ Wish Obama would say ISIS like almost everyone else rather than ISIL. _E_ Afghan Leader Karzai has received tens of millions of dollars IN CASH from the U.S. Government how stupidly is our Country being run? _E_ Will be on @jimmykimmel tonight at 11:35pmE on @ABC. #Kimmel #Trump2016 #MakeAmericaGreatAgain _E_ I'm just so tired of listening to the same old rhetoric and words day after day from our President. It's time to stop talking WORK! _E_ Over 100M are now receiving some form of welfare __HTTP__ We must do better. @MittRomney has the vision to get America working. _E_ Just to show you how unfair Republican primary politics can be I won the State of Louisiana and get less delegates than Cruz Lawsuit coming _E_ Join us in Toledo Ohio tomorrow night at 8pm! #TrumpPence16 #MAGATickets: __HTTP__ __HTTP__ _E_ America became a powerhouse because of our deep belief in the virtue of self reliance. #TimeToGetTough (cont) __HTTP__ _E_ Help fund @Dratzenberger's new show 'American Made' on @fundanything __HTTP__ John is on @teamcavuto today re project. _E_ Will be on @seanhannity tonight at 10pm hosted by @GovMikeHuckabee. Enjoy! _E_ Our online store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_ A true honor to receive the endorsement of John Wayne's daughter....read: __HTTP__ __HTTP__ _E_ Bayer AG has pledged to add U.S. jobs and investments after meeting with President elect Donald Trump the latest in a string... @WSJ _E_ #trumpvlog The song Donald Trump hits 54 million views. @MacMiller Where's my money? __HTTP__ _E_ "@TrumpFerryPoint A Brand New Championship Golf Course In NYC Developed By Donald Trump And Anyone Can Play It" __HTTP__ _E_ Via @TheOaklandPress Donald Trump speaks in Novi(Michigan) draws record breaking crowd __HTTP__ _E_ Our country under President Obama is on life support! Great leaders must bring people together. _E_ Via @theobserver: Donald Trump: Lake Norman golf course 'one of the hottest places around' __HTTP__ _E_ After being ripped off for years Obama finally figured out that China is taking advantage of us. He's finally listening to me. _E_ Great #Thanksgiving travel and parade watching tips by @NYTimesTravel including an option from @TrumpNewYork: __HTTP__ _E_ Crooked's camp incited violence at my rallies. These incidents weren't spontaneous like she claimed in Benghazi! __HTTP__ _E_ Go to Trump Doral in Miami and watch the World Golf Championship! On NOW! _E_ We commend SG @AntonioGuterres & his call for the UN to focus more on people & less on bureaucracy. #USAatUNGA #UNGA __HTTP__ __HTTP__ _E_ .@FoxNewsSunday _E_ Entrepreneurs: Look at the solution not the problem. Learn to focus on what will give results. _E_ The failing @HuffingtonPost and dopey @ariannahuff are writing so much false junk about me they just can't get enough! BE CAREFUL. _E_ Just a reminder that Ted Cruz supported liberal Justice John Roberts who gave us #Obamacare. __HTTP__ _E_ Lies and incompetence the two words that are most closely associated with ObamaCare! _E_ Spitzer and Weiner lost lightweight Eric Schneiderman will be next he will be challenged in the PRIMARY. He has done a really poor job! _E_ Each of the 176 magnificent luxury suites and guestrooms at @TrumpNewYork provide a sophisticated urban appeal __HTTP__ _E_ Chris McDaniel looks like he will win in Mississippi GREAT NEWS and big victory for Tea Party! _E_ "Push yourself again and again. Don't give an inch until the final buzzer sounds." Larry Bird _E_ Highly respected PUBLIC POLICY POLLING (PPP) just announced that I am number one in IOWA. Thank you! _E_ Bush was called unpatriotic by @BarackObama in '07 for adding $4T to debt __HTTP__ @BarackObama increased it $6T in 3 years. _E_ Donna Summer performed for me many times she was great and will be missed. @TheDonnaSummer _E_ .@BMP_Music_Event Read 'Midas Touch' great book for entrepreneurs. Good luck! _E_ #TheView Lots of fun on @TheViewTV with @JennyMcCarthy and @SherriEShepherd __HTTP__ _E_ Just completed call with President Moon of South Korea. Very happy and impressed with 15 0 United Nations vote on North Korea sanctions. _E_ .@mystikangel Bring @johnrich back? He is back! _E_ Yes the BP oil spill was bad but it was no reason to put tighter clamps on domestic drilling. That showed no (cont) __HTTP__ _E_ The habitual vacationer @BarackObama is now in Hawaii. This vacation is costing taxpayers $4 milion +++ while there is 20% unemployment. _E_ #VoteTrump video: __HTTP__ #ArizonaPrimary #UtahCaucus #UTCaucus #AmericanSamoa __HTTP__ _E_ So we can spy on our ally's leaders but can't water board terrorists? _E_ Just got back to New York from California. Will be on Fox & Friends tomorrow morning at 7.00. ObamaCare and other disasters to be discussed _E_ The Fake News Is going all out in order to demean and denigrate! Such hatred! _E_ Today's EO established a commission on combating drug addiction and the opioid crisis. Watch listening session... __HTTP__ _E_ Carly Fiorina I agree! Ted Cruz is just another politician. All talk no action! __HTTP__ _E_ It was @BarackObama who promised if you like your plan you can keep your plan. Now ObamaCare is causing (cont) __HTTP__ _E_ #UtahCaucus message from @IvankaTrump! #UTCaucus#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Vote for your favorite @MissUSA contestant the 2013 #MissUSA Fan Vote at __HTTP__ ! _E_ The @PGATOUR comes to Miami on March 6th when the @CadillacChamp returns to @TrumpDoral __HTTP__ See you there! _E_ Why aren't the lawyers looking at and using the Federal Court decision in Boston which is at conflict with ridiculous lift ban decision? _E_ Yesterday in front of Rockefeller Center __HTTP__ _E_ The judge in the Oscar Pistorious case is a total moron. She said he didn't act like a killer. This is another O.J. disaster! _E_ Wacko @glennbeck is a sad answer to the @SarahPalinUSA endorsement that Cruz so desperately wanted. Glenn is a failing crying lost soul! _E_ Thanks to our loyal viewers & fans last night's @ApprenticeNBC topped all the demos & grew 24% in our regular slot premiere. _E_ Congratulations @ElonMusk and @SpaceX on the successful #FalconHeavy launch. This achievement along with @NASA's commercial and international partners continues to show American ingenuity at its best! __HTTP__ _E_ Yesterday the White House claimed its ISIS strategy is a 'success.' Tell that to the Christians being beheaded. We need to hit ISIS hard! _E_ Imposing dunes on the Aberdeenshire coastline @TrumpScotland's Championship course is a classical Scottish links __HTTP__ _E_ With @BarackObama listing himself as Born in Kenya in 1999 __HTTP__ HI laws allowed him to produce a fake certificate. #SCAM _E_ "30000 MACY'S CUSTOMERS RETALIATE IN SUPPORT OF DONALD TRUMP" __HTTP__ via @BreitbartNews by @ASwoyer _E_ Just arrived in Youngstown Ohio with @FLOTUS Melania!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ I will be going to New Hampshire today home of my first primary victory to discuss terror and the horrible events of yesterday. 2:30 P.M. _E_ Thank you to all of our law enforcement officers across America! #LESM #MAGA __HTTP__ __HTTP__ _E_ Wow just saw the really bad @CNN ratings. People don't want to watch bad product that only builds up Crooked Hillary. _E_ Why should ObamaCare be delayed for businesses and not working families? With premiums rising at record levels it is not equitable. _E_ Next Saturday night I will be holding a BIG rally in Pennsylvania. Look forward to it! _E_ Big election tomorrow in the Great State of Alabama. Vote for Senator Luther Strange tough on crime & border will never let you down! _E_ Join me in Clive Iowa tomorrow at noon! #AmericaFirst #MAGATickets: __HTTP__ __HTTP__ _E_ ...At the same time go through a worst case scenario but keep it short. Focus on your goal—look at the solution not the problem. _E_ It was my great honor to pay tribute to a VET who went above & beyond the call of duty to PROTECT our COMRADES our COUNTRY & OUR FREEDOM! __HTTP__ _E_ Lithium ion batteries should not be allowed to be used in aircraft. I won't fly on the Boeing 787 Dreamliner it uses those batteries. _E_ Not only does Obama spy on German leaders he criticizes their trade surplus __HTTP__ We should have a trade surplus! _E_ Hillary took money and did favors for regimes that enslave women and murder gays. _E_ .@PatrickBuchanan was great on @TeamCavuto @FoxNews. Thank you Pat! #Tump2016 _E_ .@GOP HOUSE LEADERSHIP – ESTABLISH SELECT COMMITTEE ON BENGHAZI. THERE IS A MASSIVE COVERUP. _E_ RT @namusca: #VoteTrump2016 a real leader that truly cares about America & our values. He wants to bring prosperity back 2 USA __HTTP__ _E_ I am very disappointed in China. Our foolish past leaders have allowed them to make hundreds of billions of dollars a year in trade yet... _E_ Remember anything you read about Atlantic City has nothing to do with me. I sold years ago and left. Good timing but very sad! _E_ Looking forward to honoring the great Dogan family & the success of the Trump Towers project in Istanbul @FollowTurkey Annual Gala Dinner _E_ Time to end the visa lottery. Congress must secure the immigration system and protect Americans. __HTTP__ _E_ .@Peggynoonannyc An election between Hillary and myself will be the biggest voter turnout in U.S. history. Just like the debates 24 M vs 2M. _E_ Just said at #NCGOPCon that I'm not beholden to lobbyists and donors! No special interest would control me if I were in office. _E_ Excited to be travelling to New Hampshire on Monday. The Granite State is a model for the country. Live Free or Die! _E_ Spanish version of ObamaCare website delayed __HTTP__ Hitting google translate apparently too complicated. #MakeDCListen _E_ USMC Sgt. Tahmooressi sacrificed for our country. While Obama is welcoming illegals our Marine is locked in a Mexican jail. #FreeOurMarine _E_ Taxpayers are paying a fortune for the use of Air Force One on the campaign trail by President Obama and Crooked Hillary. A total disgrace! _E_ Wow Record ratings for WGC Cadillac Championship at Trump National Doral's Blue Monster Most watched in seven years. CONGRATS to@Tiger Woods _E_ I have a proven track record supporting our Veterans. Veterans deserve universal access to care. VA scandal proves politicians are inept. _E_ Launching the Trump Home by Dorya Furniture Collection today. It looks amazing! @HPMARKETNEWS @DoryaInteriors __HTTP__ _E_ #2017Jambo Remember your duty. Honor your history. Take care of the people God puts into your life – and LOVE & CHERISH your country! __HTTP__ _E_ It was the childishly written & taunting PR statement by Fox that made me not do the debate more so than lightweight reporter @megynkelly. _E_ .@PiersMorgan is right he won the show because "I know how to play the game." #CelebApprentice _E_ CNN'S slogan is CNN THE MOST TRUSTED NAME IN NEWS. Everyone knows this is not true that this could in fact be a fraud on the American Public. There are many outlets that are far more trusted than Fake News CNN. Their slogan should be CNN THE LEAST TRUSTED NAME IN NEWS! _E_ #ISIS is making $400M/year on oil. I have been saying it for years. We need to bomb the oil! __HTTP__ __HTTP__ _E_ Great Live Signing last nite! Over 25k views. I am signing books for next two weeks. Order yours for holiday gifts: __HTTP__ _E_ Endorsements for Lyin' Ted Cruz __HTTP__ _E_ .@IvankaTrump's Favorite Miami Hot Spots @TrumpGolf @TrumpDoral __HTTP__ _E_ The United States must greatly strengthen and expand its nuclear capability until such time as the world comes to its senses regarding nukes _E_ An updated POLL tracker (with all polls thru the weekend) reveals I maintained a double digit lead at... __HTTP__ _E_ Look if we can make chopsticks in America and sell them to the Chinese we can compete on hundreds of other fronts as well. TimeToGetTough _E_ Remember Obama limped across the finish line he should have lost to Hillary. Be careful! _E_ If Stop & Frisk is struck down by the pandering NYC politicians increases in crime & eventual terrorist attacks will be on them. _E_ Bullshit Pop gave me knowledge and a relatively small amount of money (split between brothers and sisters) and I built it into over 9 bill. _E_ Thank you working hard! __HTTP__ _E_ Will be doing interview on @GolfChannel at 8.00 this morning. Will be talking about getting the great PGA Championship & Senior PGA etc. _E_ Via @BreitbartNews by @rwildewrites: "Donald Trump: I Can Make America Great Again" __HTTP__ (Hyperlinked on @DRUDGE_REPORT) _E_ Must see video Obama's criticism of @MittRomney is identical to Carter's on Reagan __HTTP__ _E_ Why did we spend billions of our money on Libya if we are not going to get any of the country's oil? What do we get out of this? _E_ China is buying gas fields in Texas __HTTP__ & stealing our corporate secrets... _E_ Will be interviewed by @SeanHannity on @foxnews at 10PM tonight. Enjoy! _E_ Looking forward to watching the legendary @BarbaraJWalters interview my family (and me) tonight on @ABC at 10:00. Many things to talk about! _E_ With two champion style courses @TrumpGolfDC graces 600 rolling acres along the peaceful and scenic Potomac River __HTTP__ _E_ Senator Chuck Schumer helping to import Europes problems said Col.Tony Shaffer. We will stop this craziness! @foxandfriends _E_ If President Obama was going to attack Syria he should've done it a long time ago as a surprise & not after (cont) __HTTP__ _E_ TO ALL AMERICANS __HTTP__ _E_ I'm at Trump Doral right now Tiger will tee off shortly. _E_ As Iran began the process of taking over Iraq many people wanted me to say that "I told you so!" – so I told you so. _E_ MUST READ! My @chicagotribune editorial: I love Chicago ... and my sign! __HTTP__ _E_ #TBT At the US Open Tennis Tournament with @EricTrump see same hairstyle! __HTTP__ _E_ Via @DMRegister :"@brentroske on Politics: Trump Talks Iowa" __HTTP__ _E_ End the Democrats Obstruction! __HTTP__ _E_ TRUMP APPROVAL HITS 50% __HTTP__ _E_ Many of Hillary's donors are the same donors as Jeb Bush's—all rich will have total control—know them well. _E_ My @HollywoodLife interview w/ @MELANIATRUMP discussing her debut on @ApprenticeNBC & her skin care line __HTTP__ _E_ #MakeAmericaSafeAgain __HTTP__ _E_ Crude is at $100/Barrel. With the current state of the world economy how is that possible? OPEC is ripping of... (cont) __HTTP__ _E_ No wonder boxing is close to dead! _E_ Thank you @billoreilly & @KarlRove. Ted Cruz should be immediately disqualified in Iowa with each candidate moving up one notch. _E_ Our biggest problems are solved by growth. We need a President who is a job creator. Let's Make America Great Again! __HTTP__ _E_ Jodi thought she outsmarted the system it didn't work! Congratulations to the jury on a job well done! Now will it be life or death? _E_ Well now they're saying that I not only won the NBC Presidential Forum but last night the big debate. Nice! _E_ Young Entrepreneurs – the Holiday season is here but that is no excuse not to stay on top of your business prospects. Focus! _E_ Consumer confidence soars to highest level since 2004 📈 __HTTP__ __HTTP__ _E_ By @kwrcrow: "NY Post caught 'LYING' Again!" __HTTP__ The Donald" should go far. Actually if I run I'll win. _E_ The resolution being considered at the United Nations Security Council regarding Israel should be vetoed....cont: __HTTP__ _E_ Obama attacks the CIA for waterboarding while routinely droning civilians caught in the Islamist crosshairs. _E_ If dopey Mark Cuban of failed Benefactor fame wants to sit in the front row perhaps I will put Gennifer Flowers right alongside of him! _E_ Obama can release 5 senior Taliban for a deserter but can't make Mexico release decorated Marine Sgt. Andrew Tahmooressi. Pathetic _E_ Romney's campaign is being put on the defensive. He cannot let this happen. Stop pandering. Must get tougher (cont) __HTTP__ _E_ Do people notice Hillary is copying my airplane rallies she puts the plane behind her like I have been doing from the beginning. _E_ We need much tougher much smarter leadership and we need it NOW! _E_ John Cahill is highly respected in all circles—really nice to see that he's running for New York State Attorney General. @CahillForAG _E_ I won every poll from last nights Presidential Debate except for the little watched @CNN poll. _E_ BORDER WALL prototypes underway! __HTTP__ _E_ #CelebrityApprentice It's good to have Jack back too with @marleematlin. He's become a star. #sweepstweet _E_ "Always remember that the future comes one day at a time." Dean Acheson _E_ Denzel Washington gave a wonderful commencement speech over the weekend. From the heart! _E_ Show me someone without an ego and I'll show you a loser having a healthy ego or high opinion of yourself is a real positive in life! _E_ I love America. And when you love something you protect it passionately fiercely even. #TimeToGetTough (cont) __HTTP__ _E_ My use of social media is not Presidential it's MODERN DAY PRESIDENTIAL. Make America Great Again! _E_ #MakeAmericaGreatAgain #Trump2016Video: __HTTP__ __HTTP__ _E_ .@JebBush is totally lost he spends too much time managing the bloated staff of his campaign & not enough talking about America's future. _E_ The Oscars were a great night for Mexico & why not—they are ripping off the US more than almost any other nation. _E_ "Every strike brings me closer to the next home run." Babe Ruth _E_ Learn from yesterday live for today hope for tomorrow. The important thing is not to stop questioning. Albert Einstein _E_ Wow Obama really put it to Israel by canceling flights there. This puts them at a tremendous disadvantage. Tourism and more will just stop. _E_ I want to help our miners while the Democrats are blocking their healthcare. _E_ .@oreillyfactor why don't you have some knowledgeable talking heads on your show for a change instead of the same old Trump haters. Boring! _E_ Tom Brady is a good friend of mine a great player a great guy and a total winner! Fantastic comeback win this is what our country needs! _E_ We have given Syria so much time and information there has never been such an instance in wartime history. Syria is now fully prepared! _E_ Very sad that Republican donors were targeted by Obama's IRS. _E_ Via @Reuters by @sumeet_chat: "Donald Trump plans investment in India betting on Modi government" __HTTP__ _E_ .@CelebApprentice having "top brand impact 2012" ahead of Idol Survivor X Factor & all others has caused quite a stir no surprise! _E_ In other words Russia was against Trump in the 2016 Election and why not I want strong military & low oil prices. Witch Hunt! __HTTP__ _E_ I have always had a good relationship with Chuck Schumer. He is far smarter than Harry R and has the ability to get things done. Good news! _E_ Thank you Arlene! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #DrainTheSwamp __HTTP__ _E_ With Luis Mexico and the United States would have made wonderful deals together where both Mexico and the US would have benefitted. _E_ Join me tomorrow in Dubuque Iowa! #IACaucus #Trump2016 __HTTP__ _E_ Everybody is laughing at Jeb Bush spent $100 million and is at bottom of pack. A pathetic figure! _E_ .@TrumpNationalNY is NY's best golf club. A 5 Star Diamond Award winner w/ an elite golf course & top facilities __HTTP__ _E_ Heading to Alabama now big crowd! _E_ Just out according to @CNN: Utah officials report voting machine problems across entire country _E_ Why did Clinton supporter @AlisonForKY declare Crooked Hillary winner in KY when AP hasn't even called the race? _E_ Thank you Michigan! #VoteTrumpMITrump 35%Kasich 17%Cruz 12%Rubio 12%Carson 9% Via: ARG _E_ Thank you. __HTTP__ _E_ Thank you Fort Wayne Indiana!#Trump2016 #INPrimary __HTTP__ _E_ Newsstand sales for @VanityFair run by sleepy Graydon Carter are down almost 20%. All he cares about are his bad food restaurants! _E_ Just returned from Asia after 12 very successful days. Great to be home! _E_ Wow even I didn't realize we did so much. Wish the Fake News would report! Thank you. __HTTP__ _E_ RT @WhiteHouse: FACT: when #Obamacare was signed CBO estimated that 23M would be covered in 2017. They were off by 100%. Only 10.3M people... _E_ People who have the ability to work should. But with the government happy to send checks too many of them don't. #TimeToGetTough _E_ Vattenfall CEO stated that the company needed to prepare itself for falling electricity demands in coming years a changing market. _E_ Congratulations to @MichaelPhelps on concluding the greatest Olympic career ever. You have made us all very proud. _E_ The new winner of the @MissTeenUSA pageant K. Lee Graham __HTTP__ _E_ US job cuts jumped 53% in May from April __HTTP__ This is the Obama recovery? _E_ With all of the words President Obama just dispensed at his press conference he didn't say what we all want to hear I'LL STOP THE FLIGHTS _E_ Among the lowest temperatures EVER in much of the United States. Ice caps at record size. Changed name from GLOBAL WARMING to CLIMATE CHANGE _E_ Amnesty is suicide for Republicans.Not one of those 12 million who broke our laws will vote Republican.Obama is laughing at @GOP. _E_ UNBELIEVABLE!Clinton campaign contractor caught in voter fraud video is a felon who visited White House 342 times: __HTTP__ _E_ Crime is out of control and rapidly getting worse. Look what is going on in Chicago and our inner cities. Not good! _E_ Procter and Gamble is relocating its beauty headquarters from Cincinnati to Asia what are we doing?! _E_ 95% of Americans will pay less or at worst the same amount of taxes (mostly far less). The Dems only want to raise your taxes! _E_ Today is referendum on ObamaCare Amnesty slow growth having your healthcare dropped & all the other lies. _E_ Join me live in Waukesha Wisconsin for an 8pmE rally! #AmericaFirst #MAGA __HTTP__ _E_ President Obama should stay out of the Hong Kong protests we have enough problems in our own country!Can't even properly police White House _E_ All the haters and losers must admit that unlike others I never attacked dopey Jon Stewart for his phony last name. Would never do that! _E_ FLORIDA Just like TX WE are w/you today we are w/you tomorrow & we will be w/you EVERY SINGLE DAY AFTER to RESTORE RECOVER & REBUILD! __HTTP__ _E_ Whether @RepPaulRyan's plan is sound fiscal policy is not the relevant issue the issue is strategic timing. Why release it now? _E_ I applaud Columbia South Carolina for cleaning up biz center __HTTP__ Will cut crime & advance commerce. _E_ Whether you like it or not Bush also gave us Obama! _E_ It was 25 years ago today that Pan Am flight 103 was downed by a terrorist killing 270 innocent people. @AlexSalmond released the terrorist! _E_ THE HILL'S TWITTER ROOM: Trump: Spitzer Weiner turning New York into 'pervert central' __HTTP__ _E_ RT @DanScavino: Join @realDonaldTrump on his official social media platforms during tonight's debate ~ as @TeamTrump manages rapid response... _E_ Bring in 2014 @TrumpSoHo's NYE soireé NYC's most exclusive New Year's Eve Party w/SoHi & @VeuveClicquot __HTTP__ _E_ I have a surprise for a really special kid on Thursday's episode of @KatieShow with @KatieCouric: __HTTP__ _E_ The Wall is the Wall it has never changed or evolved from the first day I conceived of it. Parts will be of necessity see through and it was never intended to be built in areas where there is natural protection such as mountains wastelands or tough rivers or water..... _E_ Crude is about to pass $90/barrel. The OPEC monopoly must be broken. They are robbing our country blind. _E_ Ted Cruz along with Jeb Bush pushed Justice John Roberts onto the Supreme Court. Roberts could have killed ObamaCare twice but didn't! _E_ My thoughts and prayers go out to the @PhillyPolice & @Penn police officers in Philadelphia. __HTTP__ _E_ ....because he doesn't even live there! He wants to raise taxes and kill healthcare. On Tuesday #VoteKarenHandel. _E_ Thank you! #AmericaFirst __HTTP__ _E_ Very little discussion of all the purposely false and defamatory stories put out this week by the Fake News Media. They are out of control correct reporting means nothing to them. Major lies written then forced to be withdrawn after they are exposed...a stain on America! _E_ Rosie O'Donnell went after me again on The View in order to stir up her failing ratings. Nothing will help her @Rosie always fails. _E_ Obama's VA Secretary just said we shouldn't measurewait times. Hillary says VA problems are not 'widespread.' I will take care ofour vets! _E_ Remember I'll see you in D.C. at the Capitol Building on Wednesday at 1:00 o'clock. Then Dallas on Sept.14 at 6:00 P.M. American Air Center _E_ Entrepreneurs: Another question to ask yourself—"What am I pretending not to see?" There may be great opportunities right around you. _E_ Thank you Clive Iowa! __HTTP__ _E_ Weekly jobless claims soared to 21.5% a 6 month high __HTTP__ ObamaCare the greatest job killer in US history. _E_ Jeb is spending millions of dollars on "hit" ads funded by lobbyists & special interests. Bad system. _E_ I find that @Reuters is a far more professional operation than @AP. _E_ Ford is MOVING jobs from Michigan to Mexico AGAIN! __HTTP__ As President this will stop on Day One! Jobs will stay here. _E_ Despite some very corrupt and dishonest media coverage there are many great reporters I respect and lots of GOOD NEWS for the American people to be proud of! _E_ Congratulations to @NewYorkObserver on celebrating its 25 year anniversary. Great paper under amazing management! _E_ Hey @realjeffreyross @whitney cummings @lisalampanelli: you call yourselves comedians? #TrumpRoast tonight 10:30/9:30c on @ComedyCentral. _E_ Alert US jobless claims up 46000 to 388000. Really bad news. 7.8% is now a fraud not possible! _E_ The new Congress must restore military spending & stop Obama budget cuts. Also hold Obama accountable on the VA. _E_ So why aren't the Committees and investigators and of course our beleaguered A.G. looking into Crooked Hillarys crimes & Russia relations? _E_ Congratulations to @CNN for having the wisdom to pick TRUMP! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ My review of #TheDarkKnightRises and more in today's #trumpvlog __HTTP__ _E_ Why is it that the Fake News rarely reports Ocare is on its last legs and that insurance companies are fleeing for their lives? It's dead! _E_ See you at 7:00 P.M. tonight Phoenix Arizona! #MAGATickets: __HTTP__ __HTTP__ _E_ I hope Newtown CT can now start to heal—but it won't be easy! _E_ I don't hate Obama at all I just think he is an absolutely terrible president maybe the worst in our history! _E_ The situations in Tulsa and Charlotte are tragic. We must come together to make America safe again. _E_ Without passion you don't have energy and without energy you have nothing! Just one more of my totally brilliant quotes use it well. _E_ .@BillKristol Bill your small and slightly failing magazine will be a giant success when you finally back Trump. Country will soar! _E_ Dopey @billmaher still owes me $5M for charity. I hope he pays up before @hbo fires him which will happen! _E_ But why shouldn't I speak out? Don't you speak out in this country? George Steinbrenner _E_ THANK YOU to all of the great volunteers helping out with #HurricaneHarvey relief in Texas! __HTTP__ _E_ Be careful of an Obama bomb to win election! Would be a horrible thing to do. _E_ We cannot continue to let Israel be treated with such total disdain and disrespect. They used to have a great friend in the U.S. but....... _E_ Highly respected economist @Larry_Kudlow is a big fan of my tax plan—thank you Larry. __HTTP__ _E_ THE LAST THING THIS COUNTRY NEEDS IS ANOTHER BUSH! _E_ Doesn't help Kasich to do negative ads on me because he still has to go through everyone else he's almost last. _E_ .@FLGovScott can create tens of thousands of jobs by approving casinos in Miami it's time. @willweatherford _E_ A story in the @washingtonpost that I was close to "rescinding" the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources don't exist! _E_ If Alison Grimes can't admit she voted for Obama even if she is embarrassed then you can't trust her! Vote @Team_Mitch! _E_ The Mullahs laughed when @BarackObama asked Iran to return our drone they will show it to China first. _E_ .@danawhite You have done an amazing job I am proud to have been there at the very beginning! _E_ To the people of Puerto Rico:Do not believe the #FakeNews!#PRStrong _E_ Israel Saudi Arabia and the Middle East were great. Trying hard for PEACE. Doing well. Heading to Vatican & Pope then #G7 and #NATO. _E_ .@MittRomney looks much calmer and Obama should stop nodding his head backwards and forward. _E_ Congressional Black Caucus Chairman Emanuel Cleaver is right. @BarackObama's budget is a nervous breakdown on paper. __HTTP__ _E_ As an honorary Buckeye I want to thank the OH GOP primary voters for putting @MittRomney over the top. It was a crucial win. _E_ We will bring back our jobs. We will bring back our borders. We will bring back our wealth and we will bring back our dreams! _E_ .@JordanSpieth Great playing at the Masters and don't get down Jordan you will win many tournaments and many MAJORS! Keep working hard. _E_ Bill Clinton is right: Obamacare is 'crazy' 'doesn't work' and 'doesn't make sense'. Thanks Bill for telling the truth. _E_ Bernie Sanders on HRC: Bad Judgement. John Podesta on HRC: Bad Instincts. #BigLeagueTruth #Debate _E_ Watched Gennady Golovkin @gggboxing at MSG on Saturday night. He was fantastic should fight @FloydMayweather! _E_ He @BarackObama invited his top campaign bundlers and donors to the British State Dinner __HTTP__ So corrupt! _E_ Thehas great strength & patience but if it is forced to defend itself or its allies we will have no choice but to totally destroy #NoKo. __HTTP__ _E_ The Democrats have no message not on economics not on taxes not on jobs not on failing #Obamacare. They are only OBSTRUCTIONISTS! _E_ Via @gulf_news by @JoeHeim: "@IvankaTrump: Giving back is a priority for me" __HTTP__ _E_ The Blue Monster is celebrated in June issue of Robb Report as the Best of the Best winner in Golf Course Category. __HTTP__ _E_ New Day on CNN treats me very badly. @AlisynCamerota is a disaster. Not going to watch anymore. _E_ I bet the terrorists in Libya used weapons we supplied them during their so called 'revolution' to attack our embassy in Benghazi. _E_ Glad the Trans Pacific Partnership failed in the Senate. Bad deal for American worker & economy! We need SMART TRADE! __HTTP__ _E_ In America we don't worship government we worship God. #ValuesVotersSummit __HTTP__ _E_ Hillary's wars in the Middle East have unleashed destruction terrorism and ISIS across the world. _E_ This was the reporters statement when she found out there was tape from my facility she changed her tune. __HTTP__ _E_ #ChrisWallace who interviewed me on Sunday had his highest ratings since Feb of '09. Congratulations! __HTTP__ _E_ New Poll Shows Donald Trump Blowing Everyone Else Out of the Water. __HTTP__ _E_ Thank you Tucson Arizona! A great afternoon with 6000 supporters! #VoteTrump on Tuesday!#MakeAmericaGreatAgain __HTTP__ _E_ The FBI is totally unable to stop the national security leakers that have permeated our government for a long time. They can't even...... _E_ To be in charge you have to take responsibility you have to instill confidence. It's like being a conductor set the tempo. _E_ "The harder you work the luckier you get." Gary Player _E_ Listen and learn from others but make your own decisions. Use your instincts you alone know where you want to go. _E_ My successful acquisition of the Kluge estate was a fantastic deal which is already being studied in business schools. _E_ Why doesn't Obama let our marines who are guarding the embassies in Egypt have live ammunition? They need it fast. _E_ RT @NRCC: Good to hear @realDonaldTrump is on board.GOP is the party of free enterprise.Join us as we innovate: __HTTP__ _E_ The Yuan hit another record high against the Dollar. China is laughing at our expense. _E_ I want all Americans to succeed together. President Obama's illegal executive amnesty undermines job prospects for... __HTTP__ _E_ Iowa Congressman @SteveKingIA has endorsed the Newsmax @iontv debate. He has been doing great work in the House. _E_ I see Marco Rubio just landed another billionaire to give big money to his Superpac which are total scams. Marco must address him as SIR ! _E_ All the governors are already backing off of the Ebola quarantines. Bad decision that will lead to more mayhem. _E_ 92 year old registers to vote for first time says will vote for Trump __HTTP__ _E_ So wrong! @BarackObama is hosting China's VP Xi Jinping today at the Pentagon with a full honor ceremony with music and cannons... _E_ "Trump: 'Never Give Up' on Farmland Value Rally" __HTTP__ @TerryBranstad @KimReynoldsIA @ChuckGrassley @SenJoniErnst @BNorthey _E_ Watched protests yesterday but was under the impression that we just had an election! Why didn't these people vote? Celebs hurt cause badly. _E_ Let's not get too excited about Monday's U.S. Supreme Court oral argument on #ObamaCare before the decision. No (cont) __HTTP__ _E_ Via @RoyalOakPatch: Oakland County High Schoolers Have Chance to Win $1000 Scholarship & Meet Donald Trump __HTTP__ _E_ Wow 15 policemen hurt in Baltimore some badly! Where is the National Guard. Police must get tough and fast! Thugs must be stopped. _E_ Thank you Concord North Carolina! When WE win on November 8th we are going to Washington D.C. and we are going t... __HTTP__ _E_ ...Mexico cannot believe what they are getting away with and have absolutely no respect for our leader. _E_ .@bigstack19 @realDonaldTrump Does anyone actually read Rolling Stone anymore? Guess they had to create (cont) __HTTP__ _E_ Lance Armstrong was given veryvery bad advice! _E_ President Obama should have gone to Louisiana days ago instead of golfing. Too little too late! _E_ While @BarackObama spends recklessly on domestic projects he is hollowing out our military with over $487B in cuts __HTTP__ _E_ NBC News just called it the great freeze coldest weather in years. Is our country still spending money on the GLOBAL WARMING HOAX? _E_ Unlike crooked Hillary Clinton who wants to destroy all miners I want wages to go up in America. We will do so by bringing back jobs! _E_ "Peace is not absence of conflict it is the ability to handle conflict by peaceful means." – Pres. Ronald Reagan _E_ Just arrived in Cleveland Ohio join Governor @Mike_Pence and I now LIVE via: __HTTP__ _E_ Trump Puerto Rico is 1st development in Puerto Rico to combine lavish residences world class golf & a beach __HTTP__ _E_ Resolve to be bigger than your problems. Who's the boss? Realize that fear is the exact opposite of faith. _E_ I am so proud of my daughter Ivanka. To be abused and treated so badly by the media and to still hold her head so high is truly wonderful! _E_ Governor @Mike_Pence and I will be in Cleveland Ohio tomorrow night at 7pm join us! #MAGATickets:... __HTTP__ _E_ I will be on @meetthepress this morning at various times across the U.S. @NBCNews Enjoy! _E_ Trump Doral's renovations are right on schedule __HTTP__ Once completed it will be the top resort in the U.S. _E_ Why does @BarackObama support the radical Islamists in Egypt protests yet has such a high disregard for the Tea Party? _E_ RT @DonaldJTrumpJr: Ironic since Hillary has gotten a lot more of that dark unaccountable money into her campaign. #debates _E_ Watch CNN tomorrow at 2 pm & 5 pm and on Friday at 7 pm & 11 pm for a Thanksgiving Special hosted by John King. I'll be a featured guest. _E_ Why are some more concerned with granting terrorist rights than protecting innocent Americans? _E_ It has been great to meet so many wonderful people in my #TimeToGetTough book signings. Anyone who wants to be Prez should read! _E_ How bad is the New York Times—the most inaccurate coverage constantly. Always trying to belittle. Paper has lost its way! _E_ .@billmaher: Bill you are really beginning to understand what is going on with Trump actually you always knew! _E_ Wow it's now official. ObamaCare website has topped $1B __HTTP__ Will soon be up to $1.5B _E_ ICYMI my speech from this past Saturday at the @NHGOP @FITNsummit via @cspan __HTTP__ _E_ I predicted Rosie O'Donnell would fail at the View and was right. Now I predict Rosie will take over for Brian Williams! _E_ I will be the greatest job producing president in American history. #Trump2016 #VoteTrump __HTTP__ __HTTP__ _E_ Heading to Manassas Virginia for a rally. Will have a moment of silence for the victims of the California shootings. So sad! _E_ Live Free or Die: A motto for the whole country to follow. #NewHampshire #FITN #VoteTrumpNH __HTTP__ _E_ Iraq has granted Iran full air rights to fly over and arm Syria. What did America accomplish with the Iraq war? And now Syria?! _E_ A testament to American ingenuity @TrumpTowerNY shines over Fifth Avenue as one of NYC's most iconic buildings __HTTP__ _E_ Youth unemployment is at a record high. ObamaCare is a job destroyer which is ruining aspiring careers. It must be repealed. _E_ New PPP poll just released in Iowa up 6 points from last poll. Leading w/ 28%! Don't worry media won't report it! __HTTP__ _E_ I wish good luck to all of the Republican candidates that traveled to California to beg for money etc. from the Koch Brothers. Puppets? _E_ Spitzer never made 10 cents on his own he worked for his very rich father (a friend of mine who never thought much of Eliot as a businessman _E_ Thanks many are saying I'm the best 140 character writer in the world. It's easy when it's fun. _E_ #trumpvlog My thoughts on @RickSantorum in today's video blog... __HTTP__ _E_ There's nothing "compassionate" about allowing welfare dependency to be passed from generation to generation. Time To Get Tough _E_ Honored to meet w/ Pres Abbas from the Palestinian Authority & his delegation who have been working hard w/everybody involved toward peace. __HTTP__ _E_ Looking forward to visiting Mason City Iowa tomorrow. Will be my 8th day in the Hawkeye State this year! __HTTP__ _E_ What Barbara Res does not say is that she would call my company endlessly and for years trying to come back. I said no. _E_ Nobody cares about the Iowa straw poll is what @JonHuntsman said yesterday. His problem is that nobody cares about his campaign (or him). _E_ Hope you like my nomination of Judge Neil Gorsuch for the United States Supreme Court. He is a good and brilliant man respected by all. _E_ .@WineEnthusiast just awarded Trump Vineyard's Sparkling Reserve 91 points the highest rated wine in Virginia... __HTTP__ _E_ Merry Christmas have an amazing day! _E_ .@DineshDSouza had to give $1000 to @BarackObama's brother for his child's hospital bill __HTTP__ Isn't that disgraceful? _E_ Russia talk is FAKE NEWS put out by the Dems and played up by the media in order to mask the big election defeat and the illegal leaks! _E_ Remember while @BarackObama is lauding himself tonight with self indulgent compliments we have our brave soldiers fighting in Afghanistan. _E_ .@CNN should listen. Ana Navarro has no talent no TV persona and works for Bush—a total conflict of interest. __HTTP__ _E_ Just as I have been saying for MANY years and while they phony negotiate with the U.S. over nuclear Iran is taking over Iraq. Really sad! _E_ At some point the Fake News will be forced to discuss our great jobs numbers strong economy success with ISIS the border & so much else! _E_ He @MittRomney is a successful entrepreneur. @BarackObama successfuly ruined America's credit. Easy choice in November. _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ .@Chrysler disputes my statement but watch Chrysler move @Jeep jobs to China after the election. _E_ Nothing is so permanent as a temporary government program. Milton Friedman _E_ Our Native American Senator goofy Elizabeth Warren couldn't care less about the American worker...does nothing to help! _E_ The American worker is being victimized by our trade policies. We need smart trade which can only be accomplished by smart dealmakers. _E_ Incredible handheld video of the Las Vegas Strip in 1969. The skyline looks better with @TrumpLasVegas! __HTTP__ _E_ Senator Ted Cruz has been MATHEMATICALLY ELIMINATED from race. He said Kasich should get out for same reason. I think both should get out! _E_ I have no greater privilege than to serve as your Commander in Chief. HAPPY BIRTHDAY to the incredible men and women @USNavy!#242NavyBday __HTTP__ _E_ Barney Frank admited that ObamaCare does have 'death panels' yesterday. Obamacare must be fully repealed or healthcare will be destroyed. _E_ RT @DRUDGE_REPORT: TRUMP APPROVAL HITS 50% __HTTP__ _E_ Remember if you don't pat yourself on the back nobody else will. Take credit for your successes and don't let others forget!!!!!! _E_ Why isn't the @GOP congress doing everything possible to defund and cut ObamaCare? _E_ These are something I just can't buy. Excited for the @usopen __HTTP__ _E_ We must stand firm against the UN's ploy to sabotage Israel if the UN grants the PA statehood then we must immediately defund it. _E_ The state of Virginia economy under Democrat rule has been terrible. If you vote Ed Gillespie tomorrow it will come roaring back! _E_ I've learned that mistakes can often be as good a teacher as success. Jack Welch _E_ Why does @oreillyfactor and @FoxNews always have Karl Rove on. He spent $430 million and lost ALL races. A dope who said Romney won election _E_ Key Obamacare premiums to jump 25% next year: __HTTP__ _E_ The Republican Party needs strong and committed leaders not weak people such as @JeffFlake if it is going to stop illegal immigration. _E_ The @washingtonpost which is the lobbyist (power) for not imposing taxes on #Amazon today did a nasty cartoon attacking @tedcruz kids. Bad _E_ Unfortunately with some men when the poison kicks in (not me of course) there are no rules or guidelines in the military that will stop them _E_ Thank you to our GREAT Military/Veterans and @PacificCommand.Remember #PearlHarbor. Remember the @USSArizona!A day I'll never forget. __HTTP__ _E_ I am now in Palm Beach Florida and will be going to church tonight. MAKE AMERICA GREAT AGAIN! _E_ Good morning. I will be on Fox and Friends at 7.00 (30 minutes). Enjoy! @foxandfriends _E_ This is a great time for @RickSantorum to bow out with dignity. _E_ . @OMAROSA is smart and strategic. People should cut her some slack and respect the way she works on @ApprenticeNBC. _E_ My @SquawkCNBC interview discussing last night's presidential debate my stock picks and tomorrow's big announcement __HTTP__ _E_ Great day for America's future Security and Safety courtesy of the U.S. Supreme Court. I will keep fighting for the American people & WIN! _E_ I am happy to have started #ObamasFavoriteCharity. Really enjoying reading everyone's tweets. _E_ Entrepreneurs: Realize that fear is the exact opposite of faith. Resolve to be bigger than your problems. Who's the boss? _E_ My interview with @RealMichaelKay discussing why A Rod should be fired from @yankees & how to terminate his contract __HTTP__ _E_ All seven on line polls including Drudge and Time with thousands of respondents said I won the debate. @krauthammer said I was so so. _E_ The ultra liberal and seriously failing Des Moines Register is BEGGING my team for press credentials to my event in Iowa today but they lie! _E_ How did you like Michelle Obama's bangs last night? _E_ IMPORTANT @RepMattSalmon & @RepEdRoyce will hold a hearing on Oct. 1w/USMC Sgt. Tahmooressi's mother & wife __HTTP__ _E_ We should be able to negotiate a deal with Iran because they know we could blow them away to the Stone Age.They just don't believe we would. _E_ A big POLL will be announced this morning on @CBSNews Face The Nation. I wonder if I do well if the press will report the results? Doubt it _E_ I really enjoyed the debate tonight even though the @FoxNews trio especially @megynkelly was not very good or professional! _E_ Ralph Northamwho is running for Governor of Virginiais fighting for the violent MS 13 killer gangs & sanctuary cities. Vote Ed Gillespie! _E_ The Trump base is far bigger & stronger than ever before (despite some phony Fake News polling). Look at rallies in Penn Iowa Ohio....... _E_ Via @scj by @rodboshart: "Trump: Next president has to be 'great one'" __HTTP__ _E_ Surprise @oreillyfactor used my name big league in pre ads to promote the show—then talked about everyone else but me! _E_ Meet the 'Trumpocrats': Lifelong Democrats Breaking w/ Party Over Hillary to Support Donald Trump for President: __HTTP__ _E_ Yesterday 15 @GOP senators sided with people who got into this country by breaking our laws. _E_ Very proud of our incredible First Lady (@FLOTUS.) She is a truly great representative for our country! __HTTP__ _E_ People in our country want borders and without them the old line pols like Crooked Hillary will not win. It is time for CHANGE and JOBS! _E_ #TrumpAdvice __HTTP__ _E_ We're all proud of @erictrump for being on @Forbes 30 Under 30 list. __HTTP__ _E_ .@SenJohnMcCain Thank you for coming to D.C. for such a vital vote. Congrats to all Rep. We can now deliver grt healthcare to all Americans! _E_ General Keith Kellogg who I have known for a long time is very much in play for NSA as are three others. _E_ during a general election. I for one am appalled that somebody that is the nominee of one of our two major parties would take that kind _E_ P.S. There is also something really good to say about humility. Being confident and humble is a great combination maybe the best of all! _E_ I just arrived in Barcelona. I make a big speech tomorrow and then off to Ireland and Scotland. _E_ Via The Hill: Trump Tops National Poll for Second Straight Week __HTTP__ _E_ Will be on Howard Stern at 6.45 A.M. and the Today Show at 8.00 A.M. _E_ How is it possible that the people of the great State of Colorado never got to vote in the Republican Primary? Great anger totally unfair! _E_ Wow Matt Lauer was just fired from NBC for "inappropriate sexual behavior in the workplace." But when will the top executives at NBC & Comcast be fired for putting out so much Fake News. Check out Andy Lack's past! _E_ Via @GolfDigest by @LukeKerrDineen: "@MichaelBreed to open golf academy at Donald Trump's @TrumpFerryPoint" __HTTP__ _E_ Crooked Hillary Clinton is totally unfit to be our president really bad judgement and a temperament according to new book which is a mess! _E_ How can the economy ever recover when @BarackObama keeps threatening the private sector with more taxes. This is no way to spur growth. _E_ Watch Face The Nation will be on now! _E_ For the truth about job creation in America go to __HTTP__ A great site for employers to get the tools & information they need! _E_ "Successful people keep moving. They make mistakes but they don't quit." – Conrad Hilton _E_ #TimeToGetTough The crowd at the book signing at Trump Tower in NYC right now... __HTTP__ _E_ Isn't it amazing that Obama "never knew" about the IRS scandals until he saw it in the news?! _E_ Has AG Schneiderman been extorting his targets and their lawyers for contributions? We will find out. _E_ Reminder: The Miss Universe competition will be LIVE from the Bahamas Tonight @ 9pm (EST) on NBC: __HTTP__ _E_ Congratulations to @GatewayPundit on being named the #ROL15 @BreitbartNews award. Well earned & well deserved! _E_ The bigger problem with Ebola is all of the people coming into the U.S. from West Africa who may be infected with the disease. STOP FLIGHTS! _E_ I want to negotiate my own and much better trade deals for our country. MUST INCLUDE CURRENCY MANIPULATION (and more). DO NOT LET PASS! _E_ I find it offensive that Goofy Elizabeth Warren sometimes referred to as Pocahontas pretended to be Native American to get in Harvard. _E_ With championship links @TrumpScotland's world class amenities also include dining & luxury accommodations __HTTP__ _E_ "If you know the enemy and know yourself you need not fear the results of a hundred battles." Sun Tzu _E_ Fear defeats more people than any other one thing in the world. Ralph Waldo Emerson _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ The best way to build a successful business is by results. In the end that is what counts. _E_ Product integration is very important. #CelebApprentice _E_ I only wish my wonderful daughter Tiffany could have been with us at Mar a Lago for our great election victory. She is a winner! _E_ On schedule for 2016 completion @trumpvancouver's 57 story twisting tower will be the icon of Vancouver's skyline __HTTP__ _E_ My interview with @seanhannity discussing this season's @ApprenticeNBC #TimeToGetTough the economy and GOP primary. __HTTP__ _E_ Icahn Kravis Zell Buffett have all used the bankrutcy law to their benefit. Many of the top business people do. _E_ The EU just dropped their self imposed carbon tax. I bet they wish they had all that money back! _E_ Wow @FoxNews just reporting big news. Source: Official behind unmasking is high up. Known Intel official is responsible. Some unmasked.... _E_ RT @IvankaTrump: I have long respected India's accomplished and charismatic Foreign Minister @SushmaSwaraj and it was an honor to meet her... _E_ Trump Golf Links at Ferry Point: Grand Opening next Tuesday May 26th at 11 AM. Jack Nicklaus will be joining me. __HTTP__ _E_ Good news for those that want to Make America Great Again I am winning every poll in every STATE and NATIONAL and by big numbers! Thanks _E_ Over the past 11 months I have travelled tens of thousands of miles to visit 13 countries. I have met with more than 100 world leaders and everywhere I traveled it was my highest privilege and greatest honor to represent the AMERICAN PEOPLE! __HTTP__ _E_ The perfect getaway @Trump_Ireland is Europe's most elite 5 star destination perfected with old world luxury __HTTP__ _E_ Control your own destiny or someone else will. @jack_welch _E_ Only three weeks until the new season of @CelebApprentice begins filming great all star cast. _E_ The Trump Organization is honored to have been awarded the redevelopment of The Old Post Office. Will be DC's finest hotel. _E_ Obama is a disaster at foreign policy. Never had the experience or knowledge. He is not capable of doing the job. _E_ RT @AdrianaCohen16: Carly Fiorina no lifeboat for a fast sinking @tedcruz campaign __HTTP__ via @bostonherald @realdonaldtru... _E_ Via @WSJ: "A New Direction For America" by @MittRomney _E_ "The risk of a wrong decision is preferable to the terror of indecision." – Maimonides _E_ Do you think John Kerry is aware of the fact that they are building nuclear weapons in Iran and North Korea and Pakistan already has them!! _E_ Fellow inductee @SammartinoBruno and me. #WWEHOF __HTTP__ _E_ While @BarackObama is slashing the military he is also negotiating with our sworn enemy the Taliban who facilitated 9/11. _E_ Congratulations to two great and hardworking guys Corey Lewandowski and David Bossie on the success of their just out book "Let Trump Be Trump." Finally people with real knowledge are writing about our wonderful and exciting campaign! _E_ Why would the people of Florida vote for Marco Rubio when he defrauded them by agreeing to represent them as their Senator and then quit! _E_ Looking forward to addressing the record setting crowd tonight at the New York County Lincoln Day Dinner. Lots to talk about! _E_ Thank you @davidaxelrod for your nice words this morning on @CNN. It was a good night! _E_ Beautiful @MissUSA in @NewYorkPost tomorrow as Audrey Hepburn in front of Tiffany's. _E_ ...Trump/Russia story was an excuse used by the Democrats as justification for losing the election. Perhaps Trump just ran a great campaign? _E_ WOW @foxandfrlends "Dossier is bogus. Clinton Campaign DNC funded Dossier. FBI CANNOT (after all of this time) VERIFY CLAIMS IN DOSSIER OF RUSSIA/TRUMP COLLUSION. FBI TAINTED." And they used this Crooked Hillary pile of garbage as the basis for going after the Trump Campaign! _E_ Dopey Sugar @Lord_Sugar—you are the worst kind of loser—a total fool. _E_ So @BarackObama's campaign is calling @MittRomney a potential criminal __HTTP__ How about Obama's Tony Rezko land deal! _E_ Isn't it funny when a failed Senator like goofy Elizabeth Warren can spend a whole day tweeting about Trump & gets nothing done in Senate? _E_ Excited to be returning to the @NCGOP State Convention as the Keynote of Saturday's dinner! @NCGOP is a strong Conservative state party! _E_ Re Miss Universe Pageant we've spoken w/the LGBT community in Russia who asked "please don't leave it would send the wrong signal." _E_ .ccolvinj @AP is one of the truly bad reporters working for an organization that has totally lost its way. Stories are fictional garbage. _E_ Via @theFAMiLYLEADER: "Donald Trump to Speak at The Family Leadership Summit" __HTTP__ Get tix __HTTP__ _E_ Our very stupidly run Country better stop being so politically correct or we won't have a Country to run anymore! _E_ I told you so a long time ago: Iraq just lost second largest city as their soldiers drop their guns and run. Only the beginning! OIL. _E_ Did you ever see a situation so ridiculous as our President explaining what when and where to Congress about a Syrian attack. Far too late! _E_ "Donald Trump on VA woes: 'I'd fire everybody' 'you fix it by getting Trump elected'" __HTTP__ via @washtimes by @dsherfinski _E_ Via @HeraldWeekly by Lauren Odomirok: Trump Norman play renovated golf course __HTTP__ _E_ When nobody wanted the UFC I opened the way by letting them fight at the Trump Taj Mahal in Atlantic City. Dana White has done a great job! _E_ Thank you California! #Trump2016 __HTTP__ __HTTP__ _E_ Have you seen the new #TRUMP line of clothing apparel and fragrances @Macy's? Selling like hotcakes. Great for Christmas gifts etc. _E_ China is happy to learn that @BarackObama plans to borrow another $300 Billion. @BarackObama is their favorite client. _E_ An amazing article by Kevin Gabriel __HTTP__ A must read by friends and foes of President Obama. End date is tomorrow at noon. _E_ Young entrepreneurs: Your success is measured by results. Be productive in the face of challenges. Setbacks are not fatal. _E_ The @SenTedCruz endorsement was a wonderful surprise. I greatly appreciate his support! We will have a tremendous victory on November 8th. _E_ I was at @FoxNews and met Juan Williams in passing. He asked if he could have pictures taken with me. I said fine. He then trashes on air! _E_ I watched @BarackObama at the National Prayer Breakfast and he looked totally uncomfortable with his words. (cont) __HTTP__ _E_ Totally false reporting on my call with @Reince Priebus. He called me ten minutes said I hit a "nerve doing well end! _E_ I agree Mike thank you to all of our law enforcement officers! #VPDebate Police officers are the best of us... @Mike_Pence _E_ Happy #NationalFarmersDay!📸 __HTTP__ __HTTP__ _E_ I wonder who @ArsenioHall's first guest will be his show will be great! _E_ Thank you New Hampshire! #FITN #NHPrimary #VoteTrumpNH Voting questions? __HTTP__ __HTTP__ _E_ .@joycefinance #asktrump __HTTP__ _E_ Big interview tonight by Henry Kravis at The Business Council of Washington. Looking forward to it! _E_ I'll be signing copies of my new book Time To Get Tough tomorrow in Trump Tower 11 am to 2 pm. Hope to see you there. _E_ My wife @MELANIATRUMP will be #OnTheRecord w/ @greta tonight at 7pmE on @FoxNews. Enjoy! __HTTP__ __HTTP__ _E_ Governor @RicardoRossello We are with you and the people of Puerto Rico. Stay safe! #PRStrong _E_ #noratings @Lawrence will soon be off tv bad ratings he has a face made for radio. _E_ I will be interviewed on @foxandfriends at 7:30 A.M. Enjoy! _E_ When terrorists are beheading and executing American citizens in such a brutal waythe report on torture should be the least of our concerns _E_ Sometimes by losing a battle you find a new way to win the war. _E_ With the number of tweets sad sack @Rosie has done she has totally lost control of herself hopefully not a breakdown. _E_ Thank you. __HTTP__ _E_ Thank you to Shawn Steel for the nice words on @FoxNews. _E_ Thanks. __HTTP__ _E_ More dead people voted in the last election than enrolled in ObamaCare. Congratulations America! _E_ HYPOCRITE! Long before @BarackObama called the Tea Party 'teabaggers' he dressed as a revolutionary in a Hyde Park rally __HTTP__ _E_ China is about to acquire a unit of AIG which we bailed out for $5.5B __HTTP__ China is making great deals on our backs. _E_ Rising 70 stories over Panama Bay @TrumpPanama offers our elite amenities in Latin Americas tallest building __HTTP__ _E_ Hope everyone enjoyed their Thanksgiving. But get ready our country is in big trouble! _E_ Spent time with Indiana Governor Mike Pence and family yesterday. Very impressed great people! _E_ Democrats try so hard to mock & belittle Republicans—& the Republicans just don't fight back—no energy! _E_ We must keep the pressure on @BarackObama's administration to make sure Chen comes to the US. It would be a tragedy to abandon him in China. _E_ For more information on tonight's two hour telethon 8 to 10 p.m.: __HTTP__ _E_ ...The fact is that Puerto Rico has been destroyed by two hurricanes. Big decisions will have to be made as to the cost of its rebuilding! _E_ I wonder if @BarackObama ever applied to Occidental Columbia or Harvard as a foreign student. When can we see (cont) __HTTP__ _E_ ... in order to occupy space in a truly ugly office building in a much worse location! _E_ For all of those that think life is easy & don't want to work remember: HOPE IS THE POOR MAN'S BREAD. _E_ .@KellyannePolls Kellyanne you were fantastic on @meetthepress today. Keep going I will win for the people. MAKE AMERICA GREAT AGAIN! _E_ A great day in both Spencer & Davenport Iowa! THANK YOU for the support! #Trump2016 #FITN #IAPolitics __HTTP__ _E_ Thank you Mississippi! #Trump2016 _E_ Head on over to my Facebook page to have your questions answered in the next #AskTheDonald __HTTP__ _E_ Thank you Anthony @Scaramucci @WSJ The Entrepreneur's Case for Trump __HTTP__ _E_ NO MERCY TO TERRORISTS you dumb bastards! _E_ Thank you to respected columnist Katie Hopkins of Daily __HTTP__ for her powerful writing on the U.K.'s Muslim problems. _E_ Two great people! __HTTP__ _E_ With a record deficit and $15 trillion in debt @BarackObama is spending $4 million of our money on his Hawaii vacation. Just plain wrong. _E_ I'm very proud of the work my son @EricTrump has been doing with the @EricTrumpFDN take a look... __HTTP__ _E_ He @RickSantorum has as much chance of being the GOP nominee as @Rosie does of ever having a successful (cont) __HTTP__ _E_ Via @DailyCaller by @alweaver22:"Trump: Obama One Of 'The Worst Things That's Ever Happened To Israel'" __HTTP__ _E_ Pleasure in the job puts perfection in the work. Aristotle _E_ Technology has shown we have tremendous energy resources right under our feet that we didn't know about 5 years ago. _E_ So they caught Fake News CNN cold but what about NBC CBS & ABC? What about the failing @nytimes & @washingtonpost? They are all Fake News! _E_ Success is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_ Entrepreneurs: Success is good. Success with significance is even better. Make your work count. _E_ ...well into our 4th week of shooting the record 13th season of @CelebApprentice. The 'All Stars' are hard at work... _E_ Every poll Time Drudge Slate and others said I won both debates but heard Megyn Kelly had her two puppets say bad stuff. I don't watch _E_ RT @AnnCoulter: Anyone who plans to talk about Trump ever again has to see this speech. Your opinion is irrelevant unless you listened to... _E_ RT @NFIB: .@NFIB encouraged by @realDonaldTrump's #taxplan says #smallbiz would benefit from lower tax rate: __HTTP__ _E_ Departing Farmers Round Table in Boynton Beach Florida. Get out & VOTE lets #MAGA! EARLY VOTING BY FL. COUNTY:... __HTTP__ _E_ Via @BreitbartNews by @mboyle1: DONALD TRUMP: MSM INVESTIGATION INTO SCOTT WALKER'S COLLEGE A 'DOUBLE STANDARD' __HTTP__ _E_ Woody Johnson owner of the NYJets is @JebBush's finance chairman. If Woody would've been w/me he would've been in the playoffs at least! _E_ RT @realDonaldTrump: Happy Birthday @DonaldJTrumpJr! __HTTP__ _E_ The Fed is considering issuing even more US bond debt into the market. Not good! _E_ The United Nations has such great potential but right now it is just a club for people to get together talk and have a good time. So sad! _E_ My @marklevinshow interview discussing Obama's SOTU Rove's attack on the Tea Party & All Star @ApprenticeNBC __HTTP__ _E_ .@mcuban says he is a member of Dallas National but doesn't play golf. Who is a member of a golf club that doesn't play?? No talent! @TMZ _E_ Why doesn't President Obama simply apologize for telling a big fat lie announce that ObamaCare was a mistake and deal a really great plan! _E_ WHAT THEY ARE SAYING ABOUT THE CLINTON CAMPAIGN'S ANTI CATHOLIC BIGOTRY: __HTTP__ _E_ .@brithume I am in first place by a lot in all polls tied for first place with Ben Carson in one Iowa poll. I thought you knew this thanks _E_ O.K. Christmas is over now we can all go back to the wars of life. Focus focus focus never accept defeat push hard for total victory! _E_ I will be on Face the Nation with John Dickerson on CBS this morning. Enjoy! _E_ Crooked Hillary Clinton Tops Middle East Forum's 'Islamist Money List' __HTTP__ _E_ Trump Int'l Palm Beach offers a spectacular course with hill vistas bunkers and incredible water features. __HTTP__ _E_ I'm sick of always reading about outsourcing. Why aren't we talking about 'onshoring' or 'insourcing?' We need (cont) __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Wonder if Obama will ever say RADICAL ISLAMIC TERRORIST? _E_ Lots of response that Obama should give the $5M to the families of our great heroes who were murdered in Benghazi. _E_ Before Kids Can Go Places They Need a Place To Go the motto of The Police Athletic League an organization I'm very proud to support _E_ THE DONALD J. TRUMP PRESIDENTIAL EXPLORATORY COMMITTEE __HTTP__ _E_ What can you learn today that you didn't know before? Set the bar high do the best you possibly can. _E_ Via @golf_com by @joepassov: "@TrumpFerryPoint Will Be One of Nation's Best Public Courses" __HTTP__ _E_ 'U.S. Industrial Production Surged in April' __HTTP__ _E_ My complaint against @AGSchneiderman is a "case study" for JCOPE & Moreland Commissions on everything that is wrong with NYS politics. _E_ Thank you Washington! Honored to say on behalf of our great movement we have broken the all time record for votes in GOP primary history. _E_ North Korea can't survive or even eat without the help of China. China could solve this problem with one phone call they love taunting us! _E_ Get Snowden back from Russia—he has done tremendous damage to the US & should pay a very heavy price. _E_ Thank you Newt! __HTTP__ _E_ As soon as John Kasich is hit with negative ads he will drop like a rock in the polls against Crooked Hillary Clinton. I will win! _E_ Great Town Hall tonight at 10:00 P.M. (Eastern) conducted by @seanhannity on @FoxNews _E_ Had a great time on @gretawire's inaugural 7PM show. Congrats to Greta on the new spot! _E_ Pervert alert. @RepWeiner is back on twitter. All girls under the age of 18 block him immediately. _E_ ...lottery continues deadly catch and release and bars enforcement even for FUTURE illegal immigrants. Voting for this amendment would be a vote AGAINST law enforcement and a vote FOR open borders. If Dems are actually serious about DACA they should support the Grassley bill! _E_ Obama must now FOCUS get his mind off March.Madness and LEAD! Watch Russia closely work hard on the economy and get rid of ObamaCare! _E_ Once ObamaCare is fully enacted in NY conveniently after 2014 expect higher premiums bigger deductibles & worse care. Job killer! _E_ Well Obama refused to say (he just can't say it) that we are at WAR with RADICAL ISLAMIC TERRORISTS. _E_ I don't know Putin have no deals in Russia and the haters are going crazy yet Obama can make a deal with Iran #1 in terror no problem! _E_ For what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_ Doing a commercial for @NFLONFOX lots of fun! __HTTP__ _E_ Congress get ready to do your job DACA! _E_ .@latoyajackson is once again at the top of her game in the upcoming All Star season of @CelebApprentice. Amazing in the boardroom... _E_ MAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN! _E_ I am honored that Texas supporters have filed papers in Texas to create Make America Great Party on my behalf. __HTTP__ _E_ "The Constitution is the guide which I never will abandon" George Washington _E_ .@MichelleMalkin would be nothing without being on the @seanhannity show. I don't see what Sean sees in her—loser! _E_ Entrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_ Obama & Clinton should stop meeting with special interests & start meeting with the victims of illegal immigration. _E_ Golf Odyssey one of golf's most respected publications just named Trump International Golf Links Scotland golf course of the year _E_ Just landed in Iowa speaking soon! _E_ It all comes down to one simple question: How much money can you stand to lose? That's how much risk you should assume. _E_ What do you think of Gary's definition of f u n? _E_ Trump International Hotel & Tower Toronto continues to receive accolades. Great city great hotel. __HTTP__ #TrumpToronto _E_ RT @detroitnews: .@IvankaTrump in Michigan: 'This is your movement' __HTTP__ @realDonaldTrump __HTTP__ _E_ .@TPNNtweets Donald Trump Tells A Fascinating Inside Story About His Dealings w/ The Obama WH __HTTP__ @johnhawkinsrwn _E_ If Chelsea Clinton were asked to hold the seat for her motheras her mother gave our country away the Fake News would say CHELSEA FOR PRES! _E_ Sadly this kind of stuff even happened to Ronald Reagan. There is nothing nice about it! #MakeAmericaGreatAgain __HTTP__ _E_ On Sunday Jerome Bettis 'the bus' from the Pittsburgh Steelers will play at Trump Int'l Golf Club/Palm Beach against Julius Erving 'Dr J' _E_ I will be interviewed on The O'Reilly Factor this evening at 8 pm on the Fox News Channel. @oreillyfactor _E_ Our legal system is broken! 77% of refugees allowed into U.S. since travel reprieve hail from seven suspect countries. (WT) SO DANGEROUS! _E_ NY Jets center Nick Mangold interns for Trump. Watch Trump's Fabulous World of Golf tonight 9PM ET on Golf Channel __HTTP__ _E_ With 49 days until the election @MittRomney needs to stay on offense. He should not be apologizing. Deflect onto Obama's record. _E_ Obama's motto: If I don't go on tax payer funded vacations & constantly fundraise then the terrorists win. _E_ Does everyone remember @MittRomney and his famous remarks about self deportation and 47% . He was done. I don't need his angry advice! _E_ Obama told Medvedev after the '12 reelect he would "have more flexibility." It was music to Putin's ears. _E_ Amazing. @CelebApprentice has started filming our record 13th season this week thanks to our big and very loyal fan base. _E_ The results are in on the final debate and it is almost unanimous I WON! Thank you these are very exciting times. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Via @Newsmax_Media: "Donald Trump 2016: 8 Facts About Personal Life of GOP Presidential Hopeful" __HTTP__ _E_ #CrookedHillary __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ The Costa Concordia shipwreck is a MONUMENT TO STUPIDITY but the uprighting of the ship is a MONUMENT TO GENIUS! _E_ Wealth comes from big goals and sustained action toward those goals every day. Think Big _E_ Wow you are all correct about @FoxNews totally biased and disgusting reporting. _E_ My @971FMTalk int. with @DLoesch on #HandsOffMyGun 2014 election results stopping Obamacare new Senate & 2016 __HTTP__ _E_ Via @CarolinaLive by @JoelAllenWPDE:"Big names wrap up largest ever SC Tea Party Coalition Convention" __HTTP__ _E_ See story in The Scotsman re: wind turbines __HTTP__ _E_ I answered your questions in today's video... watch at __HTTP__ _E_ Will be interviewed by @andersoncooper on @CNN tonight. Let's see if he treats me fairly—enjoy! _E_ A wonderful evening in South Carolina big crowd amazing energy! _E_ The truly great Phyllis Schlafly who honored me with her strong endorsement for president has passed away at 92. She was very special! _E_ President Obama spoke last night about a world that doesn't exist. 70% of the people think our country is going in the wrong direction. #DNC _E_ Via cnsnews by @SJonesCNS: "Trump Explains His Appeal: 'People Are Tired...Of These Incompetent Politicians'" __HTTP__ _E_ Glad to hear @SethMacFarlane will be hosting this year's Oscars. Something new that should be fun. _E_ My thoughts on @barackobama's campaign.... __HTTP__ _E_ See Sanders backed Hillary on E mails at the debate hurting himself and then she threw him under the bus (but failed). Disloyal person! _E_ "You measure your people and you take action on those that don't measure up." @jack_welch _E_ Why gas prices will rise Miss Canada/Miss Universe and #CelebApprentice in today's #trumpvlog... __HTTP__ _E_ We must leave stop and frisk for A Rod and Anthony Weiner! _E_ RT @Scavino45: .@POTUS & @FLOTUS w/ @LVMPD Officer Cook 2nd day on job received gunshot wound to the right chest & right arm saving live... _E_ I had a great time in D.C. yesterday at the Trump International Hotel OPO groundbreaking ceremony. Watch __HTTP__ _E_ Mike Pence won big. We should all be proud of Mike! _E_ Going to the White House is considered a great honor for a championship team.Stephen Curry is hesitatingtherefore invitation is withdrawn! _E_ Why would smart voters want to put Democrats in Congress in 2018 Election when their policies will totally kill the great wealth created during the months since the Election. People are much better off now not to mention ISIS VA Judges Strong Border 2nd A Tax Cuts & more? _E_ Don't let Obama buy the election by handing out unlimited free money to states. _E_ Thank you New Hampshire! Departing with my amazing family now! #FITN #NHPrimary __HTTP__ __HTTP__ _E_ "Trump signs lease for a NH office returns Monday" __HTTP__ via @UnionLeader by @tuohy _E_ HillaryClinton can illegally get the questions to the Debate & delete 33000 emails but my son Don is being scorned by the Fake News Media? _E_ #FlashbackFriday @kimkardashian on the set of @ApprenticeNBC __HTTP__ _E_ "Fortunately for a quarterback you can play for a long time because you don't get hit very often." – Tom Brady @SuperBowl @Patriots _E_ It should be mandatory that all haters and losers use their real name or identification when tweeting they will no longer be so brave! _E_ .@RudyGiuliani one of the finest people I know and a former GREAT Mayor of N.Y.C. just took himself out of consideration for State . _E_ Wow @megynkelly really bombed tonight. People are going wild on twitter! Funny to watch. _E_ Mitt Romney gave a masterful speech this weekend at Liberty University with a wonderful introduction by Mark DeMoss. Well done. @MittRomney _E_ If you don't believe in yourself no one else will. _E_ When will Washington stand up to China. China is manipulating its currency and stealing our jobs. Washington should move on legislation. _E_ I am in Colorado big day planned but nothing can be as big as yesterday! _E_ Ted Cruz has now apologized to Marco Rubio and Ben Carson for fraud and dirty tricks. No wonder he has lost Evangelical support! _E_ .@CNN Why is somebody (Beck) I beat so soundly all of a sudden an expert on Donald Trump (all over television). She knows nothing about me. _E_ We may get out of ObamaCare because the train wreck is impossible to implement __HTTP__ It is a disaster. _E_ I will be interviewed on @foxandfriends at 7:00 this morning. Plenty to talk about! _E_ Thank you for a great evening Laconia New Hampshire will be back soon! #AmericaFirst __HTTP__ __HTTP__ _E_ Hillary Clinton's open borders immigration policies will drive down wages for all Americans and make everyone less safe. _E_ I fought hard against Spitzer and Weiner and both lost. For a while when Spitzer was way up it seemed that I was a lone voice! Good power _E_ Thank you America!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ A great great honor to welcome & recognize the National Teacher of the Year as well as the Teacher of the Year fro... __HTTP__ _E_ Believe and act as if it were impossible to fail. Charles F. Kettering _E_ I will and I agree! RT @ZacharyQuinto@realdonaldtrump you can't possibly make any more money. so why don't you make a difference instead?! _E_ Thank you to President Moon of South Korea for the beautiful welcoming ceremony. It will always be remembered. __HTTP__ _E_ I can't believe that Prime Minister @David_Cameron is giving massive subsidy to Scotland to destroy itself with windfarms. _E_ TRAIN WRECK just the beginning. Our roads airports tunnels bridges electric grid all falling apart.I can fix for 20% of pols & better _E_ Wow such sacrfices for his re election. @BarackObama will not vacation in Martha's Vineyard this summer. __HTTP__ _E_ Lightweight @AGSchneiderman is driving business & jobs out of NY. Only wants self publicity—a total loser! _E_ I (we) broke the all time record for most votes gotten in a Republican Primary by a lot and with many states left to go! Thank you. _E_ A strong military will stop wars. Peace through Strength! Let's Make America Great Again! __HTTP__ _E_ If a conservative Republican made the mistake that Mrs. Obama just made by calling Braley by the wrong name it would be the biggest story! _E_ .@usgsa A momentous day. Great job on Old Post Office we will make you proud! _E_ Can you believe that the Afghan war is our "longest war" ever—bring our troops home rebuild the U.S. make America great again. _E_ Everyone here is talking about why John Podesta refused to give the DNC server to the FBI and the CIA. Disgraceful! _E_ #TBT With James Lipton on the set of @ApprenticeNBC __HTTP__ _E_ Thousands of e mails from folks urging me to seek the Americans Elect Presidential nomination. _E_ .@dubephnx If we didn't remove incredibly powerful fire retardant asbestos & replace it with junk that doesn't (cont) __HTTP__ _E_ Scotland does not have free press even when you are just stating the facts it's crazy! _E_ The original Apprentice is coming back do you have what it takes to be the next Apprentice? For casting details: __HTTP__ _E_ We did it! Thank you to all of my great supporters we just officially won the election (despite all of the distorted and inaccurate media). _E_ Nothing conservative about the Club for Growth coming into my office and demanding a $1M contribution which naturally they did not get. _E_ What other country tells the enemy when we are going to attack like Obama is doing with ISIS. Whatever happened to the element of surprise? _E_ ...NFL attendance and ratings are WAY DOWN. Boring games yes but many stay away because they love our country. League should back U.S. _E_ I'm going to the @Yankees game tonight to root them on they always win when I am there. _E_ As China is building an air and naval force @BarackObama is cutting ours. __HTTP__ He is weakening our national security. _E_ Get out tomorrow and vote so that we can all finally say those magic words __HTTP__ _E_ It was a true honor to be at Yokota Air Base with our GREAT @USForcesJapan! __HTTP__ _E_ Iraq is more dangerous today than any time under Saddam. War was a mistake as I said from the very beginning. Bush & Obama should apologize _E_ I guess they have Lance Armstrong cold. Brutal report. A waste of taxpayer money to take down an American hero. _E_ Heading to Iowa join me today at noon! #MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_ Putin & I discussed forming an impenetrable Cyber Security unit so that election hacking & many other negative things will be guarded.. _E_ RT @foxandfriends: President Trump to sign an executive order on religious liberty today the National Day of Prayer | @kevincorke __HTTP__ _E_ .@FoxNews is changing their theme from fair and balanced to unfair and unbalanced. But dying @WSJ is worse.Their phony poll is a joke! _E_ .@JohnLegere @TMobile John focus on running your company I think the service is terrible! Try hiring some good managers. _E_ Be sure to get a copy of @williebosshog's new book American Hunter. _E_ The Arab Spring is not working out so well nice name bad results! _E_ Join me in Redding California tomorrow at 1:00pm. #Trump2016Tickets: __HTTP__ _E_ The next generation of luxury @TrumpVancouver will be the icon of the Vancouver skyline __HTTP__ _E_ Just landed in Iowa. See everyone soon! #MAGA _E_ We are taking action to #RepealANDReplace #Obamacare! Contact your Rep & tell them you support #AHCA. #PassTheBill... __HTTP__ _E_ Bay Bridge in San Fransisco built in China keeps getting worse. Cost overruns are out of control China is having a field day with us! _E_ Many reports of peaceful protests by Iranian citizens fed up with regime's corruption & its squandering of the nation's wealth to fund terrorism abroad. Iranian govt should respect their people's rights including right to express themselves. The world is watching! #IranProtests _E_ .@KatyTurNBC 3rd rate reporter & @SopanDeb @ CBS lied. Finished in normal manner&signed autos for 20min. Dishonest! __HTTP__ _E_ We cannot take four more years of Barack Obama and that's what you'll get if you vote for Hillary. #BigLeagueTruth _E_ Remember when Obama promised "you can keep your health care plan?" Not in these 10 states. __HTTP__ Another lie. _E_ My plan will lower taxes for our country not raise them. Phony @club4growth says I will raise taxes—just another lie. _E_ Congratulations to new Congressman @leezeldin being named to House Foreign Affairs Comm. and co chair the House Republican Israel Caucus. _E_ Welcome to the new reality. Goldman Sachs just based their new Asia Pacific chairman not in Tokyo but Beijing. __HTTP__ _E_ Doing Fox and Friends at 7.00 A.M. Hope you loved Apprentice last night. _E_ We must stop being politically correct and get down to the business of security for our people. If we don't get smart it will only get worse _E_ Today I announced our strategy to confront the Iranian regime's hostile actions and to ensure that they never acquire a nuclear weapon. __HTTP__ _E_ The Tax Cut Bill is coming along very well great support. With just a few changes some mathematical the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well! _E_ How crazy 7.5% of all births in U.S. are to illegal immigrants over 300000 babies per year. This must stop. Unaffordable and not right! _E_ Entrepreneurs: Whatever happens you're responsible. If it doesn't happen you're responsible. _E_ Can anyone imagine Chafee as president? No way. _E_ RT @foxandfriends: President Trump officially nominates former Assistant Attorney General Christopher Wray to head the FBI __HTTP__ _E_ ....it is very possible that those sources don't exsist but are made up by fake news writers. #FakeNews is the enemy! _E_ THE U.S.G.A. Boy's Junior Champion at Trump National Golf Club Bedminster just won The Australian Open. We are proud of you @JordanSpieth _E_ Iraq is being ravaged by Al Qaeda. Country in utter chaos & all oil is going to Iran & China __HTTP__ Terrible mistake! _E_ Letterman @Late_Show had Brian Williams @NBCNightlyNews as guest last night I was on last Thursday _E_ .@lightjzup Industrial turbines are destroying our land. _E_ Thanks. __HTTP__ _E_ Biggest story in politics is now happening in the great State of Colorado where over one million people have been precluded from voting! _E_ A pessimist is one who makes difficulties of his opportunities... _E_ To all of those who asked I predicted two weeks ago and again last night that Dwight Howard would go to Houston.Do I get congrats insight? _E_ Incredibly proud of my son @EricTrump & his efforts on behalf of @StJude in Memphis TN. __HTTP__ __HTTP__ _E_ If Christian Bale turned down $50M to return as Batman he should have his head examined. What was he thinking?! _E_ They should have rebuilt the two buildings of the World Trade Center exactly as they were except taller and stronger. A better statement! _E_ Can you imagine trading five really bad enemies of the U.S. for the freedom of traitor Bergdahl. Just another bad deal! _E_ RT @PChowka: Sean Hannity's Big Week Top Ratings Probing Reporting and Let There Be Light at American Thinker __HTTP__ h... _E_ Whether you think you can or think you can't you're right. Henry Ford _E_ Excited and honored to be addressing @theFAMiLYLEADER summit in Iowa this August. __HTTP__ _E_ Going to New Hampshire in a little while. Big crowds! #MakeAmericaGreatAgain! _E_ In real estate all locations can be enhanced through good marketing. Be smart! _E_ Why does @CNN bore their audience with people like @secupp a totally biased loser who doesn't have a clue. I hear she will soon be gone! _E_ My @foxandfriends int. @FoxNewsInsider "'Once a Choker Always a Choker': DJT Takes Credit for Romney Dropping Out" __HTTP__ _E_ The media has been speculating that I fired Rex Tillerson or that he would be leaving soon FAKE NEWS! He's not leaving and while we disagree on certain subjects (I call the final shots) we work well together and America is highly respected again! __HTTP__ _E_ Let's not start celebrating over Libya until we see who takes over. _E_ "@NMoralesNBC @ThomasARoberts to Host 63rd Annual @MissUniverse" __HTTP__ via @TheWrap by @AnthonyMaglio _E_ My @FoxBusiness int. w/Don Imus on not drinking alcohol politicians being all talk and no action & the border __HTTP__ _E_ .@GretchenCarlson's memoir is a powerful example of perseverance & hope. "Getting Real" is as real as it gets. Get it & enjoy! #GettingReal _E_ Today we gathered in the Roosevelt Room for one single reason: to CUT THE RED TAPE! For many decades an ever growing maze of regs rules and restrictions has cost our country trillions of dollars millions of jobs countless American factories & devastated entire industries. __HTTP__ _E_ I will start reviewing various political reporters etc & websites as to their professionalism & fairness—many people asking for this. _E_ "If you are passionate about your endeavors it will be reflected back to you in your end result." – Trump Never Give Up _E_ There's a lot going on at the Eric Trump Foundation ... __HTTP__ _E_ #GOPDebate #GoogleTrends __HTTP__ _E_ Great day for Tax Cuts and the Republican Party. But the biggest Winner will be our great Country! _E_ Ask: Is there anyone else who can do this better than I can?That's just another way of saying know yourself & know your competition. _E_ Assad will never give up his chemical weapons. He has spent years and billions accumulating them. This is all a ruse. _E_ "When you expect things to happen strangely enough they do happen." J. P. Morgan _E_ 'U.S. Murders Increased 10.8% in 2015' via @WSJ: __HTTP__ _E_ .@DennisRodman re @Omarosa is right she's becoming predictable. _E_ I know you will enjoy reading my tax plan __HTTP__ #MakeAmericaGreatAgain _E_ .@TrumpGolfLA is @theknot's pick for the Best of Weddings with our Vista Terrace looking over the Pacific Ocean __HTTP__ _E_ Hillary said such nasty things about me read directly off her teleprompter...but there was no emotion no truth. Just can't read speeches! _E_ RT @jessebwatters: Thanks for watching!! __HTTP__ _E_ Cruz said Kasich should leave because he couldn't get to 1237. Now he can't get to 1237. Drop out LYIN' Ted. _E_ Wise words from my mother: "Trust in God and be true to yourself." Mary MacLeod Trump _E_ I pick the best locations @Trump_Charlotte has incredible views of beautiful Lake Norman. __HTTP__ _E_ Jon Stewart is the most overrated joke on television. A wiseguy with no talent. Not smart but convinces dopes he is! Fading out fast. _E_ I am self funding my campaign and am therefore not controlled by the lobbyists and special interests like lightweight Rubio or Ted Cruz! _E_ The Justice Department's investigation into the national security leaks is not independent. This is a very grave situation. _E_ Thank you America! #Trump2016 __HTTP__ _E_ Very excited to be returning to Iowa tomorrow to campaign for my friend & strong Conservative leader @SteveKingIA! _E_ Clear winner of the #GOPDebate. Thank you for your support! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Thank you! #GOPDebate __HTTP__ _E_ "Donald Trump hosts first ever 'Trump Invitational' at Mar a Lago" __HTTP__ via @WPTV _E_ RT @charliekirk11: ISIS getting slaughtered: Square miles liberated from ISISTrump: 26000 Obama: 13200Total Square miles held by... _E_ When it comes to money finance and even life PROTECT THE DOWNSIDE AND THE UPSIDE WILL TAKE CARE OF ITSELF! _E_ Young entrepreneurs – always remember in negotiations that sometimes the best deal you make is the one you walk away from. _E_ #MakeAmericaSafeAgain __HTTP__ _E_ .@AlexSalmond is making a truly stupid mistake by forcing ugly industrial wind turbines down Scotland's throat –he's hated for it. _E_ Via @theblaze by @BillyHallowell:"DONALD TRUMP BLASTS OBAMA FOR FAILING TO SECURE CHRISTIAN PASTOR'S FREEDOM IN IRAN" __HTTP__ _E_ .@alexsalmond RT @RichWaugaman This time I agree 100% I never knew how useless a wind turbine was until I (cont) __HTTP__ _E_ The stage is set for the real debate it will be very interesting! _E_ Blackdog Scotland started a petition against @VattenfallGroup. __HTTP__ _E_ Thank you Arizona! See you soon!#MakeAmericaGreatAgain __HTTP__ _E_ Venezuela should allow Leopoldo Lopez a political prisoner & husband of @liliantintori (just met w/ @marcorubio) o... __HTTP__ _E_ Hillary said she was under sniper fire (while surrounded by USSS.) Turned out to be a total lie. She is not fit to... __HTTP__ _E_ The Republicans who want to cut SS & Medicaid are wrong. A robust economy will Make America Great Again! __HTTP__ _E_ In August 2012 Obama said the so called Arab Spring sprung from 'joyful longing for human freedom' __HTTP__ Good call! _E_ RT @DonaldJTrumpJr: Not surprising at all! Father Of Otto Warmbier: Obama Admin Told Us To Keep Quiet Trump Admin Brought Him Home __HTTP__ _E_ Over 2 million people have lost their jobs since @BarackObama became POTUS. How many of them still have healthcare? _E_ Little Mac Miller's next album may bomb. He can't use my name again for sales. _E_ I took some heat a long time ago when I said that George Zimmerman was a sicko and bad news. I know people and this guy is no good trouble! _E_ #ThankYouTour2016 12/6 North Carolina __HTTP__ Iowa __HTTP__ Michiga... __HTTP__ _E_ Be sure to download my new The Celebrity Apprentice app to begin interacting with this Sunday's episode __HTTP__ _E_ While Hillary and I both won South Carolina by big margins Repubs got far more votes with a massive increase from past cycles.GROWING PARTY _E_ I would love to be at the Cadillac World Golf Championship @TrumpDoral in Miami but even more so in Orlando with the #TrumpTrain! _E_ The Fed's actions these past 3 years could bring record high inflation in the near future. That would be (cont) __HTTP__ _E_ No surprise that China was caught cheating in the Olympics. That's the Chinese M.O. Lie Cheat & Steal in all international dealings. _E_ Proud of @IvankaTrump for her leadership on these important issues. Looking forward to hearing her speak at the W20! __HTTP__ _E_ Thank you Nashua New Hampshire! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Going over to @ABC to do LIVE at 9:00. _E_ Thank you to everyone for all of the nice comments by Twitter pundits and otherwise for my speech last night. _E_ Obamacare puts poor people on a form of government run single payer health insurance that many doctors don't take @Avik _E_ .@FLGovScott Gaming states are laughing at stupidity of not approving gaming in FL—they're afraid of Miami—can't believe their luck! _E_ Trump University has a 98% approval rating. I could have settled but won't out of principle! _E_ .@Toure I felt very sorry for you during your meltdown on @PiersMorgan. He drove you insane but of course Piers is a lot smarter than you _E_ Rumor has it that the grubby head of failing @VanityFair Magazine Sloppy Graydon Carter is going to be fired or replaced very soon? _E_ Via @wmbfnews: Donald Trump puts Tea Party on map for 2016 __HTTP__ _E_ Bad news for @BarackObama. @gallupnews reports that the economy (71%) and gas prices (65%) are Americans' top (cont) __HTTP__ _E_ #CaucusForTrump #Trump2016 __HTTP__ _E_ The five Taliban leaders released for a deserter must really be laughing and having a good time right now. They are saying how dumb U.S. is! _E_ My @FoxNews interview with @gretawire discussing the #CNNDebate and how to deal with Iran without using force __HTTP__ _E_ #trumpvlog China is laughing.... __HTTP__ _E_ I answered my @Facebook fans questions via video watch __HTTP__ _E_ Ted Nugent was obviously using a figure of speech unfortunate as it was. It just shows the anger people have towards @BarackObama. _E_ My thoughts on the Geico ad and more in today's video blog.... __HTTP__ _E_ Most of the world's great riders are at Mar a Lago today for the Trump Invitational one of the most important equestrian events of the year _E_ Do as I say not as I do. Obama just granted a special ObamaCare exemption for all Congress __HTTP__ All are hypocrites! _E_ "Developing your talent requires work and work creates luck." – Trump Never Give Up _E_ The US should not give a penny of foreign aid to Egypt if the Muslim Brotherhood takes over the country. We (cont) __HTTP__ _E_ Drudge Poll on who won the 3rd #GOPDebate. Thank you! __HTTP__ _E_ Thank you to @jdickerson and @FaceTheNation for a very fair and professional interview this morning. No wonder you are #1 in the ratings! _E_ Frack now and frack fast unless we want to continue to be dependent on countries that hate us. _E_ When will anyone be held accountable for the VA scandal? The politicians are experts in never facing any consequence. _E_ Via @AmSpec by Jeffrey Lord: "New Obama Scandal Erupts: Trump Targeted" __HTTP__ _E_ So I speak badly of China but I speak the truth and what do the consumers in China want? They want Trump. (cont) __HTTP__ _E_ Joe Scarborough initially endorsed Jeb Bush and Jeb crashed then John Kasich and that didn't work. Not much power or insight! _E_ I invite you to join my campaign to Make America Great Again! Sign up to Volunteer! __HTTP__ _E_ Barack Obama said absolutely not 3 times before he agreed to go after Bin Laden now he wants all of the credit! _E_ Thanks @JamersonHayes they are all total losers with nothing going for them! _E_ Check out this great story from the @WSJ... __HTTP__ _E_ In calling my tweets 'obnoxious' @AOL says "I sure know how to keep them wanting more." They are welcome. I just tell it like it is. _E_ .@PhilMickels0n_ is right—California taxes are far too high. It's ridiculous. _E_ The Audacity of @BarackObama the Federal Reserve purchased 61% of all debt issued by Treasury in 2011. Killing our children's future. _E_ RT @realDonaldTrump: At the request of the Governor of Texas I have signed the Disaster Proclamation which unleashes the full force of go... _E_ Obama should play golf with Republicans & opponents rather than his small group of friends. That way maybe the terrible gridlock would end. _E_ Between a terrible press conference mishandled prisoner swap & Taliban attacks Hagel's 1st trip as SOD was a disaster. No surprise. _E_ Our vets are treated like 3rd class citizens. Enough! Join me & @V4SA on @USSIOWA at LA Waterfront to hear my plan for vets & the military! _E_ I look forward to playing golf with President @BarackObama someday. _E_ Big response to my Tea Party statement remember they were never fully energized by Romney campaign and will have far more power with time. _E_ Entrepreneurs: Gain and use information to your advantage see every day as an opportunity to learn. _E_ I guess Rupert Murdoch and the @nypost don't like Donald Trump. Such false reporting about my big hit in Iowa. Even my enemies said bull. _E_ Don't believe Kay Hagan on Ebola travel ban. She also promised that you would keep your healthcare plan under ObamaCare. Vote @ThomTillis! _E_ Putin re Snowden issue "it is like shearing a pig: there's lots of squealing and little fleece." _E_ Check out today's video blog __HTTP__ I want to answer more of your questions tweet me..... _E_ The only reason irrelevant @GlennBeck doesn't like me is I refused to do his failing show asked many times. Very few listeners sad! _E_ Give your goals substance make them count on as many levels as you can. Remember that passion can be the catalyst for great achievement. _E_ .@ScotGolfPodcast Work has not yet begun. We're in the approval phase. It will be amazing. You will love the final result. _E_ Watch the WH spokesman try to spin @BarackObama's rationale for using exec. priv. on Fast & Furious __HTTP__ _E_ A very interesting read. Unfortunately so much is true. __HTTP__ _E_ President Obama and our negotiators are failed checker players playing against Grand Master Chess champions. Very sad to watch! _E_ Chinese oil trader just bought "record number" of Mideast crude __HTTP__ China gains while we fight ISIS. What are we doing? _E_ I have built so many great & complicated projects– creating tens of thousands of jobs video: __HTTP__ __HTTP__ _E_ Via @GolfweekMag by @GolfweekBRomine: "@TigerWoods to design Trump course in Dubai" __HTTP__ _E_ If JP Morgan took their case through the courts for 15 years nobody would be suing them—easy target. _E_ 'Trump is right about violent crime: It's on the rise in major cities' __HTTP__ _E_ As #HurricaneHarvey intensifies remember to #PlanAhead. __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_ things they did and said (like giving the questions to the debate to H). A total double standard! Media as usual gave them a pass. _E_ Congratulations to @newtgingrich‎ on being signed to co host @CNN Crossfire. Great move by Jeff Zucker. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ New York should Frack. Thousands of jobs and millions in revenue. NY would be a truly rich state. _E_ The worst thing Hillary could do is have her husband campaign for her. Just watch. _E_ On our YouTube channel the opening of the incredible Trump Ocean Club in Panama.... __HTTP__ _E_ I just saw the movie Unbroken very good except I thought the ending was weak no retribution! And we complain about waterboarding. _E_ After 1 year of investigation with Zero evidence being found Chuck Schumer just stated that Democrats should blame ourselvesnot Russia. _E_ Watching the Ryder Cup on @GolfChannel. Very interesting and tough matches. Amazing sport my favorite! _E_ Kasich only looks O.K. in polls against Hillary because nobody views him as a threat and therefore have placed ZERO negative ads against him _E_ With Ben Carson wanting to hit his mother on head with a hammer stab a friend and Pyramids built for grain storage don't people get it? _E_ RT @mitchellvii: EXACTLY AS I SAID House Intel Chair: We Cannot Rule Out Sr. Obama Officials Were Involved in Trump Surveillance __HTTP__ _E_ The USC made a terrible decision today. How can a requirement to buy private health insurance logically be a Government tax?! _E_ Carly Fiorina did such a horrible job at Lucent and HP virtually destroying both companies that she never got another CEO job offer! Pres. _E_ The Football program at Penn State should be suspended. _E_ Will be speaking with Germany and France this morning. _E_ I just left @trumpwinery in CharlottesvilleVirginia it is the finest in the country really incredible! _E_ At least 3.5M fellow Americans are going to lose their healthcare plans because of ObamaCare. Defund then repeal! _E_ I liked The Kelly File much better without @megynkelly. Perhaps she could take another eleven day unscheduled vacation! _E_ As a candidate I promised we would pass a massive TAX CUT for the everyday working American families who are the backbone and the heartbeat of our country. Now we are just days away... __HTTP__ _E_ Thank you @TrumpWomensTour!#MakeAmericaGreatAgain __HTTP__ _E_ The historic $250M renovations at @TrumpDoral are moving on pace. Once complete @TrumpDoral will be South Florida's premiere resort. _E_ They laughed at me when I said to bomb the ISIS controlled oil fields. Now they are not laughing and doing what I said. #Trump2016 _E_ The NYPost reports @VanityFair Magazine dropped 18% to only 283938 newsstand copies sold. Very sad & their bloggers are doing even worse! _E_ Donald Trump will keynote Oakland County Republicans' Lincoln Day dinner __HTTP__ via @MLive Record crowd expected. _E_ My interview with @ASavageNation discussing #TimeToGetTough my 2012 plans and Iraq __HTTP__ __HTTP__ _E_ Lightweight Senator Marco Rubio features Trump Univ. students in FL. attack ads who submitted excellent reviews. __HTTP__ _E_ Thank you Ohio see you tonight! __HTTP__ _E_ .@BarackObama bowed to the Saudi King in public yet the Dems are questioning @MittRomney's diplomatic skills. _E_ .@DonaldJTrumpJr and I on the 18th hole at Trump International Golf Links Scotland __HTTP__ _E_ Did you agree with my decision? #CelebApprentice _E_ I never met former Defense Secretary Robert Gates. He knows nothing about me. But look at the results under his guidance a total disaster! _E_ Via @FootwearNews by @kristenmhenning: "@IvankaTrump Works to Beat Breast Cancer" __HTTP__ _E_ Steven Spielberg is a great filmmaker. Go see Lincoln. _E_ I am astonished that the media continues to lie. @BarackObama gutted welfare reform. It is a fact! _E_ As a big job creator I was greatly honored to have been mentioned twice tonight during the debate. _E_ Just announced that Iraq (U.S.) is preparing for battle to reclaim Mosul. Why do they have to announce this? Makes mission much harder! _E_ Failed candidate Mitt Romneywho ran one of the worst races in presidential historyis working with the establishment to bury a big R win! _E_ Our gov't should immediately stop sending $'s to Mexico no friend until they release Marine & stop allowing immigrant inflow into U.S. _E_ Donald Trump visits Doral resort says he's allaying neighbors' concerns __HTTP__ via @MiamiHerald _E_ I hope the boycott of @Macys continues forever. So many people are cutting up their cards. Macy's stores suck and they are bad for U.S.A. _E_ The blatant waste of taxpayers' dollars doesn't bother Obama because it's all part of his broader nanny state (cont) __HTTP__ _E_ A tough negotiator can make the Chinese back off. We've done it before. #TimeToGetTough __HTTP__ __HTTP__ _E_ .@Omarosa is not winning points being called "the wicked witch of the Mid West" and most certainly other things. #CelebApprentice _E_ RT @FoxNews: .@POTUS: Our infrastructure will again be the best in the world. We used to have the greatest infrastructure anywhere in the... _E_ I have a dream that our country will be great again! #DreamDay _E_ Benghazi is bigger than Watergate. Don't let Obama get away with allowing Americans to die. Kick him out of office tomorrow. _E_ Via @dcexaminer by @eScarry: "Donald Trump: @HuffingtonPost 'a very dishonest organization'" __HTTP__ _E_ Great coordination between agencies at all levels of government. Continuing rains and flash floods are being dealt with. Thousands rescued. _E_ Dow S&P 500 and Nasdaq all finished the day at new RECORD HIGHS! __HTTP__ _E_ At least @TheTinaBeast is consistent. She takes over a magazine and it ends up in the gutter. _E_ 'Trump signs bill undoing Obama coal mining rule' __HTTP__ _E_ Thank you New Hampshire!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Just arrived in Mississippi for the rally. Word is that the crowd is overflowing and massive. Will be an amazing evening! _E_ "A man always has two reasons for doing anything: a good reason and the real reason." J. P. Morgan _E_ Happy Passover to everyone celebrating in the United States of America Israel and around the world. #ChagSameach _E_ Thank you Eric! __HTTP__ _E_ RT @foxnation: .@realDonaldTrump's First Full Month in Office Sees Biggest Jobs Gain 'In Years': Report: __HTTP__ _E_ Because of President Obama's failed leadership we have put Vladimir Putin & Russia back on the world stage! No reason for this. _E_ Jobless claims rose yet again last week __HTTP__ @BarackObama's economic record is abysmal we can do much better. _E_ Obama doesn't know what he's doing. His foreign policy is a disaster. Libya Egypt Iraq Afghanistan all (cont) __HTTP__ _E_ Thank you! __HTTP__ _E_ Curtis Sliwa doing tv commentary on 9/13/2001. Good job Curtis. Please send your apologies to @realDonaldTrump. __HTTP__ _E_ Republicans are always worried about their general approval. With proposing to 'ignore the debt ceiling' they are ignoring their base. _E_ MAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_ Thank you! Facebook: __HTTP__ __HTTP__ __HTTP__ _E_ Today we remember the men and women who made the ultimate sacrifice in serving. Thank you God bless your families & God bless the USA! _E_ Congratulations to @SteveKingIA and his team on running a great campaign. Steve is a strong leader in the House. _E_ Thank you for your endorsement @paulteutulsr! #BikersForTrump #VoteTrumpNV Video: __HTTP__ __HTTP__ _E_ If you have a hard time communicating one way to overcome it is to turn your focus onto your audience. Midas Touch _E_ Thanks for all the nice comments about the @Late_Show last night. I enjoyed it and David enjoyed the ratings. __HTTP__ _E_ A great @The Masters. The course looks so beautiful. Fantastic for golf and television ratings! _E_ Stop saying I went bankrupt. I never went bankrupt but like many great business people have used the laws to corporate advantage—smart! _E_ Amazing! Watch @NHLBruins fans take over National Anthem during pregame ceremonies __HTTP__ _E_ Should not raise taxes in Wisconsin but massive budget deficit. Education roads etc suffering. @DanHenninger lies. @WSJ _E_ Via @BreitbartNews: "EXCLUSIVE: TRUMP SMACKS BACK AGAINST MEDIA ATTACKS ON CPAC SPEECH" __HTTP__ by @mboyle1 _E_ Remember our six brave heroes who died searching for Bergdahl after he deserted __HTTP__ (h/t @Military_News) _E_ "Leverage: don't make deals without it." – The Art of the Deal _E_ My @FoxNews interview on @gretawire discussing the @RNC convention @BarackObama's sealed records & real estate advice __HTTP__ _E_ Congratulations @TrumpSoHo on being named a "Great East Coast Hotels for Teens and Families" by @ParadeMagazine __HTTP__ _E_ Derek Jeter's rehab assignment is progressing on schedule. He's a true @Yankees captain. Look forward to seeing him back on the field _E_ When I say I would end Obamacare I would also come up with a plan that would be far better much easier to understand and cost less! _E_ Happy #MothersDay to all the great mothers out there! __HTTP__ _E_ "The worst thing you can possibly do in a deal is seem desperate to make it." The Art of the Deal _E_ Entrepreneurs: See each day as an opportunity to show what you can do at the highest level. _E_ If the Prez wants to create jobs talk to some business people not liberal intellectuals. _E_ Watching Pyongyang terrorize Asia today is just amazing! _E_ Pay to play. Collusion. Cover ups. And now bribery? So CROOKED. I will #DrainTheSwamp. __HTTP__ _E_ RT @AnnCoulter: GREATEST FOREIGN POLICY SPEECH SINCE WASHINGTON'S FAREWELL ADDRESS. _E_ Today @BarackObama will borrow 40 cents on every dollar he spends from China. Just another day at the office. _E_ Always know you could be on the precipice of something great. Donald J. Trump __HTTP__ _E_ Phony Rubio commercial. I could have settled but won't out of principle! See student surveys. __HTTP__ _E_ The @TheView @ABC once great when headed by @BarbaraJWalters is now in total freefall. Whoopi Goldberg is terrible. Very sad! _E_ I will be meeting with the NRA who has endorsed me about not allowing people on the terrorist watch list or the no fly list to buy guns. _E_ I'll be on @greta ON THE RECORD tonight at 7 PM _E_ By the way New York State MUST LOWER TAXES (and fast) and must start going after all of the energy that lies just below our feet (now)! _E_ HALF of Americans don't pay income tax despite crippling govt debt... __HTTP__ _E_ 'Majority in Leading EU Nations Support Trump Style Travel Ban' Poll of more than 10000 people in 10 countries... __HTTP__ _E_ RT @DiamondandSilk: When the President says You're Fired That means: Pack Yo Stuff and Go Not Say You Refuse to Go! #DrainTheSwam... _E_ Rush Limbaugh is great tells it as he sees it really honorable guy! Thanks Rush! #Trump2016 _E_ RT @FoxNews: .@KellyannePolls: Since @POTUS took office 863000 new jobs were filled by women. Over half a million American women have en... _E_ An important part of my (or anybody's) success is the ability to judge people. I believe that @MileyCyrus is a really good person. _E_ If FM @AlexSalmond needs to litter Scotland w/ ugly industrial wind turbines to gain independence he will lose! __HTTP__ _E_ I will be interviewed on @CNN @NewDay at 7:30 A.M. Enjoy! _E_ Let's Trump the Establishment! We are no longer silent. We will Make America Great Again! __HTTP__ _E_ Outrageous @BarackObama is trying to unilaterally gut welfare reform __HTTP__ He doesn't believe in a strong work ethic. _E_ Stamps are going up once again. Now the US postal service will lose even more money. _E_ My @extratv interview discussing @Rosie's new baby my acceptance of @billmaher's $5M offer & hiring @_KatherineWebb __HTTP__ _E_ .@bobvanderplaats asked me to do an event. The people holding the event called me to say he wanted $100000 for himself.Phony @foxandfriends _E_ Trust your instincts. They are there for a reason. Without instincts you'll have a hard time getting to and staying at the top. _E_ When I jokingly said bring back Steve Jobs to run Apple because Apple has not been doing well the haters & losers had a field day! Sad. _E_ Lyin' Ted Cruz just used a picture of Melania from a G.Q. shoot in his ad. Be careful Lyin' Ted or I will spill the beans on your wife! _E_ Stock market hits new high with longest winning streak in decades. Great level of confidence and optimism even before tax plan rollout! _E_ "Donald Trump to address SC Tea Party Coalition at Myrtle Beach event" __HTTP__ via @CarolinaLive by @timmcginniswpde _E_ With Terry McAuliffe Gov of Virginia at the Trump Winery in Charlottesville VA largest on East Coast. @GovernorVA __HTTP__ _E_ You have no idea what my strategy on ISIS is and neither does ISIS (a good thing). Please get your facts straight thanks. @megynkelly _E_ Join my team tonight at 8:30pmE! __HTTP__ __HTTP__ _E_ .@TimTebow has tremendous talent and a proven ability to lead. He deserves to be in the @nfl. _E_ America's Labor Market Continues to Boom JOBS JOBS JOBS! __HTTP__ _E_ How Trump Won And How The Media Missed It __HTTP__ _E_ China is driving the price of gold up in order to ease pressure against Iranian sanctions. __HTTP__ _E_ .@FoxNews is the only network that does not even mention my very successful event last night. $6000000 raised in one hour for our VETS. _E_ Practice positive thinking this will keep you focused while weeding out anything that is unnecessary negative or detrimental. _E_ The cheap 12 inch sq. marble tiles behind speaker at UN always bothered me. I will replace with beautiful large marble slabs if they ask me. _E_ .@Linda_McMahon is an elite businesswoman who will bring a great outlook to DC. Support her campaign here __HTTP__ _E_ The debate tonight will be a total disaster low ratings with advertisers and advertising rates dropping like a rock. I hate to see this. _E_ Thank you @rushlimbaugh for your wonderful words. We will #MakeAmericaGreatAgain _E_ .@Lord_Sugar If you didn't say the iPod would be gone in a year you might have been really rich instead of the peanut money you have. _E_ "Protect the downside and the upside will take care of itself. – The Art of the Deal _E_ Congrats to @TimTebow on making @Patriots' first cut. Stay strong and positive! We are all rooting for you. _E_ New on our YouTube channel today is a brand new #trumpdocumentary giving you a look inside the world of Trump Golf... __HTTP__ _E_ If any candidate believes that with what we know today we still should have invaded Iraq then they are unqualified to be Commander in Chief. _E_ The dying @VanityFair's circulation has "dropped" & its newsstand sales have "plummeted by 20.1 percent" __HTTP__ _E_ I am truly honored to have been chosen Statesman of the Year by the Republican Party of Sarasota County. The (cont) __HTTP__ _E_ Looked at plans for Trump Doral Country Club today. It will be amazing! Glad to be in Miami. _E_ If elected I will undo all of Obama's executive orders. I will deliver. Let's Make America Great Again! __HTTP__ _E_ Entrepreneurs: Resolve to be bigger than your problems. Who's the boss? Don't negate your own power. _E_ I've done the largest house sale in U.S. history by selling a Palm Beach mansion for $100M $60M more than I paid. I love real estate. _E_ Mar a Lago in Palm Beach is one of the great palazzos of the world with a fantastic history. __HTTP__ _E_ Watch the latest From The Desk Of Donald Trump at __HTTP__ and read this article __HTTP__ _E_ Trump urges GOP to be 'mean as hell' __HTTP__ Via @CNNPolitics _E_ Looks like two time failed candidate Mitt Romney is going to be telling Republicans how to get elected. Not a good messenger! _E_ I truly understood the appeal of Ron Paul but his son @RandPaul didn't get the right gene. _E_ Most people think small because most people are afraid of success afraid of making decisions afraid of winning. The Art of the Deal _E_ .@oreillyfactor The people of Iowa love the fact that I stuck up for my rights as I will do for the U.S. Also got $6000000 for our VETS! _E_ Resolve never to quit never to give up no matter what the situation. @jacknicklaus _E_ Donald Trump backs 'Apprentice' Randal Pinkett for N.J. Lieutenant Governor: __HTTP__ _E_ Apparently @MartinBashir said something about me on his show yesterday. I was surprised to find out he is on TV. Who knew?! _E_ They don't like Rubio in Florida he left them high & dry. Doesn't even show up for votes! _E_ Republicans sorry but I've been hearing about Repeal & Replace for 7 years didn't happen! Even worse the Senate Filibuster Rule will.... _E_ The Euro is going to collapse soon. Cross border lending is already down and banks are stopping their Euro investments. _E_ Horrible killing of a 13 year old American girl at her home in Israel by a Palestinian terrorist. We must get tough. __HTTP__ _E_ I am working hard even on Thanksgiving trying to get Carrier A.C. Company to stay in the U.S. (Indiana). MAKING PROGRESS Will know soon! _E_ #TrumpVine A message for @AnthonyWeiner __HTTP__ _E_ On the 13th tee box @TrumpScotland with my grand daughter Kai! @DonaldJTrumpJr __HTTP__ _E_ The Democrats will only vote for Tax Increases. Hopefully all Senate Republicans will vote for the largest Tax Cuts in U.S. history. _E_ RT @JohnStossel: I can skate here ONLY b/c @realdonaldtrump fixed this rink after NYC gov't spent $13M but FAILED! Good for Trump! __HTTP__ _E_ Via @BPolitics by @Griffin Aboard Donald Trump's 757 at the South Carolina Tea Party Convention __HTTP__ _E_ On behalf of an entire nation Happy 242nd Birthday to the men and women of the United States Marines!#USMC242 #SemperFi __HTTP__ _E_ Ann Romney is a fantastic lady. She was great in thanking people last night! __HTTP__ _E_ Our nation is a once great nation divided! _E_ Be sure to watch my wonderful wife Melania Trump tonight on @QVC at 1AM EST _E_ ..... I wonder if Angelo has a job or is on assistance. In any event I'm sure he is a nice guy! _E_ Congratulations to @FoxNews for winning November in the cable news rating race with 9 of 10 top shows __HTTP__ _E_ My @greta int. on @FoxNews on how to defeat ISIS Obama losing ground to ISIS & Making America Great Again! __HTTP__ _E_ Miss USA pageant had a 4 to 1 vote in favor but it won't be in Miami Doral in 2014 Mayor Boria voted against it. I want total support! _E_ New study shows 80% of Congress have no business experience it shows! _E_ AmyMek Amen! @realDonaldTrump has drawn more attention to Veterans issues in 1 week than these politicians have in decades! _E_ Great poll numbers for @MittRomney just out he is leading substantially in swing states. _E_ NEW POLL: Trump Blue Collar Support highest since FDR in 1930s WOW! __HTTP__ _E_ We have to repeal & replace #Obamacare! Look at what is doing to people! #DrainTheSwamp __HTTP__ _E_ Nice story from @businessinsider __HTTP__ _E_ We have tremendous economic power over China if our leaders knew how to use it which they don't! China's economy would collapse without us. _E_ Everyone's favorite frontman Twisted Sister lead singer @deesnider returns to this year's All Star @ApprenticeNBC. Dee does great! _E_ Another good poll result in the great state of SC. Trump at 30%. Carson at 15% and Bush at 9%. __HTTP__ _E_ "We have a president who has a vendetta against businesspeople and considers them the enemy. He's also (cont) __HTTP__ _E_ "Donald Trump launches new men's fragrance Empire @Macys Because every man has his own empire to build'" __HTTP__ _E_ As bad as they were I don't remember our embassies being attacked when Mubarak and Gaddafi were in power. _E_ The Failing @nytimes the pipe organ for the Democrat Party has become a virtual lobbyist for them with regard to our massive Tax Cut Bill. They are wrong so often that now I know we have a winner! _E_ Thug Politics. Lightweight hack Schneiderman meets with Obama on Thursday then brings frivolous suit on Saturday. _E_ Washington should have brought in Strasburg to relieve they would have won. _E_ A wonderful story on Iowa voters by @arappeport of the @NYTimes. __HTTP__ _E_ The speakers slots at the Republican Convention are totally filled with a long waiting list of those that want to speak Wednesday release _E_ With gas prices rising and the economy failing @BarackObama seeks to have his EPA raise energy prices by $109B __HTTP__ _E_ RT @DRUDGE_REPORT: REUTERS ROLLING: TRUMP 39% CRUZ 14.5% BUSH 10.6% CARSON 9.6% RUBIO 6.7%... MORE... __HTTP__ _E_ .@EricTrump did an amazing job raising money for @StJude with his @EricTrumpFDN event featuring @LisaLampanelli. Watch __HTTP__ _E_ A big salute to Jerry Jones owner of the Dallas Cowboys who will BENCH players who disrespect our Flag. Stand for Anthem or sit for game! _E_ and stay at the fantastic Trump International Hotel Las Vegas ... __HTTP__ _E_ Thank you Fort Lauderdale Florida. #MakeAmericaGreatAgain __HTTP__ _E_ Stop and frisk works. Instead of criticizing @NY_POLICE Chief Ray Kelly New Yorkers should be thanking him for keeping NY safe. _E_ RT @foxandfriends: Insurers seeking huge premium hikes on ObamaCare plans __HTTP__ _E_ Paul Ryan a man who doesn't know how to win (including failed run four years ago) must start focusing on the budget military vets etc. _E_ Don't believe the media stories. OPEC and the Saudis have not been doing us any favors recently with oil outputs. Oil should be $30/barrel. _E_ I will defeat Crooked Hillary Clinton on 11/8/2016. #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_ "Palin's brand among evangelicals is as gold as the faucets in Trump tower" said Ralph Reed the chairman of the Faith & Freedom Coalition. _E_ "Fans like winners. They come to watch stars – great exciting players who do great exciting things." The Art of The Deal _E_ Great day in Colorado & Arizona. Will be in Nevada Colorado and New Mexico tomorrow join me!Tickets:... __HTTP__ _E_ Have passion drive and enthusiasm? You can check out the @TrumpCollection careers here: __HTTP__ _E_ "The most important thing in communication is hearing what isn't said." Peter Drucker _E_ "Sometimes life hits you in the head with a brick. Don't lose faith." Steve Jobs _E_ The 13th season of All Star @CelebApprentice is unique. We really pushed the envelope here. Our great and loyal fans will love it. _E_ In any business there will be ups and downs. If you can weather the rough times your success will be even greater during high times. _E_ CLINTON CORRUPTION AND HER SABOTAGE OF THE INNER CITIES. Full speech transcript: __HTTP__ _E_ I will be going to Atlanta Georgia tomorrow—here's the info: __HTTP__ Hope to see you there! #MakeAmericaGreatAgain! _E_ .@BillMaher didn't come through with his promised $5 million for charity so today I will sue him. _E_ OPEC is better off than they were 4 years ago. Gas has more than doubled during @BarackObama's term. Outrageous! _E_ The Republican Party has to be smart & strong if it wants to win in November. Can't allow lightweights to set up a spoiler Indie candidate! _E_ ....This now allows for the passage of large scale Tax Cuts (and Reform) which will be the biggest in the history of our country! _E_ With @ivankatrump and the Chairman of DAMAC in Dubai. __HTTP__ _E_ .@hardball_chris says he's "glad" we had a hurricane! With many people dying and thousands hurting MSNBC (cont) __HTTP__ _E_ Great Gravis Poll on the great state of NH. Also watch @FaceTheNation on CBS & @HowardKurtz #mediabuzz both on Sunday. _E_ The so called bipartisan DACA deal presented yesterday to myself and a group of Republican Senators and Congressmen was a big step backwards. Wall was not properly funded Chain & Lottery were made worse and USA would be forced to take large numbers of people from high crime..... _E_ Our country needs strong borders and extreme vetting NOW. Look what is happening all over Europe and indeed the world a horrible mess! _E_ Congrats @NBCInvestigates on revealing that Obama knew millions of Americans would lose their healthcare plans __HTTP__ _E_ Congrats to @TrumpWaikiki celebrating 51 consecutive months as the #1 Honolulu Hotel on @TripAdvisor! _E_ The U.S. cannot negotiate with terrorists. It is a sad and terrible situation for the family involved but this can only lead to disaster. _E_ .... to help McConnell who spoke right after him."@BreitbartNews _E_ If Republican Senators are unable to pass what they are working on now they should immediately REPEAL and then REPLACE at a later date! _E_ "You may have to try a lot of things to get just one thing to work. That's tenacity and it's critical to success." – Trump Never Give Up _E_ If the people so violently shot down in Paris had guns at least they would have had a fighting chance. _E_ Wow! Senator Mark Warner got caught having extensive contact with a lobbyist for a Russian oligarch. Warner did not want a "paper trail" on a "private" meeting (in London) he requested with Steele of fraudulent Dossier fame. All tied into Crooked Hillary. _E_ RT @foxandfriends: .@GeraldoRivera: Chances of impeachment went from 3% to 0% with Comey's testimony __HTTP__ _E_ Just landed in Da Nang Vietnam to deliver a speech at #APEC2017 _E_ Thank you Richmond Virginia! #Trump2016 __HTTP__ _E_ Thank you Iowa! #Trump2016 __HTTP__ _E_ Whether I choose him or not for State Rex Tillerson the Chairman & CEO of ExxonMobil is a world class player and dealmaker. Stay tuned! _E_ I will be live tweeting during the Celebrity Apprentice at 9 P.M. Also will be hosting Dateline just prior to Apprentice at 8 P.M. _E_ My interview from last night with @piersmorgan discussing OWS __HTTP__ _E_ I just returned from Iowa what a beautiful state. The people are amazing and the event for Congressman Steve King was a great success! _E_ Will be on @foxandfriends at 7.00. (30 minutes). A great deal to talk about including Ebola quarantine. _E_ The Trans Pacific Partnership will increase our trade deficits & send even more jobs overseas. This is a bad deal. Time for smart trade! _E_ Face The Nation's interview of me was the highest rated show that they have had in 15 years. Congratulations and WOW! @CBSNews @jdickerson _E_ Join us live in the Oval Office for the swearing in of our new Attorney General @SenatorSessions!LIVE:... __HTTP__ _E_ Business is an art in itself & powerful negotiation skills are one of the techniques necessary to facilitate success. Think Like a Champion _E_ Elections have consequences. Obama just published "final regulations for ObamaCare's individual mandate" __HTTP__ Enjoy! _E_ Read a great interview with Donald Trump that appeared in The New York Times Magazine: __HTTP__ _E_ .@EWErickson got fired like a dog from RedStateand now he is the one leading opposition against me. _E_ Don't let the GLOBAL WARMING wiseguys get away with changing the name to CLIMATE CHANGE because the FACTS do not let GW tag to work anymore! _E_ .@Yankees should get rid of A Rod ASAP I can't watch this guy anymore! _E_ It's Tuesday how much will the media continue to cover up the embassy attacks for Obama? _E_ It is great to meet fellow patriots at the #TimeToGetTough book signings. Can't wait to meet more today at Trump Tower from 12PM to 2PM _E_ According to new employment numbers 296000 Americans have dropped out of the work force & gave up looking for work. _E_ Wow it's snowing in Isreal and on the pyramids in Egypt. Are we still wasting billions on the global warming con? MAKE U.S. COMPETITIVE! _E_ Tonight I will be on @FoxNews with @SeanHannity at 10pm and @CNN w/ @AndersonCooper at 10:10pm. Enjoy! #VoteTrumpSC #Trump2016 _E_ Donald Trump shocked by 'stupid decision' about @OMAROSA on '@ApprenticeNBC' __HTTP__ @TODAY_Clicker _E_ Obama's war on coal is killing American jobs making us more energy dependent on our enemies & creating a great business disadvantage. _E_ With @StephenBaldwin7 earlier today at @ApprenticeNBC press conference in @TrumpTowerNY. __HTTP__ _E_ By Scotland officials canceling my local ad about how damaging wind turbines are it became a much bigger story around the world. Great! _E_ Why does @BarackObama always have to rely on teleprompters? _E_ I will be on @foxandfriends at 8:30 A.M. Will be talking about lightweight Marco Rubio and lying Ted Cruz! _E_ Obama will be trying very hard at next debate he doesn't want to lose the Boeing. _E_ We need a real President! __HTTP__ _E_ When your life flashes before your eyes make sure you've got plenty to watch. Anonymous _E_ How stressed are @lisarinna and @pennjillette already? #CelebApprentice _E_ Looking forward to giving keynote speech tonight @ChesterfieldGOP Lincoln Reagan dinner in Virginia. _E_ Via @bostonherald by Eugene R. Dunn: "Iran a clear danger" __HTTP__ _E_ I hear this moron @billmaher said nasty things about me (hair etc—boring) on the terminated @jayleno show. Stupid guy/bad ratings! _E_ Unbelievable evening in Melbourne Florida w/ 15000 supporters and an additional 12000 who could not get in. Tha... __HTTP__ _E_ On behalf of @FLOTUS Melania and I THANK YOU for an unforgettable afternoon and evening at the Forbidden City in Beijing President Xi and Madame Peng Liyuan. We are looking forward to rejoining you tomorrow morning! __HTTP__ _E_ ...allegations of unmasking Trump transition officials. Not good! _E_ Wow NY Observer story about @AGSchneiderman really exposes him as a sleazebag & crook. He's bad for New York. __HTTP__ _E_ Does Madonna know something we all don't about Barack? At a concert she said we have a black Muslim in the White House. _E_ How do you fight millions of dollars of fraudulent commercials pushing for crooked politicians? I will be using Facebook & Twitter. Watch! _E_ Thank you for a great night at the Verizon Wireless Arena New Hampshire! #VoteTrumpNH#MakeAmericaGreatAgain #FITN __HTTP__ _E_ There is no better place in the world to spend Christmas than Mar a Lago __HTTP__ in Palm Beach Florida. _E_ Just finished another week of filming @ApprenticeNBC. This season a record 14th is shaping up to be the best yet. _E_ Leaving soon after a great time in New Hampshire a truly special place! _E_ From ABC News: In Demand: Washington's Highest (and lowest) Speaking Fees by Scott Wilson __HTTP__ _E_ Don't forget to watch Celebrity Apprentice tonight at 9pm...you will love it! _E_ Landing in Pennsylvania now. Great new poll this morning thank you. Lets #DrainTheSwamp and #MakeAmericaGreatAgain... __HTTP__ _E_ Will be playing golf today with Rand Paul at Trump International in Palm Beach. Will be both interesting and fun! _E_ Can you believe that our very stupid politicians released the leader of ISIS and now we are spending billions trying to get him back! _E_ Via the Washington Post: Inside the World of Donald Trump's Super Fans: __HTTP__ _E_ RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_ Hurricane looks like largest ever recorded in the Atlantic! _E_ Great news that @ehasselbeck will be joining @foxandfriends. Elisabeth is a tremendous person and will be missed on @theviewtv. _E_ Thank you to the @nydailynews for a very nice story __HTTP__ _E_ Thank you to @exxonmobil for your $20 billion investment that is creating more than 45000 manufacturing & construction jobs in the USA! _E_ Like her or not Hillary did what she had to do in the debate last night—get through it. Her opponents were very gentle and soft! _E_ Obama's nuclear deal with the Iranians will lead to a nuclear arms race in the Middle East. It has to be stopped. _E_ Miss Universe contestants are amazing—the most beautiful ever! _E_ Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_ Via @politicalwire: Tweet of the Day __HTTP__ _E_ Now that Obama's poll numbers are in tailspin – watch for him to launch a strike in Libya or Iran. He is desperate. _E_ Of course there is large scale voter fraud happening on and before election day. Why do Republican leaders deny what is going on? So naive! _E_ to make up their own minds as to the truth. The media lies to make it look like I am against Intelligence when in fact I am a big fan! _E_ Congratulations to @seanhannity on his tremendous increase in television ratings. Speaking of ratings I will be on his show tonight @ 10pE. _E_ My interview with @NYDNGatecrasher discussing @BarackObama's #WHCD and my endorsement of @MittRomney __HTTP__ _E_ Happy #SmallBusinessSaturday!A great day to support your community and America's JOB creators by shopping locally at a #SmallBiz. #ShopSmall __HTTP__ _E_ Via @TODAY_Clicker: Donald Trump promises 'tough and mean and nasty' 'Celebrity Apprentice' __HTTP__ _E_ .@GovernorPataki did a terrible job as Governor of New York. If he ran again he would have lost in a landslide. He and Graham ZERO in polls _E_ This week the Senate can join the House & take a strong stand for the Middle Class families who are the backbone of America. Together we will give the American people a big beautiful Christmas present a massive tax cut that lets Americans keep more of their HARD EARNED MONEY! __HTTP__ _E_ Obama spoke to the Mexican president last week & did not mention UMC Sgt. Tahmooressi. Sad! _E_ I find the photos of these children killed in Newtown in the New York Post heartbreaking.#Angels _E_ It's Monday. How much will premiums rise today because of ObamaCare? REPEAL! _E_ Thank you Michael Harrison @Talkersmagazine for your kind words greatly appreciated! _E_ A fact golfers don't get aches & pains like others who don't golf. It is amazingly remedial. _E_ ...popular vote. ABC News/Washington Post Poll (wrong big on election) said almost all stand by their vote on me & 53% said strong leader. _E_ RT @DonnaWR8: @realDonaldTrump Thank you @POTUS for believing in Us like we believed in you! #MAGA __HTTP__ _E_ If the presidential election were held today according to this @surveyusa poll Donald Trump would defeat any Dem: __HTTP__ _E_ Just read @PiersMorgan's book "Shooting Straight" and whether you love him or hate him (I'm in the first category) it is terrific. _E_ "Miss Universe Ratings 6.1 Million Viewers Best Since 2008" __HTTP__ _E_ We will have the votes for Healthcare but not for the reconciliation deadline of Friday after which we need 60. Get rid of Filibuster Rule! _E_ #CrookedHillary gives Obama an "A" for an economic recovery that's the slowest since WWII... #BigLeagueTruth... __HTTP__ _E_ How foolish did @davidaxelrod look yesterday trying to rationalize why @BarackObama accepts donations from Bain? __HTTP__ _E_ The problem with agreeing to a policy on immigration is that the Democrats don't want secure bordersthey don't care about safety for U.S.A. _E_ ICYMI via @foxnewsinsider my @foxandfriends from yesterday on Obama's dangerous disconnect __HTTP__ _E_ Why is @BarackObama letting the Taliban know when our troops are leaving? __HTTP__ This is dangerous for our soldiers. _E_ I will be holding a major news conference in New York City with my children on December 15 to discuss the fact that I will be leaving my ... _E_ I will be interviewed on @jaketapper @CNN at 9:00 A.M. and Fox News Sunday with Chris Wallace at 10:O0 A.M. CNN Iowa Poll 13 point lead! _E_ My interview on @gretawire discussing the economy and @TheHermanCain Witch Hunt __HTTP__ _E_ Spend your last day of 2013 contemplating the moves you will make in 2014 to make it your best year ever! _E_ Considering Obama hasn't proposed anything concrete if he wins he won't have a mandate. Another 4 years of legislative stalemate. _E_ The @nfl ratings continue to fall every week and will keep dropping. Boring games too many flags too soft! _E_ Today's #trumpvlog answers your tweets about my thoughts on the Republican candidates... __HTTP__ _E_ Limited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ __HTTP__ _E_ I guess @edshow is a lot smarter than dopes like @JonahNRO & @stephenfhayes. Oh well both mags are dying anyway. __HTTP__ _E_ To the brave men and women past and present in our armed services best wishes on Veterans Day. _E_ President Obama you are a complete and total disaster but you have a chance to do something great and important: STOP THE FLIGHTS! _E_ I've been watching very little @CNBC lately—the good news is I'm switching over to @BloombergNews and @FoxNews. _E_ Mobile Alabama today at 3:00 P.M. Last rally of the year THANK YOU ALABAMA AND THE SOUTH Biggest of all crowds expected see you there! _E_ RT @TeamTrump: "Police officers are the BEST of us. Law enforcement in this country is a force for GOOD. @mike_pence #VPDebate #BigLeagu... _E_ Congratulations to Eric & Lara on the birth of their son Eric Luke Trump this morning! __HTTP__ _E_ Entrepreneurs: Don't be confined by expectations. There are no exact rules for negotiation try to remain flexible and open to new ideas. _E_ I ask again how much is very wealthy South Korea paying the United States for protecting it against North Korea? _E_ Via @CBNNews by @TheBrodyFile: "Donald Trump: 'We Must Make America Great Again'" __HTTP__ _E_ .@FoxNews FBI's Andrew McCabe "in addition to his wife getting all of this money from M (Clinton Puppet) he was using allegedly his FBI Official Email Account to promote her campaign. You obviously cannot do this. These were the people who were investigating Hillary Clinton." _E_ On the way to the great state of Rhode Island big rally. Then to Pennsylvania for rest of day and night! _E_ Received a standing applause at #NCGOPcon when I said to have free trade be fair for the US we need really intelligent negotiators. _E_ The entire world understands that the good people of Iran want change and other than the vast military power of the United States that Iran's people are what their leaders fear the most.... __HTTP__ _E_ My @CNN interview with @TVAshleigh discussing @MittRomney's electability and @RickSantorum's Senate loss. __HTTP__ _E_ Donald Trump Ed Koch and the Ice Skating Rink: A Tale of Bureaucracy __HTTP__ @ActonInstitute _E_ People ask why do you tweet and re tweet to millions about @JebBush when he is so low in the polls? Because of his big $ hit ads on me! _E_ Via @BreitbartNews by @TheTonyLee: @Citizens_United sues @AGSchneiderman for violating 1st Amendment __HTTP__ _E_ Danger Weiner is a free man at 12:01AM. He will be back sexting with a vengeance. All women remain on alert. _E_ Good luck to @joniernst. You will make a wonderful Senator. _E_ We have a president who has a vendetta against businesspeople and considers them the enemy. #TimeToGetTough (cont) __HTTP__ _E_ If we do not protect the rule of law then we can expect even more illegals to cross the border. Obama's executive amnesty is dangerous. _E_ ....it is very possible that those sources don't exist but are made up by fake news writers. #FakeNews is the enemy! _E_ Almost every T.V. show is asking me to go on especially the @Late_Show. It's simple I get the ratings! _E_ Why does the federal government send foreign aid to China? Unbelievable! Washington is financing America's de... (cont) __HTTP__ _E_ Failed Presidential Candidate Mitt Romney was campaigning with John Kasich & Marco Rubio and now he is endorsing Ted Cruz. 1/2 _E_ Here's a great video of the official launch of my new fragrance #Success @Macys Herald Square __HTTP__ _E_ Denver Minnesota and others are bracing for some of the coldest weather on record. What are the global warming geniuses saying about this? _E_ Maniac Sergeant who went on a killing spree in Afghanistan must be punished big time and quickly. _E_ Limited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ __HTTP__ _E_ The pathetic new hit ad against me misrepresents the final line. You can tell them to go BLANK themselves was about China NOT WOMEN! _E_ I asked @VP Pence to leave stadium if any players kneeled disrespecting our country. I am proud of him and @SecondLady Karen. _E_ Bill Clinton's meeting was a total secret. Nobody was to know about it but he was caught by a local reporter. _E_ Great @nytimes story about our conversion of the Old Post Office building in D.C. to luxury hotel __HTTP__ _E_ Little Marco Rubio the lightweight no show Senator from Florida is just another Washington politician. __HTTP__ _E_ See Newsmax story re Republican National Convention __HTTP__ _E_ To be successful your focus has to be broad enough to think big at the same time. 'Midas Touch' with @theRealKiyosaki _E_ I remained strong for @TigerWoods during his difficult period. He rewarded me (and himself) by winning at Trump National Doral. _E_ Looking forward to @David_Bossie & @RepJeffDuncan's @Citizens_United Freed Summit in Greenville SC this Saturday! _E_ Soon to be the greatest hotel in U.S. don_trump_jr @ivankatrump @erictrump #OldPostOffice __HTTP__ _E_ The so called 'moderate' Syrian rebels pledged their allegiance to ISIS after Obama's address. We should not be arming them! _E_ Replay of Fox News Sunday With Chris Wallace at 2:00 P.M. on @FoxNews. Big statement made by Chris! _E_ I pay millions of $'s a year to Florida Power & Light & they can't give us what we want. Maybe a major class action suit against them? _E_ The police in London say I'm right. Major article in Daily Mail. "We can't wear uniform in our own cars." __HTTP__ _E_ I am now in Iowa getting ready to speak. People are always amazed to find out that I am Protestant (Presbyterian). GREAT. _E_ Roger Ailes just called. He is a great guy & assures me that "Trump" will be treated fairly on @FoxNews. His word is always good! _E_ .@AndrewKreig Thank you Andrew so correct! _E_ Mark Begich votes with Obama 97%. He opposes drilling & supports Amnesty for illegals. Next Tuesday vote @DanSullivan2014! _E_ Get ready for two amazing episodes of Celebrity Apprentice tomorrow night (Monday) at 8:00. Some incredible things happen! _E_ Obama better than last time but again @MittRomney wins. Good night. #debate _E_ I don't believe you have to be better than everybody else. I believe you have to better than you ever thought you could be. Ken Venturi _E_ Stuart Stevens the failed campaign manager of Mitt Romney's historic loss is now telling the Republican Party what to do with Trump. Sad! _E_ Remember all these 'freedom fighters' in Syria want to fly planes into our buildings. _E_ Great column by David Bossie at @BreitbartNews: "A Battle Won but the War Continues to Defund ObamaCare" __HTTP__ _E_ After reading the false reporting and even ferocious anger in some dying magazines it makes me wonder WHY? All I want to do is #MAGA! _E_ Met with President Putin of Russia who was at #APEC meetings. Good discussions on Syria. Hope for his help to solve along with China the dangerous North Korea crisis. Progress being made. _E_ RECORD HIGH FOR S & P 500! _E_ "Trumps Are Giving @TrumpDoral A Makeover" __HTTP__ via @CBSMiami _E_ The Trans Pacific Partnership is an attack on America's business. It does not stop Japan's currency manipulation. This is a bad deal. _E_ My @gretawire interview re: the dismal job report getting ripped off by South Korea 2016 election & #WWEHOF __HTTP__ _E_ .@FoxNews treats me so badly. Using old Quinnipiac Poll where I have a much smaller lead than the just out @CNN Poll. All negative! _E_ Address to the NationFull Video & Transcript: __HTTP__ __HTTP__ _E_ Why are we letting the three girls who left the U.S. to join ISIS back into the country? How stupid has our once respected country become! _E_ Via @ TheScotsman: "Donald Trump to lay out new golf course plan" __HTTP__ _E_ Via @townhallcom by @MattTowery: "Why Trump Should Run" __HTTP__ _E_ New poll thank you! #Trump2016 __HTTP__ __HTTP__ _E_ "If you don't have problems you're pretending or you don't run your own business." –Donald J. Trump __HTTP__ _E_ I will be interviewed by @TuckerCarlson tonight at 9:00 P.M. on @FoxNews. Enjoy! _E_ "Trump to campaign for @SteveKingIA" __HTTP__ via @kscj1360 _E_ With ZERO Democrats to help and a failed expensive and dangerous ObamaCare as the Dems legacy the Republican Senators are working hard! _E_ Entrepreneurs: Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_ Just took off for ceremony @ Pearl Harbor. Will then be heading to Japan SKorea China Vietnam & the Philippines. Will never let you down! _E_ The NFL has all sorts of rules and regulations. The only way out for them is to set a rule that you can't kneel during our National Anthem! _E_ .@EWErickson is a total low life read his past tweets. A dummy with no "it" factor. Will fade fast. _E_ The New York Giants are looking really bad so far tonight. Does not get much worse than this! _E_ Crooked Hillary said loudly and for the world to see that she SHORT CIRCUITED when answering a question on her e mails. Very dangerous! _E_ #SuccessByTrump Here's a photo from my appearance at @Macy's Herald Square with @ximenanr __HTTP__ _E_ Celebrity Apprentice starts in 15 minutes on NBC. ENJOY! _E_ Looking forward to being at the @RyderCupUSA announcement tonight. _E_ The Miss U.S.A. pageant will be amazing tonight. To be politically incorrect the girls (women) are REALLY BEAUTIFUL. NBC at 8 PM. _E_ Good move by Aubrey to be the red headed model they didn't have. #sweepstweet _E_ .@EdRendell's book A Nation of Wusses is an excellent read especially page 10. Go get it! _E_ Does any Republican have the ability to negotiate? _E_ We need jobs & we need them fast. I am a job creator. None of the pols can or will. Let's Make America Great Again! __HTTP__ _E_ If Mitch McConnell wants to win his election he'd better get rid of jinxed Karl Rove and fast... _E_ So nice of @Cher greatly appreciated! __HTTP__ _E_ #CelebApprentice Time for the first firing of the night. _E_ .@marcthiessen is a failed Bush speechwriter whose work was so bad that he has never been able to make a comeback. A third rate talent! _E_ The Fannie and Freddie execs should not get million dollar bonuses with our tax dollars. They were bailed out with $169B of our money. _E_ "The only source of knowledge is experience." – Albert Einstein _E_ Watch #MissUSA 2012 live tonight on @NBC at 9PM EST! _E_ Why is the @GOP being asked to do a debate that is so much longer than the just aired and very boring #DemDebate? _E_ It was a great honor to welcome Atlanta's heroic first responders to the White House this afternoon! __HTTP__ _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ When Strasburg leaves @Nationals for another team for more money will Washington still like the decision to shut him down for his good? _E_ Statement on International Holocaust Remembrance Day: __HTTP__ _E_ Good luck to the US Men's National Team in tomorrow's CONCACAF Cup vs. Mexico! It should be a great game! __HTTP__ _E_ No such meeting or conversation ever happened a made up story by low ratings @CNN. _E_ Congratulations to Brandy as our new Apprentice and to Clint for being a great player. It's been a terrific season! _E_ I am hearing that @NRCC Digital Director @lansing is doing great work expanding and modernizing @GOP social media. Good – we need it. _E_ RT @jayMAGA45: NFLplayer PatTillman joined U.S. Army in 2002. He was killed in action 2004. He fought 4our country/freedom. #StandForOurAnt... _E_ We blow up the famous Blue Monster at Trump National Doral on.Monday in order to build a spectacular new bigger and better Blue Monster! _E_ With the world's top amenities @TrumpTO's luxury residential condominiums provide the ultimate Toronto lifestyle __HTTP__ _E_ Find out who and what is the best in your field. Identify the trendsetters leaders and authorities. Learn the standards they follow. _E_ Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_ RT @EricTrump: Join my family in this incredible movement to #MakeAmericaGreatAgain!! Now it is up to you! Please #VOTE for America! __HTTP__ _E_ Yea NBC has increased all remaining Celebrity Apprentice episodes to two hours starting at 9 P.M. on Sunday! Amazing show. _E_ To all young college graduates – stick in there keep your head up and make sure you don't miss any opportunities. They are out there. _E_ To be called Trump Links at Ferry Point course will be GREAT and over the years hold many tournaments and major championships $'s to NYC. _E_ #TrumpAdvice __HTTP__ _E_ I have instructed Homeland Security to check people coming into our country VERY CAREFULLY. The courts are making the job very difficult! _E_ RT @Morning_Joe: VIDEO: @realDonaldTrump announces 'a very powerful endorsement' will be coming today. __HTTP__ _E_ Great Twitter poll and I wasn't even there. Thank you! #GOPDebate __HTTP__ _E_ Mitt did the right thing—not because he had to but because he never would have been given a second chance after his first fiasco _E_ Just did final purchase on fabulous @LodgeatDoonbeg in Ireland. Will become Trump International Hotel & Golf Links Ireland. Very exciting! _E_ Very exciting. I will be at Macy's Herald Square this Wednesday at 5:30pm to celebrate the launch of Trump Home crystal! _E_ Thank you @ShopFloorNAM. An honor to be with you today. Great news! Manufacturers report record high economic optimism in 2017. #TaxReform __HTTP__ _E_ Our clubhouse facility & suites in Ireland @LodgeatDoonbeg #TrumpIreland __HTTP__ __HTTP__ _E_ 46% of Americans think the Media is inventing stories about Trump & his Administration. @FoxNews It is actually much worse than this! _E_ If Russia or any other country or person has Hillary Clinton's 33000 illegally deleted emails perhaps they should share them with the FBI! _E_ RT @Scavino45: Under POTUS' @realDonaldTrump S&P 500 38th📈Record High NASDAQ 44th📈Record High#MakeAmericaGreatAgain __HTTP__ _E_ Has anyone seen the financials of @Univision. They are doing really badly. Too much debt and not enough viewers. Need money fast. Funny! _E_ Mike Flynn should ask for immunity in that this is a witch hunt (excuse for big election loss) by media & Dems of historic proportion! _E_ The @Lakers should have an amazing team next year with Kobe Nash and Howard. Will be fun to watch. _E_ Via @GolfweekMag: Major makeover: Trump has big vision for Doral __HTTP__ by @BKleinGolfweek _E_ Thank you! #AmericaFirst __HTTP__ _E_ Join me on #FacebookLive as I conclude my final #debate preparations. __HTTP__ _E_ #sweepstweet Teresa seems to underestimate the power of observance—that of the client as well as her team but she's a wonderful person _E_ Understand that difficulties mistakes & setbacks are an inevitable part of business and life. Don't allow them to knock you off your feet. _E_ "Relax & clear your mind if someone is speaking so that you're receptive to what they're saying." – Roger Ailes You are the Message _E_ With 18 beautiful holes each boasting unique characteristics Trump Nat'l Philadelphia is a Golf treasure __HTTP__ _E_ Jeb Bush just announced he raised over $100M. Everyone of those people who contributed are getting something to the detriment of America! _E_ Can you believe thatwith all of the problems and difficulties facing the U.S. President Obama spent the day playing golf.Worse than Carter _E_ The CIA deserves our praise for taking the fight to the enemy in the dark corners of the world. The CIA perseveres the politicians whine! _E_ Wake up Jeb supporters! __HTTP__ _E_ THE APPRENTICE. 10 years 182 shows many at number one for week or night Amazing! @NBC _E_ I am honored to be receiving the American Spectator Foundation Award for excellence in entrepreneurialism in Washington DC this fall. _E_ Just landed in North Carolina heading to the J.S. Dorton Arena. See you all soon! Lets #MakeAmericaGreatAgain! __HTTP__ _E_ Heading into the 12 days with great negotiating strength because of our tremendous economy. __HTTP__ _E_ "I also protect myself by being flexible. I never get too attached to one deal or one approach." – THE ART OF THE DEAL _E_ ...Maybe the best thing to do would be to cancel all future press briefings and hand out written responses for the sake of accuracy??? _E_ The @WSJ Wall Street Journal loves to write badly about me. They better be careful or I will unleash big time on them. Look forward to it! _E_ My @nbcdfw int. by @EricKingNBC5 w/@IvankaTrump discussing the Sunday @nbc premiere of @ApprenticeNBC's 14th season __HTTP__ _E_ Trump Int'l Hotel & Tower Chicago has received accolades for design service & our signature restaurant Sixteen __HTTP__ _E_ I was referring to a backstop for pre existing conditions. I will eliminate the law in its entirety & replace it w/ something much better. _E_ This Russian connection non sense is merely an attempt to cover up the many mistakes made in Hillary Clinton's losing campaign. _E_ On 59th & Park Avenue Trump Park Avenue transformed the legendary Hotel Delmonico into 120 luxury residences __HTTP__ _E_ Lets #MakeAmericaGreatAgain Maryland! #VoteTrump __HTTP__ _E_ . @Newsmax__Media is one of the top media outlets in the country. @ChrisRuddyNMX has revolutionized political commentary and reporting. _E_ The deplorables came back to haunt Hillary.They expressed their feelings loud and clear. She spent big money but in the end had no game! _E_ Axl Rose should take his #rockhall2012 honors and be happy. Stop the no induction nonsense. Do it for your fans @axlrose. _E_ Pocahontas is at it again! Goofy Elizabeth Warren one of the least productive U.S. Senators has a nasty mouth. Hope she is V.P. choice. _E_ 'The Clinton Foundation's Most Questionable Foreign Donations'#PayToPlay #DrainTheSwamp __HTTP__ _E_ Big announcement coming soon regarding South Carolina... _E_ Hope & Change. Millions are losing their healthcare plans & ObamaCare is taking cancer patients' doctors away __HTTP__ _E_ The White House should stop publicly pressuring Israel on Iran. Iran's nuclear program is the threat not Israel's right to self defense. _E_ ... the ratings of Shark Tank. Everyone was hitting on me until the numbers came in—and now—dead silence! _E_ Work has begun ahead of schedule to build the greatest golf course in history: Trump International – Scotland. _E_ My statement as to what's happening in Sweden was in reference to a story that was broadcast on @FoxNews concerning immigrants & Sweden. _E_ I am not trying to get top level security clearance for my children. This was a typically false news story. _E_ I will be on The Situation Room with @wolfblitzer from 5 7pm est on CNN _E_ .@CoachDanMullen Great to have you and your GREAT team at Trump National Doral. Go out and finish your fantastic season in style! _E_ Great meeting with Governor Mapp of the #USVI. He is very thankful for the great job done by @FEMA and First Responders. __HTTP__ _E_ #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_ Remember @JebBush wants COMMON CORE (education from D.C.) and is very weak on ILLEGAL IMMIGRATION ( come as act of love ). Not a leader! _E_ An honor to welcome the Taoiseach of Ireland @EndaKennyTD to the @WhiteHouse today with @VP Pence. __HTTP__ _E_ Looking forward to being interviewed on the @marklevinshow tonight at 6:30 PM EST. Be sure to listen! _E_ Crooked Hillary Clinton was not at all loyal to the person in her rigged system that pushed her over the top DWS. Too bad Bernie flamed out _E_ Iran is going to buy 114 jetliners with a small part of the $150 billion we are giving them...but they won't buy from U.S. rather Airbus! _E_ Sad just 16% of American parents think their children will be better off than them __HTTP__ We can do much better! _E_ Playing politics with the Keystone decision? @BarackObama vetos 20000 jobs and cheaper oil. _E_ Via @gatewaypundit: "Mother of Murdered Teen Thanks Donald Trump During Senate Hearing" __HTTP__ _E_ I will be interviewed on @seanhannity tonight at 10pmE on @FoxNews. Enjoy! _E_ America wasted billions and precious lives in Iraq and Iran will soon take control very very sad. _E_ The Dems want to stop tax cuts good healthcare and Border Security.Their ObamaCare is dead with 100% increases in P's. Vote now for Karen H _E_ If you really want to succeed you'll have to go for it every day. The big time isn't for slackers. Keep up your stamina and remain curious. _E_ RT @AnnCoulter: I hear Churchill had a nice turn of phrase but Trump's immigration speech is the most magnificent speech ever given. _E_ Windmills are the greatest threat in the US to both bald and golden eagles. Media claims fictional 'global warming' is worse. _E_ Thank you America! #Trump2016 __HTTP__ _E_ ...never allow the Republicans to pass even great legislation. 8 Dems control will rarely get 60 (vs. 51) votes. It is a Repub Death Wish! _E_ Must read article by @boonepickens & @AmbJohnBolton: "America's Untapped Energy Weapon" __HTTP__ We don't need foreign oil! _E_ Crooked @HillaryClinton's foundation is a CRIMINAL ENTERPRISE. Time to #DrainTheSwamp! __HTTP__ #BigLeagueTruth #Debate _E_ "Government's first duty is to protect the people not run their lives." – Ronald Reagan _E_ The @MissUSA 2012 contestants standing outside of Trump Tower in New York City __HTTP__ @MissUSA 2012 tomorrow at 9PM ET NBC. _E_ .@TrumpChicago's Spa has an array of 5 star services12 treatment rooms & 53 spa guestrooms w/great views __HTTP__ _E_ The independent watchdog who exonerated @BarackObama for the failed green energy loans just donated $52500 to Obama's campaign. _E_ Imagination is more important than knowledge. Albert Einstein _E_ The Pope should not have resigned—he should have lived it out. It hurts him it hurts the church... _E_ You are right more like the opening of the Tonys. _E_ I can't believe we are not asking South Korea for anything. They make a fortune on us while we spend a fortune defending them how stupid! _E_ The U.S. Senate should switch to 51 votes immediately and get Healthcare and TAX CUTS approved fast and easy. Dems would do it no doubt! _E_ RT @JackPosobiec: Dick Durbin called Trump racist for wanting to end chain migration. Here's a video of Dick Durbin calling for an end to... _E_ It has been 1000 days since @BarackObama has passed a budget. He continues to spend this country into the ground without any control. _E_ Presidents and their administrations have been talking to North Korea for 25 years agreements made and massive amounts of money paid...... _E_ Via @EllonTimesKenny: Trump course sparks international interest __HTTP__ _E_ The Great State of Arizona where I just had a massive rally (amazing people) has a very weak and ineffective Senator Jeff Flake. Sad! _E_ .@JebBushAt the debate you said your brother kept us safe I wanted to be nice & did not mention the WTC came down during his watch 9/11. _E_ Thank you to our fantastic veterans. The reviews and polls from almost everyone of my Commander in Chief presentation were great. Nice! _E_ Obama has zero credibility on oil and coal. If we do not win energy as a country we just do not win period! _E_ From the Wall Street Journal: Google Steps Into Autism Research re @autismspeaks __HTTP__ _E_ Via @CBNNews' @TheBrodyFile: "Poll: Donald Trump in GOP Top Tier for President" __HTTP__ _E_ Great and we should boycott Fake News CNN. Dealing with them is a total waste of time! __HTTP__ _E_ If the gov't shuts down it is because Obama wants to make working Americans buy ObamaCare while businesses and gov't are exempt. _E_ .@MissUniverse ratings were great! A big win and a wonderful night! __HTTP__ _E_ Negotiation is a true talent. It is an art. And our politicians are killing our country b/c they don't have it. @SRQRepublicans speech _E_ RT @DanScavino: .@POTUS @realDonaldTrump signs executive orders on trade that will set the stage for revival in American manufacturing. #Am... _E_ Surprise – China has spies throughout NASA stealing our R&D __HTTP__ When will we ever make them pay for espionage? _E_ I am confident when American public gets to know @MittRomney the race will go his way. He's honorable & successful man polls looking good. _E_ CBO estimates over 2.3M jobs will be lost due to ObamaCare __HTTP__ Elections have consequences. _E_ Just landed from Paris France. It was an incredible visit with President @EmmanuelMacron. A lot discussed and accomplished in two days! _E_ Penn Jillette shows his dark side in new crowdfunded film Director's Cut __HTTP__ @pennjillette @bradwyman _E_ Excellent story on @MittRomney very good moment for Ryan. #VPDebate _E_ I'll be signing copies of my new book @TimeToGetTough tomorrow at Trump Tower (5th Avenue between 56 and 57) from noon to 2pm. _E_ Stuart Stevens is a dumb guy who fails @ virtually everything he touches. Romney campaignhis booketc. Why does @andersoncooper put him on? _E_ I agree! The headline says it all. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Our economy is in trouble. The unemployed are more likely to drop out of the workforce than find a job. We need growth and now! _E_ Tom Brady just did it again. He is not only a great guy he is without question the BEST quarterback! _E_ It's Thursday. How much money did Barack Obama waste today on crony green energy projects? _E_ Jeb Bush just said about Marco Rubio he's my friend! Pure political speak. Why can't he be truthful and say disloyal guy no friend! _E_ Great job @IvankaTrump! #RNCinCLE __HTTP__ _E_ Don't attack Syria an attack that will bring nothing but trouble for the U.S. Focus on making our country strong and great again! _E_ 500 of the most vicious prisoners escaped from an Iraq prison today. That country is a time bomb waiting to happen a total corrupt mess! _E_ I would absolutely consider investing in Atlantic City again great and hard working people but much would have to change taxes regs. etc _E_ Wow now leading in @ABC /@washingtonpost Poll 46 to 45. Gone up 12 points in two weeks mostly before the Crooked Hillary blow up! _E_ Thank you for your support. Together we will MAKE AMERICA SAFE AND GREAT AGAIN!#POTUSAbroad #USA __HTTP__ _E_ A Rod should do the Yankees a favor and never play again. _E_ #HappyIndependenceDay #July4 #USA __HTTP__ _E_ It was a great privilege to meet with President Moon of South Korea.Stay tuned! 🇰 #UNGA __HTTP__ _E_ #1. Keep the big picture in mind. There are always opportunities and possibilities and thinking too small can negate a lot of them. _E_ We continue to lose our nation's finest in Afghanistan almost daily. The Rules of Engagement are costing lives. _E_ Concerns over the national debt are stopping businesses from hiring and expanding __HTTP__ Obama's policies are unsustainable _E_ How can @BarackObama invoke Richard Nixon against @MittRomney when Obama just used Executive Privilege on Fast & Furious?! _E_ THANK YOU! #Trump2016 __HTTP__ _E_ RT @JaydaBF: VIDEO: Muslim migrant beats up Dutch boy on crutches! __HTTP__ _E_ Thank you @TheTodaysGolfer for the wonderful statement that the new par 3 9th hole @Trump Turnberry could be the most dramatic in Britain. _E_ RT @kevcirilli: CEDAR RAPIDS TRUMP'S DAUGHT IVANKA: I can just say without equivocation my father will make America great again. _E_ Boy is this guy @ShepNewsTeam tough on me. So totally biased. As a reporter he should be ashamed of himself! #Trump2016 _E_ We are going to have a great time in Cleveland. Will lead to special results for our country. We will Make America Great Again! _E_ The Fed's reckless monetary policies will cause problems in the years to come. The Fed has to be reined in or we will soon be Greece. _E_ Cutting taxes and simplifying regulations makes America the place to invest! Great news as Toyota and Mazda announce they are bringing 4000 JOBS and investing $1.6 BILLION in Alabama helping to further grow our economy! __HTTP__ _E_ Anybody that believes in strong borders and stopping illegal immigration cannot vote for Marco Rubio READ THIS: __HTTP__ _E_ Americans already believe that @PaulRyanVP is better qualified to serve as President over @JoeBiden __HTTP__ No surprise. _E_ Oil is under $50/barrel. Now is the time to increase sanctions against Iran not lift them. No deal is better than a bad deal. #ArtOfTheDeal _E_ Doctors have already died treating Ebola __HTTP__ We should not be importing the disease to our homeland. _E_ If you can't run your own house you certainly can't run the White House A statement made by Mrs. Obama about Crooked Hillary Clinton _E_ I have been saying for weeks for President Obama to stop the flights from West Africa. So simple but he refused. A TOTAL incompetent! _E_ Wow little Mac Miller has almost 100 million views on his song Donald Trump. Keep pushing Mac and come up with another hit just do it! _E_ Why is the NFL getting massive tax breaks while at the same time disrespecting our Anthem Flag and Country? Change tax law! _E_ My @FoxNews with @gretawire discussing the Keystone pipeline Re election is more important than 20000 jobs and (cont) __HTTP__ _E_ #TrumpVlog @Rosie wasn't even a short term fix at The View. __HTTP__ _E_ The Wall Street Journal stated falsely that I said to them "I have a good relationship with Kim Jong Un" (of N. Korea). Obviously I didn't say that. I said "I'd have a good relationship with Kim Jong Un" a big difference. Fortunately we now record conversations with reporters... _E_ The first 90 days of my presidency has exposed the total failure of the last eight years of foreign policy! So true. @foxandfriends _E_ Health insurance premiums are rising by double digits __HTTP__ Another tax to the consumer by Obama Care. Enjoy! _E_ Thank you on my way! __HTTP__ _E_ Today in Florida I pledged to stand with the people of Cuba and Venezuela in their fight against oppression cont: __HTTP__ _E_ I have gotten to know many Spanish speaking people as the owner of Trump National Doral in Miami. They are smart hard working and great _E_ My interview with @Newsmax_Media where I explain that gas is headed to $5 $6 and why @RickSantorum can't win __HTTP__ _E_ My experience in Iowa was a great one. I started out with all of the experts saying I couldn't do well there and ended up in 2nd place. Nice _E_ With Sen. Elizabeth Dole & @DoleFoundation Caregiver Fellows. Tremendous people caring for our military & veterans! __HTTP__ _E_ Must read via @IowaGOP by @shanevanderhart: "Congress Should Vote No on Syria" __HTTP__ _E_ Wind Power Company Fined $1 Million for Killing Birds. Golden eagles among victims... __HTTP__ @alexsalmond @Aberdeenshire _E_ With almost 1.3 million followers and rising really fast everyone is asking me to critique things(and people). Finally I will be a critic. _E_ I want to thank evangelical Christians for the warm embrace I've received on the campaign trail. Video: __HTTP__ _E_ Liberty University speech by DJT was biggest by far in school's history. Standing ovations...great young people! _E_ How many more billions of dollars will @BarackObama continue to waste in these solar companies? _E_ Beginning today the United States of America gets back control of its borders. Full speech from today @DHSgov:... __HTTP__ _E_ "Keep a good attitude and do the right thing even when it's hard. When you do that you are passing the test." @JoelOsteen _E_ You know what is the worst part of @BarackObama's Tuesday speech playing class warfare we paid for it with our tax dollars. _E_ People have got to stop working to be so politically correct and focus all of their energy on finding solutions to very complex problems! _E_ It is so great to be back home! Looking forward to a great rally tonight in Bethpage Long Island! _E_ With my friends at the great @Adidas Boost event at the @cadillacchamp at @trumpdoral __HTTP__ _E_ Not only did the $1B ObamaCare website not work it can't even protect your personal information __HTTP__ A disaster. _E_ Smart move by the Democrats to have Pres. @billclinton play a key role in their convention. _E_ I think the @NewYorkObserver was far too nice to sleazebag @AGSchneiderman. He's got plenty more to worry about!. _E_ Find something for everyone on your list with this Holiday Gift Guide from @TrumpSoHo on @TrumpCollection's Tumblr: __HTTP__ _E_ The important thing is not to stop questioning. Curiosity has its own reason for existing. Albert Einstein _E_ Tickets are now available for the 2015 @CadillacChamp at @TrumpDoral March 4 8: __HTTP__ _E_ In his own words @BarackObama was born in Kenya and raised in Indonesia and Hawaii. This statement was made (cont) __HTTP__ _E_ New ad concerning lightweight Senator Marco Rubio: __HTTP__ _E_ Catch me on Fox News right now my interview with Neil Cavuto __HTTP__ _E_ France was just stripped of its AAA bond rating. With the PMs radical tax rates... _E_ So now that Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the "unsolved mystery" that took place in Florida years ago? Investigate! _E_ Must read article by @EmilyMiller: "Anthony Weiner is a twit who treats women like dirt" __HTTP__ _E_ Via @CNNPolitics by @JDiamond1: "Trump: RNC call was 'congratulatory'" __HTTP__ _E_ Our soldiers can't even have any more joint exercises with Afghan soldiers because they are getting shot in the (cont) __HTTP__ _E_ With @VanityFair circulation and advertising revenue doing so badly rumor has it that dopey Graydon Carter is going to resign? He should. _E_ Very much enjoyed my tour of the Smithsonian's National Museum of African American History and Culture...A great job done by amazing people! _E_ The media is unrelenting. They will only go with and report a story in a negative light. I called Brexit (Hillary was wrong) watch November _E_ Young entrepreneurs should always remember that if you do not promote yourself no one else will! _E_ Obama vacationing in West Palm Beach starting tomorrow. He should play a round at Trump Int'l Golf Club #1 rated course in Florida. _E_ Via @PJMedia_com by @NicholasBallasy: "Trump Calls Election a 'Big Blow to Obama... I Think He's in Denial'" __HTTP__ _E_ Well maintained real estate is always going to be worth a lot more than poorly maintained real estate. The Art of the Deal _E_ It is a great honor for me to be inducted into the @WWE Hall of Fame. This will take place on April 6... _E_ TONIGHT! NORTH CAROLINA: __HTTP__ GEORGIA: __HTTP__ NEVADA: __HTTP__ _E_ .@BernardGoldberg was not good tonight on @oreillyfactor. He just doesn't know about winning! But he is a nice guy. _E_ Our President must be very careful with the 28 year old wack job in North Korea. At some point we may have to get very tough blatant threats _E_ Last October on @meetthepress @chucktodd attacked @jack_welch and I for saying Obama cooked the job number. Will he apologize? _E_ If you are a young entrepreneur just entering the business world I highly recommend that you read The Art of (cont) __HTTP__ _E_ It's Wednesday how many more of our embassies will be stormed by Islamists? _E_ Obama is tougher on WWII vets wanting to visit a DC memorial than Iran. He needs to show respect to our vets and not play games. _E_ While I am a critic of President Obama I hate it when someone (Robert Gates) writes a self serving negative book about his boss. _E_ You never know when the tide is going to turn in your favor. It's important to never give up on yourself. Think Like a Champion _E_ Thank you @scottienhughes for your powerful words on @FoxNews. I am with the Evangelicals and Tea Party big time. We will all WIN together! _E_ The Trump Tower restaurant Trump Grill just received the highest sanitary inspection grade possible "A" – the food is also great! _E_ Via @MiamiHerald: Donald Trump to be inducted into WWE Hall of Fame __HTTP__ _E_ Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person! _E_ Hank Greenberg formerly of AIG gave $10 million to the @JebBush campaign 3 months ago. He is not happy a total waste of money! _E_ Kate Middleton is great but she shouldn't be sunbathing in the nude only herself to blame. _E_ 'President Donald J. Trump Approves Emergency Declarations' __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_ A bad manager such as @BarackObama will continually be plagued by scandals. __HTTP__ Leadership starts at the top. _E_ China attempted to sell embargoed computers to Iran __HTTP__ China loves these deals! _E_ Due to the holiday I will NOT be doing Fox & Friends this morning. Next Monday at 7. _E_ It is a shame Keystone wasn't powered by solar panels and wind because then @BarackObama would have wasted billions on it. _E_ Our trade deficit just jumped in May to "the second highest level on record" __HTTP__ FAIR trade not free trade. I TOLD YOU. _E_ There is ZERO margin for error on Ebola. Are we confident in Obama when he can't even make a website for $5 Billion? _E_ THANK YOU INDIANA! #Trump2016 __HTTP__ _E_ Record snowfall & freezing temps throughout the country. Where is Global Warming when you need it?! _E_ Re: Success Don't put blinders on and do not limit yourself reach out seek and explore. Think big at all times. _E_ Scary. Our military is a using a Chinese made satellite for North Africa command communications __HTTP__ _E_ Read Donald Trump's Top Ten Tips for Success: __HTTP__ _E_ I'd like to call JEB a liar but the truth is he has no clue & never revealed that he used Eminent Domain when criticizing me! (1/2) _E_ Thank you to all of the supporters who far out numbered the protesters yesterday at the Women's U.S. Open. Very cool! _E_ Congrats to Pres.Obama and Dems. CBO has TRIPLED its estimate of working hours lost due to ObamaCare __HTTP__ Job Killer _E_ It was just announced by sources that no charges will be brought against Crooked Hillary Clinton. Like I said the system is totally rigged! _E_ RT @gatewaypundit: Democrat Fire Marshal Turns THOUSANDS of Trump Supporters Away at Columbus Rally __HTTP__ via @gatewaypun... _E_ I really enjoyed being at the Iowa State Fair. The crowds love and enthusiasm is something I will never forget. _E_ Despite winning the second debate in a landslide (every poll) it is hard to do well when Paul Ryan and others give zero support! _E_ "Remember people's names and small details about them. Use both in conversation... _E_ Thank you for inviting me to the Western Conservative Summit in Colorado! #ImWithYou #WCS16 __HTTP__ __HTTP__ _E_ Adrian also gives autographs if you stop by the lobby of @TrumpTowerNY. #CelebApprentice _E_ I talk about Obamacare in today's #TrumpVlog __HTTP__ _E_ The new job figures don't include 315000 people who have given up looking for jobs. _E_ We are building China's wealth by buying all their products even though we make better products in America. _E_ 'Presidential Executive Order on Strengthening the Cybersecurity of Federal Networks and Critical Infrastructure'... __HTTP__ _E_ Each day that Iran delays the deal if that is what you call it we must add another sanction and make them progressively tough. _E_ Via @HeraldBusiness by @hannahbsampson: "@TrumpDoral looking to hire hundreds" __HTTP__ _E_ The @nytimes is so poorly run and managed that other family members are looking to take over control. With unfunded liabilities big trouble! _E_ My thoughts on last night's meeting with @SarahPalinUSA in today's #trumpvlog... __HTTP__ _E_ Heading for Atlanta tomorrow morning for noon speech at North Atlanta Trade Center. Big crowds great people! _E_ Georgetown should not host @KathleenSebelius for the graduation ceremony. Her policies abuse Catholics. _E_ .@PolitiTrends @realdonaldtrump is dominating the discussion on Twitter with 79352 mentions today (via __HTTP__ ) _E_ RT @GovAbbott: To ensure your safety ahead of #Harvey heed warnings from local officials & review important safety information. __HTTP__ _E_ 83% of the government is still running during the shutdown while 41% of nondefense federal workers are furloughed. Room for cuts. _E_ I am in Baton Rouge where the Miss USA Pageant will be shown live on NBC on Sunday night for 3 hours starting at 8 P.M. INCREDIBLE SHOW! _E_ My @Newsmax_Media interview discussing OPEC US gas resources @MittRomney and running a campaign against @BarackObama __HTTP__ _E_ Irrelevant clown @KarlRove sweats and shakes nervously on @FoxNews as he talks bull about me. Has zero cred. Made fool of himself in '12. _E_ Remember after this new episode starts 5 MINUTES! _E_ The Republicans have been played into a trap by the President they forgot the 14th amendment..... _E_ The convention in Cleveland will be amazing! __HTTP__ _E_ I just arrived in Miami where I will be checking out construction of the brand new Trump National Doral always closely watch construction! _E_ Last night in his SOTU @BarackObama claimed that he is a friend of Israel. Does anyone really believe that. _E_ We need a #POTUS with great strength & stamina. Hillary does not have that.#Trump2016 __HTTP__ __HTTP__ _E_ My Monday @foxandfriends interview discussing the fiscal cliff negotiations making the big deal and who has the cards __HTTP__ _E_ I will be interviewed from Cleveland Ohio on @seanhannity Tonight at 10:00 P.M. Enjoy! _E_ Clinton made a false ad about me where I was imitating a reporter GROVELING after he changed his story. I would NEVER mock disabled. Shame! _E_ When I made the Apprentice the #1 show in the US that was a good day for you... _E_ If Obama wins it is the end of the Republican party. @limbaugh _E_ Mainstream (FAKE) media refuses to state our long list of achievements including 28 legislative signings strong borders & great optimism! _E_ .@TrumpCollection's @DoralResort renovations are revitalizing Miami. The new course will be a great challenge __HTTP__ _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ Celebrity Apprentice is rebroadcasting last weeks episode at 9 P.M. WITH A GREAT NEW EPISODE FEATURING @MELANIA TRUMP AT 10 P.M. AMAZING! _E_ My interview last week with Greta van Susteren is available here in slightly abridged form. __HTTP__ Good info to know about. _E_ Obama is looking like an incompetent fool in the handling of the war against.ISIS! Why isn't China and Russia helping they gain so much! _E_ The negative television commercials about me paid for by the politicians bosses are a total #Mediafraud. When you watch remember! _E_ If @MittRomney has a good debate tomorrow night Obama is finished! _E_ I am on @foxandfriends at 7:00 A.M. ENJOY! _E_ Heading to the great state of Mississippi at the invitation of their popular and respected Governor @PhilBryantMS. Look forward to seeing the new Civil Rights Museum! _E_ Will be landing in Knoxville Tennessee shortly tremendous crowd expected. It's all very simple we want to #MakeAmericaGreatAgain! _E_ Honored to receive an endorsement from @SJSOPIO thank you! Together we are going to MAKE AMERICA SAFE & GREAT AG... __HTTP__ _E_ Thank you Indiana! #Trump2016 __HTTP__ _E_ Today we are going to win the great state of MICHIGAN and we are going to WIN back the White House! Thank you MI!... __HTTP__ _E_ Mariano Rivera is greatest closer of all time. A leader in the club house & an exceptional man. One of the best @Yankees in history. _E_ Why would anyone think Obama would attack Syria the day of his speech in Washington. He doesn't want to detract from his press & glory. _E_ If the GOP will have any chance to beat @BarackObama in November the great people of Michigan need to support @MittRomney's candidacy. _E_ I always said the people we fought for in Libya were bad news. Once again I was right. _E_ Making my speech. #WWEHOF __HTTP__ _E_ Don't forget the three hour episode of Celebrity Apprentice this Sunday night 8pm 11pm on NBC. You're in for a (cont) __HTTP__ _E_ The windfarm approval in Scotland is subject to many conditions that can never be met will be tied up in courts for years! #EOWDC _E_ I never did give anybody hell. I just told the truth and they thought it was hell. Harry S. Truman _E_ #HappyMothersDay! __HTTP__ __HTTP__ _E_ HAPPY THANKSGIVING to everyone I love you all even my many enemies (sometimes!). _E_ We must protect our veterans. #MakeAmericaGreatAgain __HTTP__ _E_ How do you like the boardroom so far? _E_ Wow l just found out that A.G. Schneiderman met with President Obama in Syracuse on Thursday and sued me on Saturday! Same as IRS etc. _E_ Bureaucratic red tape and overregulation are discouraging the American dream. It's time for a bold new direction! __HTTP__ _E_ When somebody challenges you unfairly fight back be brutal be tough don't take it. It is always important to WIN! _E_ Get out to VOTE on 11/8/2016 and we will #DrainTheSwamp!RASMUSSEN NATIONAL Trump 43%Clinton 41% __HTTP__ _E_ Dopey Sugar @Lord_Sugar I hear your ratings last week were at an all time low you better get them up or you'll be fired. _E_ This show was taped just before the terrible Bill Cosby revelations came to light.She still should have asked him for money goes to charity. _E_ Look what happened to the autism rate from 1983 2008 since one time massive shots were given to children __HTTP__ _E_ Just got home watching the news and every story is bad about the U.S. Someday we will return to being great again but we need leadership! _E_ CHILD CARE REFORMS THAT WILL MAKE AMERICA GREAT AGAIN!Transcript: __HTTP__ __HTTP__ _E_ Our incredible U.S. Coast Guard saved more than 15000 lives last week with Harvey. Irma could be even tougher. We love our Coast Guard! _E_ Via @ABCPolitics by @ajdukakis & @rickklein: Mr. Trump Goes to Washington And Talks 2016 __HTTP__ _E_ Franklin such a great photo. HAPPY 99th BIRTHDAY to your father @BillyGraham! __HTTP__ _E_ Will Team Power be able to withstand Omarosa as PM? Smooth sailing is not expected. _E_ #MakeAmericaGreatAgain __HTTP__ _E_ Don't talk to me about Bush I was never a defender or a fan! _E_ Mexico's court system corrupt.I want nothing to do with Mexico other than to build an impenetrable WALL and stop them from ripping off U.S. _E_ What's incredible is that Obamacare hasn't even kicked in yet and already it's doing tremendous damage. (cont) __HTTP__ _E_ .@KirstenPowers New book is excellent and so true! Congrats! _E_ The ever dwindling @WSJ which is worth about 1/10 of what it was purchased for is always hitting me politically. Who cares! _E_ I hope @MittRomney now starts asking for any & all of @BarackObama's sealed records it's time. _E_ I am very proud of @StephenBaldwin7's performance in the record 13th season of All Star @CelebApprentice. Watch. _E_ Very nice @HuffingtonPost @pollsterpolls has me in first place at 18% and Bush second at 14% __HTTP__ _E_ My @greta int. on @FoxNews with @MELANIATRUMP at OPO discussing my potential candidacy & making America great again __HTTP__ _E_ China demanded that we raise our debt ceiling and then their rating agency downgraded us. Our leaders are hope... (cont) __HTTP__ _E_ The Amateur! First @BarackObama was caught bowing to the Saudi King but now the President of Mexico! __HTTP__ _E_ Always protect against the downside the upside will take care of itself. Donald J. Trump _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Whether you have someone managing your finances or you're doing it yourself money like anything takes maintenance & planning to grow. _E_ Elite Traveler & the 12 Best Hotel Room Views in the World __HTTP__ #TrumpChicago _E_ Team Power+@LilJon= Spielberg? Let's find out. #CelebApprentice _E_ Congratulations to David Wright on signing a long term extension with the @Mets. David is an exceptional player and person. _E_ The @Washingtonpost reported about the closing hotels in Atlantic City but knowingly failed to report that I am not involved left years ago _E_ Watch ET tonight to find out what my beautiful wife will be wearing at the Met Gala! __HTTP__ _E_ ...Re: China I told you that a long time ago. __HTTP__ _E_ Thank you to Tom Brady Coach Ditka Coach Bobby Knight and all of the many champions that have been so supportive! _E_ Yesterday Obama campaigned with JayZ & Springsteen while Hurricane Sandy victims across NY & NJ are still decimated by Sandy. Wrong! _E_ Via @TIME by @ZekeJMiller: "Trump Talks Politics at His Virginia Winery" __HTTP__ _E_ Be sure to check @fundanything to see my picks __HTTP__ _E_ Just out: The same Russian Ambassador that met Jeff Sessions visited the Obama White House 22 times and 4 times last year alone. _E_ RT @ErinBurnett: Sat down w/ @EricTrump @DonaldJTrumpJr here in Iowa. Talked God @realDonaldTrump late night tweets __HTTP__ _E_ Seth Myers is so unnatural and uncomfortable doing his show that you have to feel sorry for him. Bad interviewer marbles in his mouth! _E_ .@davidaxelrod David Thank you my great honor for a very worthy cause! _E_ This will prove to be a great time in the lives of ALL Americans. We will unite and we will win win win! _E_ Will be on Fox & Friends tomorrow morning at 7.00. Will be discussing the disgusting and wasteful $635 million website rollout and more! _E_ Rise high in affordable luxury. Trump Parc Stamford offers gracious living with entertainment spaces __HTTP__ _E_ I have a gift for my loyal viewers of All Star @ApprenticeNBC Mrs. @MELANIATRUMP debut on this week's episode __HTTP__ _E_ We boarded the helicopter for Sarasota earlier & will be landing soon! See you there. #Trump2016 __HTTP__ _E_ Thank you Macomb County Michigan! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Received a #HurricaneHarvey briefing this morning from Acting @DHSgov Secretary Elaine Duke @FEMA_Brock @TomBossert45 and COS John Kelly. __HTTP__ _E_ CNN will soon be the least trusted name in news if they continue to be the press shop for Hillary Clinton. _E_ FLASHBACK: Donald Trump Answers Boy's Prayer for New Bike __HTTP__ via @FoxNewsInsider _E_ Mark it on your calendar: Comedy Central Roast March 15th at 10:30 pm for the Roast of Trump __HTTP__ _E_ Enough about my ties etc. @Macys but they are doing really big numbers people love them (and @Macys loves Trump)! _E_ When everything seems to be going against you remember that the airplane takes off against the wind not with it. Henry Ford _E_ #TrumpVlog China is laughing at U.S. __HTTP__ _E_ Entrepreneurs: Success is good. Success with significance is even better. _E_ Every poll shows high approval of the new sign on @TrumpChicago. I am honored by the great support. _E_ My @foxandfriends interview discussing Pres. Obama playing golf w/@TigerWoods US Airways American merger & oil __HTTP__ _E_ Negotiation tip: Be reasonable & flexible. Being open to change could lead you into a fortunate situation and open the door to innovation. _E_ Since @BarackObama is on such a transparency kick how about releasing Fast & Furious info to Brian Terry's family? __HTTP__ _E_ RT @foxandfriends: President Trump vows America will respond to North Korean threats with fire & fury in a warning to the rogue nation ht... _E_ Many Syrian 'rebels' are radical Jihadis. Not our friends & supporting them doesn't serve our national interest. Stay out of Syria! _E_ When true golfers see what I do at Doral it will be the hottest club in the country. #sayfie #newsmax _E_ I salute all Tea Party Patriots for marching on DC today. Stand strong! _E_ ICYMI The ALS #IceBucketChallenge that Trumps them all __HTTP__ @MissUSA @MissUniverse @DonaldJTrumpJr @EricTrump _E_ Polling shows nearly 7 in 10 Americans support an immigration reform package that includes DACA fully secures the border ends chain migration & cancels the visa lottery. If D's oppose this deal they aren't serious about DACA they just want open borders. __HTTP__ _E_ "Some people dream of great accomplishments while others stay awake and do them." Anonymous _E_ Good news @RasmussenPoll has @MittRomney beating @BarackObama 49% 44% __HTTP__ Obama was up by 5% at same point in '08. _E_ Looks like Plan B is stuck with the mechanical dog. @THEGaryBusey has latched on and won't let go. #CelebApprentice _E_ A resort in Arizona is using sewage to make snow. Environmentalists are going crazy I won't be skiing in that snow. _E_ Tomorrow the House votes on #KatesLaw & No Sanctuary For Criminals Act. Lawmakers must vote to put American safety... __HTTP__ _E_ "The best entrepreneurs believe the true measure of success has to do with the number of jobs their business creates." – Midas Touch _E_ We will continue to follow developments in Charlottesville and will provide whatever assistance is needed. We are ready willing and able. __HTTP__ _E_ Our country will soon be relegated to THIRD WORLD status if proper decisions are not made by our president. He was never qualified for job! _E_ It's sad to see once decent newspapers like @USAToday failing so badly. I just don't know if they can be saved. _E_ Aetna CEO: Obamacare in 'Death Spiral' #RepealAndReplace __HTTP__ _E_ "Donald Trump: 'I Will Take Full Credit' for Romney Dropping Out" __HTTP__ via @Newsmax_Media by @ssfitzgerald _E_ ....agencies not just the FBI & DOJ now the State Department to dig up dirt on him in the days leading up to the Election. Comey had conversations with Donald Trump which I don't believe were accurate...he leaked information (corrupt)." Tom Fitton of Judicial Watch on @FoxNews _E_ Departing for #GOPDebate. Let's #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_ Today I signed the Global War on Terrorism War Memorial Act (#HR873.) The bill authorizes....cont __HTTP__ __HTTP__ _E_ Durst is a disaster at operating the new World Trade Center. It takes forever for workers or visitors to get in with impossible security. _E_ Others claim they can make America great again but only one knows The Art of The Deal. It's time for an outsider __HTTP__ _E_ A great new poll 33%! __HTTP__ _E_ When will @BarackObama release an actual budget? _E_ .@MittRomney should have been more aggressive last night. Yet some polls have him winning the debate. _E_ great business in total in order to fully focus on running the country in order to MAKE AMERICA GREAT AGAIN! While I am not mandated to .... _E_ Glad to see that @RondaRousey lost her championship fight last night. Was soundly beaten not a nice person! _E_ Happy birthday to my friend the great @jacknicklaus a totally special guy! _E_ Business is easy. Dealing with people is hard. If you are an entrepreneur your most important job is to choose who works with you. _E_ The media is so dishonest. If I make a statement they twist it and turn it to make it sound bad or foolish.They think the public is stupid! _E_ Today I signed the Veterans (OUR HEROES) Choice Program Extension & Improvement Act @ the @WhiteHouse. #S544 Watch... __HTTP__ _E_ If my many supporters acted and threatened people like those who lost the election are doing they would be scorned & called terrible names! _E_ Latin America's tallest building @TrumpPanama is the perfect getaway location to celebrate the New Year in luxury __HTTP__ _E_ General Petraeus should stop apologising and get on with his life. He is a good man and should have a great future. _E_ When Obama tried to tweak his previous statement on ObamaCare he made it an even greater lie even the Senate Democrats are angry with him! _E_ .@SenRandPaul's Tea Party rebuttal to Obama's SOTU explained why limited government promotes freedom. Well done! _E_ If you want to kill any idea in the world get a committee working on it. Charles Kettering _E_ If you're sitting in an office working in a job you hate then it's time to THINK BIG and plan your next step... _E_ I've been saying it for a long long time. #NoKo __HTTP__ _E_ Both are looking good! Now we begin! _E_ .@ronsirak Thank you for being so fair this morning on @GolfChannel—greatly appreciated. _E_ .@hardball_chris is a really dumb guy(and I know him well)—that's why he works swimmingly with our leaders in Washington. _E_ We are building our future with American hands American labor American iron aluminum and steel. Happy #LaborDay! __HTTP__ _E_ ..under a magnifying glass they have zero tapes of T people colluding. There is no collusion & no obstruction. I should be given apology! _E_ Crimea was TAKEN by Russia during the Obama Administration. Was Obama too soft on Russia? _E_ Congratulation to Jane Timken on her major upset victory in becoming the Ohio Republican Party Chair. Jane is a loyal Trump supporter & star _E_ Oh the wonders of the Arab Spring. Our new allies in Egypt the Muslim Brotherhood just called the Holocaust a myth __HTTP__ _E_ Jeb Bush spent more than $40000000 in New Hampshire to come in 4 or 5 I spent $3000000 to come in 1st. Big difference in capability! _E_ New York we will make America great again! __HTTP__ _E_ Via @clarechampion by @DanDanaherNews: "Wind Farm Proposal Near @Trump_Ireland Rejected" __HTTP__ _E_ South Carolina and the audience were GREAT THANKS! _E_ The World as we know it is falling apart. Much of the blame can be attributed to the fact that the United States is no longer respected! _E_ The @Broncos had a truly bad day my advice is to go home forget about it and come back tough next year. _E_ "You want to compete and you want to compete at the highest level." @boonepickens _E_ Republicans must be careful with immigration—don't give our country away. _E_ Unemployment rate only dropped because more people are out of labor force & have stopped looking for work.Not a real recovery phony numbers _E_ I'm going to do what @MittRomney was totally unable to do WIN! _E_ Can you believe that Sony chief Amy Pascal wants to meet with Al Sharpton to seek forgiveness for her racial slurs. Al is laughing at her! _E_ .@serenawilliams had a flawless @usopen quarterfinal win last night. She's a great player and a wonderful person. _E_ "But if someone has a gun and is trying to kill you... it would be reasonable to shoot back with your own gun." @DalaiLama _E_ .@BrandenRoderick did a great job on All Star Celebrity @ApprenticeNBC. Raised a lot of money for charity while looking great. _E_ Will be covering President Obama's speech at 9.00 on Twitter you are all so lucky! _E_ Great shot by @KingJames yesterday. Lebron is a tough competitor who delivers under pressure. _E_ We will confront ANY challenge no matter how strong the winds or high the water. I'm proud to stand with Presidents for #OneAmericaAppeal. _E_ I had 15000 people in Phoenix but @politico said the rooms capacity is just over 2000. But said Bernie Sanders had 11000 in same room. _E_ My prayers and condolences to the victims and families of the terrible tragedy in Nice France. We are with you in every way! _E_ Entrepreneurs: Having an ego and acknowledging it is a healthy choice. There's nothing wrong with bringing your talents to the surface. _E_ REPEAL AND REPLACE!!! #ObamaCareInThreeWords _E_ HAPPY BIRTHDAY to the United States Air Force!! __HTTP__ _E_ I sure hope the sexting pervert Anthony Weiner runs for mayor. Will be great fun watching him both lose and be humiliated. _E_ "If it's worth doing it's worth fighting for. You'll have lots of people & obstacles in your way. Fight to get beyond them."–Midas Touch _E_ Today I was honored to be joined by Republicans and Democrats from both the House and Senate as well as members of my Cabinet to discuss the urgent need to rebuild and restore America's depleted infrastructure. __HTTP__ __HTTP__ _E_ I am in Miami at Trump National Doral. Just gave out contract to build a new ballroom and luxury suites. Blue Monster complete opens Dec 14. _E_ A drug free A Rod is just an average baseball player.@Yankees will soon move him down in the batting order & should renegotiate his contract _E_ Massive record setting snowstorm and freezing temperatures in U.S. Smart that GLOBAL WARMING hoaxsters changed name to CLIMATE CHANGE! $$$$ _E_ House Democrats want a SHUTDOWN for the holidays in order to distract from the very popular just passed Tax Cuts. House Republicans don't let this happen. Pass the C.R. TODAY and keep our Government OPEN! _E_ Obama just appointed an Ebola Czar with zero experience in the medical area and zero experience in infectious disease control. A TOTAL JOKE! _E_ RT @JoeNBC: Remarkable how cost effective Post says Trump campaign was per vote and stunning how much Jeb spent per vote. __HTTP__ _E_ No matter how good the replacement refs do they will be soundly criticized they can't win! _E_ "I pride myself on being obstinate stubborn & tough. I think those are important qualities found in successful people." – Think Big _E_ RT @WhiteHouse: Dr. King's dream is our dream. It is the American Dream. It's the promise stitched into the fabric of our Nation etched i... _E_ Just spoke to @JohnKasich to express condolences and prayers to all for the horrible shooting of two great police officers from @WestervillePD. This is a true tragedy! _E_ With proper thinking and leadership we can have a much better plan than Obamacare something that works for the people and costs much less _E_ Publicity seeking Lindsey Graham falsely stated that I said there is moral equivalency between the KKK neo Nazis & white supremacists...... _E_ .@JoseCanseco who I got to know very well during #CelebApprentice can't carry @SHAQ's jock. _E_ As a former host of Saturday Night Live I look forward to attending tonight! _E_ "Go for the jugular so that people watching will not want to mess with you." – Think Big _E_ RT @TuckerCarlson: .@RichardGrenell : @realDonaldTrump told Tillerson he had the full support of the U.S. Gov't to bring #OttoWarmbier home... _E_ A strong America creates opportunity and growth. We just need to change Washington. Let's Make America Great Again! __HTTP__ _E_ Do you ever notice that @CNN gives me very little proper representation on my policies. Just watched nobody knew anything about my foreign P _E_ The dishonest media does not report that any money spent on building the Great Wall (for sake of speed) will be paid back by Mexico later! _E_ RT @TeamTrump: RT if you agree @realDonaldTrump WON the #Debate BIG LEAGUE! #MAGA __HTTP__ _E_ Now we have a once in a lifetime opportunity to RESTORE AMERICAN PROSPERITY – and RECLAIM AMERICA'S DESTINY.But in order to achieve this bright and glowing future the SENATE MUST PASS TAX CUTS – and bring Main Street roaring back to life! __HTTP__ __HTTP__ _E_ A feature on the progress of the course @ #Trump Int'l #Golf Club will feature on @CNNLivingGolf Thurs 8 May 2014 @ 0930 & 1630 GMT #DAMAC _E_ .@hardball_chris Did you forget about Bill Ayers & so many others? You should apologize to all the people you offended yesterday. _E_ So much interest in my visit to Scotland! I greatly look forward to attending the opening event @TrumpTurnberry taking place on June 24th. _E_ Great job on @Greta @DonaldJTrumpJr. Nobody could have done it better! _E_ RT @GOP: .@IvankaTrump: This administration is committed to keeping working families at the forefront of our agenda. __HTTP__ _E_ Via @CNNMoney by @AaronSmithCNN: The Donald wins. Trump name coming off casino __HTTP__ _E_ If the GOP Establishment really wants to defeat @BarackObama then they should read #TimeToGetTough. _E_ Sen.Richard Blumenthal who never fought in Vietnam when he said for years he had (major lie)now misrepresents what Judge Gorsuch told him? _E_ They must be kidding can this be happening #Oscars _E_ Will be interviewed on @Morning_Joe at 7:00 A.M. So much to talk about! _E_ Give the public a break The FAKE NEWS media is trying to say that large scale immigration in Sweden is working out just beautifully. NOT! _E_ The only global warming that people should be concerned with is the global warming caused by nuclear weapons because of our weak U.S. leader _E_ I will be making a big surprise announcement to the massive crowd assembled in Huntsville/Madison Alabama! Landing now! #Trump2016 _E_ Take a look at __HTTP__ and __HTTP__ to see these beautiful hotels. _E_ You're just not getting there @DanaPerino. Sometimes things just don't work out but don't worry no problem! _E_ I will be interviewed on @foxandfriends at 7:30. Things are looking good had a great Easter look forward to spending the week in Wisconsin! _E_ Then on June 25th back to the USA to MAKE AMERICA GREAT AGAIN! _E_ In the very least Congress must defund Obama's unconstitutional amnesty order. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Looking forward to promoting a pro growth & positive message at this Saturday's @Citizens_United @AFPhq Freedom Summit in Manchester. _E_ Small crowds at @RedState today in Atlanta. People were very angry at EWErickson a major sleaze and buffoon who has saved me time and money _E_ I will be interviewed on Face The Nation @CBSNews at 10:30 A.M. Should be interesting ENJOY! _E_ Focus on your goals not your problems. And remember problems are a mind exercise so enjoy the challenge. _E_ Rexnord of Indiana is moving to Mexico and rather viciously firing all of its 300 workers. This is happening all over our country. No more! _E_ .@TrumpChicago is Chicago's sole destination showcasing a Five Star @Forbes rating for both hotel & restaurant __HTTP__ _E_ Fitch has downgraded our credit outlook to negative. Why? @BarackObama's failure to lead with the Super Committee. __HTTP__ _E_ Thank you Colorado! An honor to win @NBC @9News #GOPDebate Poll. __HTTP__ _E_ .@DineshDSouza's '2016: Obama's America' is expanding to over 1000 theaters this weekend. Will be highest grossing documentary in 2012. !! _E_ Busy day—working on buying a major property—and creating lots of jobs. _E_ I agree with Marco Rubio that Ted Cruz is a liar! _E_ Great meetings will take place today at Trump Tower concerning the formation of the people who will run our government for the next 8 years. _E_ Glad to see that Jamie Dimon passed yesterday's shareholder vote. The JP Morgan stock holders understand that a good CEO is worth keeping. _E_ Hillary Clinton is taking the day off again she needs the rest. Sleep well Hillary see you at the debate! _E_ .@CNBC is pushing the @GOP around by asking for extra time (and no criteria) in order to sell more commercials. _E_ Thank you Pennsylvania! #Trump2016 __HTTP__ _E_ .@Macy's is a big contributor to @PPFA . Anybody against Planned Parenthood should boycott racial profiling Macy's. _E_ Sleepy Chuck Todd of NBC falls far short of the late great Tim Russert. _E_ Former President Jimmy Carter is so happy that he is no longer considered the worst President in the history of the United States! _E_ THe Westminster Dog Show asked if I'd be interested in meeting Hickory a Scottish Deerhound who won Best in Show. She came to visit today! _E_ President Obama should ask the DNC about how they rigged the election against Bernie. _E_ Richard Mourdock a very good man running for the Senate in Indiana. Hopefully he will win! @richardmourdock _E_ Ted Cruz is a nervous wreck. He is making reckless charges not caring for the truth! His poll #'s are way down! _E_ Take the time to be thorough in whatever you undertake. Remain open to new ideas. Remain fluid not fixed in your expectations. _E_ I'll be on @foxandfriends on Monday at 7:30 AM tune in! _E_ We pay for Obama's travel so he can fundraise millions so Democrats can run on lies. Then we pay for his golf. _E_ When will @BarackObama release his transcripts? What is he hiding? _E_ I will win the election against Crooked Hillary despite the people in the Republican Party that are currently and selfishly opposed to me! _E_ Congratulations to #TeamUSA🏆on your great @PresidentsCup victory! __HTTP__ _E_ For what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_ #trumpvlog My thoughts on Afghanistan @RickSantorum and why I fired two people on this week's #CelebApprentice... __HTTP__ _E_ Pocahontas just stated that the Democrats lead by the legendary Crooked Hillary Clinton rigged the Primaries! Lets go FBI & Justice Dept. _E_ '16 Fake News Stories Reporters Have Run Since Trump Won' __HTTP__ _E_ Republicans have a last chance to do the right thing on Repeal & Replace after years of talking & campaigning on it. _E_ A photo delivered yesterday that will be displayed in the upper/lower press hall. Thank you Abbas! __HTTP__ _E_ Reverend Wright must have great hatred for Obama and the manner in which he was shunted aside. _E_ Right now 4000 U.S. troops are stupidly heading to West Africa to help fight Ebola.No help from China Russia or wealthy African oil nations _E_ First segment of my @seanhannity @FoxNews interview discussing @GOP are terrible negotiators & lost all their cards __HTTP__ _E_ No deal was made last night on DACA. Massive border security would have to be agreed to in exchange for consent. Would be subject to vote. _E_ ....likewise billions of dollars gets brought into Mexico through the border. We get the killers drugs & crime they get the money! _E_ I am enjoying my travels across Europe but home is where the heart is. Looking forward to coming back to the family in New York very soon. _E_ Obama is a terrible negotiator. He bails out Chrysler and now Chrysler wants to send all Jeep manufacturing to China and will! _E_ GDP was revised upward to 3.1 for last quarter. Many people thought it would be years before that happened. We have just begun! _E_ My condolences and prayers to the victims of the terrorist attack in Paris. _E_ Great article a must read by Peter Ferrara at @Forbes about The Biggest Government Spender in World History __HTTP__ _E_ Everyone should go see @HatingBreitbart. Great documentary showcasing @AndrewBreitbart's legacy. _E_ Everybody is asking about my announcement this Wednesday concerning Barack Obama just wait and see! _E_ If @TedCruz doesn't clean up his act stop cheating & doing negative ads I have standing to sue him for not being a natural born citizen. _E_ .@JustinRose99 Great playing today in the Scottish Open. I see our practice facility is helping—use it a lot! _E_ "Our side needs Donald Trump." @AnnCoulter on @seanhannity's show last night. Thanks Ann. _E_ Just leaving Akron Ohio after a packed rally. Amazing people! Going now to Texas. _E_ Wow just came out on secret tape that Crooked Hillary wants to take in as many Syrians as possible. We cannot let this happen ISIS! _E_ Lyin' Ted Cruz who can never beat Hillary Clinton and has NO path to victory has chosen a V.P.candidate who failed badly in her own effort _E_ .@PennJillette and @StephenBaldwin7's arguments are making the edit room look like the boardroom. #CelebApprentice _E_ "Is business success a natural talent? I think it's a combination of aptitude work and luck." – Think Like a Champion _E_ If I were President I would push for proper vaccinations but would not allow one time massive shots that a small child cannot take AUTISM. _E_ So true Ivanka! __HTTP__ _E_ Pretty audacious for Obama to call @MittRomney a BSer when he has lied about so much we don't have room to write. _E_ A great day in Wisconsin many stops many great people! Melania is joining me on Monday. Big crowds. MAKE AMERICA GREAT AGAIN! _E_ Thank you for your continued support!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ #trumpvlog Same last name same bad ratings @lawrence and @rosie..... __HTTP__ _E_ As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_ My official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_ Looks like we will have a pervert running for mayor after all just what New York City needs and he will revert back to form always do! _E_ In '08 @BarackObama hit Bush for secrecy __HTTP__ When will Obama release all his sealed college records?! _E_ .@AlexSalmond suffered a huge defeat by the people of Blackdog. Communities all over Scotland are fighting this loser. _E_ Goofy Elizabeth Warren sometimes referred to as Pocahontas pretended to be a Native American in order to advance her career. Very racist! _E_ I am deeply disturbed by what I have read in the case of @TrayvonMartin. I support a full investigation and justice. _E_ More great news as a result of historical Tax Cuts and Reform: Fiat Chrysler announces plan to invest more than $1 BILLION in Michigan plant relocating their heavy truck production from Mexico to Michigan adding 2500 new jobs and paying $2000 bonus to U.S. employees! __HTTP__ _E_ Will be interviewed by @SeanHannity on @FoxNews tonight at 10pm from Pennsylvania. Enjoy! #Trump2016 __HTTP__ _E_ Our new allies in Egypt the Muslim Brotherhood have close relations with Iran __HTTP__ We never should have abandoned Mubarak. _E_ The U.S. has a 60 billion dollar trade deficit with Mexico. It has been a one sided deal from the beginning of NAFTA with massive numbers... _E_ The acclaimed @TrumpChicago soars 92 stories high. You're either staying in @TrumpChicago or in its shadow. __HTTP__ _E_ Increasing America's debt weakens us domestically and internationally. US Senator @BarackObama 2007 _E_ Prediction: Rand Paul has been driven out of the race by my statements about him he will announce soon. 1%! _E_ The Democrats are delaying my cabinet picks for purely political reasons. They have nothing going but to obstruct. Now have an Obama A.G. _E_ .@HuffingtonPost actually gave me a positive story yesterday! _E_ Remember our brave men & women who have fallen protecting our country this Memorial Day! _E_ Law enforcement & military did a spectacular job in Hamburg. Everybody felt totally safe despite the anarchists. @PolizeiHamburg #G20Summit _E_ 'What I Like About Trump ... and Why You Need to Vote for Him' __HTTP__ _E_ RT @EricTrump: __HTTP__ _E_ If ObamaCare is such a wonderful law then why does Obama summarily change the law before an election? _E_ Social media has changed the news & communication landscape for good. Everything must be up to date by the second instead of the hour or day _E_ Mechanical dog is going to be trending tonight. #MechanicalDog #CelebApprentice _E_ Crooked Hillary's V.P. pick said this morning that I was not aware that Russia took over Crimea. A total lie and taken over during O term! _E_ Entrepreneurs: Seek opportunity and see opportunity as a perk. You never know what will evolve. Keep an open mind! _E_ Thank you Roanoke Virginia this a MOVEMENT join us today!Sign up: __HTTP__ __HTTP__ _E_ Wow I have had so many calls from high ranking people laughing at the stupidity of the failing @nytimes piece. Massive front page for that! _E_ Great listening session with CEO's of the Retail Industry Leaders Association this morning! __HTTP__ _E_ Trump National Doral will have big crowds this weekend for the WGC. THE BLUE MONSTER IS READY FOR THE WORLD'S TOP FIFTY PLAYERS! _E_ Dems have been complaining for months & months about Dir. Comey. Now that he has been fired they PRETEND to be aggrieved. Phony hypocrites! _E_ When a complex website is broken the best thing to do is blow it up and start all over again then sue the culprits and use the proper team! _E_ The Letterman show really turned things around people finally understand my $5 million dollar offer to charity.... __HTTP__ _E_ Hillary says this election is about judgment. She's right. Her judgement has killed thousands unleashed ISIS and wrecked the economy. _E_ Bernie caved! __HTTP__ _E_ I have proven to be far more correct about terrorism than anybody and it's not even close. Hopefully AZ and UT will be voting for me today! _E_ Melania and I look forward to being with President Xi & Madame Peng Liyuan in China in two weeks for what will hopefully be a historic trip! __HTTP__ _E_ Tracking 149 polls from 29 pollsters nationwide/HuffPost Pollster #GOP __HTTP__ _E_ .@MattGinellaGC It's true Matt the NEW Blue Monster is better than Pinehurst so is Bedminster. Turnberry & Trump Aberdeen blow it away! _E_ There can never be a sharp economic recovery until @BarackObama is out of the White House. _E_ Today proves what I have always known that @Reince Priebus is the tough one and the smart one not Debbie Wasserman Shultz (@DWStweets.) _E_ I LOVE NEW YORK! #NewYorkValues __HTTP__ _E_ People do not assume this but more than anything else I like helping people. Be at Trump Tower at 11 AM today. _E_ It's sad that the WH is punishing children from across the country by closing all tours. Doesn't have to be. WH should take my offer. _E_ My @6abc int. with @Jim_Gardner on Atlantic City Philadelphia's real estate market & 2014 2016 elections __HTTP__ _E_ Cruz lies are almost as bad as Jeb's. These politicians will do anything to stay at the trough! _E_ The ties shirts and suits at Macy's are doing fantastically well check out the new designs and low prices nothing better! _E_ Tonight on @ExtraTV I'm talking #CelebApprentice. Tune in! _E_ Receiving the Algemeiner Liberty Award a great honor. __HTTP__ _E_ How could Michael Forbes get Scot of the Year when he lost—badly—to me & Andy Murray a true Scot who won the U.S. Open & Olympic gold? _E_ Just like Jonathan Gruber viciously lied & called Americans "stupid" on ObamaCare many consultants are doing the same on Global Warming. _E_ Wow even lowly Rand Paul has just past @JebBush in the new @CNN Poll. Jeb is at 3% I'm at 39%. Stop throwing your money down the drain! _E_ I will be on On The Record @gretawire tonight at 7 PM _E_ Unemployment is now 7.9%. Four years and $6.5T later that is really bad! _E_ Make sure you take some time to enjoy the weekend. Important for your mind and will help you be productive next week. _E_ New Reuters poll! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Record of Health: __HTTP__ #Trump2016 _E_ Will be at the Women's U.S. Open today! _E_ Will be doing Fox & Friends in a few minutes hope you enjoy! _E_ Come on MLB do the right thing! Let @PeteRose_14 into the Hall. No drugs—just hard work and talent! _E_ "Good communicators control space." – Roger Ailes 'You Are The Message' @FoxNews _E_ An honor to welcome PM of Australia @TurnbullMalcolm to America & join him in marking the 75th Anniversary of the... __HTTP__ _E_ Innovation distinguishes between a leader and a follower. Steve Jobs _E_ Where is the progress in the state of New York over the last three years? There is none only backwards. _E_ The government is borrowing 46 cents on every dollar it spends __HTTP__ Dangerous for us but great news for China. _E_ What a time we all had in Iowa yesterday massive overflow crowd. Love them! _E_ Heading now to Pella Iowa. Big crowd! Remember Trump is a big buyer of Pella windows. See you soon! _E_ Anderson Silva just got knocked out by new champion Chris Weidman! Congrats to Chris. _E_ .@MarkSteynOnline Thank you and great job on @seanhannity tonight! _E_ The big loss yesterday for Israel in the United Nations will make it much harder to negotiate peace.Too bad but we will get it done anyway! _E_ #GOPDebate #Trump2016 __HTTP__ _E_ So much for creating American jobs @BarackObama gave $529 Million to a Green car company so they can be manufactured in Finland. _E_ President Obama if it is important to you I will substantially increase the $5M offer! _E_ She's baaack! @Rosie needs me to salvage her dying career. But it won't help she's got no talent & no persona. Too many tv cancellations! _E_ Wow what a nice honor! __HTTP__ _E_ U.S. Stock Market up almost 20% since Election! _E_ Looking forward to being hosted by @NickLangworthy's Erie County Lincoln Leadership Reception tonight. Record crowd! Can't wait. _E_ An aerial shot of Jacksonville crowd yesterday! I may as well show you because the media won't. #Trump2016 __HTTP__ _E_ The very outdated filibuster rule must go. Budget reconciliation is killing R's in Senate. Mitch M go to 51 Votes NOW and WIN. IT'S TIME! _E_ Amazing that Derek Jeter played with an injury throughout most of last night's @yankees game and did so well. _E_ How do you spend over $635M on websites and they don't work? _E_ Why did @MittRomney give his tax returns without demanding that Obama release his college records & applications in return? _E_ Does anyone believe that @BarackObama did not fully write or review the 1991 publisher booklet? _E_ Attention all hackers: You are hacking everything else so please hack Obama's college records (destroyed?) and check place of birth _E_ The spotlight has finally been put on the low life leakers! They will be caught! _E_ .@TrumpToronto was just voted the #1 hotel in Canada in Conde Nast Traveler's prestigious Reader's Choice Awards __HTTP__ _E_ Money was never a big motivation for me except as a way to keep score. The real excitement is playing the game. The Art of the Deal _E_ RT @DanScavino: Join @realDonaldTrump LIVE in Denver Colorado via his #Facebook page we are here!!#MakeAmericaGreatAgain __HTTP__ _E_ Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! _E_ 25 days to go until fiscal cliff (bad name)—it is only a fiscal curb! Debt ceiling is real fiscal cliff...and that will be interesting! _E_ The United States is experiencing the coldest weather in decades with vast amounts of snow blanketing many states.Pendulum has swung to cool _E_ ..North Korea is a rogue nation which has become a great threat and embarrassment to China which is trying to help but with little success. _E_ Thank you Michigan! #Trump2016 _E_ Crooked Hillary has never created a job in her life. We will create 25 million jobs. Think she can do that? Not a c... __HTTP__ _E_ We're missing a lot of information on autism. Support @AutismSpeaks' project by visiting mss.ng #MSSNG _E_ It's @BarackObama who wants to raise all our taxes who applauds China for cutting their taxes! (cont) __HTTP__ _E_ When will all the haters and fools out there realize that having a good relationship with Russia is a good thing not a bad thing. There always playing politics bad for our country. I want to solve North Korea Syria Ukraine terrorism and Russia can greatly help! _E_ Debate was somewhat hard to watch last night. Viewership will be way down. _E_ Obamacare is a disaster! Time to repeal & replace! #ObamacareFail __HTTP__ _E_ US tourists threaten to boycott Scotland over windfarms' __HTTP__ Write to Alex Salmond: firstminister@scotland.gsi.gov.uk _E_ RT @DonnaWR8: @realDonaldTrump You can boycott our anthem WE CAN BOYCOTT YOU! #NFL #MAGA __HTTP__ _E_ Who do you want negotiating for us? #MakeAmericaGreatAgain __HTTP__ _E_ I will be interviewed by @BretBaier @SpecialReport at 6pm ET tonight @FoxNews _E_ Congratulations to @JamesOKeefeIII on exposing more Democrat voter fraud. @DNC was caught red handed telling people to vote twice. _E_ Little Adam Schiff who is desperate to run for higher office is one of the biggest liars and leakers in Washington right up there with Comey Warner Brennan and Clapper! Adam leaves closed committee hearings to illegally leak confidential information. Must be stopped! _E_ My speech at @AmSpec Bartlet Gala Dinner where I received @boonepickens Entrepreneur Award __HTTP__ _E_ Watch this video for a look at our great course in Los Angeles Rancho Palos Verdes __HTTP__ @TrumpGolfLA _E_ .@MarkHalperin's and John Heilemann's book Double Down is an excellent read on the just passed election. Great book congrats! @jheil _E_ Congratulations Secretary Mattis! __HTTP__ _E_ Re my hair Should I change it? What do you think? _E_ .@VattenfallGroup doesn't have the finances or financial statement to build the hated windfarm in Aberdeen. _E_ It was a great honor to have President Xi Jinping and Madame Peng Liyuan of China as our guests in the United States. Tremendous... _E_ "Leverage: don't make deals without it." – The Art of The Deal _E_ The North Korean regime has pursued its nuclear & ballistic missile programs in defiance of every assurance agreement & commmitment it has made to the U.S. and its allies. It's broken all of those commitments... __HTTP__ _E_ It is time to send someone from the outside to fix DC from the inside. Let's Make America Great Again! __HTTP__ _E_ 'Everything in Dubai': Learn from Emirate's rebound says @DonaldTrumpJr __HTTP__ via @Emirates247 by @Parag1301 _E_ Money was never a big motivation for me except as a way to keep score. The real excitement is playing the game. #TheArtofTheDeal _E_ LIVE on #Periscope: Watch major press conference live from @TrumpTowerNY now! #MakeAmericaGreatAgain __HTTP__ _E_ New PPP Poll just out Trump up big Cruz Rubio and Bush down. The debate results even with a stacked RNC audience were wonderful! _E_ Alabama people are saying their team has real football & real girlfriends—not good for Notre Dame—but they'll be back! _E_ My new book Midas Touch in stores now.... __HTTP__ #trumpvlog _E_ The president of the pathetic Club For Growth came to my office in N.Y.C. and asked for a ridiculous $1000000 contribution. I said no way! _E_ Via @thehill by @martinmatishak: "Trump: 'We look like we're beggars' in Iran nuclear talks" __HTTP__ _E_ Back in NY from Scotland and fighting for our country to get better. Trump International Golf Links Scotland opened to rave reviews. _E_ Leaving for Jacksonville now. See you there! Miami was great. _E_ There are so many blatant lies coming out of the ADMINISTRATION healthcare spying NSA IRS brutally killed Americans WILL IT EVER END? _E_ Jailed USMC Sgt Andrew Tahmooressi should be released immediately. Since when does Mexico care about border security?#BringBackOurMarine _E_ RT @FoxBusiness: #BreakingNews: U.S. employers added 209000 jobs in July unemployment rate down to 4.3% #JobsReport __HTTP__ _E_ Crooked Hillary Clinton mentioned me 22 times in her very long and very boring speech. Many of her statements were lies and fabrications! _E_ I am very surprised that @lancearmstrong gave up. I never thought he was a quitter... _E_ I really enjoyed being in New Hampshire & speaking for Joe McQuaid @deucecrew & the Nackey Loeb School @LoebSchool honoring James Foley. _E_ She is so sad and pathetic that I almost feel sorry for Sec.Sebelius. She has done great harm to many people and must be fired. Incompetent! _E_ Join me at 7:00 P.M. on Tuesday August 22nd in Phoenix Arizona at the Phoenix Convention Center! Tickets at: __HTTP__ __HTTP__ _E_ "Donald Trump Congratulated on @foxandfriends for Receiving the @Algemeiner's 'Liberty Award'" __HTTP__ via @Algemeiner _E_ Watching these politicians trying to get a deal done is truly painful Republicans are in a much stronger position than they think. _E_ In 1960 there were approximately 20000 pages in the Code of Federal Regulations. Today there are over 185000 pages as seen in the Roosevelt Room.Today we CUT THE RED TAPE! It is time to SET FREE OUR DREAMS and MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Univision wants to back out of signed @MissUniverse contract because I exposed the terrible trade deals that the U.S. makes with Mexico. _E_ Thanks @AndreaTantaros for all of your kind words and thoughts. Big progress is being made. Keep up the great work! _E_ It was an honor to have the amazing Root family join me in Iowa. I have been so inspired by their courage & bravery. __HTTP__ _E_ Here is a letter I received yesterday from someone who has had personal experience with our health care situation. __HTTP__ _E_ My interview with @gretawire last night on @FoxNews: @BarackObama 'Missed His Opportunity' __HTTP__ _E_ Hillary Clinton only knows how to make a speech when it is a hit on me. No policy and always very short (stamina). Media gives her a pass! _E_ #badratings @Lawrence's show failed at 8pm and is failing(even worse) at 10pm not long for tv..... _E_ Oh really check out innocent @megynkelly discussion on @HowardStern show 5 years ago I am the innocent (pure) one! __HTTP__ _E_ "Winners never quit and quitters never win." Vince Lombardi _E_ RT @seanhannity: Tonight the truth about how despicable the media and the left are in America today. We will name names. 9 est Hannity Fox... _E_ Is this really America? Terrible! __HTTP__ _E_ RT @accesshollywood: @realDonald Trump: 'Celebrity Apprentice' Season 5 is 'Tough Nasty & Smart.' Watch: __HTTP__ _E_ I would absolutely kill Jon Stewart(?) in a debate it would be no contest he's not fast enough or smart enough (only obnoxious enough!). _E_ Hillary Clinton surged the trade deficit with China 40% asSecretary of State costing Americans millions of jobs. _E_ With a world renowned open air lobby w/ ocean views & top restaurants @TrumpWaikiki is Honolulu's premier hotel __HTTP__ _E_ Thanks to everyone for your support on @CNBC's "Top Leaders Icons and Rebels" vv __HTTP__ Thanks for voting Trump! _E_ . @WWE's @WrestleMania XXIX less than 3 weeks away. Looking forward to being inducted into the Hall of Fame! _E_ John Kasich was never asked by me to be V.P. Just arrived in Cleveland will be a great two days! _E_ Via @NYDailyNews by Rich Schapiro: Donald Trump slams Mitt Romney Jeb Bush __HTTP__ _E_ The real story turns out to be SURVEILLANCE and LEAKING! Find the leakers. _E_ #FlashbackFriday Many big movies have filmed in my buildings. Here is @TrumpChicago in #Transformers 3. __HTTP__ _E_ groveling when he totally changed a 16 year old story that he had written in order to make me look bad. Just more very dishonest media! _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ USA has the greatest business people in the world but we let political hacks negotiate our deals. We need change! #BigLeagueTruth #Debate _E_ It must have been President Obama that called in what will go down as the DUMBEST PLAY IN THE HISTORY OF FOOTBALL. Same thought process! _E_ New business start ups at the lowest level in 30 years and the EPA is now the Employment Prevention Agency. @bobmcdonnell _E_ I am on @foxandfriends Enjoy! _E_ Thank you Indiana! Was great seeing everyone on Wednesday! I will be back soon! #Trump2016 __HTTP__ __HTTP__ _E_ ALL SAFE IN ORANGE COUNTY NORTH CAROLINA. With you all the way will never forget. Now we have to win. Proud of you all! @NCGOP _E_ Via @DailyCaller by @AlexPappas: "Donald Trump Headed To Iowa Says Ebola Is Further Proof Of Obama's Incompetence" __HTTP__ _E_ Via @washtimes: Donald Trump warns of 'dangerous precedent' in Cyprus bank skimming __HTTP__ _E_ What a great couple... Katherine Webb and AJ McCarron. They are both winners! _E_ Polls close but can you believe I lost large numbers of women voters based on made up events THAT NEVER HAPPENED. Media rigging election! _E_ Thank you for having me this morning @AmericanLegion. I enjoyed my time with everyone! #ALConvention2016 __HTTP__ _E_ I have never been a fan of John Edwards but it is time for the gov't to focus on more important things. @johnedwards _E_ While in politics it is often smart to send out false messages one thing is clear: That Hillary does not want to run against TRUMP. _E_ Joan Rivers @Joan_Rivers was an amazing woman and a great friend. Her energy and talent were boundless. She will be greatly missed. _E_ Not believable that Manti Te'o was in love for one year with a girl he never met she then died. He is either very stupid.... _E_ Thank you Indiana! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_ ObamaCare is a disaster. Americans will see record increases in their premiums and inferior care services. _E_ A fantastic day in D.C. Met with President Obama for first time. Really good meeting great chemistry. Melania liked Mrs. O a lot! _E_ Success is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_ RT @foxandfriends: Sec. Mattis: If North Korea fires missile at US it's 'game on' __HTTP__ _E_ "Tone it down? No way! Donald Trump needs to crank up the volume" __HTTP__ via @FoxNews by @toddstarnes _E_ A robust growing economy is how to fix Social Security and Medicare not cuts on Seniors. _E_ So true! __HTTP__ _E_ SC has kept us safe from exec amnesty for now. But Hillary has pledged to expand it taking jobs from Hispanic & African American workers. _E_ Eric Cantor's concession speech was ridiculous acted like nothing had just happened. WE NEED REAL LEADERS! _E_ We should cut off all aid to every country that does not respect our border. Why are we giving them money in the first place? _E_ The ill conceived windfarm that @AlexSalmond is pushing for Aberdeen will lose $50 million a year. Only a fool would build it or want it! _E_ "Keep focusing on doing what you love even if times are tough." – Think Big _E_ Rubio is totally owned by the lobbyists and special interests. A lightweight senator with the worst voting record in Senate. Lazy! _E_ Happy New Year from #MarALago! Thank you to my great family for all of their support. __HTTP__ _E_ Packed with holiday celebrations members & staff are enjoying the first Christmas season at @Trump_Charlotte __HTTP__ _E_ RT @TeamTrump: She put the office of Sec of State up for sale. If she ever got the chance she'd put the Oval Office up for sale too. #Fo... _E_ I have asked the reigning Miss Universe and Miss USA to do the honors. At least I will not have to wash my hair this morning! Enjoy. _E_ The NYPD has been doing a fantastic job protecting NYC. I hope Chief Ray Kelly is strongly considering running for mayor. _E_ In today's all new #TrumpVlog I discuss what a great honor it was to be inducted into the WWE Hall of Fame. __HTTP__ _E_ Via @Suntimes' @CSTearlyoften by @FSPIELMAN: "Council sign rules mean Trump name will loom large on river" __HTTP__ _E_ Everybody should boycott the @megynkelly show. Never worth watching. Always a hit on Trump! She is sick & the most overrated person on tv. _E_ .@CNN Will be interviewed by Jake Tapper at 9:00 A.M. Enjoy. _E_ Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_ Thanks for all the great comments on all my recent interviews. Much appreciated. _E_ I will be on@gretawire tonight at 10 P.M. Now I know she will get great ratings! _E_ On my way to Des Moines Iowa will see you soon with @mike_pence. Join us! Tickets: __HTTP__ #ThankYouTour2016 _E_ Sen. @MaxBaucus has announced his retirement. A major proponent of ObamaCare Baucus now says it's a 'huge train wreck.' _E_ Fact – Obama still has not fixed the backend of the ObamaCare website. This could be the greatest internet boondoggle in history. _E_ Will be in Bangor Maine today at 3pm join me! #MAGATickets: __HTTP__ __HTTP__ _E_ George Ross and I have done some great real estate deals together. He's a tough negotiator. #CelebApprentice _E_ Vote for Mar a Lago __HTTP__ _E_ Congratulations to @serenawilliams on her superb @usopen win. She is terrific! _E_ I've never seen anything like it everything he touches turns to gold! So nice a quote by Fred C.Trump about his son Donald (me!). _E_ Michelle Nunn supports Amnesty a weak border & ObamaCare. She is an Obama liberal. Send DC an independent voice. Vote @Perduesenate! _E_ ...to stop drugs they want to take money away from our military which we cannot do." My standard is very simple AMERICA FIRST & MAKE AMERICA GREAT AGAIN! _E_ Thank you to Linda Bean of L.L.Bean for your great support and courage. People will support you even more now. Buy L.L.Bean. @LBPerfectMaine _E_ Thank you! An honor to be the first candidate ever endorsed by the @NRA prior to @GOPconvention! #Trump2016 #2A __HTTP__ _E_ George Will was critical of @MittRomney throughout the primary. Maybe it is because his wife was turned down for (cont) __HTTP__ _E_ Brooklyn Nets have the worst uniform ever Boring won't matter if they win( Winning solves all problems (cont) __HTTP__ _E_ Manufacturing is now less than 9% of US GDP. The Rust Belt heart of our country's factory sector has been destroyed by our leaders. _E_ #DrainTheSwamp #PhoenixRally __HTTP__ _E_ Why would Republican candidates want the support of Mitt Romney. He lost an election against Obama that should NEVER have been lost! _E_ Who should star in a reboot of Liar Liar Hillary Clinton or Ted Cruz? Let me know. __HTTP__ _E_ The true question for the @UN... __HTTP__ _E_ .@WSJ is bad at math. The good news is nobody cares what they say in their editorials anymore especially me! _E_ Great to see @Yankees Captain Derek Jeter back on the field. He will have another great season and make NYC proud again. _E_ "The Conference Board said that consumer sentiment was at its highest level in nearly 17 years in November. The Consumer Confidence Index rose from 126.2 in October to 129.5 notching its best reading since December 2000..." __HTTP__ _E_ Via @Reuters_Biz Trump flies into ex Soviet Georgia for tower project project __HTTP__ _E_ I will be interviewed today on Fox News Sunday with Chris Wallace at 10:00 (Eastern) Network. ENJOY! _E_ Our country is not going to have a comeback with any politician. my @SRQRepublicans speech _E_ Thank you to the GREAT NYPD First Responders and all govt officials for having handled the terrible West Side attack so professionally! __HTTP__ _E_ On my way! #Inauguration2017 __HTTP__ _E_ "Playing golf with business associates creates a relaxing atmosphere where everyone has fun... _E_ Fact Obama does not read his intelligence briefings nor does he get briefed in person by the CIA or DOD. Too busy I guess! _E_ President Obama said over and over again if you like your plan you can keep your plan PERIOD! This turned out to be a total lie 90 mill. _E_ If history teaches us anything it's that strong nations require strong leaders with clearly defined national (cont) __HTTP__ _E_ With our record debt & trillion $ deficits our $ is now at an all time low against the Chinese Yuan. Time for our gov't to work together. _E_ Can you believe the head of Iran refused to meet with our great President?—Zero respect! _E_ via American Spectator @AmSpec Trump Card by Jeffrey Lord __HTTP__ _E_ Thank you Missouri! #Trump2016 __HTTP__ _E_ Opening for the 2014 season soon the National Register landmark Mar a Lago Club is the crown jewel of Palm Beach __HTTP__ _E_ Use adverse events and monumental challenges to make you stronger. Think Big _E_ Thoughts and prayers to the families of the four great Marines killed today. _E_ Some of you were asking about the All Star line up for Celebrity Apprentice __HTTP__ _E_ Crooked Hillary Clinton says that she got more primary votes than Donald Trump. But I had 17 people to beat—she had one! _E_ Keep stimulating your mind with big ideas. Fill your mind with new information & use this information to spawn new ideas. Think Big _E_ Major Mexican cartel boss El Diego was just arrested with weapons provided to him through Fast and Furious __HTTP__ Media?? _E_ This is a tragedy. The real unemployment rate is 14.8% with over 23.2 million unemployed Americans. We can do much better. _E_ Democrats would do much better as a party if they got together with Republicans on HealthcareTax CutsSecurity. Obstruction doesn't work! _E_ Thank you OHIO! #TrumpPence16 __HTTP__ __HTTP__ _E_ Looking forward to meeting everyone at the North Carolina GOP this Friday where I will be the keynote speaker at the dinner. #GOP _E_ This Sunday's LIVE FINALE of @ApprenticeNBC will be tough & nasty. Be sure to watch @pennjillette & @TraceAdkins fight to the finish! _E_ Now @BarackObama is planning to have we the taxpayers pay off mortgages he will spend this country into the ground. __HTTP__ _E_ Trump Offers To Donate $5 Million To Charity If Obama Releases College Transcripts __HTTP__ via @rcpvideo _E_ #badratings @Rosie you will never make it. You are not funny or talented. _E_ Due to popular demand @lisarinna returns to the 13th season of All Star @CelebApprentice. Lisa's fans won't be disappointed! _E_ Obama is planning on attacking Romney on Bain in tomorrow's debate __HTTP__ Mitt should bring up college applications & records _E_ Today at 1:30PM CT I will be addressing @RepLeadConf in New Orleans __HTTP__ Will focus on how to fix our great country. _E_ China just put a tariff on US cars and trucks 22% China is laughing at our inept leaders. @BarackObama _E_ Thank you Mr. & Mrs. @TomBarrackJr for the wonderful and magical evening last night. It will not be forgotten. #Trump2016 _E_ The stock market and US dollar are both plunging today. Welcome to @BarackObama's second term. _E_ 46 stories in the center of downtown New York @TrumpSoHo's 391 spacious rooms each have floor to ceiling windows __HTTP__ _E_ The #FakeNews MSM doesn't report the great economic news since Election Day. #DOW up 16%. #NASDAQ up 19.5%. Drilling & energy sector... _E_ Via @Newsmax_Media by @wandacarruthers: "Trump: GOP on Edge of Winning 'Big' and Forcing Obama to Act" __HTTP__ _E_ We should be focused on clean and beautiful air not expensive and business closing GLOBAL WARMING a total hoax! _E_ Via @WashTimes by @harperbulletin: "Donald Trump Goes to Washington" __HTTP__ _E_ "Christmas waves a magic wand over this world and behold everything is softer and more beautiful." Norman Vincent Peale _E_ "The difference between stupidity and genius is that genius has its limits." Albert Einstein _E_ My @foxandfriends interview discussing my friend Whitney Houston @SarahPalinUSA's CPAC speechthe economy and primary __HTTP__ _E_ Happy #FirstRespondersDay to all of our HEROES out there. We are forever grateful to you for your service sacrifice and courage 24/7/365! __HTTP__ _E_ Nancy Reagan the wife of a truly great President was an amazing woman. She will be missed! _E_ Doing my best to disregard the many inflammatory President O statements and roadblocks.Thought it was going to be a smooth transition NOT! _E_ Legendary basketball coach Bobby Knight who has 900+ wins many championships and a gold medal will be introducing... __HTTP__ _E_ While our wonderful president was out playing golf all day the TSA is falling apart just like our government! Airports a total disaster! _E_ A country that does not control or respect its own borders is a country destined for failure. Secure our borders! _E_ If I win the presidency my judicial appointments will do the right thing unlike Bush's appointee John Roberts on ObamaCare. _E_ "After every setback start thinking big as soon as possible." Think Big _E_ My @FoxNews interview with @TeamCavuto discussing my endorsement of @MittRomney and how I came to my decision __HTTP__ _E_ 'Moderate' Repubs plotting against @GOP strategy have short term memories. Tea Party gave them majority in House & primaries aren't fun. _E_ "Each excellent thing once learned serves for a measure of all other knowledge." Philip Sidney _E_ Flashback: Donald Trump would fire A Rod __HTTP__ via @espn 10.17.12 _E_ He @BarackObama claims he does not want higher gas prices. That's not what he said in 2008: __HTTP__ _E_ Alison Grimes will protect the 'sanctity' of her Obama ballot yet admits she voted for Hillary in primary. Hypocrite. Vote @Team_Mitch! _E_ Honestly whether you're for or against ObamaCare the 635 million dollar website fiasco is bad for the U.S. It makes us look totally inept! _E_ I said this was happening long ago I will stop this immediately! __HTTP__ _E_ Building a personal brand? Then focus on being great. Focus on being the best at what you do. Excellent article: __HTTP__ _E_ A message to the great people of New Hampshire on this important day! #VoteTrumpNH Video: __HTTP__ __HTTP__ _E_ Most people do not know what Presient Obama is going to do to save his legacy. I do! He's got to get back to basics.Forget Syria FIX THE USA _E_ Global warming has been proven to be a canard repeatedly over and over again. __HTTP__ The left needs a dose of reality. _E_ Entrepreneurs: What is the standard for which you want to be known? Identify that standard & then establish it. Simple but not easy. Focus! _E_ My performance from last week on David Letterman @Late_Show will be re aired tonight at 11:35 PM on CBS. _E_ Many generals and military leaders are now saying I told you so! They say this will have big impact on military strength & national sec. _E_ Focus on your goals not on fixed patterns. Do what's necessary and what's unnecessary will be made clear. _E_ Two of the best ever episodes of Celebrity Apprentice tonight at 8. Totally vicious and crazy! I will live tweet. _E_ Thank you! __HTTP__ _E_ It wasn't Matt Lauer that hurt Hillary last night. It was her very dumb answer about emails & the veteran who said she should be in jail. _E_ Many people would like to see @Nigel_Farage represent Great Britain as their Ambassador to the United States. He would do a great job! _E_ RT @DiamondandSilk: The Media Says: The President Should Stop Tweeting about Russia. Well Why Don't the Media Take Their Own Advice & S... _E_ RT @statedeptspox: #GES2017 highlights the important role of women #entrepreneurs & demonstrates the importance of #innovation & partnershi... _E_ Take a sneak peek into one of Trump Park Avenue's most exclusive residences on the market __HTTP__ _E_ Join me live in Wilmington Ohio! __HTTP__ _E_ Why the hell did we help the Libyan rebels in the first place. That is the real scandal. _E_ Ralph Norman who is running for Congress in SC's 5th District will be a fantastic help to me in cutting taxes and.... _E_ A gallon of gas has more than doubled while @BarackObama has been POTUS and he still won't approve Keystone. _E_ Great! Last night @CelebApprentice winner @johnrich & alumni @RealMeatLoaf packed OH stadium rallying w/ @MittRomney __HTTP__ _E_ Interesting article by @MattTowery @townhallcom:"It Is Time to Use 'The Trump Card'" __HTTP__ Thanks Matt for the nice mention _E_ Oh no they are worried that they didn't read the Boston killer his rights and he may have a good legal argument. 12 year case to finish? t _E_ I was just given a great tour of Moscow fantastic hard working people. CITY IS REALLY ENERGIZED! The World will be watching tonight! _E_ Hopefully the Republican Party can come together and have a big WIN in November paving the way for many great Supreme Court Justices! _E_ .@Matt_Berry87 Piers did a great job the interview was very important. _E_ Together we are going to MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ _E_ Opening in 2016 Trump Tower Punta del Este will bring our signature luxury living to the sands of Playa Brava __HTTP__ _E_ Some really dumb blogger for failing @VanityFair a magazine whose ads are down almost 18% this year said I wear a hairpiece I DON'T! _E_ I never did give anybody hell. I just told the truth and they thought it was hell. Harry S. Truman _E_ The Democrats dropped all references to God from their platform. Not good! _E_ James Gandolfini was a remarkable talent. He was also a decent man. We will all miss him. _E_ Re Negotiation: Know exactly what you want & focus on that. View conflict as an opportunity this will expand your mind and your horizons. _E_ .@AGSchneiderman Why is Douglas Durst allowed to use the Freedom Tower to get out of a lease with Conde Nast? _E_ Bad performance by Crooked Hillary Clinton! Reading poorly from the telepromter! She doesn't even look presidential! _E_ .@thehill discussing my @foxandfriends interview: Trump: 'Clamor for @MittRomney's tax returns has died down' __HTTP__ _E_ Great night in WI. I'm going to fight for every person in this country who believes government should serve the PEO... __HTTP__ _E_ Chris @hardball_chris Matthews ratings are at new historic lows. He is single handedly destroying the entire @msnbc channel. _E_ RT @JoeNBC: Pope Francis tear down that wall! #vaticanwalls __HTTP__ _E_ .@HeyTammyBruce Thank you for your nice words on Fox today. They never use my full statements on nuclear which you would agree with! _E_ Of course I don't think Jimmy Carter is dead saw him today on T.V. Just being sarcastic but never thought he was alive as President stiff! _E_ For those that don't think a wall (fence) works why don't they suggest taking down the fence around the White House? Foolish people! _E_ .@gretawire Greta—you're wrong Kirsten Powers is a dummy—wasn't she Anthony Weiner's girlfriend? _E_ Vladimir Putin said today about Hillary and Dems: In my opinion it is humiliating. One must be able to lose with dignity. So true! _E_ My visit to Japan and friendship with PM Abe will yield many benefits for our great Country. Massive military & energy orders happening+++! _E_ Thank you Michigan! #Trump2016 __HTTP__ _E_ However beautiful the strategy you should occasionally look at the results. Winston Churchill _E_ THANK YOU St. Augustine Florida! Get out and VOTE! Join the MOVEMENT and lets #DrainTheSwamp! Off to Tampa now!... __HTTP__ _E_ Congratulations to @TrumpWaikiki for being selected as Best of +VIP Access 2014 by @Expedia! _E_ Watch my @oreillyfactor appearance from this week discussing nuclear negotiations with Iran __HTTP__ _E_ While Obama is denying it he did receive intelligence about the attacks 3 days before __HTTP__ Too busy campaigning? _E_ Weekly AddressJoin me here: __HTTP__ __HTTP__ _E_ What a coincidence?! @BarackObama's campaign logo uses the same font as Cuban communist propaganda posters. __HTTP__ _E_ The world was gloomy before I won there was no hope. Now the market is up nearly 10% and Christmas spending is over a trillion dollars! _E_ Under @BarackObama 1 out of every 7 Americans is on food stamps. _E_ 36 hrs Central Park as seen in @nytimes including a stop @TrumpNewYork for a bite in @Nougatine_NYC. Full article __HTTP__ _E_ With our amazing All Star cast @Joan_Rivers @johnrich @ArsenioOFFICIAL & @piersmorgan are also returning as boardroom advisors. _E_ It is simply immoral for the government to encourage able bodied Americans to think that a life on welfare of (cont) __HTTP__ _E_ "We would accomplish many more things if we did not think of them as impossible." Vince Lombardi _E_ Republicans have once again capitulated to Obama. This time on the Iran nuclear treaty. When will it end? _E_ Thank you Novi Michigan! Get out and VOTE #TrumpPence16 on 11/8. Together WE WILL MAKE AMERICA GREAT AGAIN!... __HTTP__ _E_ I was on the TODAY Show this morning and then visited Regis & Kelly. The Celebrity Apprentice starts this Sunday night—don't miss it! _E_ Can you imagine if the election results were the opposite and WE tried to play the Russia/CIA card. It would be called conspiracy theory! _E_ Goofy Elizabeth Warren is now using the woman's card like her friend crooked Hillary. See her dumb tweet "when a woman stands up to you..." _E_ A Rod is a less than average baseball player now that he is unable to use drugs. A Rod misrepresented to th... (cont) __HTTP__ _E_ Departing @JBA_NAFW for St. Charles Missouri to help push our plan for HISTORIC TAX CUTS across the finish line.A successful vote in the Senate this week will bring us one giant step closer to delivering an incredible victory for the American people! __HTTP__ __HTTP__ _E_ Today is #TrumpTuesday on @SquawkCNBC 7:30 AM. Tune in! _E_ .@HBO should fire @BillMaher and bring back @DennisDMZ someone that is actually funny. _E_ As election looms some bad news for Clinton Democrats: __HTTP__ _E_ These last 4 years have not had a single quarter over 4% GDP. Obama has overseen the weakest economic recovery in American history. _E_ Who is the dumbest man on TV? @Lawrence of MSNBC... __HTTP__ _E_ The Dollar is at an all time WWII low against the Yen. The Fed's recklessness is going to lead to record inflation. _E_ Via @HamptonsMag: @IvankaTrump Talks Hamptons Lifestyle with Emmy Rossum __HTTP__ _E_ It's true... Dennis is really into this very animated. I have never seen him this way before. _E_ VOTE TODAY! Go to __HTTP__ to find your polling location. We are going to Make America Great Again!... __HTTP__ _E_ Many people booed the players who kneeled yesterday (which was a small percentage of total). These are fans who demand respect for our Flag! _E_ The liberal media is focusing on @MittRomney's bank records. How about reviewing @BarackObama's illegal land deal contracts with Tony Rezko? _E_ A great honor to receive polling numbers like these. Record setting African American (25%) & Hispanic numbers (31%). __HTTP__ _E_ The problem with the U.S. is that our leadership has no knowledge or ability to negotiate or see into the future. Every nation beats us! _E_ I discuss yesterday's tragedy at the Boston Marathon in today's video blog. __HTTP__ _E_ The new reality – China's demand for oil now controls the market __HTTP__ And OPEC gets away with ripping us off at $105! _E_ If Goofy Elizabeth Warren a very weak Senator didn't lie about her heritage (being Native American) she would be nothing today. Pick her H _E_ The fact is you're not going to see real growth or create real jobs until we get these exorbitant energy costs (cont) __HTTP__ _E_ RT @NRA: But there IS something we will do on #ElectionDay: Show up and vote for the #2A! #DefendtheSecond #NeverHillary _E_ Good luck and best wishes to my dear friend the wonderful and very talented Joan Rivers! Winner of Celebrity Apprentice amazing woman. _E_ Kellyanne Conway went to @MeetThePress this morning for an interview with @chucktodd. Dishonest media cut out 9 of her 10 minutes. Terrible! _E_ Preliminary talks have begun for next season's #CelebrityApprentice. As usual we will have another great season. _E_ Why does @BarackObama continue to defend radical Islam? He is calling the Ft. Hood massacre workplace violence. _E_ Also appearing on the Miss USA Pageant will be Country Superstar Trace Adkins and Pop Rock Sensation Boys Like Girls... _E_ I will be in Washington D.C. tomorrow to receive the 2014 Joseph Wharton Award at the Wharton Club of D.C.—a great honor! @Wharton _E_ ....John McCain has failed miserably to fix the situation and to make it possible for Veterans to successfully manage their lives. _E_ The gorgeous contestants of Trump Miss Universe are so excited to be simulcast on both @nbc and @Telemundo. Will be a beautiful show! _E_ Call me old school but I believe in the old warrior's credo that to the victor go the spoils. In other word... (cont) __HTTP__ _E_ Via @CBS19: Trump Winery President Nominated for Award by Wine Enthusiast Magazine __HTTP__ Congrats @EricTrump! _E_ Watch this amazing ad from @autismspeaks and learn the signs... __HTTP__ _E_ N.Y. City is paying FORTY MILLION DOLLARS to five men that many think are guilty as hell. So many facts should have been trial. Politics! _E_ My team of deplorables will be taking over my Twitter account for tonight's #debate#MakeAmericaGreatAgain _E_ It's finally happening Fiat Chrysler just announced plans to invest $1BILLION in Michigan and Ohio plants adding 2000 jobs. This after... _E_ ... People love to hear their names and their stories said out loud." – Think Like a Billionaire _E_ My interview with @PaulWTalk on @wjrradio on behalf of @MittRomney discussing why Michigan needs to go for Romney. __HTTP__ _E_ Firing Bret was a tough one for me but Omarosa doesn't seem to mind. _E_ Spoke to Roy Moore of Alabama last night for the first time. Sounds like a really great guy who ran a fantastic race. He will help to #MAGA! _E_ Russia has never tried to use leverage over me. I HAVE NOTHING TO DO WITH RUSSIA NO DEALS NO LOANS NO NOTHING! _E_ The final part of restoring fiscal sanity to America is the most obvious and that's to control Obama style (cont) __HTTP__ _E_ Again don't forget to watch @hannityshow tonight on Fox at 9 o'clock EST. _E_ We crushed the original goal! I will write a $2 MILLION check to our campaign if we hit our end of month goal! __HTTP__ _E_ .@IvankaTrump @EricTrump & @DonaldJTrumpJr take no prisoners in boardroom of 'All Star' @CelebApprentice. Where do they get it from? _E_ You don't necessarily need the best location. What you need is the best deal. The Art of the Deal _E_ Trump Nears 100 days on Top via The Hill __HTTP__ _E_ The NYPD Surveillance Program kept NYC safe since 9/11. There will be tragic consequences for ending it. _E_ A GREAT HONOR to spend time with our BRAVE HEROES at the @USMC Air Station Yuma. THANK YOU for your service to the United States of America! __HTTP__ _E_ .@tedcruz must be doing something right if @cher sadly rated "the 4th ugliest celebrity" according to @listverse is attacking him. _E_ .@WashTimes states Democrats have willfully used Moscow disinformation to influence the presidential election against Donald Trump. _E_ .#Celebrityapprentice will be live tomorrow night. Entire cast will be there. Who do you like to win? _E_ .@MittRomney much better on Libya and Middle East problems. Obama has no answer. _E_ The tax scam Washington Post does among the most inaccurate stories of all. Really dishonest reporting. _E_ Various media outlets and pundits say that I thought I was going to lose the election. Wrong it all came together in the last week and..... _E_ My @gretawire interview discussing @billlmaher's comments attacks on @MittRomney and @CNN & @msnbctv's low ratings __HTTP__ _E_ I will be interviewed by @DavidMuir tonight at 10 o'clock on @ABC. Will be my first interview from the White House.... __HTTP__ _E_ If we keep on this path if we reelect @BarackObama the America we leave our kids and grandkids won't look (cont) __HTTP__ _E_ The Democrats just aren't calling about DACA. Nancy Pelosi and Chuck Schumer have to get moving fast or they'll disappoint you again. We have a great chance to make a deal or blame the Dems! March 5th is coming up fast. _E_ On 800 beautiful acres in Miami @TrumpDoral boasts 100000 sq. ft. in meeting space with event planning services __HTTP__ _E_ ....8 Dems totally control the U.S. Senate. Many great Republican bills will never pass like Kate's Law and complete Healthcare. Get smart! _E_ .@BoonePickens Thank you for the T. Boone Pickens Entrepreneur Award—a great honor for me from a fantastic man. _E_ Today's assignment: read chapter three of Think Big "Basic Instincts." Focus on my acquisition of 40 Wall Street. _E_ Iraq is no longer our problem. We never should have been there in the first place! _E_ I couldn't make the Faith and Freedom confab in Orlando so I sent a video... __HTTP__ _E_ Thank you! Four new #DebateNight polls with the MOVEMENT winning. Together we will MAKE AMERICA SAFE & GREAT AGAIN... __HTTP__ _E_ Had dinner last night at Megu 845 United Nations Plaza fabulous food beautiful restaurant. _E_ RT @EricTrump: Friends: Remember to VOTE tomorrow if you live in Louisiana Maine Kentucky or Kansas! #MakeAmericaGreatAgain __HTTP__ _E_ Based on new oil prices the ugly windfarms being built in Scotland will quickly die! What a mess! _E_ ObamaCare not only has brought higher premiums decreased care & loss of jobs but now .1% Q1 growth. REPEAL BEFORE IT IS TOO LATE! _E_ The 5 star @Trump_Ireland graces over 500 acres fronting 2.5 miles on the Atlantic Ocean in County Clare Ireland __HTTP__ _E_ "Ice Skaters Invade Mar a Lago as Snow Falls on Palm Beach Salvation Army Ball!" __HTTP__ via @GossipExtra _E_ .@WendyWilliams Thanks for the nice statement especially about my wife and kids very much appreciated. _E_ Read my full statement here on the Supreme Court's executive amnesty decision #imwithyou __HTTP__ _E_ Our NOBEL PRIZE FOR PEACE president said I'm really good at killing people according to just out book Double Down. Can Oslo retract prize? _E_ I look forward to watching @megynkelly tonight 8 PM ET. It will be interesting to see how she treats me—I think she will be very fair. _E_ Join me in Delaware Ohio tomorrow at 12:30pm! #DrainTheSwamp Tickets: __HTTP__ __HTTP__ _E_ RT @DanScavino: Join #PEOTUS Trump & #VPEOTUS Pence live in West Allis Wisconsin! #ThankYouTour2016 #MAGA __HTTP__ __HTTP__ _E_ Today @MittRomney addressed the NAACP. @BarackObama takes their vote for granted which is why there is such high Black unemployment. _E_ Thank you @GOPLeader Kevin McCarthy! Couldn't agree w/you more. TOGETHER we are #MAGA __HTTP__ _E_ No matter how much I accomplish during the ridiculous standard of the first 100 days & it has been a lot (including S.C.) media will kill! _E_ Trump Int'l Hotel & Tower Vancouver will be a new landmark in a fantastic city __HTTP__ _E_ Lyin' Ted Cruz consistently said that he will and must win Indiana. If he doesn't he should drop out of the race stop wasting time & money _E_ RT @theRealKiyosaki: Donald Trump coined the phrase 'multilevel focusing' I love it. It is when two ideas intersect & form a new innovation _E_ Crooked Hillary said that I want guns brought into the school classroom. Wrong! _E_ One by one we are keeping our promises on the border on energy on jobs on regulations. Big changes are happening! _E_ Thank you! #GOPDebate Polls #MakeAmericaGreatAgain __HTTP__ _E_ "Leadership is perhaps the key to getting any job done." – The Art of The Deal _E_ Bay Bridge in California made in China for $1.8 billion. $300 million in cost overruns. Are we stupid? _E_ Oscar Pistorious the blade runner is as guilty as O.J. I wonder if the result will be the same? _E_ Thank you! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_ This guy @sethmeyers can't do a simple interview—saw him the other night stumbling & mumbling while trying to interview a guest. _E_ With autism being way up what do we have to lose by having doctors give small dose vaccines vs. big pump doses into those tiny bodies? _E_ RT @Scavino45: Time lapse video of the border wall prototypes when they were being built in San Diego. Next phase underway: testing and ev... _E_ Vattenfall the company behind a proposed asinine windfarm off the coast of Aberdeen Scotland is having serious financial difficulty. _E_ Keystone pipeline would create 20000 direct jobs another 50000 jobs servicing the pipeline. 700000 barrels a (cont) __HTTP__ _E_ The last person that Hillary or Bernie want to run against is Donald Trump and that is fact! _E_ Putin has no respect for our President really bad body language. _E_ Why is Obama's auto bailout now creating jobs in China? He is ruining American industry. _E_ General Flynn was given the highest security clearance by the Obama Administration but the Fake News seldom likes talking about that. _E_ Our ally Canada is 'frustrated' by @BarackObama's radical anti gas policies __HTTP__ BHO is forcing Canada to send gas to China. _E_ Entrepreneurs: Stay focused and be tenacious. Remain fixed on your goals. _E_ I am asking all citizens to believe in yourselves believe in your future and believe once more in America. #AmericaFirst __HTTP__ _E_ Want access to Crooked Hillary? Don't forget it's going to cost you!#DrainTheSwamp #PayToPlay __HTTP__ _E_ Crooked H is nasty to Sanders supporters behind closed doors. Owned by Wall St and Politicians HRC is not with you. __HTTP__ _E_ Cruz came to Mississippi there was nobody there he left the state. I had a rally in Madison MS with 10000! Thank you! _E_ Just left $259 million rebuilding of Doral in Miami. Amazing Trump National Doral will be a masterpiece (if I do say so myself)! _E_ The big problem for little @MacMiller is that he's going to have to have another hit song not just his Donald Trump bonanza. _E_ On Bill O'Reilly in 5 minutes! _E_ Just left the best golf course in the State of California @trumpgolfla. When in the LA area check it out even (cont) __HTTP__ _E_ FBI director said Crooked Hillary compromised our national security. No charges. Wow! #RiggedSystem _E_ America needs strong leadership. Politicians can talk but they don't get things done. Video: __HTTP__ __HTTP__ _E_ Great American heroes who averted an attack in France. THANK YOU! Spencer Stone Anthony Sadler & Alex Skarlatos. __HTTP__ _E_ RT @EricTrump: Aloha Hawaii: We would be honored to have your vote! Find your caucus __HTTP__ #TrumpWaikiki #Mahalo __HTTP__ _E_ Any American who fights w/ ISIS in Iraq or Syria should have their passport revoked. If they try to come back in send them to Gitmo. _E_ Have a GREAT EASTER I love you all! _E_ Just got back from Asheville North Carolina where we had a massive rally. The spirit of the crowd was unbelievable. Thank you! #MAGA _E_ True America is rapidly losing it's SPIRIT and when that's gone we will only be going in one direction and that direction is down! _E_ TERRORISM IMMIGRATION AND NATIONAL SECURITY SPEECH TRANSCRIPT: __HTTP__ __HTTP__ _E_ Leaders at Trump National Doral are only one under par. The great Ben Hogan said I've never seen a great course that was easy! _E_ "I don't measure a man's success by how high he climbs but how high he bounces when he hits bottom." George S. Patton _E_ A friend is one who has the same enemies as you have. Abraham Lincoln _E_ Lightweight @AGSchneiderman's phony lawsuit against Trump U was decimated by the court—he's a loser! _E_ People rarely say that many conservatives didn't vote for Mitt Romney. If I can get them to vote for me we win in a landslide. _E_ Very important that NFL players STAND tomorrow and always for the playing of our National Anthem. Respect our Flag and our Country! _E_ The country of Georgia is a small wonder. Performing well economically under the leadership of @SaakashviliM. A great American ally. _E_ Can't believe we are less than three weeks away from the election. Time certainly flies! _E_ Obama said in his SOTU that "global warming is a fact." Sure about as factual as "if you like your healthcare you can keep it." _E_ Super PACs should be disavowed by anyone running for President. They are a total scam on our system and country! I am self funding. _E_ For every CEO that drops out of the Manufacturing Council I have many to take their place. Grandstanders should not have gone on. JOBS! _E_ A GREAT DAY IN WISCONSIN!Thank you #Racine & #Wausau! Just arrived in #EauClaire! #Trump2016#WIPrimary #TrumpTrain __HTTP__ _E_ Leaving the great people of North Carolina. Amazing event. Heading to Tampa now! #VoteTrump _E_ United States looks more and more like a paper tiger. Won't be that way if I win! _E_ THANK YOU ASIA! #USA __HTTP__ _E_ .@PennyPritzker Really important to cover currency manipulation in trade agreements that's where China and others are beating us. Best! _E_ Thank you. __HTTP__ _E_ Good luck to the people of Scotland whatever their decision may be on Thursday. The whole world is watching—really exciting! _E_ The @GOP should not agree to the ridiculous debate terms that @CNBC is asking unless there is a major benefit to the party. _E_ #ICYMI: @KarlRove & @oreillyfactor discuss what Ted Cruz did to the great people of Iowa as they went to vote. __HTTP__ _E_ As promised my @SuperBowl pick is the San Francisco @49ers. _E_ My Twitter has been seriously hacked and we are looking for the perpetrators. _E_ I look forward to my press conference on Weds of next week @TrumpTurnberry to discuss changes & big investment I'll make. Very exciting! _E_ Because Obama was so pathetic in the first debate tonight's audience will be humongous people want to see if he is for real. _E_ Tomorrow is #TrumpTuesday on @SquawkCNBC 7:30 AM _E_ It was just announced that @MacMiller's song "DonaldTrump" went platinum—tell Mac Miller to kiss my ass! _E_ Today we are thrilled to welcome @Broadcom CEO Hock Tan to the WH to announce he is moving their HQ's from Singapore back to the U.S.A..... __HTTP__ _E_ Looking at the figures and plans behind @Disney's acquisition of Lucas Film makes you realize how stupid @AOL (cont) __HTTP__ _E_ Obama's $1T+ deficit budget expanded welfare & green cronyism & it cut domestic bomb prevention in half __HTTP__ _E_ Restoring American wealth will require that we get tough. The next president must understand that America's (cont) __HTTP__ _E_ Thank you America! #MAGARasmussen National PollDonald Trump 43%Hillary Clinton 40% __HTTP__ _E_ The so called A list celebrities are all wanting tixs to the inauguration but look what they did for Hillary NOTHING. I want the PEOPLE! _E_ W/a newly expanded 27 holes of golfing Trump Intl.Palm Beach is ranked by Florida Golf Magazine as FL's #1 course __HTTP__ _E_ Check @billmaher's background & you will find he is not a smart guy—he just wants people to think he is just call him dummy. _E_ Government's first duty is to protect the people not run their lives." – President Ronald Reagan _E_ An honor having the National Sheriffs' Assoc. join me at the @WhiteHouse. Incredible men & women who protect & serv... __HTTP__ _E_ My @SquawkCNBC interview discussing the @GOP convention @BarackObama's sealed records & @SenatorReid's tax claim __HTTP__ _E_ Interview w/ @AndreaTantaros discussing my WH tour offer @KarlRove's terrible ads & Ashley Judd's candidacy __HTTP__ _E_ Americans understand that the US has a spending problem not a revenue problem. #TimeToGetTough __HTTP__ __HTTP__ _E_ The trade deal is a disaster she was always for it! #DemDebate _E_ The seriously failing @nytimes despite so much winning and poll numbers that will soon put me in first place only writes dishonest hits! _E_ A bite from last night's @piersmorgan interview discussing Rev. Wright's Ed Klein interview and the 2012 campaign __HTTP__ _E_ Our not very bright Vice President Joe Biden just stated that I wanted to carpet bomb the enemy. Sorry Joe that was Ted Cruz! _E_ Just shows that you can have all the cards and lose if you don't know what you're doing. _E_ #TrumpVine from D.C. __HTTP__ _E_ With a @SharkGregNorman designed course directly along the water @Trump_Charlotte is North Carolina's elite club __HTTP__ _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Had a great time hosting the Palm Beach County Republican at Mar a Lago. @IngrahamAngle gave a strong speech. She's great! _E_ "You have to be positive every single day. Positive stamina is a necessary ingredient for success." – Think Like a Champion _E_ Please @21Club go back to your original menu and preparation. Believe me it was much better. Let me know when the change is made! _E_ .@pennjillette has received a star on the Hollywood Walk of Fame— about time! #CelebApprentice _E_ Be aware of things that seem inexplicable because they can be a big step towards innovation. Donald J. Trump __HTTP__ _E_ I am not only fighting Crooked Hillary I am fighting the dishonest and corrupt media and her government protection process. People get it! _E_ Did the Boston terrorists register their guns? No. Another example of why gun control legislation is not the answer! _E_ A Clinton economy = more taxes and more spending! #DebateNight __HTTP__ _E_ I can't wait to donate @billmaher's $5 million to charity. Just waiting on @billmaher to send me the money. _E_ Peoples lives are being shattered and destroyed by a mere allegation. Some are true and some are false. Some are old and some are new. There is no recovery for someone falsely accused life and career are gone. Is there no such thing any longer as Due Process? _E_ So sad that @CNN and many others refused to show the massive crowd at the arena yesterday in Oklahoma. Dishonest reporting! _E_ This story is not about Mr. Khan who is all over the place doing interviews but rather RADICAL ISLAMIC TERRORISM and the U.S. Get smart! _E_ Getting back to the nicer and more normal parts of life Celebrity Apprentice is great tonight on NBC at 9. It will be a full two hour show! _E_ My twitter account is now reaching more people than the New York Times not bad. And we're only going to get better! _E_ Senator (Doctor) Bill Cassidy is a class act who really cares about people and their Health(care) he doesn't lie just wants to help people! _E_ The story with Hillary will never change. __HTTP__ _E_ The big story is the unmasking and surveillance of people that took place during the Obama Administration. _E_ .@TMobile gives terrible service and has many complaints just check. _E_ It's 10 AM: Two hours to go for Obama to easily pick up millions for charity! _E_ Gotta hand it to @IvankaTrump she loved Doral from the time we looked at it. The Trump Doral will be an Icon. #sayfie #newsmax _E_ Rubio is weak on illegal immigration with the worst voting record in the U.S. Senate in many years. He will never MAKE AMERICA GREAT AGAIN! _E_ Congratulations to @SpeakerBoehner on standing strong and tying government shutdown to defunding ObamaCare. _E_ Via @USATODAYsports: "Last year it was Tiger Woods with the walk off" __HTTP__ @CadillacChamp @DoralResort #TrumpDoral _E_ .@Lord_Sugar....but you wouldn't notice because you have no vision and you are a total loser. _E_ Thank you to all Americans who participated in Nat'l Rx Drug Take Back Day. A record amount of drugs collected & disposed. We can do this! _E_ ...and an optimist is one who makes opportunities of his difficulties. Harry S. Truman _E_ The #AmazonWashingtonPost sometimes referred to as the guardian of Amazon not paying internet taxes (which they should) is FAKE NEWS! _E_ Some of the Fake News Media likes to say that I am not totally engaged in healthcare. Wrong I know the subject well & want victory for U.S. _E_ Thank you! "Trump's Defining Speech" WSJ Editorial: __HTTP__ __HTTP__ _E_ Very strange why do database records contradict @BarackObama and show he was only at Columbia 1 year? __HTTP__ _E_ Crooked Hillary can't even close the deal with Bernie and the Dems have it rigged in favor of Hillary. Four more years of this? No way! _E_ Big announcement tomorrow morning concerning the great Turnberry Resort in Scotland! _E_ RT @TeamTrump: .@HillaryClinton had her chance and she BLEW IT. #BigLeagueTruth #Debates __HTTP__ _E_ Obama's war on women has lead to the biggest decline in female employment in 40 years. 4 more years?? _E_ After Turkey call I will be heading over to Trump National Golf Club Jupiter to play golf (quickly) with Tiger Woods and Dustin Johnson. Then back to Mar a Lago for talks on bringing even more jobs and companies back to the USA! _E_ Looking forward to my Iowa visit at @bobvanderplaats' @theFAMiLYLEADER Summit __HTTP__ Big crowd! _E_ President Obama's Arab Spring is not looking so good right now! _E_ Guests are raving about our exclusive hotel mattress and so we've made it available for purchase! __HTTP__ _E_ Keystone must be approved. Oil is at a record high. We need to use our resources and support allies like Canada. _E_ .@ximenaNR Great job we are all proud of you one of our all time BEST! _E_ RT @foxandfriends: Getting the job done! Sen. Mitch McConnell delays August recess to work on health care bill __HTTP__ _E_ Marshawn Lynch of the NFL's Oakland Raiders stands for the Mexican Anthem and sits down to boos for our National Anthem. Great disrespect! Next time NFL should suspend him for remainder of season. Attendance and ratings way down. _E_ I just left the Trump Tower atrium it is packed with great people. #1 tourist attraction in NYC Fun! #TrumpTower _E_ Via @LuxuryDaily by Joe McCarthy: "Trump Collection leverages 2016 election frenzy for Washington debut" __HTTP__ _E_ "TRUMP: IMMIGRATION BILL A REPUBLICAN 'DEATH WISH'" __HTTP__ via @BreitbartNews by @mboyle1 _E_ Congressman Ron DeSantis is a brilliant young leader Yale and then Harvard Law who would make a GREAT Governor of Florida. He loves our Country and is a true FIGHTER! _E_ The new reality. 'China Daily' is sold in street newspaper vending machines across DC. Why not? They own the place. _E_ The @TrumpChicago Spa offers 5 star services12 treatment rooms & 53 spa guestrooms overlooking Chicago skyline __HTTP__ _E_ Republicans and Democrats have both created our economic problems. _E_ I am self funding my campaign so I do not owe anything to lobbyists & special interests. __HTTP__ __HTTP__ _E_ Jerry Finkelstein passed away last night a great New York mover & shaker & a really great guy! _E_ Speaking at the Red White and Blue Dinner in Maryland __HTTP__ _E_ Obama is laughing at Karl Rove & all the losers who spent hundreds of millions of dollars and didn't win one race including the big one! _E_ "Donald Trump to Build Trump Towers Complex in Rio de Janeiro" __HTTP__ via Hispanically Speaking News _E_ Secretary of Defense Chuck Hagel seems so lost and frankly dumb. He can't even speak properly. Poor leader in these very dangerous times! _E_ This is my last election. After my election I have more flexibility. Obama to @MedvedevRussiaE discussing our nuclear arsenal. _E_ Why I would not have approved the deal... __HTTP__ #trumpvlog _E_ RT @foxandfriends: VIDEO: Rep. Scalise — GOP agrees on over 85 percent of health care bill __HTTP__ _E_ You have to believe in what you want. Keep your focus keep your momentum and remain patient and persistent. _E_ Obama keeps namedropping Bill Clinton he is no Bill Clinton. _E_ Be tenacious. Being tenacious means you're tough and patient at once a formidable combination. _E_ China is the biggest environmental polluter in the World by far. They do nothing to clean up their factories and laugh at our stupidity! _E_ The NFL image is really tarnished! Now if the sponsors start leaving and the ratings go down the NFL will be in big trouble. Boring games! _E_ Emmy Awards show was terrible last night. Same shows winning over and over again (politics). Amazing race a joke. Host Seth Meyers bombed! _E_ My @foxandfriends interview on risk for @GOP on immigration wasting money in Middle East & firing @OMAROSA __HTTP__ _E_ I'll be on @foxandfriends on Monday at 7:30 AM...be sure to tune in. _E_ My prayers and best wishes are with the family of Edwin Jackson a wonderful young man whose life was so senselessly taken. @Colts _E_ Looking forward to being awarded the '2015 Statesman of the Year' by @SRQRepublicans this Thursday. A record 2000+ attendees Can't wait! _E_ Hillary has called for 550% more Syrian immigrants but won't even mention "radical Islamic terrorists." #Debate... __HTTP__ _E_ ....victory and cannot be burdened with the tremendous medical costs and disruption that transgender in the military would entail. Thank you _E_ #MakeAmericaGreatAgain#TrumpPence16 __HTTP__ _E_ There is no excuse for riots in Ferguson regardless of the grand jury outcome. _E_ I am the king of debt. That has been great for me as a businessman but is bad for the country. I made a fortune off of debt will fix U.S. _E_ "There can be no liberty unless there is economic liberty." Margaret Thatcher _E_ Thank you Brian Krzanich CEO of @Intel. A great investment ($7 BILLION) in American INNOVATION and JOBS!... __HTTP__ _E_ You haven't seen fireworks until you see @OMAROSA & @piersmorgan go at it again! Let's just say it's no happy reunion... _E_ Steve Jobs is spinning in his grave Apple has lost both vision and momentum must move fast to get magic back! _E_ Great job @EricTrump! Proud of you!#AmericaFirst #RNCinCLE __HTTP__ _E_ RT @FLOTUS: Thank you to all who participated in today's discussion on opioid abuse. By talking about it we can start to make a real diffe... _E_ Via @BreitbartSports by @warnerthuston: "Donald Trump Buys Four Time British @The_Open Golf Course" __HTTP__ _E_ Thank you so much to __HTTP__ for naming me the 2015 Man of the Year. This is indeed a great honor for me! _E_ What apology didn't they go around beating the crap out of people and robbing them? Why did they all confess? Aren't police convinced? _E_ Today it was my great honor to welcome Prime Minister Erna Solberg of Norway to the @WhiteHouse a great friend and ally of the United States! Joint press conference: __HTTP__ __HTTP__ _E_ It is time to take care of OUR COUNTRY to rebuild OUR COMMUNITIES and to protect our GREAT AMERICAN WORKERS! #TaxReform __HTTP__ _E_ Crooked Hillary Clinton put out an ad where I am misquoted on women. Can't believe she would misrepresent the facts! My hit was on China _E_ My @foxandfriends interview discussing the @nyjets acquisition of @TimTebow and the timing of @RepPaulRyan's plan __HTTP__ _E_ We have to make the U.S.A. RICH again so that we can afford to pay Social Security Medicareand Medicaid and STRONG to keep our enemies out _E_ Our Marines are sent to kill the Taliban not coddle them. USMC should be praised not investigated. Semper Fi ! _E_ Why has nobody asked Kaine about the horrible views emanated on WikiLeaks about Catholics? Media in the tank for Clinton but Trump will win! _E_ .@TraceAdkins isn't excited about their ideas. Are you? #CelebApprentice _E_ "Give me a smart idiot over a stupid genius any day." Samuel Goldwyn _E_ Baseball player Ryan Braun turned out to be a total con man after so vociferously proclaiming his innocence only to be guilty as.hell! _E_ Can't believe Major League Baseball just rejected @PeteRose_14 for the Hall of Fame. He's paid the price. So ridiculous let him in! _E_ #CrookedHillary __HTTP__ _E_ RT @FoxNews: .@jessebwatters on @DonaldJTrumpJr meeting with Russian attorney: I believe Don Jr. is the victim here. #TheFive __HTTP__ _E_ DESPERATE @BarackObama is already asking supporters to 'find dirt' on @MittRomney's VP picks __HTTP__ Dirty tactics. _E_ Randy Moss said he was the greatest receiver of all time—no way—it was @JerryRice! _E_ A review of @MikeTyson's show great press on Trump International Golf Links Scotland and more in today's #trumpvlog __HTTP__ _E_ Elizabeth Warren often referred to as Pocahontas just misrepresented me and spoke glowingly about Crooked Hillary who she always hated! _E_ Will be on Fox & Friends at 7 (10 minutes). ENJOY! _E_ Wow the final ratings for the Miss Universe Pageant show that it won in all key demos number one on Sunday. I have a winner! _E_ RT @EricTrump: #ThrowbackThursdays @realDonaldTrump __HTTP__ _E_ We are inspired by the stories of everyday heroes who pull their communities from the depths of despair through leadership and love. __HTTP__ _E_ Megyn Kelly has two really dumb puppets Chris Stirewalt & Marc Threaten (a Bushy) who do exactly what she says. All polls say I won debates _E_ Congratulations to my son Eric on the fantastic job he has done in rebuilding Turnberry and its great Ailsa Course. Always support kids! _E_ .@Modern_Do_Good #asktrump __HTTP__ _E_ .@rushlimbaugh Rush I am in LA inspecting property (big job creator) & listening to you. You are truly fantastic thanks! _E_ Iowa was fantastic last night amazing crowd and people. I'm now in Florida getting ready to go to South Carolina. Big crowd very exciting _E_ .@BarackObama was caught telling Russian PM @MedvedevRussiaE that he can be more 'flexible' in his second term. Russia thinks he's weak. _E_ Via @Newsmax_Media by @OwenTew: "Donald Trump: 'Last Thing We Need Is Another Bush'" __HTTP__ _E_ RT @dmartosko: 'Duck Dynasty' star Phil Robertson says he'll back Trump for president __HTTP__ via @MailOnline _E_ I love taking lawsuits all the way when I'm right. @AGSchneiderman is finding that out the hard way! _E_ ISIS has infiltrated countries all over Europe by posing as refugees and @HillaryClinton will allow it to happen h... __HTTP__ _E_ Remember that I am self funding my campaign. Hillary Jeb and the rest are spending special interest and lobbyist money.100% CONTROLLED _E_ Join Governor Mike Pence in Reno Nevada tonight at 7pm! Tickets available at: __HTTP__ _E_ Fact: without Texas and states reaping the fracking boom Obama's job record would go from bad to worse! _E_ A great gift idea is my new book #TimeToGetTough easy to order on Amazon __HTTP__ _E_ Join me live from the @WhiteHouse. __HTTP__ _E_ I will be on Morning's with Maria on the Fox Business Network tomorrow during the 7am and 8am ET hours. _E_ Major League Baseball was really smart when they wouldn't let Mark Cuban buy a team. Was it his financials or the fact that he's an asshole? _E_ Brian if I'm well past the last exit to relevance how come you spent so much time reading my tweets last night? @NBCNightlyNews _E_ My two wonderful sons Don and Eric will be on @foxandfriends at 7:02 now! Enjoy. _E_ Via @Newsmax_Media:  Trump: I'd Be Better 'Meet the Press' Host Than 'Moron' Chuck Todd __HTTP__ _E_ I'm on @ETonlineAlert tonight to talk about what the Yankees should have done about A Rod long ago __HTTP__ _E_ Attorney General Jeff Sessions has taken a VERY weak position on Hillary Clinton crimes (where are E mails & DNC server) & Intel leakers! _E_ "You have to be patient as well as enthusiastic when it comes to your goals. Think big but be realistic." – Think Big _E_ I was putting together my early deals in New York & I was advised by many that I was too young. Believe in yourself & you can do anything. _E_ Great honor to have @GOP General Counsel #JohnRyder as a Trump delegate in TN. RNC meeting well worth it! Unifying the party! _E_ Arena was packed totally electric! _E_ Melania will be interviewed by @morningmika on @Morning_Joe now (8:30 A.M.). ENJOY! _E_ Yesterday @BarackObama actually spent a full day in Washington. He didn't campaign fund raise or play golf. Shocking. _E_ Jon Huntsman called to see me. I said no he gave away our country to China! @JonHuntsman _E_ RT @foxandfriends: FOX NEWS ALERT: 2 US drone strikes in Somalia target Al Qaeda and Al Shabaab __HTTP__ _E_ Florida Ethics Commission Advocate comes down hard on Rubio. So do two people who worked with him. Said he used the wrong credit card! Sure. _E_ It will now start to cool down concerning Sterling and the Clippers. This mess will start to fade after litigation into the murky past! _E_ . @deesnider is a great guy & a total winner! He understood he did not leave me any other choice. Look forward to keeping in touch. _E_ Nobody understands politicians like I do all talk and no action. They will never get our country where it needs to be truly great again! _E_ Mrs. Goldberg who filed the Chicago case many years ago is a vicious and conniving woman loved beating her. _E_ MUST READ It's time people listened to Trump' says mother of gunned down teenage football star __HTTP__ SECURE THE BORDER! _E_ Wow China's growth accelerated 7.8% in third quarter. If the U.S. had half that number we would be the talk of the World need leadership _E_ .@JTimberlake It was great having you play The Blue Monster. Thanks for your nice statements many agree that it is best they've seen! _E_ My rallies are not covered properly by the media. They never discuss the real message and never show crowd size or enthusiasm. _E_ I will be doing Greta Van Susteren @gretawire tonight at 10 PM on Fox News talking about China & Mitt's failed campaign team. _E_ Glad to hear Clint Eastwood endorsed @MittRomney. He understands that America needs a big boost to be strong again. _E_ Just got back to the White House from the Great States of Texas and Louisiana where things are going well. Such cooperation & coordination! _E_ North Korea is looking for trouble. If China decides to help that would be great. If not we will solve the problem without them! U.S.A. _E_ "Definiteness of purpose is the starting point of all achievement." W. Clement Stone _E_ Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_ If you can't see it it will never happen. Bring your vision to fruition through perseverance and hard work. That will build momentum. _E_ I know Shia LaBeouf @thecampaignbook and when sober a really nice guy. Must get act together fast before too late. _E_ The economy is in terrible shape. @BarackObama is manipulating the job numbers to hide the truth. __HTTP__ _E_ When it comes to China @BarackObama practices pretty please diplomacy. He begs and pleads and bows and it'... (cont) __HTTP__ _E_ Best speech in #GoldenGlobes history __HTTP__ _E_ Watching the madness in Cyprus? If our government keeps spending trillion dollar deficits that could happen here. _E_ #CelebApprentice #TeamVortex or #TeamInfinity? _E_ Big day on Thursday for Indiana and the great workers of that wonderful state.We will keep our companies and jobs in the U.S. Thanks Carrier _E_ It's still exciting after all these years and this cast is special! _E_ One of the greatest tributes to a father I have ever witnessed given to the great @jacknicklaus by his wonderful son __HTTP__ _E_ How will raising taxes create jobs? Washington is all out of answers. New leadership is needed. _E_ According to a @gallupnews poll over 60% think ObamaCare will make things worse for taxpayers __HTTP__ ObamaCare is a T A X. _E_ I will be holding a major briefing on the Opioid crisis a major problem for our country today at 3:00 P.M. in Bedminster N.J. _E_ .@robertjeffress I greatly appreciate your kind words last night on @FoxNews. Have great love for the evangelicals great respect for you. _E_ Make sure to grab your copy of this month's @Newsmax_Media detailing The Trump Effect __HTTP__ _E_ Music cues audience participation sounds like a very active Team Power. #CelebApprentice _E_ NATIONAL DEBT January 2009 = $10.6 TRILLIONAugust 2016 = $19.4 TRILLION __HTTP__ _E_ Dem Gov. of MN. just announced that the Affordable Care Act (Obamacare) is no longer affordable. I've been saying this for years disaster! _E_ "When you can't make them see the light make them feel the heat." – President Ronald Reagan _E_ Over 35 CIA operatives were on the ground in Benghazi the night of the 9.11 attack __HTTP__ Still a phony scandal ? _E_ The people of Ireland are very smart—they just killed an ugly windfarm which would've hurt tourism @AlexSalmond __HTTP__ _E_ ... to build a wind farm and destroy this view! _E_ I will be on @seanhannity tonight from Las Vegas Nevada at 10pmE. Enjoy! #Hannity #Trump2016 __HTTP__ _E_ The great State of Nebraska can do much better than @BenSasse as your Senator. Saw him on @greta totally ineffective. Wants paid for pols. _E_ Iran's continued public threats of annihilating @Israel are unacceptable. Iran's nuclear drive must be stopped. #TimeToGetTough _E_ Very little pick up by the dishonest media of incredible information provided by WikiLeaks. So dishonest! Rigged system! _E_ Entrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_ It was an honor to host our American heroes from the @WWP #SoldierRideDC at the @WhiteHouse today with @FLOTUS @VP... __HTTP__ _E_ This was sent out from Ted Cruz as Iowans arrived at their caucus sites to vote. #CruzFraud __HTTP__ _E_ RT @GOP: Reminder: last year Clinton pledged she had turned over all work related email under penalty of perjury __HTTP__ _E_ With the labor participation rate at a 36 yr. low over 92M Americans are out of the work force. _E_ Christians in the Middle East have been executed in large numbers. We cannot allow this horror to continue! _E_ Just arrived at Camp David where I am monitoring the path and doings of Hurricane Harvey (as it strengthens to a Class 3). 125 MPH winds! _E_ 'Democratic operative caught on camera: Hillary PERSONALLY ordered 'Donald Duck' troll campaign that broke the law' __HTTP__ _E_ When Warren Buffett & others play w/ bankruptcy nobody cares—when Trump plays the game it becomes a big deal! __HTTP__ _E_ Heading back from a very exciting two days in Davos Switzerland. Speech on America's economic revival was well received. Many of the people I met will be investing in the U.S.A.! #MAGA _E_ Ask yourself is this a blip or is it a catastrophe? and your equilibrium will be kept in check if hard times hit. _E_ I always said that @lancearmstrong had to keep fighting the charges. By stopping he gave his enemies an opening. _E_ Landing in New Hampshire soon to talk about the massive drug problem there and all over the country. _E_ Donald J. Trump Ethics Reform Plan For Washington D.C. __HTTP__ _E_ "Destiny has a part to play in your life and in your business – so give it a chance to work." – Think Like a Champion _E_ My wife Melania will be on @QVC today @ 5 PM selling really beautiful jewelry at a very low price. Perfect for Mother's Day—call in! _E_ Entrepreneurs: Keep your momentum. See yourself as victorious and leading a winning team. Keep everyone moving forward. _E_ RT @foxandfriends: Mark Levin: The collusion is among the Democrats __HTTP__ _E_ .@marklevinshow has written a great book Plunder and Deceit. He powerfully analyzes issues that are crucial to us today. Read it! _E_ Why is the Pentagon wasting precious dollars on going 'green.' Complete waste. We need the best & easiest fuel for our military. _E_ It was great being with Luther Strange last night in Alabama. What great people what a crowd! Vote Luther on Tuesday. _E_ TV's darling @TheRealMarilu is back in this year's "All Star" @CelebApprentice. Marilu is a fierce competitor. _E_ .@CNN & @CNNPolitics did not say that lawyer Beck lost the case and I got legal fees. Also she wanted to breast pump in front of me at dep. _E_ "Keep your brand standard in mind and your expansion will seem possible as well as gratifying." – Midas Touch _E_ I will be handing over my Twitter account to my team of deplorables for tonight's #debate#MakeAmericaGreatAgain _E_ Anti Morsi protests are 10 times larger than 2011 anti Mubarek protests. Interesting. _E_ "Trump: Illegal Immigrants Are Getting Treated Better than Vets" __HTTP__ via @nro by @AndrewE_Johnson _E_ I will be live tweeting during the @ApprenticeNBC tonight at 9PM ET. _E_ If you fail once twice three times it doesn't matter. Learn from your mistakes and push forward to VICTORY the sweetest feeling there is! _E_ Thank you Laura! __HTTP__ _E_ .@oreillyfactor was very negative to me in refusing to to post the great polls that came out today including NBC. @FoxNews not good for me! _E_ HAPPY NEW YEAR & THANK YOU! __HTTP__ __HTTP__ _E_ Millions of dollars being spent on false TV ads by special interest groups who own Rubio & Cruz.When you see them think of your puppet POLS _E_ It's Friday. How many bald eagles did wind turbines kill today? They are an environmental & aesthetic disaster. _E_ Just won the highest rated sanitary award in NY—an A & the food is great also. Trump Grill/ 57th & 5th. _E_ Every dollar @BarackObama spends costs $1.40 with interest borrowed from China on our children and grandchildren's backs. CUT CAP BALANCE! _E_ Looking forward to being honored with the prestigious 'Friend of Israel' award at the @Algemeiner Gala Dinner __HTTP__ _E_ Obama's plan to have Russia stand up to Iran was a horrible failure that turned America into a laughingstock. #TimeToGetTough _E_ "Success is dependent on effort." Sophocles _E_ Use adverse events and monumental challenges to make you strong Think Big _E_ This Sunday's LIVE FINALE of @ApprenticeNBC puts @pennjillette against @TraceAdkins. Watch two great competitors battle to win! _E_ What you dream about is what you do. If you cannot even dream of doing big things you will never do anything big. Think Big _E_ We have just begun! __HTTP__ _E_ Great poll Florida! Thank you! __HTTP__ _E_ .@WhiteHouse Briefing with Director Marc Short and Director Mick Mulvaney... __HTTP__ _E_ Celebrate 2013 @TrumpSoHo with downtown's nicest #NYE party. Get your tickets now: __HTTP__ _E_ Inspiration exists but it has to find us working. Pablo Picasso _E_ 20 Most Anticipated Hotel Openings of 2016: Trump International Hotel Washington D.C. __HTTP__ _E_ I said gas prices would sky rocket after election Opec payback! _E_ Oil prices just went over $100 per barrel for first time in nine months! _E_ Obama is on yet another two day West Coast fundraising swing. Has to fit it in before his 15 day tax payer funded vacation. _E_ Snowden is a spy who has caused great damage to the U.S. A spy in the old days when our country was respected and strong would be executed _E_ I just answered my Facebook fan's questions in the latest #AskTheDonald watch the video __HTTP__ _E_ Who says Obama will do better in the next debate has he gotten smarter in 2 weeks! _E_ "The belief that security can be obtained by throwing a small state to the wolves is a fatal delusion." Winston Churchill _E_ Thank you @Morning_Joe & @morningmika a great show! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ I wonder how officials @TexasTech feel now after treating Coach Mike Leach with so little respect after their loss to @TCUFootball 82 27? _E_ Interesting article by @newtgingrich @HumanEvents: "WHY ROVE AND STEVENS ARE PLAIN WRONG" __HTTP__ _E_ Via @AP on @washingtonpost: Trumps look at building 18 hole golf course on former Kluge estate in rural Virginia __HTTP__ _E_ Will be in South Bend Indiana in a short while big rally! See you soon! _E_ I have just ordered Homeland Security to step up our already Extreme Vetting Program. Being politically correct is fine but not for this! _E_ .@KAThomas212 Congratulations on joining the finest and fastest growing group of very talented people in the City. You will be GREAT! _E_ Obama must now start focusing on OUR COUNTRY jobs healthcare and all of our many problems. Forget Syria and make America great again! _E_ Heading down to D.C. __HTTP__ _E_ DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER MAKE AMERICA GREAT AGAIN! _E_ 10000 people in South Carolina unbelievable evening! Will be in New Hampshire tomorrow love it. __HTTP__ _E_ Glenfiddich is a joke—should have chosen Andy Murray—U.S. Open & Olympic gold winner—as Top Scot instead of a total loser! _E_ The Ebola patient who came into our country knew exactly what he was doing. Came into contact with over 100 people.Here we go I told you so! _E_ Thank you! __HTTP__ _E_ Democrats refused to vote down their ObamaCare subsidy. While Americans will be hit w/ rising premiums Washington won't feel any pain _E_ 47M on food stamps. Over 23M Americans unemployed. 50% of college grads unemployed. And Obama wants us talking about Big Bird. _E_ I will be speaking Monday September 24 (10 A.M.) at Liberty University to a record setting student body. I look forward to it! _E_ Breaking news negotiations with Iranians broke down because Obama insisted that they use ObamaCare. _E_ As stated here is the press release. __HTTP__ _E_ ObamaCare is a complete disaster. Many of my friends have to scale down their businesses because they can't afford it. Terrible. _E_ Good luck to @Joy_Villa on her decision to enter the wonderful world of politics. She has many fans! _E_ Departing now thank you Cedar Rapids Iowa. This is a MOVEMENT! __HTTP__ _E_ If I would have challenged the man the media would have accused me of interfering with that man's right of free speech. A no win situation! _E_ What a shame that Kobe Bryant was so badly injured last night a truly great champion who brought the Lakers back from oblivion this year! _E_ Will be interviewed tonight by @seanhannity on @FoxNews at 10 PM. Enjoy! _E_ New episode starting now! _E_ Just arrived in New Hampshire. Another packed venue! Will be fun. _E_ In debate @MittRomney should ask Obama why autobiography states born in Kenya raised in Indonesia. _E_ After Poland had a great meeting with Chancellor Merkel and then with PM Shinzō Abe of Japan & President Moon of South Korea. _E_ Congratulations Chuck. Must be wonderful to have Donald Trump as your guest #BeCool! #Trump2016 __HTTP__ _E_ RT @Scavino45: Manufacturer Optimism Hits Record High After #TaxReform Plan Revealed __HTTP__ _E_ I promise to do a new #trumpvlog when I get back next week lots of requests. Thanks! _E_ .@MikeTyson and @SpikeLee I gave a great review of your show in my #trumpvlog __HTTP__ _E_ I will be making some very big campaign stops next week big crowds and tremendous energy! MAKE AMERICA GREAT AGAIN _E_ I dictate my tweets to my executive assistant and she posts them. Time is money The Art of the Deal. _E_ As usual the ObamaCare premiums will be up (the Dems own it) but we will Repeal & Replace and have great Healthcare soon after Tax Cuts! _E_ Black Lives Matter protesters totally disrupt Hillary Clinton event. She looked lost. This is not what we need with ISIS CHINA RUSSIA etc. _E_ The failing @UnionLeader newspaper in N.H. just sent The Trump Organization a letter asking that we take ads. How stupid how desperate! _E_ .@AC360 Has the absolutely worst anti Trump talking heads on his show. Dopey writer O'brian knows nothing about me or my wealth. A waste! _E_ .@BradSteinle Thank you for yr wonderful tweet of July 4. I wanted a little time to go by before calling. Your sister & family are amazing. _E_ Egypt is turning into a hot bed of radical Islam. The current protest is another coup attempt. We should never have abandoned Mubarak. _E_ The jury in the Jodi Arias trial is believe it or not still out. You never know but such a long deliberation could be good for the defense _E_ #TextTrump88022 for exclusive @realDonaldTrump updates! We will Make America Great Again! _E_ Great read: "Hollywood can kiss Adam Corolla's ass he's going Trump funding" __HTTP__ via @upstartbusiness _E_ I hope people will start to focus on our Massive Tax Cuts for Business (jobs) and the Middle Class (in addition to Democrat corruption)! _E_ China is neither an ally or a friend they want to beat us and own our country. _E_ 72% of refugees admitted into U.S. (2/3 2/11) during COURT BREAKDOWN are from 7 countries: SYRIA IRAQ SOMALIA IRAN SUDAN LIBYA & YEMEN _E_ "Obama doesn't respect the fact that the money he wastes belongs to us. He thinks that the wealth you create (cont) __HTTP__ _E_ Open for the 2014 season Mar a Lago Club is an architectural masterpiece offering the finest amenities in the world __HTTP__ _E_ LIVE on #Periscope: Good morning Iowa! Let's #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Dennis Rodman is a project manager tonight on Celebrity Apprentice watch Dennis in full action! _E_ #TrumpVlog South African justice __HTTP__ _E_ Do you believe it? The Obama Administration agreed to take thousands of illegal immigrants from Australia. Why? I will study this dumb deal! _E_ Just finished a press conference in Trump Tower wherein I gave information on which VETERANS groups got the $5600000 that I raised/gave! _E_ Starting to develop a much better relationship with Pakistan and its leaders. I want to thank them for their cooperation on many fronts. _E_ Another Dishonest Politician #LightweightSenatorMarcoRubio __HTTP__ _E_ FL KS ME MD MN NJ OR & WV! It's the LAST DAY to mail in voter reg forms. Get the forms at... __HTTP__ _E_ Who would you rather have negotiating for the U.S. against Putin Iran China etc. Donald Trump or Hillary? Is there even a little doubt? _E_ Will be doing Fox & Friends at 7 A.M. It never ends (hopefully)! _E_ Great victory for people of Blackdog Scotland. They defeated substation stopping inefficient & ugly wind turbines.@AlexSalmond _E_ I will be interviewed on @GMA Good Morning America tomorrow at 7:00 A.M. Big new ABC poll coming out I hope I do well! _E_ Without more Republicans in Congress we were forced to increase spending on things we do not like or want in order to finally after many years of depletion take care of our Military. Sadly we needed some Dem votes for passage. Must elect more Republicans in 2018 Election! _E_ The polls have been really amazing we are all tired of incompetent politicians and bad deals! __HTTP__ _E_ .@Omarosa admitting she's a threat in the boardroom that's not revelation knowledge. #CelebApprentice _E_ Give your goals substance. Imbue them with a value that exceeds the monetary. Make them count on as many levels as you can. _E_ Congrats @TrumpChicago for being named #3 Best Business Hotel in Chicago in @TravlandLeisure's 2014 World's Best __HTTP__ _E_ Be prepared there is a small chance that our horrendous leadership could unknowingly lead us into World War III. _E_ RT @TheFive: Trump just won on law & order and now he's delivering the goods. @jessebwatters #thefive _E_ A former classmate Roy Eaton has published a great book "Makers Shakers & Takers" – check it out __HTTP__ _E_ .@TraceAdkins great job on FOX this morning. Keep up the good work! _E_ The two dumbest interviews in history may go down as Lance Armstrong who is being sued by everyone in the world & Michael Douglas. _E_ I will be going to Trump Links at Ferry Point for the official opening of this long delayed (but future NYC treasure) course. Great job D _E_ My @foxandfriends int. destroying Schneiderman's frivolous suit which he brought after meeting Obama on Thurs. __HTTP__ _E_ Having a vision for something can be a very powerful force for accomplishment. Midas Touch _E_ The Republican Party of New York has been conditioned to lose and there is no excuse for this. Leadership must move fast and decisively! _E_ If their highly unethical behavior including begging me for ads isn't questionable enough they have endorsed a candidate who can't win. _E_ ... than his destruction of Scotland's magnificent lands.@AlexSalmond _E_ MY POSITION ON VISAS#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Congrats to @FLGovScott on today's inauguration and having done a great job! _E_ Washington is in total gridlock—no trust no leadership—very interesting! _E_ North Korea is behaving very badly. They have been playing the United States for years. China has done little to help! _E_ Shock! Obamacare's high risk pool spending DOUBLED government estimates __HTTP__ @BarackObama is bankrupting this country! _E_ "Once you know you love your job never stop and never give up." – Think Like a Billionaire _E_ Think big. Stay focused. Be passionate. Don't ever give up _E_ If you can count the amount of time you put into a project on your fingers then you haven't spent enough time on it. _E_ Did A Rod really try to buy the papers that would implicate him re. drugs wow that would be the end a disaster! _E_ P.S. 42 in Queens is getting a truckload of food and much needed supplies for Rockaway residents #HurricaneSandyRelief _E_ Ralph Norman ran a fantastic race to win in the Great State of South Carolina's 5th District. We are all honored by your success tonight! _E_ Join me this Thursday in Wilmington Ohio at noon! #ImWithYouTickets: __HTTP__ __HTTP__ _E_ It was an honor to be with @MittRomney the night he clinched the nomination. He will defeat @BarackObama and be a tremendous POTUS. _E_ Just out: Neera Tanden Hillary Clinton adviser said "Israel is depressing." I think Israel is inspiring! _E_ If NFL fans refuse to go to games until players stop disrespecting our Flag & Country you will see change take place fast. Fire or suspend! _E_ UPCOMING RALLIES JOIN ME!TOMORROWFletcher NC @ 12pm. __HTTP__ OH @ 7pm. __HTTP__ _E_ His spending is reckless: @BarackObama will set a record fourth year of a $1 trillion budget deficit. __HTTP__ _E_ Join me LIVE from the Rose Garden at 1:30pmE with Prime Minister Alexis Tsipras of Greece. __HTTP__ __HTTP__ _E_ The exclusive home of @PGATOUR's @CadillacChamp @TrumpDoral sits on 800 beautiful acres in the center of Miami __HTTP__ _E_ Order my book CRIPPLED AMERICA for your holiday gifts. I will be signing books for the next two weeks! __HTTP__ _E_ "Donald Trump to crown @FIU as Miss Universe venue" __HTTP__ via @MiamiHerald _E_ A really nice article about the Blue Monster from "The Street." __HTTP__ _E_ Donald Trump promises 'world class' Crandon Park golf course __HTTP__ via @WPLGLocal10 by @GlennaOn10 _E_ RT @DRUDGE_REPORT: 'Win lose deal that benefits Iran and hurts United States'... __HTTP__ _E_ Failed Presidential Candidate Mitt Romney is having a news conference tomorrow to criticize me. (1/2) _E_ We've gone from $10 trillion that the president inherited from all prior presidents to $16 trillion @MittRomney _E_ Will be on Fox & Friends in five minutes enjoy and good morning! _E_ We have made more progress in the last nine months against ISIS than the Obama Administration has made in 8 years.Must be proactive & nasty! _E_ Sorry I never went bankrupt and don't wear a wig (it's all mine)! _E_ It's Thursday. How many people have lost their healthcare today? _E_ "One man with courage is a majority." Thomas Jefferson _E_ Remember when I recently said that Brussels is a hell hole and a mess and the failing @nytimes wrote a critical article. I was so right! _E_ Oil is double the price now compared to last year OPEC is laughing at @BarackObama. _E_ Congratulations to the Miss USA Pageant it was the #1 telecast of the night among ABC CBS NBC and Fox. A great show and a huge success. _E_ Thank you. __HTTP__ _E_ Gas prices are way too high. With an economy contracting and lower demand how do OPEC & the speculators get away with this?! _E_ On the shores of the Lake Norman @Trump_Charlotte features a world class course designed by @SharkGregNorman __HTTP__ _E_ It was recently reported that 3rd rate $ losing @Politico is a foil for the Clintons. Questions given to Clinton in advance. No credibility. _E_ Hillary was involved in the e mail scandal because she is the only one with judgement so bad that such a thing could have happened! _E_ Wow! New National Zogby Poll just out:.TRUMP 45. CRUZ 13. RUBIO 8. Big numbers. _E_ Before I or anyone saw the classified and/or highly confidential hacking intelligence report it was leaked out to @NBCNews. So serious! _E_ Retail sales are at record numbers. We've got the economy going better than anyone ever dreamt and you haven't seen anything yet! _E_ I met a Trump Twitter hater last night (well known). As he came near me he nervously said Mr. Trump it is an honor to meet you sir! Nice _E_ The Iraqi Army is useless. President Obama stay the hell out of Iraq (we should never have been there in the first place). _E_ .@tedcruz should not make statements behind closed doors to his bosses he should bring them out into the open more fun that way! _E_ Obama's goal of 1 million electric car sales is a little off by over 910000 __HTTP__ $100B of our money wasted! _E_ When I was 18 people called me Donald Trump. When he was 18 @BarackObama was Barry Soweto. Weird. _E_ Thank you Miami! In 6 days we are going to WIN the GREAT STATE of FLORIDA and we are going to win back the White... __HTTP__ _E_ Tweet me your questions for the next #trumpvlog.... _E_ I just bought stock in Tiffany & Company and McDonald's. Two ends of the spectrum but I like both companies. _E_ Miss Alabama @_KatherineWebb stopped by to say hello today. __HTTP__ _E_ Am in Bedminster for meetings & press conference on V.A. & all that we have done and are doing to make it better but Charlottesville sad! _E_ The new hot term that they have recently invented is POLAR VORTEX give me a break! _E_ Thank you for your support on my way now! See you soon. #TrumpTrain __HTTP__ _E_ If we can help little #CharlieGard as per our friends in the U.K. and the Pope we would be delighted to do so. _E_ Via @fitsnews:"Donald Trump Surges In New Hampshire Poll: MOGUL REALITY STAR EMERGES AS GRANITE STATE'S 'ANTI BUSH' __HTTP__ _E_ Castro Chavez and Ahmadinejad are all anxiously awaiting our election results. They are praying Obama wins. _E_ Border agent: We might as well abolish our immigration laws altogether __HTTP__ _E_ In the East it could be the COLDEST New Year's Eve on record. Perhaps we could use a little bit of that good old Global Warming that our Country but not other countries was going to pay TRILLIONS OF DOLLARS to protect against. Bundle up! _E_ Go as far as you can see when you get there you'll be able to see farther. J.P. Morgan _E_ Congratulations to @WWERaw on passing 1000 episodes. @WWE is still going strong after all these years @VinceMcMahon is great! _E_ Fake News CNN is looking at big management changes now that they got caught falsely pushing their phony Russian stories. Ratings way down! _E_ The media is going crazy. They totally distort so many things on purpose. Crimea nuclear the baby and so much more. Very dishonest! _E_ Entrepreneurs: Keep the big picture in mind. There are always opportunities & possibilities and thinking too small can negate a lot of them _E_ Governor Rick Scott of Florida did really poorly on television this morning. I hope he is O.K. _E_ Trump promises special session to repeal Obamacare: __HTTP__ _E_ Tweet me your New Year's resolution to make America great again! #TrumpNewYearsRes __HTTP__ _E_ The U.S. has appealed ro Russia not to intervene in Ukraine Russia tells U.S. they will not become involved and then laughs loudly! _E_ I will once again write a $1 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_ Will be doing Fox & Friends at 7 A.M. (1 hour). ENJOY! _E_ On my way! __HTTP__ _E_ "A brand is not a logo. A brand is the promise you put out there and the experience you deliver." – Midas Touch _E_ One of the best moves I made early in my career was buying the air rights from Tiffany's flagship. Trump Tower gleams over Fifth Avenue. _E_ WHAT THEY ARE SAYING ABOUT MIKE PENCE "DOMINATING" THE DEBATE: __HTTP__ #VPDebate _E_ Somebody with aptitude and conviction should buy the FAKE NEWS and failing @nytimes and either run it correctly or let it fold with dignity! _E_ Fiscal mismanagement of cash costing US Taxpayer billions cut fraud and waste before cutting funding for Seniors. _E_ Stock market hit yet another all time record high yesterday. There is great confidence in the moves that my Administration.... _E_ I want to end the day by saying there is no check I would rather write than that to a good charity designated by our President. _E_ Destiny has a part to play in your life and in your business so give it a chance to work. Think Like a Champion _E_ Why doesn't @JebBush in his ads show my answer to his statement in the debate? _E_ Raleigh North Carolina was fantastic last night. Such incredible spirit. We all want to and will MAKE AMERICA GREAT AGAIN! _E_ RT @EricTrump: Very proud of what my father has accomplished in the past 7 months Wishing him amazing luck and success tonight! #NVcaucus ... _E_ Our country has a big heart. And it's a point of national pride that we take care of our own. #TimeToGetTough (cont) __HTTP__ _E_ .@AndreaTantaros's radio show is a great addition to talk radio. She is sharp talented & great sense of humor. Congratulations. _E_ Attorney General Bill Shuette will be a fantastic Governor for the great State of Michigan. I am bringing back your jobs and Bill will help! _E_ Once again #MSM is dishonest. Schlonged is not vulgar. When I said Hillary got schlonged that meant beaten badly. _E_ Failed presidential candidate Mitt Romney the man who choked and let us all down is now endorsing Lyin' Ted Cruz. This is good for me! _E_ Spoiler @dennisrodman has really got his act together so far on the upcoming season of @CelebApprentice... _E_ Instead of driving jobs and wealth away AMERICA will become the world's great magnet for innovation and job creati... __HTTP__ _E_ Poor @JebBush spent $50 million on his campaign I spent almost nothing. He's bottom (and gone) I'm top (by a lot). That's what U.S. needs! _E_ RT @realDonaldTrump: "President Trump is not getting the credit he deserves for the economy. Tax Cut bonuses to more than 2000000 workers... _E_ RT @MarkHalperin: Utah Speaker of the House announces endorsement of @realDonaldTrump. Says @DonaldJTrumpJr played a big role _E_ Now that the ObamaCare website contractor has been terminated for obvious incompetence is the person who hired them going to be fired? _E_ Great article by @WayneRoot @theblaze Obama's College Classmate: 'The Obama Scandal Is at Columbia' __HTTP__ _E_ Awarded 5 stars from @ForbesInspector @TrumpTO offers 261 rooms & 115 suites in the center of downtown Toronto __HTTP__ _E_ Obama has now become the weakest POTUS against China yuan just hit record high against dollar __HTTP__ Very sad! _E_ NYC terrorist was happy as he asked to hang ISIS flag in his hospital room. He killed 8 people badly injured 12. SHOULD GET DEATH PENALTY! _E_ Lyin' Ted Cruz is now trying to convince prople that his problems with The National Enq.were caused by me. I had NOTHING to do with story! _E_ Every day Pastor Saeed is imprisoned by Iran is an indictment on Obama's 'diplomacy.' #SaveSaeed _E_ While Derek Jeter is training every day in the off season reports come out that A Rod is partying all over the country. Go Derek. @Yankees _E_ Homeland Security and law enforcement are on alert & closely watching for any sign of trouble. Our borders are far tougher than ever before! _E_ The new amnesty bill is over 1000 pages. It is another monstrosity a la ObamaCare. _E_ Just did theToday Show to announce that Baton Rouge Louisiana will host the Miss USA Pageant on Sunday June 8th. @Miss USA. _E_ Looking forward to addressing @TheEconomicClub on December 15th at the Marriot Marquis Washington DC. _E_ Look I have always liked Lance Armstrong I just hated what he did to himself including recently. His life will now be hell. _E_ I am lowering taxes far more than any other candidate. Any negotiated increase by Congress to my proposal would still be lower than current! _E_ "Get in. Get it done. Get it done right. Get out." – My father Fred C. Trump _E_ "Yesterday's home runs don't win today's games." Babe Ruth _E_ On at 9:00A.M. or 10:00 A.M. (depending on your location) on Fox is a tough but really good interview with Chris Wallace. Enjoy! _E_ Ted is the ultimate hypocrite. Says one thing for money does another for votes. __HTTP__ _E_ Sports fans should never condone players that do not stand proud for their National Anthem or their Country. NFL should change policy! _E_ I'll be on @gretawire On the Record tonight to talk about the ObamaCare fiasco 7 pm on Fox News _E_ Donald Trump: GOP Has 'Nuclear Weapon' In Fiscal Cliff Negotiation But They Don't Know It __HTTP__ via @mediaite _E_ l still think @Boeing should just bite the bullet & get rid of the new batteries in the 787. Those batteries will always be a problem! _E_ Great meeting with @THEHermanCain yesterday in Trump Tower. Great guy! _E_ Instead of creating new jobs Obamacare is destroying jobs. And the worst part is yet to come since the truly (cont) __HTTP__ _E_ .@lisarinna is at the top of her game in the upcoming season of @CelebApprentice All Stars. Our fans love her. _E_ Come on Republican Senators you can do it on Healthcare. After 7 years this is your chance to shine! Don't let the American people down! _E_ Thank you Foxconn for investing $10 BILLION DOLLARS with the potential for up to 13K new jobs in Wisconsin! MadeInTheUSA __HTTP__ _E_ Miami's top destination @TrumpDoral's remodeled Royal Palm Pool offers 18 luxurious cabanas __HTTP__ _E_ Be sure to watch Oprah today (4 pm on Channel 7) I'll be on with my entire family and it will be an entertaining hour.. __HTTP__ _E_ The Navy Yard shooting is a horrible disaster. If we don't clean up OUR COUNTRY of the garbage soon we are just going to do a death spiral! _E_ I don't know what will happen with the lawsuit against dummy @billmaher but have an obligation to charity to bring it. _E_ Great Strategic & Policy CEO Forum today with my Cabinet Secretaries and top CEO's from around the United States.... __HTTP__ _E_ It doesn't matter who you vote for it matters who is counting the votes. Be careful of voter fraud! _E_ Join me on Monday April 4th in Milwaukee! #WIPrimary #Trump2016Tickets: __HTTP__ __HTTP__ _E_ People love gossip. It's the biggest thing that keeps the entertainment industry going. @TheEllenShow _E_ If @OMAROSA is not in the Board Room I can't fire her. @latoyajackson made a strategic mistake. _E_ Sneak peek of Trump's trio of spectacular new seaside holes on the famed Ailsa course/@TrumpTurnberry __HTTP__ _E_ Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut and it will only get better. MUCH MORE TO COME! _E_ Another cover up. Obama won't disclose how many illegal immigrants he has released into our country __HTTP__ No surprise. _E_ It was my great honor to welcome Mayor's from across America to the WH. My Administration will always support local government and listen to the leaders who know their communities best. Together we will usher in a bold new era of Peace and Prosperity! __HTTP__ __HTTP__ _E_ ...and job losses. American companies must be prepared to look at other alternatives. _E_ CNN which is totally biased in favor of Clinton should apologize. They knew they were wrong. __HTTP__ _E_ My interview with @jheil & @MarkHalperin at @WollmanRink airing at 5PM on @bpolitics. __HTTP__ _E_ Look forward to seeing final results of VoteStand. Gregg Phillips and crew say at least 3000000 votes were illegal. We must do better! _E_ General James Mad Dog Mattis who is being considered for Secretary of Defense was very impressive yesterday. A true General's General! _E_ ....The Wall will be paid for directly or indirectly or through longer term reimbursement by Mexico which has a ridiculous $71 billion dollar trade surplus with the U.S. The $20 billion dollar Wall is "peanuts" compared to what Mexico makes from the U.S. NAFTA is a bad joke! _E_ What a dumb mistake AOL made buying the @huffingtonpost. How much longer will Arianna last I predict not much. _E_ I was just told by one of the top @PGATOUR players that my golf courses are the most elite in the country. Very nice compliment I agree. _E_ Last Friday's gaffe by @BarackObama claiming that the private sector is doing fine is illustrative.Everything to him revolves around gov't _E_ 'President Elect Donald J. Trump Nominates Elaine Chao as Secretary of the Department of Transportation' __HTTP__ _E_ Via @TheScotsman: "Donald Trump's @TrumpTurnberry plan gets go ahead" __HTTP__ _E_ Why don't we ask the Navy SEALs who killed Bin Laden? They don't seem to be happy with Obama claiming credit. All he did is say O.K. _E_ .@ColinCowherd said such nice things about me during the debate that I thought I'd do his show @TheHerd on Monday (2:30pm EST). _E_ Heading to Boston to see another huge crowd! My friend Tom Brady is a great competitor and golf partner. __HTTP__ _E_ The World Economic Forum now ranks the US the fifth most competitive economy in the world. We have fallen from first under @BarackObama. _E_ The Country is being run just like the stadium. _E_ When will people and the media start to apologize to me for my statement Mexico is sending.... which turned out to be true? El Chapo _E_ How can a dummy dope like Harry Hurt who wrote a failed book about me but doesn't know me or anything about me be on TV discussing Trump? _E_ Entrepreneurs: Cover your bases. Know everything you can about what you're doing. _E_ The House Republicans and Democrats are finally unanimous! Yesterday they voted down @BarackObama's $3.6T budget (cont) __HTTP__ _E_ Hillary Clinton needs to address the racist undertones of her 2008 campaign. #FlashbackFriday __HTTP__ _E_ Our GREAT VETERANS can now connect w/ their VA healthcare team from anywhere using #VAVideoConnect available at: __HTTP__ __HTTP__ _E_ Tom marbles in his mouth Brokaw once thanked me for the great success of the Apprentice for NBC. Now he calls (cont) __HTTP__ _E_ RT @DanScavino: Last nights winner was clear & it will be proven time & time again lets #MAGA!! Lets WIN!! #TrumpTrain __HTTP__ _E_ Sugar @Lord_Sugar—you should say thank you Donald like a good little boy... ... _E_ RT @AlanDersh: We should stop talking about obstruction of justice. No plausible case. We must distinguish crimes from pol sins __HTTP__ _E_ Obama's speech indicates he wants to change this country as we know it wow he really feels emboldened. _E_ General says that the Armed Forces will be severely weakened if the large scale rape and sexual abuse problem is not brought under control. _E_ Polls show that the hurricane had a huge positive effect for Obama on his win isn't that ridiculous? _E_ .@IvankaTrump will lead the U.S. delegation to India this fall supporting women's entrepreneurship globally.#GES2017 @narendramodi _E_ The Russia Trump collusion story is a total hoax when will this taxpayer funded charade end? _E_ With all the talk of fiscal responsibility at the @DNC convention yesterday it was ironic that the debt passed $16T. _E_ Great honor to be inducted into the NJ Boxing Hall of Fame last night. Thank you! Timing could not have been better! __HTTP__ _E_ Congratulations to my daughter Ivanka and her husband Jared on the birth of their daughter Arabella Rose yesterday. _E_ Always bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_ Looking forward to hosting our heroes from the Wounded Warrior Project (@WWP) Soldier Ride to the @WhiteHouse on Th... __HTTP__ _E_ Stop The China Curse Pass the Chinese Currency Bill! _E_ ...... Circulation is way down and all he thinks about are his bad food restaurants. @CondeNastCorp _E_ This is the One Year Anniversary of my Presidency and the Democrats wanted to give me a nice present. #DemocratShutdown _E_ Congratulations to Sung Hyun Park on winning the 2017 @USGA #USWomensOpen _E_ Very exciting—tomorrow night at Madison Square Garden I get inducted into the @WWE Hall of Fame. _E_ Looking forward to hosting the @FloridaGOP "House Majority 2014 Golf Tournament" at Trump Int'l West Palm Beach on Jan. 27th. _E_ New Yorkers will get a chance to see a film for free this summer from @attnyc and @tribecafilmfest. My choice? Citizen Kane #FilmForAll _E_ #ICYMI: Announcement of Air Traffic Control Initiative Watch __HTTP__ _E_ Just watched Facebook COO Sheryl Sandberg on 60 Minutes. She should spend more time trying to get the F stock price up & less on her ego! _E_ With all my Administration has done on Legislative Approvals (broke Harry Truman's Record) Regulation Cutting Judicial Appointments Building Military VA TAX CUTS & REFORM Record Economy/Stock Market and so much more I am sure great credit will be given by mainstream news? _E_ Fans shouldn't worry. We have adjusted the filming schedule of the upcoming 13th season of @CelebApprentice appropriately due to the storm. _E_ Just watched Brian Williams on @TODAYshow very sad! Brian should get on with a new life and not start all over at @msnbc. Stop apologizing _E_ The many losers and haters never have the brains or stamina to become truly successful! _E_ ...when they have no environmental restrictions! America' s workers need us. __HTTP__ _E_ Thank you Houston Texas! #AmericaFirst #Trump2016 __HTTP__ _E_ Why Isn't the Senate Intel Committee looking into the Fake News Networks in OUR country to see why so much of our news is just made up FAKE! _E_ ... Doesn't seem like they have a coherent strategy right now. _E_ The FAKE & FRAUDULENT NEWS MEDIA is working hard to convince Republicans and others I should not use social media but remember I won.... _E_ Yes I will be live tweeting during the final debate this coming Monday. _E_ Good luck #TeamUSA#OpeningCeremony #Rio2016 __HTTP__ _E_ This was a great evening I would like to thank everyone for their wonderful support. _E_ Our leaders are terrible. The government spends over $50B a day. It can't find cuts for less than 2 days of spending?! Sad! _E_ Career Advice from Donald Trump __HTTP__ via @BNDarticles by @brittneyplz _E_ This may be the worst football game ever played by one team Denver! Hard to watch. _E_ Our enemy China is illegally buying oil from our enemy Iran __HTTP__ China loves it! _E_ Shock Obama WH given three pinocchios for lying about Benghazi emails __HTTP__ _E_ The pessimist sees the difficulty in every opportunity and the optimist sees the opportunity in every difficulty. Pres. Lincoln _E_ .@Lawrence is the poor man's left wing @oreillyfactor(with no ratings)! _E_ RT @SheriffClarke: Happy Father's Day to all dads. My dad. Like father like son @realDonaldTrump supporters to the end. He an Airborne Ra... _E_ Paula Deen made a big mistake in using a forbidden word but must be given some credit fot admitting her mistake. She will be back! _E_ The NRA in Nashville today was amazing. Packed house and standing ovation for Trump. THANKS! _E_ Because I was told I could not do well in Iowa I spent very little there a fraction of Cruz & Rubio. Came in a strong second. Great honor _E_ I will be on Face The Nation (CBS) today at 10:30 A.M. and Media Buzz (Fox News) at 11:00 A.M. Enjoy! _E_ Ivanka is now on Twitter You can follow her @IvankaTrump Have a terrific weekend! _E_ Today we gathered in the East Room to pay tribute to the HEROES whose courageous actions under fire saved so many lives in Alexandria VA. __HTTP__ _E_ See you tomorrow Michigan!Grand Rapids MI tomorrow at noon: __HTTP__ MI tomorrow at 3pm:... __HTTP__ _E_ .@KatrinaCampins Thank you so much for the wonderful statements you made about me on TV. Also keep up the great work! _E_ Congratulations to @TrumpChicago and @SixteenChicago for receiving the @AAANews Five Diamond Award again this year! _E_ China is filling the vacuum left by Obama at the UN on the world stage. _E_ Washington is wasting over $2 billion this year on Solyndra type loans. Yet they want to cut military spending. _E_ Via @DailyCaller: Donald Trump: Obama should golf w/ Republicans not his 'local friends' __HTTP__ by @NicholasBallasy _E_ Look at the way Crooked Hillary is handling the e mail case and the total mess she is in. She is unfit to be president. Bad judgement! _E_ Texas @GovAbbott & Lt. Gov. @DanPatrickThank you for todays briefing on hurricane recovery efforts here in TX. Keep up the great work! __HTTP__ _E_ RT @charliekirk11: 3 big wins in 2017 you won't hear:Trump confirmed the most circuit court judges ever in a President's 1st year (all co... _E_ China OPEC and Russia laugh at us. But now thanks to Obama so does Syria. Very sad! _E_ Who is more believable on the state of employment the great @jack_welch or some government bureaucrat who is voting for Obama? _E_ I turned down a meeting with Charles and David Koch. Much better for them to meet with the puppets of politics they will do much better! _E_ The NRA strongly endorses Luther Strange for Senator of Alabama.That means all gun owners should vote for Big Luther. He won't let you down! _E_ Our country needs leadership now. There is total dysfunction in Washington. _E_ Remember this Sunday I am also featured on @datelinenbc at 8PM right before the premiere of All Star @CelebApprentice @nbc likes me! _E_ I have created tens of thousands of jobs and will bring back great American prosperity. Hillary has only created jobs at the FBI and DOJ! _E_ Can you believe that the builder of the failed ObamaCare website was just given a new government contract how stupid is that CLUELESS!!! _E_ #VoteTrumpKS #Trump2016March 5 2016 | Wichita Kansas: __HTTP__ __HTTP__ _E_ I wonder what the rest of the world is thinking about the United States as they watch the disgusting and out of control Baltimore riots? _E_ If Crooked Hillary Clinton can't close the deal on Crazy Bernie how is she going to take on China Russia ISIS and all of the others? _E_ How could Obama leave those American heroes out to die in Benghazi? And he continues to lie to the public! _E_ Via @PressClubDC by @snlyngaas: "Trump Says U.S. Brand Has Lost Its Luster" __HTTP__ _E_ No more Clintons or Bushes! __HTTP__ _E_ I will be interviewed by @MariaBartiromo at 6:00 A.M. @FoxBusiness. Enjoy! _E_ .@Team_Mitch Fantastic win we are all proud of you! Your victory speech last night was very gracious to an opponent whose speech was not. _E_ Donald Trump Defends His Big Obama Bombshell: 'It's Not a Publicity Stunt' __HTTP__ via @eonline _E_ Terrible! Just found out that Obama had my wires tapped in Trump Tower just before the victory. Nothing found. This is McCarthyism! _E_ Dummy @KarlRove continues to make and write false statements. He still thinks Romney won he should get a life! _E_ RT @TeamTrump: .@HillaryClinton & @timkaine think you're #Deplorables & #BasementDwellers. @realDonaldTrump & @mike_pence think you're PATR... _E_ 08 02 2011 19:56:31 _E_ Trump Int'l Hotel & Tower Vancouver's original twisting design gives every unit a distinct view __HTTP__ A landmark! _E_ Congratulations to our great Women's Olympic Soccer team @ussoccer on their gold medal. They made us all proud! _E_ I appreciate the GOP candidates who remain strong on border security. They know I am right. A nation without borders cannot survive. _E_ RT @The_Trump_Train: @realDonaldTrump Make no mistake we are going to put the interest of AMERICAN CITIZENS FIRST! The forgotten men & w... _E_ Thank you to Eli Lake of The Bloomberg View The NSA & FBI...should not interfere in our politics...and is Very serious situation for USA _E_ Obama sadly has no business or private sector background and it shows. _E_ Watch You've Got Donald Trump at __HTTP__ _E_ The DJT Foundation unlike most foundations never paid fees rent salaries or any expenses. 100% of money goes to wonderful charities! _E_ Iraq is falling apart fast two trillion dollars and so many deaths Bush got us in and Obama took far too long to get us out! _E_ Such a serious problem for Ted & the GOP. Great doubt Dems will sue! Let's all work together to solve this problem. __HTTP__ _E_ Republican Senators will not let the American people down! ObamaCare premiums and deductibles are way up it was a lie and it is dead! _E_ Tune in tonight at 10 pm on NBC for another exciting episode of The Apprentice and see the Dog Whisperer make an appearance. _E_ RT @mike_pence: There's one clear choice in this election to create jobs and grow the American economy. #VPDebate __HTTP__ _E_ Now Chinese state run companies are taking over our coal market __HTTP__ China wants to deplete our resources here at home. _E_ One of the reasons I assume I was inducted into the @WWE Hall of Fame is that Vince McMahon and I have the all time highest ratings... _E_ Clinton Foundation's Fundraisers Pressed Donors to Steer Business to Former President __HTTP__ _E_ .@katyperry Katy what the hell were you thinking when you married loser Russell Brand. There is a guy who has got nothing going a waste! _E_ I wouldn't use @Richard_Meier to design a doghouse let alone a house or building! _E_ The real outsourcer @BarackObama is funding German automakers with the GM bailout money __HTTP__ How does that help us? _E_ Thank you for your support! Being #PoliticallyCorrect will NOT #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_ Via @Newsmax_Media: "@RepMattSalmon: Obama 'Didn't Lift a Finger' to Help Free Marine in Mexican Prison" __HTTP__ _E_ Sad thing is Rolling Stone was (is) a dead magazine with big downward circulation and now for them at last people are talking about it! _E_ I never thought I'd be saying this but I've really enjoyed @RichLowry on television lately and he was terrific hosting @seanhannity _E_ Just tried watching Saturday Night Live unwatchable! Totally biased not funny and the Baldwin impersonation just can't get any worse. Sad _E_ We don't need a Secretary of Business to understand business we need a president who understands business and I do @MittRomney _E_ Via @CarrGaz: "Trump's grand plans for @TrumpTurnberry resort get the green light" __HTTP__ _E_ "You can always become better." @TigerWoods _E_ I hope the NY tax payer appreciates the millions Schneiderman is about to waste on a small case. I will litigate to victory. _E_ Violent crime is rising across the United States yet the DNC convention ignored it. Crime reduction will be one of my top priorities. _E_ James Comey better hope that there are no tapes of our conversations before he starts leaking to the press! _E_ The thing I like best about Rex Tillerson is that he has vast experience at dealing successfully with all types of foreign governments. _E_ Speech transcript at Arab Islamic American Summit __HTTP__ __HTTP__ #POTUSAbroad _E_ My @msnbc int w/ @krystalball at #WHCD on my 2016 timetable saving Social Security & Making America Great Again! __HTTP__ _E_ "God's word is the same yesterday and today and a million years from now." @Franklin_Graham _E_ There is no challenge too great no dream outside of our reach! Thank you Selma North Carolina!#ICYMI watch here... __HTTP__ _E_ Great news I'm now leading in most polls w/ new CNN poll also having me #1. NBC I am #1 in NH by a lot #2 in Iowa close & gaining. _E_ Read this about @lawrence...... __HTTP__ _E_ The United States needs great deals and fast. We have to make our country rich again in order to MAKE OUR COUNTRY GREAT AGAIN! _E_ President Obama seems so fawning and desperate to make a deal with Iran that lots of bad results can occur. Be cool and be careful! _E_ People buy deals & immediately put them into bankruptcy in order to make better deals. It's a very effective & commonly used business tool. _E_ Iran humiliated the United States with the capture of our 10 sailors. Horrible pictures & images. We are weak. I will NOT forget! _E_ The Wall Street Journal has reported that Obama's food stamp policies are ushering in a massive 'food stamp crime wave.' #TimeToGet Tough _E_ I will be traveling to Florida tomorrow to meet with our great Coast Guard FEMA and many of the brave first responders & others. _E_ On @foxandfriends in two minutes! _E_ Thank you! I miss my father. __HTTP__ _E_ .@MichaelPhelps you are the greatest Olympic champion of them all. Fantastic job! _E_ It is a miracle how fast the Las Vegas Metropolitan Police were able to find the demented shooter and stop him from even more killing! _E_ I have decided to postpone my trip to Israel and to schedule my meeting with @Netanyahu at a later date after I become President of the U.S. _E_ It's crunch time. This Sunday's All Star Celebrity @ApprenticeNBC's task will separate the winners from the losers. _E_ Getting ready to leave for Cincinnati in the GREAT STATE of OHIO to meet with ObamaCare victims and talk Healthcare & also Infrastructure! _E_ On my way to Charleston/Mount Pleasant South Carolina. Big crowd. Look forward to it! #USSYorktown __HTTP__ _E_ For the first time in the history of military operations a country has broadcast what when and where they will be doing in a future attack! _E_ One of the keys to thinking big is total focus." – THE ART OF THE DEAL _E_ When will our country stop wasting money on global warming and so many other truly STUPID things and begin to focus on lower taxes? _E_ Vets mistreated NO border security? I'm with @V4SA this Tuesday 9/15 to #MakeAmericasMilitaryGreatAgain! Join us! __HTTP__ _E_ After 7 months of investigations & committee hearings about my collusion with the Russians nobody has been able to show any proof. Sad! _E_ .@MittRomney's poll numbers are looking really good. One more great debate performance and it will be a total knockout. _E_ Obama just endorsed Crooked Hillary. He wants four more years of Obama—but nobody else does! _E_ To all journalists look into the financial dealings of Scottish Parliament members with Vattenfall...Follow the money. _E_ Obama just had another trillion dollar budget deficit for the fourth year in a row. At least he is consistent. _E_ Just landed in Ohio. Thank you America I am honored to win the final debate for our MOVEMENT. It is time to... __HTTP__ _E_ .@jack_welch is correct these reporters would not have been so brave while Jack was running GE. _E_ Congratulations to @CharlieCrist who has now lost a statewide election in Florida as a Republican Independent & Democrat. _E_ "To be a visionary and to be a billionaire you have to chase impossibilities. Few ever get rich easily." – Think Like a Billionaire _E_ .@AnnCoulter's new book Adios America! The Left's Plan to Turn Our Country into a Third World Hellhole is a great read. Good job! _E_ New Iowa poll. Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ "Let your passion for your work carry you through all the setbacks they can throw at you." – Trump Never Give Up _E_ Higher Taxes kill job creation cut wild government spending and waste. _E_ .@FoxNews you should be ashamed of yourself. I got you the highest debate ratings in your history & you say nothing but bad... _E_ The only candidate who can get 1145 delegates is @MittRomney. The primary is over. _E_ Set your sights and aim high. You never know what you can achieve until you focus on achieving it. Midas Touch _E_ Celebrity Apprentice in 15 minutes don't miss it! _E_ .@nyrangers did a great job of winning tonight played like champions! _E_ Hillary Clinton didn't go to Louisiana and now she didn't go to Mexico. She doesn't have the drive or stamina to MAKE AMERICA GREAT AGAIN! _E_ Obama claims that he needs an extra $4B to secure the border. Well then he should not have wasted $5B on the ObamaCare website. _E_ Another broken promise by @BarackObama: @ObamaCare actually increases income inequality __HTTP__ It must be fully repealed! _E_ If you're going to think think big The Art of the Deal _E_ Nobody is watching @Morning_Joe anymore. Gone off the deep end bad ratings. You won't believe what I am watching now! _E_ Congress must protect our borders first. Amnesty should be done only if the border is secure and illegal immigration has stopped. _E_ Congrats @TrumpDoral for being named one of the Most Notable Openings of 2014 from @BizBash: __HTTP__ _E_ ObamaCare must be completely repealed. A recent report from UBS shows that it is the number one reason employers are not hiring. _E_ Thank you @greta. #ImWithYou __HTTP__ _E_ Doing David Letterman @Late_Show tonight at 11:30. 1st nite of Sweeps.Going into the lion's den but I've been there many times before. Enjoy _E_ Thank you Appleton Wisconsin!#WIPrimary #Trump2016 __HTTP__ __HTTP__ _E_ A great victory in Scotland ... __HTTP__ __HTTP__ _E_ Here's @Joan_Rivers. She & @IvankaTrump make a terrific team as my advisors. #CelebApprentice _E_ There are no buyers for the worthless @NYDailyNews but little Mort Zuckerman is frantically looking. It is bleeding red ink a total loser! _E_ Looking forward to being hosted by @saintanselm for Politics & Eggs next Tuesday. See you in Manchester! #NHPolitics _E_ I'll be making a major announcement on President Obama next week stay tuned! _E_ The reason that President Obama did NOTHING about Russia after being notified by the CIA of meddling is that he expected Clinton would win.. _E_ Congratulations to @TrumpNewYork @TrumpChicago @TrumpWaikiki @TrumpToronto on your Forbes Five Star ratings @ForbesInspector _E_ Nick Adams new book Green Card Warrior is a must read. The merit based system is the way to go. Canada Australia! @foxandfriends _E_ Thank you @Morning_Joe for throwing the pathetic reporter from the failing and money losing Daily Beast off the air. Really cool! _E_ Everyone is now saying how right I was with illegal immigration & the wall. After Paris they're all on the bandwagon. _E_ Please only respond by tweet @lawrence because like everyone else I don't watch your show. _E_ Ready to lead. Ready to Make America Great Again. #Debate #MAGA _E_ I just retained Sir Nick Faldo to be the architect of the Red Course at Doral he will do a tremendous job! @NickFaldo006 _E_ Thank you South Bend Indiana! Everyone get out & #VoteTrump tomorrow! #INPrimary __HTTP__ __HTTP__ _E_ He that is good for making excuses is seldom good for anything else. Benjamin Franklin _E_ The Baldwin family is well represented in the 13th season of All Star @CelebApprentice with @StephenBaldwin. Stephen does great. _E_ Wishing everyone a very Happy Holiday season! _E_ RT @LindseyGrahamSC: I support President Trump's desire to re enter the Paris Accord after the agreement becomes a better deal for America... _E_ Now they say obese women may cause Autism in children nonsense they use any excuse. The FDA should immediately (cont) __HTTP__ _E_ Bus crash in Tennessee so sad & so terrible. Condolences to all family members and loved ones. These beautiful children will be remembered! _E_ Today I welcomed the Victory Christian Center School. Good luck @ the Team America Rocketry Challenge! #TARC Watch... __HTTP__ _E_ RT @WhiteHouse: Happy Father's Day! __HTTP__ _E_ I've dealt w/politicians throughout the world. My deals are multi faceted transactions which involve many issues. I know the process & win! _E_ Passing what was once a vibrant manufacturing area in Pennsylvania. So sad! #MakeAmericaGreatAgain __HTTP__ _E_ Sorry @Rosie is a mentally sick woman a bully a dummy and above all a loser. Other than that she is just wonderful! _E_ The USC should be ruling any day now on @ObamaCare. Hopefully we will get the right result. _E_ A good example of how our country wastes money... __HTTP__ #trumpvlog _E_ Hope & Change! China now controls a record number of our debt __HTTP__ _E_ China is an international pariah. They are now harassing Japan over its purchase of 3 uninhabited islands __HTTP__ _E_ Let's properly check goofy Elizabeth Warren's records to see if she is Native American. I say she's a fraud! _E_ Success tip: Be ready for problems and be patient there are very few cases of instant gratification. _E_ China is openly sailing warships in our waters & arming countries in our hemisphere including Mexico __HTTP__ Ally? _E_ I will be in Wisconsin until the election. Jobs trade and immigration will be big factors. I will bring jobs back home make great deals! _E_ Heading to beautiful West Virginia to be with great members of the Republican Party. Will be planning Infrastructure and discussing Immigration and DACA not easy when we have no support from the Democrats. NOT ONE DEM VOTED FOR OUR TAX CUT BILL! Need more Republicans in '18. _E_ Trump Tuesday @SquawkCNBC tomorrow at 7:38 AM. _E_ Network news has become so partisan distorted and fake that licenses must be challenged and if appropriate revoked. Not fair to public! _E_ Miss Alabama Katherine Webb has been a truly great representative of the Ms. USA Organization ..We are proud of her! _E_ When foreigners attend our great colleges & want to stay in the U.S. they should not be thrown out of our country. _E_ The @AmSpec interview by Jeffrey Lord: A TRUMP CARD The Donald talks politics and parenting. __HTTP__ _E_ The Freedom Caucus will hurt the entire Republican agenda if they don't get on the team & fast. We must fight them & Dems in 2018! _E_ Re: Ashley Judd: Keep @KarlRove away. He already made her a viable candidate. _E_ Our gov't is so pathetic that some of the billions being wasted in Afghanistan are ending up with terrorists __HTTP__ _E_ Great going to Bob Kraft & Bill Belichick of the @Patriots on @TimTebow. Tim is a winner just like them! _E_ The 2013 Trump @MissUniverse Pageant comes to Moscow on November 9th. Airing from Crocus City Hall on @nbc! _E_ I'll be on @gretawire tonight on @foxnews at 10 pm. _E_ The so called angry crowds in home districts of some Republicans are actually in numerous cases planned out by liberal activists. Sad! _E_ RT @EricTrump: We should all take a moment to say a prayer for those who paid the ultimate price — Their bravery and sacrifice allows us t... _E_ On Friday @VPBiden said that China has better cities and airports than the US. Well what has @BarackObama done about it the last 3 years?! _E_ Please tune in January 15th at 6:00AM EST and 6:00PM EST to the QVC network to watch my wife @MELANIATRUMP... _E_ Plan a perfect weekend for the holidays in NYC's hottest neighborhood using @TrumpSoHo's 20% offer __HTTP__ _E_ The Village @Trump_Charlotte offers a variety of 5 Star dining experiences for everyday dining & catered affairs __HTTP__ _E_ Wow! What a great night. Thank you to all of the viewers and congratulations to @StephenAtHome __HTTP__ @colbertlateshow _E_ RT @FoxNews: New Poll Shows @POTUS Approval at 50 Percent __HTTP__ _E_ We must do everything possible to keep this horrible terrorism outside the United States. _E_ Interesting article from highly respected Wayne Allyn Root __HTTP__ _E_ With an award winning course designed by Tom Fazio Trump National Philadelphia is a 360 acre exclusive jewel __HTTP__ _E_ The President Changed. So Has Small Businesses' Confidence __HTTP__ _E_ Wow so many Fake News stories today. No matter what I do or say they will not write or speak truth. The Fake News Media is out of control! _E_ A few of the many clips of John McCain talking about Repealing & Replacing O'Care. My oh my has he changed complete turn from years of talk! __HTTP__ _E_ There will be no amnesty!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ Do you believe that Hillary Clinton now wants Obamacare for illegal immigrants? She should spend more time taking care of our great Vets! _E_ We need a tax system that is fair and smart one that encourages growth savings and investment. It's time to (cont) __HTTP__ _E_ I believe this book will rock a lot of people. Don't just read #TImeToGetTough but share it with your friends and family! RushLimbaugh _E_ RT @IvankaTrump: We must reform our tax code so that all Americans can succeed in our modern economy & achieve the American Dream! #TaxRefo... _E_ My @SquawkCNBC #TrumpTuesday interview discussing how @MittRomney has to get tough real unemployment & bias press __HTTP__ _E_ Get ready for the fireworks between @OMAROSA & @latoyajackson in 13th season of All Star @CelebApprentice! Neither one will back down. _E_ .@MelaniaTrump looks amazing in 2000 @SInow! __HTTP__ _E_ Huma Abedin the top aide to Hillary Clinton and the wife of perv sleazebag Anthony Wiener was a major security risk as a collector of info _E_ Newly released emails prove that scientists have manipulated data on global warming. The data is unreliable. __HTTP__ _E_ Tremendous cold wave hits large part of U.S. Lucky they changed the name from global warming to climate change G.W. just doesn't work! _E_ Welcome to the new reality be careful. Retirement ages will be pushed to 80 due to the incompetence of our leaders. __HTTP__ _E_ Just released @CNN Poll gives me a big 13 point lead in Iowa. Change your false story failing @nytimes. Thank you Iowa! _E_ It's easy to see why Americans are sick of career politicians and both parties. _E_ Why has Obama let China and others take our jobs? _E_ Mexico's biggest drug lord escapes from jail. Unbelievable corruption and USA is paying the price. I told you so! _E_ Isn't it funny that I am now #1 in the money losing @HuffingtonPost (poll) and by a big margin. Dummy @ariannahuff must be thrilled! _E_ My son Don and his wife Vanessa just had a beautiful baby boy named Spencer Frederick very thrilling. _E_ Excited to be heading home to see the House pass a GREAT Tax Bill with the middle class getting big TAX CUTS!#MakeAmericaGreatAgain _E_ 3. You should tweet your pick for MVP using the celebrity's name followed by the hashtag #CelebApprenticeMVP. _E_ .@VanityFair Magazine is doing really poorly. It has gotten worse and worse over the years and has lost almost all of it's former allure! _E_ Ted Cruz only talks tough on immigration now because he did so badly in S.C. He is in favor of amnesty and weak on illegal immigration. _E_ I hope you all are looking at the Donald J. Trump Signature Collection of ties shirts & cufflinks @Macys—great for Christmas & holidays. _E_ Great day yesterday at @TrumpDoral unveiling the new Gary Player Villa __HTTP__ Gary is a champion and a great guy. _E_ Watch this video to see how bad wind turbines are for the environment __HTTP__ _E_ The @TuckerCarlson opening statement about our once cherished and great FBI was so sad to watch. James Comey's leadership was a disaster! _E_ I just learned that @politico has no credibility total phonies that don't report the truth. A puppet of Obama? _E_ Via @TVbytheNumbers: 'Celebrity Apprentice' is Number 1 among ABC CBS & NBC for its Second Hour from 10 11 p.m. __HTTP__ _E_ Karl Rove is now making excuses for his total wasting of $400M—not one win—(the Republicans better get smart next time)... _E_ .@HillaryClinton ITS CALLED EXTREME VETTING! #Debates2016 __HTTP__ _E_ This is the first time in my life that I have caused controversy by NOT saying something. _E_ The people are really smart in cancelling subscriptions to the Dallas & Arizona papers & now USA Today will lose readers! The people get it! _E_ Join me live for the commissioning ceremony of the USS Gerald R. Ford! __HTTP__ #USA __HTTP__ _E_ Dopey @billmaher is in for a lot of trouble—I hope he has $5 million (for charity). _E_ Getting ready for @nbcsnl commercial. __HTTP__ _E_ I will be On The Record with Greta Van Susteren @gretawire tonight at 10 PM on Fox News. _E_ The people of Ireland have been so great about my purchase of Doonbeg I'll be there soon. @LodgeatDoonbeg _E_ Assad hit the jackpot! _E_ Here we go with the Oscars! _E_ By the way folks @billmaher is not a smart guy (just look at his past)—he just pretends he is! _E_ Does anybody notice that Atlantic City lost its magic after I left years ago. I had the big boxing introduced UFC (ask Dana)the best shows _E_ Where's the electability? Jeb is losing to HRC by 13 points. A Bush will never beat a Clinton. Wake up @GOP! _E_ Obama should stop running down the stairs when getting off Air Force One. Doesn't look presidential and at some point he will take a fall. _E_ It's Tuesday. How many more non stories will the liberal media try to manufacture so everyone ignores Obama's record? _E_ Thank you to all law enforcement agencies for a fabulous job!#LEO #LESM #Trump2016 __HTTP__ _E_ Celebrity Apprentice will be LIVE on Sunday at 9 PM (from New York City).Casting has already begun for next season. _E_ For first time the failing @nytimes will take an ad (a bad one) to help save its failing reputation. Try reporting accurately & fairly! _E_ Will be interviewed on @Morning_Joe at 7:40. ENJOY! _E_ How is Chris Christie running the state of NJ which is deeply troubled when he is spending all of his time in NH? New Jerseyans not happy! _E_ Happy Father's Day to all even the haters and losers! _E_ Thank you for the kind words tonight @OMAROSA. You were great! See you soon! _E_ Last night Melania and I attended the Skating with the Stars Gala at Wollman Rink in Central Park it was fantastic. Stay tuned for Part 2.. _E_ We create success or failure on the course primarily by our thoughts. Gary Player _E_ Via @ConcordNHPatch by @politizine: "Trump: 'We'll Make America Great Again'" __HTTP__ _E_ Hillary there is nothing to laugh about __HTTP__ _E_ .@sethmeyers Seth can't help it he is really trying hard but just doesn't have what it takes. Very awkward and insecure! _E_ See you tomorrow w/ Gov. @Mike_Pence Iowa & Wisconsin! 3pm __HTTP__ __HTTP__ __HTTP__ _E_ Why does US doping agency destroy an American icon @lancearmstrong for events that took place years ago in France? _E_ Departing Golden CO. for Arizona now after an unbelievable rally. Watch here: __HTTP__ __HTTP__ _E_ A wonderful article by a writer who truly gets it. I am for the people and the people are for me. #Trump2016 __HTTP__ _E_ Associated Press knowingly and inaccurately wrote about Liberty University speech. Shameful reporting...no credibility. _E_ Looks like @tedcruz is getting ready to attack. I am leading by so much he must. I hope so he will fall like all others. Will be easy! _E_ Despite having a black president the racial divide seems greater than it has in decades.If Obama were a leader this would not be the case _E_ Via @BreitbartNews by @NolteNC: DONALD TRUMP SURGES TO COMMANDING LEAD IN POST MCCAIN BACKLASH POLL __HTTP__ _E_ He is destroying our country:@BarackObama has requested to raise our debt limit to over $16.4Trillion by the end (cont) __HTTP__ _E_ Just looked at new selection of Donald J. Trump Signature Collection ties & shirts @Macys fantastic! Would make great gifts! _E_ Jeb used Eminent Domain & took advantage of a disabled vet in the process. (2/2) __HTTP__ _E_ The fans are going to love the tasks in the upcoming 13th season of All Star @CelebApprentice. The biggest yet! _E_ I have accepted the invitation of President Enrique Pena Nieto of Mexico and look very much forward to meeting him tomorrow. _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Where are @RepMarkMeadows @Jim_Jordan and @Raul_Labrador?#RepealANDReplace #Obamacare _E_ In NYC looks like another attack by a very sick and deranged person. Law enforcement is following this closely. NOT IN THE U.S.A.! _E_ Michael Forbes lives in a pigsty and bad liquor company Glenfiddich gave him Scot of the Year award... _E_ Listening to @rushlimbaugh on way back to Jury Duty. Fantastic show terrific guy! _E_ .@MagicJohnson Good luck with the Dodgers this season if they were like you they would never lose a game! _E_ One of the worst and most boring political pundits on television is @krauthammer. A totally overrated clown who speaks without knowing facts _E_ Univision apologized to me but I will not accept their apology. I will be suing them for a lot of money. Miss U.S.A. contestants are hurt! _E_ Lightweight @AGSchneiderman will probably win only because he is a Dem in NY but what a loser! _E_ I have watched sloppy Graydon Carter fail and close Spy Magazine and now am watching him fail at @VanityFair Magazine. He is a total loser! _E_ I'll be on @foxandfriends Monday at 7:30 AM don't miss it. _E_ Ratings way down show irrelevant. Why haven't they learned? @Rosie always fails. _E_ Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our dangerous Southern Border. They could have easily made a deal but decided to play Shutdown politics instead. #WeNeedMoreRepublicansIn18 in order to power through mess! _E_ #HappyIndependenceDay #USA __HTTP__ _E_ 'Manufacturing openings hires rise to highest levels of the recovery' __HTTP__ _E_ I bought Tim Tebow's jersey and helmet at auction for a good cause fighting breast cancer __HTTP__ _E_ For those of you in trouble—(in these troubled times)—never ever give up! _E_ If my people said the things about me that Podesta & Hillary's people said about her I would fire them out of self respect. Bad instincts _E_ Congress should be worried about American workers not people who came into our country by breaking our laws. _E_ RT @IvankaTrump: 2016 has been one of the most eventful and exciting years of my life. I wish you peace joy love and laughter. Happy New... _E_ .@CNN is the worst.They go to their dumb one sided panels when a podium speaker is for Trump! VAST MAJORITY want: Make America Great Again! _E_ MAKE AMERICA GREAT AGAIN! #IACaucus #CaucusForTrump __HTTP__ __HTTP__ _E_ My Twitter account was taken down for 11 minutes by a rogue employee. I guess the word must finally be getting out and having an impact. _E_ .@FrankLuntz is a low class slob who came to my office looking for consulting work and I had zero interest. Now he picks anti Trump panels! _E_ The CDC chief just said Ebola is spreading faster than Aids. Marines are preparing for a pandemic drill. Stop all flights from West Africa! _E_ Why the Rust Belt just gave Donald Trump a hero's welcome __HTTP__ _E_ From an amazing day on the border in Laredo. __HTTP__ _E_ Condolences to the family of the young woman killed today and best regards to all of those injured in Charlottesville Virginia. So sad! _E_ I keep getting great feedback on new #TRUMP cologne 'Success.' Exclusively available at @Macy's __HTTP__ And best shirts & ties _E_ Just read @marklevinshow's bestseller book—really great! _E_ The failing @nytimes does not mention the new @CNN Poll that has me leading Iowa by a massive 13 points I am at 33%. Maggie Haberman sad! _E_ Give me clean beautiful and healthy air not the same old climate change (global warming) bullshit! I am tired of hearing this nonsense. _E_ ...Save your energy Rex we'll do what has to be done! _E_ Join me in Colorado at 12pm tomorrow or Arizona at 3pm!TICKETS:Golden: __HTTP__ __HTTP__ _E_ The best luck of all is the luck you make for yourself. Douglas MacArthur _E_ Getting ready to do the David Letterman @Late_Show tonight—I hope you all will watch—I think! _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ He was quick to issue an apology on behalf of America to Karzai. Why won't he release the letter? @BarackObama __HTTP__ _E_ We should immediately close all tax loopholes that favor foreign investments and taking our jobs overseas t... (cont) __HTTP__ _E_ I told you @TIME Magazine would never pick me as person of the year despite being the big favorite They picked person who is ruining Germany _E_ There are just so many penalties and such long commercials in these NFL games that they are no longer worth watching. Soft hitting & boring! _E_ Thank you America! #Trump2016 __HTTP__ _E_ Governor Cuomo only cut the Verrazano Bridge tolls because I made it a major point in speeches. I love the people of Staten Island! _E_ Via @Newsmax_Media by Alana Marie Burke: "Donald Trump 2016: 7 Key Political Positions" __HTTP__ _E_ Politicians are all talk and no action. Washington can only be fixed by an outsider. Let's make America great again! __HTTP__ _E_ ...and West Virginia. The fact is the Fake News Russian collusion story record Stock Market border security military strength jobs..... _E_ I have spoken w/ @GovAbbott of Texas and @LouisianaGov Edwards. Closely monitoring #HurricaneHarvey developments & here to assist as needed. _E_ The Ebola doctor who just flew to N.Y. from West Africa and went on the subway bowling and dining is a very SELFISH man should have known! _E_ Together we will prevail in the GREAT state of Texas. We love you!GOD BLESS TEXAS & GOD BLESS THE USA __HTTP__ _E_ Even liberals & Democrats think Eric Schneiderman's use of the Atty General's office is unfair & unethical. __HTTP__ _E_ Speaker John Boehner who I like should never have agreed to raise taxes because the Republicans got absolutely nothing for it! _E_ ...extremism and all reference was pointing to Qatar. Perhaps this will be the beginning of the end to the horror of terrorism! _E_ Via @EW by @DaltonRoss: "recap: 'Nobody Out Thinks Donald Trump'" __HTTP__ _E_ The S&P downgrade is a direct result of @BarackObama's increased reckless budget spending and Obama Care. He owns this. _E_ I'm on @CNN's @AC360 tonight @8pm & @FoxNews' @seanhannity @ 10PM discussing immigration and lots of other things.#LetsMakeAmericaGreatAgain _E_ Wonderful coordination between Federal State and Local Governments in the Great State of Texas TEAMWORK! Record setting rainfall. _E_ What a sad thing that the memory of Nelson Mandela will be stained by the phoney sign language moron who is in every picture at funeral! _E_ Landing in Phoenix now. Tomorrow's events will be amazing! #Trump2016 _E_ RT @foxandfriends: FOX NEWS EXCLUSIVE: President Trump 'seriously considering' a pardon for ex Sheriff Joe Arpaio __HTTP__ _E_ Leaving now for New Hampshire. Big crowd looking forward to it! #FITN _E_ Newsmax is a great news organization and its pres debate in IA on Dec 27 will be fair balanced and informative. @ralphreed _E_ A horrible day for Newtown CT and our country yesterday. My condolences to all of the families so tragically affected. _E_ Isn't it ironic that China is going all in nuclear for energy while at the same time making wind turbines for others. @alexsalmond _E_ .@ScottWalker is a nice guy but not presidential material. Wisconsin is in turmoil borrowing to the hilt and doing poorly in jobs etc. _E_ Via @MailOnline by @dmartosko: "President Trump? Says 'there's a very substantial chance' he'll run in 2016" __HTTP__ _E_ ...Even though parts of healthcare could pass at 51 some really good things need 60. So many great future bills & budgets need 60 votes.... _E_ .@alexsalmond @pressjournal RT @JohnDuthie1 just sitting here looking out over Aberdeen bay. These clowns cannot be allowed... _E_ Word is that crying @GlennBeck left the GOP and doesn't have the right to vote in the Republican primary. Dumb as a rock. _E_ Health Insurance stocks which have gone through the roof during the ObamaCare years plunged yesterday after I ended their Dems windfall! _E_ Remember it was the Republican Party with the help of Conservatives that made so many promises to their base BUT DIDN'T KEEP THEM! Hi DT _E_ If the working proud and productive people of our country don't start exerting their authority and views the U.S. as we know it is doomed! _E_ One of the saddest things in journalism is what happened to the formerly great @AP. They have lost their way and are no longer credible. _E_ Change is not a destination just as hope is not a strategy. Rudy Giuliani _E_ Off to Indiana! #Trump2016 __HTTP__ _E_ Rosie O'Donnell should leave Lindsay Lohan alone @Rosie has bigger problems than Lindsay. Lindsay's mother called my office for help _E_ One of the many reasons that @VattenfallGroup dropped out of windfarm project—they couldn't solve military radar defense problems _E_ "Don't let the fear of striking out hold you back." – Babe Ruth _E_ When will the Fake Media ask about the Dems dealings with Russia & why the DNC wouldn't allow the FBI to check their server or investigate? _E_ ObamaCare gives free insurance to illegal immigrants. Yet @BarackObama is cutting our troops healthcare. (cont) __HTTP__ _E_ It is wonderful to be in beautiful Doonbeg touring @Trump_Ireland. I'm truly honored by the wonderful welcome to my family & organization _E_ If you're going to be thinking you may as well think big. _E_ Watch the game really good. _E_ "Success isn't permanent and failure isn't fatal." – Mike Ditka _E_ I will be in Iowa all day and until Tuesday morning. Finally after all these years of watching stupidity we will MAKE AMERICA GREAT AGAIN! _E_ True @THEGaryBusey is a scene stealer without trying. He's got a gift. #CelebApprentice _E_ If I become the next POTUS they will not be ignoring! #AmericaFirst __HTTP__ _E_ Is that all there is? We need a new President FAST! _E_ Invincibility lies in the defence the possibility of victory in the attack. Sun Tzu _E_ Innovation distinguishes between a leader and a follower. Steve Jobs _E_ Loved doing #NCGOPConvention keynote speech last night! Unbelievable reception. Had the biggest crowds by far of any of the GOP candidates. _E_ The talks between the U.S. and Iran are going on forever WORLD'S LONGEST NEGOTIATION. Obama has no idea what he is doing incompetent! _E_ You're all wrong—check the facts! UK is massively subsidizing Scotland's wind turbines & the people don't want them. _E_ ... Rove's ad campaign has made Ashley Judd a totally credible candidate. Be careful Mitch! _E_ Tonight I trade places with Larry King @kingsthings and interview him on the 25th anniversary of his show. 9PM on CNN featuring best clips. _E_ My @foxandfriends interview discussing the Benghazi cover up Hostess' closing & celebrating Thanksgiving with family __HTTP__ _E_ Crooked took MILLIONS from oppressive ME countries. Will she give the $$$ back? Probably not. Don't forget her slog... __HTTP__ _E_ I am having 600 Thanksgiving dinners sent to the Rockaways prepared by my wonderful Trump Grill/Trump Tower staff. #SandyRelief _E_ Spolier alert...the record setting 13th season of All Star @CelebApprentice also features the return of previous winners in the boardroom. _E_ Venezuelan leader Hugo Chavez said in a television interview that aired on Sunday If I were American I'd vote for Obama. _E_ JOBS JOBS JOBS! #MAGA __HTTP__ _E_ On my way to Pensacola Florida. See everyone soon! #MAGA __HTTP__ _E_ Crooked Hillary Clinton has destroyed jobs and manufacturing in Pennsylvania. Against steelworkers and miners. Husband signed NAFTA. _E_ ...@BarackObama is hiding plenty of bad things. _E_ With the coming forward today of the woman central to the failing @nytimes hit piece on me we have exposed the article as a fraud! _E_ This is the right TAX CUT @ the RIGHT TIME. We will ALL succeed & grow TOGETHER – as one team one people & one American family. #TaxReform __HTTP__ _E_ If you want to conquer fear don't sit home and think about it. Go out and get busy. Dale Carnegie _E_ Should have gone after the oil years ago (like I have been saying). _E_ An 'extremely credible source' has called my office & told me that @BarackObama applied to Occidental as a foreign student think about it! _E_ Crooked Hillary Clinton blames everybody (and every thing) but herself for her election loss. She lost the debates and lost her direction! _E_ Looking forward to addressing @ralphreed's @FaithandFreedom 'Road to Majority Conference' on June 13th __HTTP__ _E_ President Obama spends so much time speaking of the so called Carbon footprint and yet he flies all the way to Hawaii on a massive old 747. _E_ As a tribute to the late great Phyllis Schlafly I hope everybody can go out and get her latest book THE CONSERVATIVE CASE FOR TRUMP. _E_ The economy is bad and getting worse almost ZERO growth this quarter. Nobody can beat me on the economy (and jobs). MAKE AMERICA GREAT AGAIN _E_ We are excited to announce Trump Estates at Akoya by DAMAC luxury villas situated byTrump Int'l Golf Links Dubai __HTTP__ _E_ Breaking news The Washington Redskins have just announced that they will be removing the name Washington from their name! _E_ "@DamacOfficial Announces @TigerWoods to Create Golf Course for Trump World Golf Club Dubai" __HTTP__ via @BusinessWire _E_ RT @DanScavino: LOUISIANA GENERAL ELECTIONDonald Trump vs. Hillary Clinton#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Glad to hear @GovChristie will be delivering the Keynote for the @RNC convention. He will deliver a strong message. _E_ It's Tuesday. How much money will Karl Rove waste today trying to push amnesty through the House? _E_ No @JebBush you're pathetic for saying nothing happened during your brother's term when the World Trade Center was attacked and came down. _E_ Kasich voted for NAFTA a disaster for Ohio and now wants the even worse TPP approved. Vote Trump and end this madness! _E_ We must all be united in offering assistance to everyone suffering in Puerto Rico and elsewhere in the wake of this terrible disaster. _E_ A clip of my upcoming interview with @DavidBrody discussing #TimeToGetTough @Israel and the Islamist winter __HTTP__ _E_ RT @DanScavino: Join @realDonaldTrump LIVE in Wisconsin with Gov. @ScottWalker @MayorRGiuliani @Reince & Coach Bobby Knight! LIVE: __HTTP__ _E_ It is amazing how often I am right only to be criticized by the media. Illegal immigration take the oil build the wall Muslims NATO! _E_ Congratulations to Aberdeen and Scotland for just having our great golf course named Best New Course In World by The Robb Report. _E_ I love that thousands of people are boycotting @Macys and cutting up credit cards. No guts no glory. This really backfired love it! _E_ Facebook was always anti Trump.The Networks were always anti Trump henceFake News @nytimes(apologized) & @WaPo were anti Trump. Collusion? _E_ I hope Bill Clinton and NEWSMAX's Chris Ruddy are enjoying their mission to Africa. Two great people. _E_ Celebrity Apprentice continues to be a top ten trend on twitter this morning __HTTP__ _E_ No I'm saying that the World is paying the price for China's pollution while they make a fortune with their dirty factories! Very sad. _E_ The Club For Growthwhich asked me for $1000000 in an extortion attempt just put up a Wisconsin ad with incorrect math.What a dumb group! _E_ Via "TRUMP: HILLARY PRESIDENCY WILL CAUSE 'CRIME WAVE LIKE YOU'VE NEVER SEEN'" __HTTP__ via @BreitbartNews _E_ .@MattGinellaGC Don't forget to watch Matt tomorrow on Morning Drive talking about The Blue Monster and Trump Doral. @GolfChannel _E_ A quote was read from a parody account last night on MSNBC re: Jeb. __HTTP__ _E_ RT @Corrynmb: @realDonaldTrump Liberals have an agenda and it's not in America's best interest. Keep fighting the good fight! We stand with... _E_ Today it was my great honor to sign a new Executive Order to ensure Veterans have the resources they need as they transition back to civilian life. We must ensure that our HEROES are given the care and support they so richly deserve! __HTTP__ __HTTP__ _E_ Leaving now I'm spending the entire day in Iowa great people great state! _E_ #MakeAmericaSafeAgain #ImWithYou __HTTP__ _E_ Rambling and stumbling @hardball_chris is as dumb as a rock! _E_ My interview yesterday with @TeamCavuto discussing Europe's debt deal and the GOP primary __HTTP__ _E_ Great New Poll __HTTP__ _E_ The Justice Dept. should have stayed with the original Travel Ban not the watered down politically correct version they submitted to S.C. _E_ The best vision is insight. Malcolm Forbes _E_ Business is looking better than ever with business enthusiasm at record levels. Stock Market at an all time high. That doesn't just happen! _E_ The Senate must NOT pass TPA! Any Senator who votes for it is disqualified for being POTUS. Protect the American worker and manufacturer! _E_ Yet another terrorist attack this time in Turkey. Willthe world ever realize what is going on? So sad. _E_ A MUST WATCH TRULY BEAUTIFUL! @PrivateCaddie: Amazing Turnberry Ailsa course changes from @realDonaldTrump #Golf __HTTP__ _E_ I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_ Coming together is a beginning keeping together is progress working together is success. Henry Ford _E_ It's important to remain open to new ideas and new information. Keep your door open every day to something innovative and energizing. _E_ Located in Palm Beach FL historic Mar a Lago features 20 exquisite acres filled w/ world class amenities __HTTP__ _E_ Via @businessinsider by @hunterw: "TRUMP: 'I'm going to surprise a lot of people' in 2016" __HTTP__ _E_ Results are what matter. The bottom line is clearly the bottom line. Think Like a Champion _E_ I will be live tweeting my interview with @megynkelly on the Fox Network tonight at 8! Enjoy! __HTTP__ _E_ I will be interviewed by @MariaBartiromo on @MorningsMaria @FoxBusiness at 7:30 A.M. Enjoy. _E_ Bush and Rubio are finally attacking each other as I knew they would in order to be the last establishment man standing against me.Great _E_ This is what REAL PRIDE in our COUNTRY is all about! #USA __HTTP__ _E_ Bernie Sanders is being treated very badly by the Democrats the system is rigged against him. Many of his disenfranchised fans are for me! _E_ RT @axios: The DOJ is opening a civil rights investigation on the car attack in Charlottesville __HTTP__ _E_ Thank you New Hampshire! #FITN __HTTP__ _E_ Meeting with Iowa State Senate Leaders __HTTP__ _E_ Must read f/@ weeklystandard by @JayCostTWS: "Obamacare Myth Making Five phony success stories." __HTTP__ _E_ "Each life is made up of mistakes and learning waiting and growing practicing patience and being persistent." – Rev. @BillyGraham _E_ Being good in business is the most fascinating kind of art.Making money is art & working is art & good business is the best art. A. Warhol _E_ Republicans must be careful in that the Dems own the failed ObamaCare disaster with its poor coverage and massive premium increases...... _E_ Hillary's vision is a borderless world where working people have no power no jobs no safety. _E_ Always bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_ Ivanka Trump will be interviewed on @foxandfriends. _E_ Crude has skyrocketed since @BarackObama delayed the Keystone Pipeline. Not only are 20000 jobs gone but family budgets are tightening. _E_ Our founders invoked our Creator four times in the Declaration of Independence. Our currency declares "IN GOD WE TRUST." And we place our hands on our hearts as we recite the Pledge of Allegiance and proclaim that we are "One Nation Under God." #NationalPrayerBreakfast __HTTP__ _E_ Sugar @Lord_Sugar Why don't you tell the public what you're really worth they would be very disappointed. _E_ Closely monitoring #HurricaneHarvey from Camp David. We are leaving nothing to chance. City State and Federal Govs. working great together! _E_ The #Hyperlapse app in @TrumpTowerNY __HTTP__ _E_ I will be interviewed by @GStephanopoulos on @GMA at 7:00 A.M. There is much to talk about! _E_ Thank you for your support in Biloxi MS! Let's ALL get out & VOTE in 2016 so we can #MakeAmericaGreatAgain! __HTTP__ _E_ Crooked Hillary promised 200k jobs in NY and FAILED. We'll create 25M jobs when I'm president and I will DELIVER! __HTTP__ _E_ They finally let our Marine out of a Mexican prison no thanks to Obama. Way too long. Such an event should never be allowed to happen again _E_ It was just announced that @ErinBurnett won't be going to mornings on CNN. @OutFrontCNN just made a wise decision. _E_ No wonder the Today Show on biased @NBC is doing so badly compared to its glorious past. Little credibility! _E_ Join me in Mobile Alabama on Sat. at 3pm! #ThankYouTour2016 Tickets: __HTTP__ __HTTP__ _E_ I will be interviewed by @MarthaMaccallum on @FoxNews tonight at 7pm. Enjoy! _E_ It is about time that Roger Goodell of the NFL is finally demanding that all players STAND for our great National Anthem RESPECT OUR COUNTRY _E_ I'm sick of always reading about outsourcing. Why aren't we talking about 'onshoring'? (cont) __HTTP__ _E_ Thank you to @foxandfriends for the nice reviews of last night. _E_ Nasty Ted Cruz is at it again same dirty tricks he used w/ @RealBenCarson saying I may not be on ballot & I hold liberal positions. LIES! _E_ "He who knows when he can fight and when he cannot will be victorious." Sun Tzu _E_ My Trump Home Mattress Collection by Serta is setting records they are really phenomenal. You can order them at __HTTP__ _E_ Pigs get slaughtered ... again. Ft Lauderdale plaintiffs must pay me close to $400k in legal fees after Trump trial victory. _E_ Do you believe this singing? #Oscars _E_ .@antbaxter Anthony—did you illegally take clips from the Letterman @Late_Show show and @GolfChannel without their approval? _E_ Remember this the worst doctors (by far) are celebrity doctors. If you see their names or read about them in the newspapers stay away! _E_ Via @espn: @dallasmavs "most likely scenario remains finishing a frustrating ninth in the West" __HTTP__ _E_ Busy week planned with a heavy focus on jobs and national security. Top executives coming in at 9:00 A.M. to talk manufacturing in America. _E_ .@KellyandMichael are both wonderful people. Their show is terrific. #CelebApprentice _E_ Don't take vacations. What's the point? If you're not enjoying your work you're in the wrong job. Think Like A Billionaire _E_ Big ratings getter @seanhannity and Apprentice Champion John Rich are right now going on stage in Las Vegas for #VegasStrong. Great Show! _E_ American must now get very tough very smart and very vigilant. We cannot admit people into our country without extraordinary screening. _E_ #RiyadhSummit #POTUSAbroad __HTTP__ _E_ Really bad article about me in the dying (or dead) Esquire Magazine. Totally false lots of hatred. When will this boring magazine close? _E_ People that have read it tell me that @KarlRove book is terrible (and boring). Save your money! @FoxNews should can him no credibility! _E_ See the attack very possibly could have been stopped. We need real leadership and vision. __HTTP__ _E_ What's more important for the American public to have? @MittRomney's tax returns or @BarackObama's sealed records? _E_ Iran with all of the money and all else given to them by Obama has wanted a way to take over Saudi Arabia & their oil. THEY JUST FOUND IT! _E_ You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_ To aspiring entrepreneurs: Trust your instincts. They are there for a reason. _E_ COMING UP @GenFlynn @newtgingrich on @foxandfriends _E_ Who would you like to see on next season of #CelebrityApprentice? Let us know everyone wants to be on it. _E_ Trump defends campaign manager charged for bruising a reporter: __HTTP__ _E_ Stock Market up 5 months in a row! _E_ At 9:00 P.M. @CNN of all places is doing a Special Report on my daughter Ivanka. Considering it is CNN can't imagine it will be great! _E_ HILLARY'S BAD TAX HABIT! __HTTP__ _E_ Such long rhetorical and boring answers from Obama. No wonder nothing gets done. _E_ .@DannyZuker I hear your filmography is stacked with failures. _E_ I hope you buy my shirts and ties at @Macys _E_ A huge honor for @TrumpToronto for being named #1 Luxury Hotel in Canada by @TripAdvisor's #TravelersChoice Awards __HTTP__ _E_ RT @DonaldJTrumpJr: Great group at our Victory Office in Columbus Ohio. I'm incredibly grateful to have so many... __HTTP__ _E_ Wow new @ABCnews/@WashingtonPost @GOP preference poll has DonaldTrump 11 points up! Thank you. _E_ The delegates at the @DNC convention keep shouting Four More Years. Four more years of 18% real unemployment and another $6T in debt? _E_ The #CelebApprentice post @OMAROSA. Will it ever be the same? _E_ Referees are destroying the enjoyment of NFL games. Slowing down the fun. Big shots. Jets game is ridiculous! _E_ I hope Derek Jeter's recovery is going well. He is a very special player and a great guy. New York loves him. @yankees _E_ Thank you Pennsylvania I am forever grateful for your amazing support. Lets MAKE AMERICA GREAT AGAIN! #MAGA... __HTTP__ _E_ My @SquawkCNBC interview. __HTTP__ _E_ .@TheRealMarilu is impressing the All Star Celebrity @ApprenticeNBC viewers with her continued success on Team Power. _E_ An unbelievable night in Iowa with our great Veterans! We raised $6000000.00 while the politicians talked! #GOPDebate _E_ I missed the PGA Championship because it was not broadcast by TimeWarner @TWC. Why aren't they giving subscribers major discounts? _E_ If I only had 1 person running against me in the primaries like Hillary Clinton I would have gotten 10 million more votes than she did! _E_ Via @Newsmax_Media: "Trump to Speak at CPAC" __HTTP__ @CPACnews #CPAC13 _E_ The hatred that clown @krauthammer has for me is unbelievable – causes him to lie when many others say Trump easily won debate. _E_ Thank you Mahoning County Ohio! See you soon! #MakeAmericaSafeAgain __HTTP__ __HTTP__ _E_ .@GovernorPataki was a terrible governor of NY one of the worst would've been swamped if he ran again! _E_ The new Red Tiger course at @TrumpDoral __HTTP__ Follow @TrumpGolf for more great photos. _E_ More and more people are suggesting that Republicans (and me) should be given Equal Time on T.V. when you look at the one sided coverage? _E_ Canadian PM Harper immediately called the Ottawa attack terrorism. At least North America has a strong leader who lives in reality. _E_ My first order as President was to renovate and modernize our nuclear arsenal. It is now far stronger and more powerful than ever before.... _E_ The failing @nytimes does major FAKE NEWS China story saying Mr.Xi has not spoken to Mr. Trump since Nov.14. We spoke at length yesterday! _E_ Medicare payments have become so unpredictable that record amount of doctors are now leaving __HTTP__ Bad for long term. _E_ Arnold Schwarzenegger isn't voluntarily leaving the Apprentice he was fired by his bad (pathetic) ratings not by me. Sad end to great show _E_ Thank you Virginia! #Trump2016#SuperTuesday _E_ Great Live Signing last nite! Over 25k views. I am signing books for next two weeks. Order yours for holiday gifts. __HTTP__ _E_ Excited to see @SixteenChicago's "elevated fine dining" explored by @USAToday @10Best! __HTTP__ _E_ It's Friday. How many people have been forced off their plans and lost their doctors today because of ObamaCare? _E_ Miss Israel and Miss Lebanon no more fighting! #TrumpVlog #MissUniverse __HTTP__ _E_ Robert I'm getting a lot of heat for saying you should dump Kristen but I'm right. If you saw the Miss Universe girls you would reconsider. _E_ Two people fired very early on Celebrity Apprentice tonight at 9 leading up to next weeks live Finale. Don't get angry at me tonight! _E_ It's important to promote an image of yourself each and every day. It's part of having a sense of self and a sense of purpose. _E_ Now Chinese agents are smuggling our military weapons through rogue US soldiers __HTTP__ China loves to cheat! _E_ Re: Negotiation: View any conflict as an opportunity. Be a diplomat as much as possible. _E_ I will be landing in Las Vegas shortly to pay my respects with @FLOTUS Melania. Everyone remains in our thoughts and prayers. _E_ Our billion dollar website __HTTP__ _E_ Via @BleacherReport: "Donald Trump to Be Inducted into WWE Hall of Fame" __HTTP__ _E_ RT @DonaldJTrumpJr: An Honor to be in #Indiana w @realDonaldTrump @greta & the legend Bobby Knight! I like our secret weapon better!!! __HTTP__ _E_ Going over to @TodayShow now to introduce @ApprenticeNBC cast etc. watch. _E_ Alaska Arizona Maine and Kentucky are big winners in the Healthcare proposal. 7 years of Repeal & Replace and some Senators not there. _E_ If I run I will be in all the primary debates and you will see why I am the only one who can Make America Great Again! _E_ Happy Thanksgiving to all even the haters and losers! _E_ Apprentice ratings doing great easily won the 10 o'clock hour over other networks! _E_ Can you imagine the anger and disgust when the heads of other countries found out that their cell phones were being tapped by NSA.Obama mess _E_ It's been great making so many new friends at Trump @DoralResort for the @CadillacChamp. Good luck to everyone! _E_ .@EricShawnonFox Highest rated Saturday Night Live in four years. 47% higher than their opening night with Hillary & Miley Cyrus. Nice words _E_ .@GStephanopoulos just announced that I am leading BIG in the new @ABC Poll which will be shown on This Week at 9:00 A.M. I will be on show _E_ THANK YOU California Maryland New York and Pennsylvania! See you soon!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Out of our very big country with many choices does everyone notice that both the ban case and now the sanctuary case is brought in ... _E_ Remember if you don't sell yourself no one else will. Make sure the public friends & the business community hears about your success. _E_ My interview with @gretawire last night Everything Obama Does is a 'Campaign Speech' __HTTP__ _E_ Bernie Sanders totally sold out to Crooked Hillary Clinton. All of that work energy and money and nothing to show for it! Waste of time. _E_ Just received applause at #NCGOPcon when I said People ask me why I may run for President I might so we can Make America Great Again! _E_ Carson now admits his friend named Bob who he tried to stab (Bob was saved by his belt buckle!) no longer exists as Bob. Wrong name! _E_ The United States mourns for the victims of Nice France. We pledge our solidarity with France against terror. __HTTP__ _E_ Republican Tax Cuts are looking very good. All are working hard. In the meantime the Stock Market hit another record high! _E_ I saw from my window just before accident that the crane was not properly anchored for the storm. _E_ If we did all the things we are capable of we would literally astound ourselves. Thomas A. Edison _E_ .@MajorCBS Major Garrett of @CBSNews covers me very inaccurately. Total agenda bad reporter! _E_ Piers truly hates Omarosa! _E_ Certainly has been an interesting 24 hours! _E_ When will @BarackObama present an actual budget? Enough with the games. _E_ Failing @nytimes which has been calling me wrong for two years just got caught in a big lie concerning New England Patriots visit to W.H. _E_ What recovery? JP Morgan has readjusted Q2 growth down from 1.7% to 1.4% and Q3 to 1.5% with 2012 on a whole at 1.7% __HTTP__ _E_ China wouldn't provide a red carpet stairway from Air Force One and then Philippines President calls Obama the son of a whore. Terrible! _E_ "Presidential Proclamation Commemorating the 50th Anniversary of the Vietnam War" __HTTP__ __HTTP__ __HTTP__ _E_ RT @PressSec: .@POTUS and @FLOTUS meet w/ some of America's finest on the USS Kearsarge off the coast of PR. __HTTP__ _E_ Almost all reporters falsely report that I had a bad time at last year's White House Correspondents' Dinner. (cont) __HTTP__ _E_ Bernie Sanders is lying when he says his disruptors aren't told to go to my events. Be careful Bernie or my supporters will go to yours! _E_ We are going to ask Katherine Webb to be a judge at the Miss USA Pageant coming up in Las Vegas. _E_ On this wonderful Veterans Day I want to express the incredible gratitude of the entire American Nation to our GREAT VETERANS. Thank you! __HTTP__ _E_ A great night in Iowa! __HTTP__ _E_ I still don't know who I'm going to choose. @GeraldoRivera or @LeezaGibbons? Who do you like? @ApprenticeNBC _E_ He @BarackObama is incapable of admitting that he is a complete and utter failure. He is 100% responsible for Solyndra. __HTTP__ _E_ Why do people listen to clown @KarlRove on @FoxNews? Spent $430M & lost all races—a Bushy! _E_ Thanks you for all of the Trump Rallies today. Amazing support. We will all MAKE AMERICA GREAT AGAIN! _E_ Thoughts & prayers with everyone in Lafayette Louisiana this evening. _E_ .@TheEconomist Poll one of the most highly respected was just released. Wow wait until the media digests these numbers won't be happy! _E_ Thank you America! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_ My great honor to host the 2017 back to back #StanleyCup Champion Pittsburgh Penguins at the WH with @FLOTUS today! __HTTP__ __HTTP__ _E_ I have a tip that can take 5 strokes off anyone's golf game. It's called an eraser. Arnold Palmer _E_ With unempoyment over 10% in 2009 @BarackObama held an extravagant Alice in Wonderland party. He is a man of the people! _E_ If you are steadfast in your efforts and self respect critics will be harmless. Keep your focus! _E_ It is being reported by virtually everyone and is a fact that the media pile on against me is the worst in American political history! _E_ Via @MoscowTimes: Donald Trump in New @eminofficial Video __HTTP__ Emin & family are wonderful people. _E_ #Imwithyou __HTTP__ _E_ Price of corn has jumped over 50%. This will cause a jump in food prices perhaps beyond what we've ever seen. Nasty for the economy. _E_ When confronted @RickSantorum can't defend his ridiculous attacks on @MittRomney __HTTP__ _E_ ... So if you want to aim high you have to have the guts to handle the inevitable bumps in the road. Think BIG _E_ Millions protesting in Egypt for Morsi's ouster __HTTP__ When will Obama demand Morsi's resignation as he did to Mubarak _E_ #TBT As a young man when I proposed the Convention Center in New York City. __HTTP__ _E_ RT @DRUDGE_REPORT: WSJ: Grifters in Chief... __HTTP__ _E_ Putin has become a big hero in Russia with an all time high popularity. Obama on the other hand has fallen to his lowest ever numbers. SAD _E_ This story is no longer about John McCain it's about our horribly treated vets. Illegals are treated better than our wonderful veterans. _E_ ...they are costly inefficient bird killing community destroying machines. They are obsolete! @maddow _E_ The last person corrupt Hillary Clinton wants to run against is Donald J. Trump. I'll end up beating her in every state. New Fox Poll Trump! _E_ #trumpvlog NY Area Two book signings Tonight and Thursday.... __HTTP__ _E_ Courageous Patriots have fought and died for our great American Flag we MUST honor and respect it! MAKE AMERICA GREAT AGAIN! _E_ Knowledge requires patience action requires courage. Put patience and courage together and you'll be a winner. _E_ My childcare plan makes a difference for working families more money more freedom. #AmericaFirst means... __HTTP__ _E_ Would seem that plane landed short of runway in San Francisco! _E_ My son Donald openly gave his e mails to the media & authorities whereas Crooked Hillary Clinton deleted (& acid washed) her 33000 e mails! _E_ Hillary Clinton is using race baiting to try to get African American voters but they know she is all talk and NO ACTION! _E_ Today's open call drew thousands of eager applicants. It was an impressive group I enjoyed meeting them. We've got some great candidates! _E_ Because I will be busy doing anything other than being in the movie #RoadHard. __HTTP__ _E_ I hear @pennjillette show on Broadway is terrible. Not surprised boring guy (Penn). Without The Apprentice show would have died long ago. _E_ Isn't it sad that on a day of national tragedy Hillary Clinton is answering softball questions about her email lies on @CNN? _E_ Just for your info tax returns have 0 to do w/ someone's net worth. I have already filed my financial statements w/ FEC. They are great! _E_ ISIS is taking credit for the terrible stabbing attack at Ohio State University by a Somali refugee who should not have been in our country. _E_ Now even @BarackObama's old professors are coming out in opposition to his re election. __HTTP__ He has embarrassed them. _E_ Only a fool would buy the @NYDailyNews. Loses fortune & has zero gravitas. Let it die! _E_ Entrepreneurs: Review your work habits regularly and make sure they are taking you in the right direction. Keep your focus intact. _E_ Advice from my mother Mary MacLeod Trump: Trust in God and be true to yourself. _E_ Ted Cruz was born in Canada and was a Canadian citizen until 15 months ago. Lawsuits have just been filed with more to follow. I told you so _E_ ..my endorsement). He also wanted to be Secretary of State I said NO THANKS. He is also largely responsible for the horrendous Iran Deal! _E_ .@kilmeade It was great being with you on @foxandfriends this morning. So many people saw and loved the piece. Great work! _E_ Join us Saturday night for the South Carolina Primary Watch Party!#SCPrimary #Trump2016 __HTTP__ _E_ Pictures of @melaniatrump and me from the Men In Black III premiere in New York City __HTTP__ We loved the movie! _E_ The United States under President Obama has truly become the gang that couldn't shoot straight. Everything he touches turns to garbage! _E_ Thank you @DennisRodman. It's time to #MakeAmericaGreatAgain! I hope you are doing well! __HTTP__ _E_ RT @JackPosobiec: Meanwhile: 39 shootings in Chicago this weekend 9 deaths. No national media outrage. Why is that? __HTTP__ _E_ The Ebola nurse should NEVER have been allowed to fly to Cleveland and (amazing) back again. Nothing works in our once great country anymore _E_ Just signed 702 Bill to reauthorize foreign intelligence collection. This is NOT the same FISA law that was so wrongly abused during the election. I will always do the right thing for our country and put the safety of the American people first! _E_ I look forward to Saturday night and being inducted into the @WWE Hall of Fame. _E_ Clips from tax speech and @seanhannity on @foxandfriends now. Have a great day! _E_ Via @TPInsidr __HTTP__ _E_ The biggest doers often suffer the biggest setbacks in life... _E_ The legendary @BarbaraJWalters interviews my family and me tonight at 10:00 on @ABC2020 . Don't miss it! __HTTP__ _E_ CNN Poll just out on South Carolina – great #'s __HTTP__ _E_ 'Obama Warned Of Rigged Elections In 2008.' Time to #DrainTheSwamp __HTTP__ __HTTP__ _E_ South Carolina was so great last night. Will be back soon! _E_ I will be interviewed on @FacetheNation Sunday 10AM on CBS. @johndickerson is a true pro! _E_ The #CNBCGOPDebate poll closed with #Trump2016 declared the official winner. Thank you! __HTTP__ __HTTP__ _E_ For those of you defending Bret and saying Omarosa should go remember Bret chose O which could also be considered a big mistake! _E_ Via @realitytvworld: La Toya Jackson fired from 'All Star Celebrity Apprentice' by Donald Trump __HTTP__ _E_ I am so glad @Rosie got fired by @Oprah. Rosie is a bully and it's always nice to see bullies go down! _E_ Tina Brown could finally be over. @thedailybeast is a total failure. She just got fired great! _E_ It is a great honor to have helped the community so much. __HTTP__ _E_ RT @foxandfriends: OPIOID CRISIS: Worse than we thought with a new study showing overdose deaths were under reported __HTTP__ _E_ Via @AP by @kronayne & @colvinj: Disavowed by GOP leaders Trump has supporters cheering __HTTP__ _E_ Trump @DoralResort's renovations are on schedule. With such a massive project underway I am watching closely. _E_ Crooked Hillary Makes History! #ImWithYou #AmericaFirst __HTTP__ _E_ Thank you the very dishonest Fake News Media is out of control! __HTTP__ _E_ Thank you for your support! TOGETHER we will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ Wrong used to be called global warming and when that name didn't work they deftly changed it to climate change because it's freezing! _E_ .@ewanshearer Happy Birthday _E_ Beijing had a bigger celebration than Chicago last night. The Chinese are happier with the election than we are. _E_ Negotiation is an art. Treat it like one. _E_ People don't understand that I left The Apprentice to run for Pres—the Apprentice DID NOT leave me. Bob Greenblatt & folks @NBC were GREAT! _E_ At the request of many I will be doing live tweets during the next presidential debate. _E_ I loved being at Liberty University today! Record setting crowd unbelievable people! Thank you Jerry and Becki! __HTTP__ _E_ .@HillaryClinton #ICYMI WE ARE NOT IN A NARRATIVE FIGHT. @Mike_Pence #MAGA __HTTP__ _E_ Going to New Hampshire all sold out crowds. People want real change POLS WILL NEVER MAKE OUR COUNTRY GREAT AGAIN! _E_ Rep. Stephen Lynch (D Ma) said There's all of these taxes and fees that are the tough medicine..it's going to hit the fan 're ObamaCare. _E_ We are experiencing the coldest weather in more than two decades most people never remember anything like this. GLOBAL WARMING anyone? _E_ Is Cruz honest? He is in bed w/ Wall St. & is funded by Goldman Sachs/Citi low interest loans. No legal disclosure & never sold off assets. _E_ A lot of the @Yankees should be ashamed of their play in the post season. They are lucky they don't have to deal with George Steinbrenner. _E_ I have millions more votes/hundreds more dels than Cruz or Kasich and yet am not being treated properly by the Republican Party or the RNC. _E_ Live on the edge no complacency is allowed and keep an open mind. Business is a creative endeavor. _E_ Via @BreitbartNews' @biggovt: "WAR! TRUMP LEVIN PUMMEL ROVE AS CONSERVATIVE BATTLE ESCALATES" __HTTP__ _E_ .@genesimmons is terrific congratulations on Hall of Fame. _E_ ...They have been in our country for many years through no fault of their own brought in by parents at young age. Plus BIG border security _E_ Trump Int'l Hotel Washington D.C.: The iconic Old Post Office Building will be one of the world's great hotels. __HTTP__ _E_ For those who missed my chat with @hannityshow on radio here it is on TV. Sean is terrific. __HTTP__ _E_ The media is really on a witch hunt against me. False reporting and plenty of it but we will prevail! _E_ Readout of my meeting with Israeli Prime Minister Benjamin Netanyahu: __HTTP__ __HTTP__ _E_ On the whole the teams seem to be working well together. No wars...yet. _E_ In all of television the only one who said anything bad about last nights landslide victory was dopey @KarlRove. He should be fired! _E_ It is time Republicans stop attacking each other and focus on @BarackObama. America cannot survive a second term. _E_ The Trump Signature Collection exclusively available at @Macys tops all menswear styles. Dress to impress! __HTTP__ _E_ Via @DMRegister by @JenniferJJacobs: Trump to hand out Trump memorabilia at Iowa summit __HTTP__ _E_ .@Yankees are in trouble without Derek. Try A Rod at short get him some confidence. _E_ .@TrumpWaikiki is Hawaii's top luxury hotel & destination. Each room features stunning views & superb amenities __HTTP__ _E_ Why did lightweight A.G. Eric Schneiderman come to my office on numerous occasions begging for campaign contributions? Also recent asks? _E_ I hate @USAToday's redesign the logo is terrible. Lightweight Al Neuharth must've had something to do with this No wonder paper is failing. _E_ THANK YOU NEVADA!#Trump2016 #MakeAmericaGreatAgain@Snapchat! Username: realdonaldtrump __HTTP__ __HTTP__ _E_ Watching the #GOPConvention#AmericaFirst #RNCinCLE _E_ Join me in California or Montana!5/25/16: Anaheim California __HTTP__ Billings Montana __HTTP__ _E_ If speeches and memoirs created jobs then @BarackObama would be Ronald Reagan. _E_ Good news is that my campaign has perhaps more cash than any campaign in the history of politics b/c I stand 100% behind everything we do. _E_ Thank you to @Franklin_Graham. I have always appreciated your courage but now more so than ever! _E_ Same CDC which is bringing Ebola to US misplaced samples of anthrax earlier this year __HTTP__ Be careful. _E_ .@mcuban is so short off the tee he can't have much of a punch. He's just a weak man with a big mouth! _E_ #trumpvlog @BarackObama is very inconsiderate... __HTTP__ _E_ RT @foxandfriends: Millions of gallons of Mexican waste threaten Border Patrol agents __HTTP__ _E_ It was great spending time with @joniernst yesterday. She has done a fantastic job for the people of Iowa and U.S. Will see her again! _E_ The arrogant young woman who questioned me in such a nasty fashion at No Labels yesterday was a Jeb staffer! HOW CAN HE BEAT RUSSIA & CHINA? _E_ Don't forget to watch Larry King tonight CNN at 9 pm. He's a television legend and a great friend. It's going to be a fantastic farewell. _E_ Via @BreitbartNews by @THESHARKTANK1: DONALD TRUMP FIRES ENTIRE 2016 GOP FIELD __HTTP__ _E_ . @foxandfriends interview discussing a budget deal my #CPAC2013 speech @RealBenCarson & firing @latoyajackson __HTTP__ _E_ #TBT On the stage during the Emmys performing Green Acres with Megan Mullally __HTTP__ _E_ Thank you to the Governor of Florida Rick Scott for your endorsement. I greatly appreciate your support! _E_ #CrookedHillary has FAILED all over the world! 􏰀 #BigLeagueTruth #Debates2016 __HTTP__ _E_ The movie may be garbage but we can't let a foreign country dictate to us what to watch. @SonyPictures _E_ It's Wednesday. I wonder how much money @BarackObama borrowed from China today? _E_ Just returned home from the great state of New Hampshire. Have made so many friends there special place! _E_ "@TrumpFerryPoint was something we've been working on for years and Donald Trump got it to the finish line." @rubendiazjr _E_ .@peachespulliam at @TrumpTowerNY this afternoon a wonderful woman. It was an honor to donate $25K to her charity. __HTTP__ _E_ America deserves a commander in chief who respects the challenges and realities our Armed Forces face in our (cont) __HTTP__ _E_ This is no surprise. Constant phony reporting from failing @CNN turns everyone off. The American people get it! __HTTP__ _E_ New job numbers once again show no growth or recovery. Unemployment has been over 8% for 41 straight months now up to 8.3% _E_ Don't wait for dire circumstances to test your quick thinking ability. Be on alert at all times. _E_ Real unemployment is at over 21%. Businesses won't hire until @BarackObama is defeated in 2012. #TimeToGetTough _E_ Can you imagine what the outcry would be if @SnoopDogg failing career and all had aimed and fired the gun at President Obama? Jail time! _E_ The truth continues to come out after 14 years. A truth that many in the media did not want to tell. #Trump2016 __HTTP__ _E_ Tomorrow's election will have historic repercussions for our country. Make America strong again. Vote for @MittRomney. _E_ We are going to WIN and MAKE AMERICA GREAT AGAIN maybe better than ever before! _E_ Our prayers are with Rev. @BillyGraham for a speedy recovery. His faith continues to inspire us all. _E_ Just leaving Mechanicsburg PA. Incredible crowd so enthusiastic! Will be back soon. #MAGA __HTTP__ _E_ RT @paulsperry_: BREAKING: top FBI investigator for Mueller PETER STRZOK busted sending political text messages bashing Trump & praising... _E_ Response to the Des Moines Register __HTTP__ _E_ .@Zagat named Christmas Day Brunch @TrumpChicago @SixteenChicago one of the best in the city! #TrumpHolidays __HTTP__ _E_ Thank you Michigan! This is a MOVEMENT that will never be seen again it's our last chance to #DrainTheSwamp! Watch... __HTTP__ _E_ I am on @FoxNewsSunday with Chris Wallace his 20th year anniversary with #FNS throughout the day. Enjoy! __HTTP__ _E_ Press Conference Following National Security Briefing in Bedminster New Jersey. __HTTP__ __HTTP__ _E_ $6 gas is coming sooner than later. America must become energy independent with our own resources and fast.Also (cont) __HTTP__ _E_ It's Thursday and again I ask how much money is China stealing from us? _E_ Believe you can and you're halfway there. Pres. Theodore Roosevelt _E_ I believe in #AmericaFirst and that means FAMILY FIRST! My childcare plan reflects the needs of modern working clas... __HTTP__ _E_ In case you missed it my @gretawire interview on Obama's IRA rate cut hurting savings & economic growth __HTTP__ _E_ Trump Making GOP Speech — Is 2016 in the Cards? __HTTP__ via @Newsmax_Media _E_ All signs are that business is looking really good for next year only to be helped further by our Tax Cut Bill. Will be a great year for Companies and JOBS! Stock Market is poised for another year of SUCCESS! _E_ I wonder if the Rutgers coach who had the audacity to yell at the player is a proponent of global warming? _E_ Thank you @BillyJoel many friends just told me you gave a very kind shoutout at MSG. Appreciate it love your music! _E_ #TeamTrump. Police and law enforcement seem to have killed one of the California shooters and are in a shootout with the others. Go police _E_ I loved firing goofball atheist Penn @pennjillette on The Apprentice. He never had a chance. Wrote letter to me begging for forgiveness. _E_ Thank you Wilmington North Carolina!#MakeAmericaGreatAgain __HTTP__ _E_ We spent TWO TRILLION DOLLARS in Iraq and got NOTHING. Now we are going back and will again get NOTHING because our leaders are clueless! _E_ Will be on Hannity tonight. Rebroadcast of town hall from Pittsburgh PA. 8:00pm on FOX. Enjoy! #Trump2016 __HTTP__ _E_ I have self funded my winning primary campaign with an approx. $50 million loan. I have totally terminated the loan! _E_ Interesting read from Peggy Noonan. __HTTP__ _E_ Since stop & frisk was struck down gun shootings & victims have spiked while gun seizures have decreased. __HTTP__ _E_ RT @DoralResort: Thanks! RT @gem3wood: @DonaldJTrumpJr You guys @DoralResort have one hell of a leaderboard. Love this Tournament. _E_ Ted Cruz is incensed that I want to refocus NATO on terrorism as well as current mission but also want others to PAY FAIR SHARE a must! _E_ Moderator: Hillary plan calls for more regulation and more government spending. #Debate #BigLeagueTruth _E_ Joseph Kennedy is really being used by Venezuela and Hugo C. in oil commercial! _E_ So many lives and two trillion dollars wasted and our worst enemies will get the 2nd largest oil reserves in the World. Such stupid leaders _E_ Leaving for New Hampshire now. Making a speech—packed house. Love it! _E_ The Fake News is now complaining about my different types of back to back speeches. Well there was Afghanistan (somber) the big Rally..... _E_ A budget that puts #AmericaFirst must make safety its no. 1 priority—without safety there can be no prosperity: __HTTP__ _E_ Now that George Bush is campaigning for Jeb(!) is he fair game for questions about World Trade Center Iraq War and eco collapse? Careful! _E_ RT @EricTrump: Nevada remember you can Vote and Go walk in vote and walk out! Caucus locator: __HTTP__ #TrumpLV __HTTP__ _E_ Big news to share in New Hampshire tonight! Polls looking great! See you soon. _E_ Dopey Arianna @huffingtonpost is really after me boring story after boring story...but I hear she is in big trouble! _E_ China is a threat to America. They are not our friend. _E_ The Budget passed late last night 51 to 49. We got ZERO Democrat votes with only Rand Paul (he will vote for Tax Cuts) voting against..... _E_ The rally in Lowell Massachusetts was amazing. 10000 people going wild. MAKE AMERICA GREAT AGAIN! _E_ China keeps manipulating its currency at our financial expense. Why do our leaders continually let China run all over us? _E_ Puerto Rico survived the Hurricanes now a financial crisis looms largely of their own making. says Sharyl Attkisson. A total lack of..... _E_ Dying @GQMagazine just named me to a list. Too bad GQ is no longer relevant—won't be around long! _E_ Great meeting with CEOs of leading U.S. health insurance companies who provide great healthcare to the American peo... __HTTP__ _E_ This boardroom gets CRAZY! These people are wild _E_ Unemployment is plaguing both Black and Hispanic youths. Very troubling. _E_ I am happy to announce that the @PGAGrandSlam will be held at @TrumpGolfLA this year! __HTTP__ Follow @TrumpGolf for more! _E_ Awarded the renowned 5 Star @ForbesInspector rating the 65 story @TrumpTO brings style luxury & impeccable service __HTTP__ _E_ I'm protesting the @UnionLeader from having anything to do w/ ABC debate. Their unethical record doesn't give them the right to be involved! _E_ So nice thank you Laura. __HTTP__ _E_ .@antbaxter—Your documentary works better than any sleeping pill—in fact that may be your only way to make money with this recycled garbage! _E_ Derek get well soon the @Yankees need youl. _E_ Now @BarackObama is issuing regulatory demands to states ordering no firings in November __HTTP__ _E_ Thank you Delaware County Ohio! Remember either we WIN this election or we are going to LOSE this country!... __HTTP__ _E_ Thank you South Carolina! Everyone has to get out and VOTE on 11/8/16. #MakeAmericaGreatAgain... __HTTP__ _E_ Great day in D.C. with @SpeakerRyan and Republican leadership. Things working out really well! #Trump2016 __HTTP__ _E_ Obama has now had two record & historic midterm losses. There is Hope & Change for America. _E_ Poll numbers have nosedived for pervert NYC mayoral candidate Anthony Weiner good news for New York! _E_ Lightweight Schneiderman's suit was filed on a Saturday (unheard of) against a school with a 98% approval rating right after Obama meeting. _E_ Check out today's #trumpvlog about the upcoming episode of @ApprenticeNBC.... __HTTP__ #celebrityapprenticefinale _E_ Ted Cruz said he didn't know that he was a Canadian Citizen. He also FORGOT to file his Goldman Sachs Million $ loan papers.Not believable _E_ Democrats have shut down our government in the interests of their far left base. They don't want to do it but are powerless! _E_ .@Jimmyv3 @WWE Greatly appreciate your nice words re WrestleMania. That's why you are such a respected writer. _E_ In standing by @dennisrodman I was also representing many people who have addiction problems & are working hard to come back. _E_ Iran was on its last legs and ready to collapse until the U.S. came along and gave it a life line in the form of the Iran Deal: $150 billion _E_ The successful man will profit from his mistakes and try again in a different way. Dale Carnegie _E_ Thank you to the LGBT community! I will fight for you while Hillary brings in more people that will threaten your freedoms and beliefs. _E_ Follow me on Instagram __HTTP__ _E_ With that being said I have personally directed the fix to the unmasking process since taking office and today's vote is about foreign surveillance of foreign bad guys on foreign land. We need it! Get smart! _E_ #CelebApprentice Another exciting episode tune in next Monday at 8pm for 2 more new episodes! _E_ DONALD TRUMP BLASTS THE OSCARS __HTTP__ via @theblaze _E_ Early on Ted Cruz said that if he didn't win South Carolina it's over. He didn't win and lost to me in a landslide! _E_ It is a joke the amount of time that network news spends talking about the weather. No wonder their ratings are way down! Enough already. _E_ The Golden Rule of Negotiating: He who has the gold makes the rules. _E_ The dishonest media is fawning over the Democratic Convention. I wonder why then my speech had millions of more viewers than Crooked H? _E_ Like your current health care plan? Too bad you're going to lose it under ObamaCare. Hope Change & a 300% Increase in Your Premium. _E_ #AskTrump Getting ready to answer your questions. __HTTP__ _E_ Honored to host a luncheon for African leaders this afternoon. Great discussions on the challenges & opportunities facing our nations today. __HTTP__ _E_ .@THEGaryBusey and one of his Busey isms: "Art is only the search it is not the final form." #CelebApprentice _E_ Get it straight: Pakistan is not our friend. When our tremendous Navy SEALS took out Osama bin Laden they did... (cont) __HTTP__ _E_ Who did the House Task Force onUrgent Fiscal Issues call when America needed HELP? __HTTP__ _E_ I hope @billmaher pays quickly so that this money can immediately be given to the charities. _E_ Congratulations Kevin Gabriel on your amazing article. If I were a journalist this would be the next Watergate and I would be a star. _E_ He ruins the brand: @Robertgbeckel doesn't belong on @FoxNews . As CM for Mondale in '84 you lost 49 states. Sad! _E_ President @EmmanuelMacronThank you for the beautiful welcome ceremony at Les Invalides today! __HTTP__ _E_ Many many people are disappointed I didn't run third party but I won't risk @BarackObama benefiting from a split in the anti Obama vote. _E_ "America is the experiment that works." – President Ronald Reagan _E_ Stay on message is the chant. I always do trade jobs military vets 2nd A repeal Ocare borders etc but media misrepresents! _E_ It pays to have friends in high places like the Justice Department. Clearly the Clintons do. #DrainTheSwamp! __HTTP__ _E_ Entrepreneurs: There are no guarantees. But being ready sure beats being taken by surprise. Do your due diligence! _E_ Wow the two highest apartment rentals in all of 2013 were at Trump Park Avenue—each one = $100000 per month __HTTP__ _E_ I just sent @THEGaryBusey a check of $20000 for his charity Children's Kawasaki Disease . He worked hard and deserves it. _E_ I will be interviewed by @JudgeJeanine tonight on @FoxNews Enjoy! _E_ "Trump: Rove Gave Us Obama" __HTTP__ via @cnsnews _E_ "Integrity is the essence of everything successful." – Richard Buckminster Fuller _E_ Thank you @foxandfriends. Really great job and show! _E_ My @foxandfriends interview discussing how @BarackObama is running a hateful campaign & the @RNC convention 'Surprise' __HTTP__ _E_ Congratulations to @MariaBartiromo on her big move to @FoxBusiness. She is a total winner! _E_ Put this on your calendar: The Celebrity Apprentice live finale is this Sunday at 9 p.m. on NBC. Who will be the next Celebrity Apprentice? _E_ Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! __HTTP__ _E_ RT @hughhewitt: @realDonaldTrump I spoke to a group of influential CA GOPers tonight long time activists bundlers influencers. Support f... _E_ Great success in Iowa today. Fantastic sold out crowd. Will be back soon! _E_ Happy Easter to everyone! _E_ Dummy writer @tonyschwartz who wanted to do a second book with me for years (I said no) is now a hostile basket case who feels jilted! _E_ My @gretawire interview from last Friday discussing the unemployment numbers gas prices and acquiring the Doral __HTTP__ _E_ ObamaCare is a total disaster. Hillary Clinton wants to save it by making it even more expensive. Doesn't work I will REPEAL AND REPLACE! _E_ Surprise? 1970's global cooling alarmists were pushing same no growth liberal agenda as today's global warming __HTTP__ _E_ Coming up in March: The Comedy Central Roast of Donald Trump. March 15 mark your calendars. __HTTP__ _E_ Both Ted Cruz and John Kasich have no path to victory. They should both drop out of the race so that the Republican Party can unify! _E_ We will always take care of our GREAT VETERANS. You have shed your blood poured your love and bared your soul in... __HTTP__ _E_ Looking forward to speaking at #sparknb next week in Atlantic Canada my first time ever. _E_ .....but that's what I've been saying. Very unfair treatment by the media! _E_ .@JebBush has embarrassed himself & his family with his incompetent campaign for President. He should remain true to himself. _E_ It's disgraceful that the Obama Administration's first response was not to condemn attacks on our diplomatic (cont) __HTTP__ _E_ I will be signing copies of my new book TIME TO GET TOUGH tomorrow Dec 9th in Trump Tower from 11 a.m. to ... (cont) __HTTP__ _E_ ... It is time to get out and rebuild our own nation. _E_ We must repeal Obamacare and replace it with a much more competitive comprehensive affordable system. #debate #MAGA _E_ The Huffington Post is such a loser it will die just as AOL is dying What a stupid deal AOL made to buy it! _E_ A Rod is now being investigated for continued doping __HTTP__ @yankees have a great opportunity to dump him now. Go for it! _E_ .@DonaldJTrumpJr & his wife @MrsVanessaTrump attended the #SnowflakeGardenBrunch here w/ Governor @TerryBranstad. __HTTP__ _E_ Let @PeteRose in the HOF it's time! _E_ The opening of the @TigerWoods Villa at trumpdoral __HTTP__ _E_ Wow China exports rise 15% in September. They are laughing at USA! _E_ The mark of a great player is in his ability to come back. The great champions have all come back from defeat. Sam Snead _E_ #MakeAmericaGreatAgain! __HTTP__ _E_ Rand Paul or whoever votes against Hcare Bill will forever (future political campaigns) be known as the Republican who saved ObamaCare. _E_ Consumer prices rose in June due to OPEC __HTTP__ OPEC continues to rip off hard working American families daily. _E_ Great jobs report today It is all beginning to work! _E_ .@AndreaTantaros You are a true journalistic professional. I so agree with what you say. Keep up the great work! #MakeAmericaGreatAgain _E_ How quickly people forget that Crooked Hillary called African American youth SUPER PREDATORS Has she apologized? _E_ Don't believe the @FoxNews Polls they are just another phony hit job on me. I will beat Hillary Clinton easily in the General Election. _E_ Models! Remember to register for the Trump Model Search. Check out the info here: __HTTP__ @CadillacChamp _E_ It won't stay a buyer's market forever. If you can take advantage and buy property asap. You'll thank me! _E_ Small bright spot in lackluster economy travel industry added 81000 jobs in 2012 __HTTP__ Trump Org had a record year. _E_ Sadly when it comes to using the energy industry to create American jobs @BarackObama has been a total (cont) __HTTP__ _E_ .@MittRomney should continue to stay on offense on the embassy issue. Obama who put these radicals in power deserves blame. _E_ Thank you to a #Trump2016 supporter for this video of my campaign over the past 6 months. Video: __HTTP__ _E_ Can you believe Crooked Hillary said We are going to put a whole lot of coal miners&coal companies out of business. She then apologized. _E_ Great new poll thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ .@whitehouse continues to defend the billions it pissed away on 'green energy' failures __HTTP__ Your money was wasted. _E_ The debate was pretty even but I thought Mitt should have been much more aggressive on Obama's failed foreign policy and I mean much more. _E_ Standing on what will be the greatest golf course in the world. Opens July 10th. __HTTP__ _E_ So much for 'global warming.' Earth is cooling at a record pace __HTTP__ _E_ Only you can #SavetheQueen during the LIVE telecast of #MissUSA on June 8 at 8/7c on NBC. Click for more info: __HTTP__ _E_ An 'extremely credible source' has called my office and told me that @BarackObama bought his house with the help of Tony Rezko. _E_ #TBT My confirmation picture at First Presbyterian Church in Jamaica NY. __HTTP__ _E_ Join me today in Wilmington Ohio at 4pm: __HTTP__ Tampa Florida at 10am: __HTTP__ _E_ The #CNNDebate was amazing so much fun! __HTTP__ _E_ Buy American & hire American are the principals at the core of my agenda which is: JOBS JOBS JOBS! Thank you @exxonmobil. _E_ Obama has called @GOP terrorists during this showdown. It's a shame he really doesn't think it because then he would meet all @GOP demands. _E_ I loved being a surrogate on behalf of @MittRomney. I am glad I was able to help him win. _E_ .@JuliInkster Congratulations on your great win what a captain what a champion! _E_ Entrepreneurs: Don't put blinders on or limit yourself. Reach out seek and explore. The opportunities are always there. _E_ Everybody knows why Obama would not show his college applications they are just not willing to say! _E_ RT @EWErickson: Personally I think it is awesome that @realDonaldTrump listens to @DLoesch on the radio. She's awesome. _E_ Watch my appearance on @foxandfriends... __HTTP__ _E_ Make sure you realize that this 'deal' is only a stop gap measure.Obama will be looking to raise even more taxes in the coming negotiation.. _E_ LIVE on #Periscope: Major announcement! #MakeAmericaGreatAgain __HTTP__ _E_ Why would the people of Kentucky want a rookie Senator– they have Sen. Mitch @McConnellPress who may be next Leader & bring $'s to KY _E_ I've been visiting Trump Int'l Golf Links Scotland and the course will be unmatched anywhere in the world. Spectacular! __HTTP__ _E_ Watch my endorsement of @MittRomney. __HTTP__ _E_ Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN! _E_ Chuck Hagel showed gross incompetence before yesterdays Senate panel...our new Secretary of Defense. _E_ Hillary Clinton is unfit to be president. She has bad judgement poor leadership skills and a very bad and destructive track record. Change! _E_ See the new sizzle reel for The Apprentice __HTTP__ _E_ .....Has worst attendance record in Senate rarely there to vote on a bill! @marcorubio _E_ The great GENERALS MacArthur and Patton real leaders and fighters are spinning in their graves as we give Syria info & time to prepare. _E_ A great honor to host the @SuperBowl Champion New England @Patriots at the White House today. Congratulations!... __HTTP__ _E_ Getting China to stop playing its currency charades can begin whenever we elect a president ready to take decisive action. #TimeToGetTough _E_ Sad sack @JebBush has just done another ad on me with special interest money saying I won't beat Hillary I WILL. But he can't beat me. _E_ I will be interviewed on @oreillyfactor at 8:00 P.M. Enjoy! _E_ Do you think the 14 African nations that are banning West Africans from coming into their nations are being called racists? Perhaps not! _E_ Media desperate to distract from Clinton's anti 2A stance. I said pro 2A citizens must organize and get out vote to save our Constitution! _E_ .@McIlroyRory Way to go Rory fantastic victory! _E_ The amazing Trump National Golf Club Los Angeles. __HTTP__ _E_ That was really exciting. Made all of my points. MAKE AMERICA GREAT AGAIN! _E_ Derek Jeter @yankees wants to rent an apartment. Derek only in a Trump building Trump is lucky for you. _E_ Tweet me your questions to answer. #trumpvlog _E_ I will be on the Mike & Mike Show on radio and ESPN at about 6 to 7 A.M. We will be talking Super Bowl and sports no Obama Care! _E_ Joined the @HouseGOP Conference this morning at the U.S. Capitol. __HTTP__ #PassTheBill #MAGA... __HTTP__ _E_ Volunteer to be a Trump Election Observer. Sign up today!#MakeAmericaGreatAgain __HTTP__ _E_ .@AnnCoulter's new book 'In Trump We Trust comes out tomorrow. People are saying it's terrific knowing Ann I am sure it is! _E_ I like Rob Astorino. He's a friend and really good guy. Sadly he has ZERO chance of beating Cuomo and the 2 to 1 Dems for governor! _E_ I will miss @Letterman & doing his show. He was always intriguing & smart. You never knew what would happen but he was fair! _E_ Was going to do a phoner this morning with @jaketapper on @CNN but they could not get their phone equipment to hook in. Will do next week. _E_ Help those affected by #Sandy. @TrumpSoHo is giving $10 per booking made by 11/23 to @RedCross for #sandyrelief. __HTTP__ _E_ Use your intelligence and your education to execute what your imagination presents to you. This is one step to becoming an entrepreneur. _E_ .@washingtonpost thinks @IvankaTrump is What Washington's Social Scene Needs __HTTP__ Truth is she's amazing. _E_ Every penny of the $7 billion going to Africa as per Obama will be stolen corruption is rampant! _E_ Obama's Def. Sec. just said US Asia focus 'not aimed to contain China' __HTTP__ China is hoping that Obama is re elected. _E_ Via @politico: Donald Trump to get more CPAC time than Marco Rubio __HTTP__ @CPACnews knows how to prioritize! _E_ How can an Attorney General ask for campaign contributions during his evaluation of a case a total sleazebag! _E_ Via @freep: Trump to speak to GOP __HTTP__ _E_ Rep. Steve Scalise of Louisiana a true friend and patriot was badly injured but will fully recover. Our thoughts and prayers are with him. _E_ Great write up on @thedailymeal about our new Executive Sous Chef Sydney Jones @TrumpLasVegas: __HTTP__ _E_ Look how bad it is getting! How much more crime how many more shootings will it take for African Americans and Latinos to vote Trump=SAFE! _E_ The same people who did the phony election polls and were so wrong are now doing approval rating polls. They are rigged just like before. _E_ .@willweatherford @FLGovScott Gaming in Miami will be incredible—best in world and create lots of jobs and revenue. _E_ "Be flexibly focused. Focus does not mean being narrow minded or rigid." – Think Big _E_ It's going to take an outsider to clean up after Clinton Bush and Obama. Let's Make America Great Again! __HTTP__ _E_ Can you imagine if Obama had to give today's press conference before the election? He would have lost. @GOP really blew it. _E_ Everyone is saying the bad news is that Donald Trump is going to take credit & they are right—Mitt wouldn't have won anyway. _E_ #ICYMI: #Trump2016 closing speech inBuffalo New York!#VoteTrumpNY  __HTTP__ _E_ I want to #MakeAmericaGreatAgain __HTTP__ _E_ IPSOS/REUTERS POLLThank you! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ It is so imperative that we have the right justices. #DrainTheSwamp #Debates #BigLeagueTruth __HTTP__ _E_ When are we going to wake up and realize that we are funding our enemies? #TimeToGetTough _E_ .@kevinolearytv Great job on @foxandfriends this morning. You tell it like it is! Also thx for the nice mention. Your book sounds great! _E_ Just watched the totally biased and fake news reports of the so called Russia story on NBC and ABC. Such dishonesty! _E_ Entrepreneurs: It's often to your advantage to be underestimated. _E_ Great news on the 2018 budget @SenateMajLdr McConnell first step toward delivering MASSIVE tax cuts for the American people! #TaxReform __HTTP__ _E_ .@FrankLuntz knows nothing about me or my religion. Came to my office looking for work. I had NO interest. I will save the vets! _E_ Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet __HTTP__ _E_ CPAC 2013: Donald Trump: Immigration reform is a 'suicide mission' for GOP __HTTP__ by @SethMcLaughlin1 _E_ Just as I predicted immigration reform will increase the cost of ObamaCare over $300B __HTTP__ More money borrowed from China. _E_ If anybody else but Coore and Crenshaw designed Pinehurst they would be run out of town—(and the turtleback greens are totally unfair)! _E_ John Kasich was managing director of Lehman Brothers when it crashed bringing down the world and ruining people's lives. A total failure! _E_ The military and Navy Seals should be given more credit for Bin Laden's death not Obama who works hard to take (cont) __HTTP__ _E_ #MakeAmericaGreatAgain! __HTTP__ _E_ The 18th hole at the Blue Monster @Doral in Miami is considered the toughest finishing hole in golf... __HTTP__ _E_ Via @washingtonpost: The Donald's video should have trumped Eastwood by @CapehartJ __HTTP__ _E_ and fair elections. We've accepted the outcomes when we may not have liked them and that is what must be expected of anyone standing on a _E_ So since the people at the @nytimes have made all bad decisions over the last decade why do people care what they write. Incompetent! _E_ .@youngmman @realDonaldTrump Conrad Hilton was a great man but Barron Hilton is a dope. Wrong on Barron! _E_ Ted Cruz talks about the Constitution but doesn't say that if the Dems win the Presidency the new JUSTICES appointed will destroy us all! _E_ I believe America can be great only with proper leadership. _E_ "Chalk failure up to experience don't take it personally and go find your next challenge." – Trump: Never Give Up _E_ Aubrey has a lot of self confidence—but will it be warranted? #sweepstweet _E_ For entrepreneurs ignorance is not bliss. It's fatal. It's costly. And it's for losers. You either get organized or get crushed. _E_ The Arab Spring has turned into the Islamist Winter. Our ally @Israel is in a perilous position. We must stand behind @Israel. _E_ The Mullahs are laughing at what they think is a very stupid president@BarackObama has asked for Iran to return the drone #TimeToGetTough _E_ Negotiation is an art. Treat it like one. Be open to change it's another word for innovation. _E_ The new selection of ties shirts and suits at Macy's is amazing also available in Trump Tower lobby. _E_ Melania and I extend our warmest greetings to those observing Rosh Hashanah here in the United States in Israel and around the world. _E_ In order to save Medicare and stop record premium increases we must repeal ObamaCare. _E_ Getting ready to visit Walter Reed Medical Center with Melania. Looking forward to seeing our bravest and greatest Americans! _E_ The Budget Agreement today is so important for our great Military. It ends the dangerous sequester and gives Secretary Mattis what he needs to keep America Great. Republicans and Democrats must support our troops and support this Bill! _E_ Any deal on DACA that does not include STRONG border security and the desperately needed WALL is a total waste of time. March 5th is rapidly approaching and the Dems seem not to care about DACA. Make a deal! _E_ While under no obligation to do so I have raised between 5 & 6 million dollars including 1million dollars from me for our VETERANS. Nice! _E_ Senator Luther Strange who is doing a great job for the people of Alabama will be on @foxandfriends at 7:15. Tough on crime borders etc. _E_ Great new polls! Thank you Nevada North Carolina & Ohio. Join the MOVEMENT today & lets #MAGA!... __HTTP__ _E_ Watching @trishstratuscom get inducted from the sold out crowd. #WWEHOF. __HTTP__ _E_ I'll be on @foxandfriends on Monday at 7:30 AM. Always interesting. Tune in! _E_ RT @WhiteHouse: Today @POTUS will welcome the Prime Minister of India @narendramodi to the White House. __HTTP__ _E_ Very honored: Trump Is Tops As Clinton Drops In Connecticut Primaries Quinnipiac University Poll Finds __HTTP__ _E_ "We are fully supportive of @Israel's right to defend itself." @BarackObama Very good I like it. _E_ Congratulations to @TheSlyStallone and Arnold @Schwarzenegger on 'Expendables 2' #1 box office opening. Still going strong! _E_ Thank you Kentucky! #Trump2016#SuperSaturday _E_ What a statesman! @BarackObama made sure to quickly call the Muslim Brotherhood victor to congratulate him on (cont) __HTTP__ _E_ I wonder if traitor Edward Snowden will be attending the Miss Universe Pageant in Moscow on November 9th. _E_ So @ReutersPolitics claims that @MittRomney's birth certificate evokes 'controversy' __HTTP__ Where (cont) __HTTP__ _E_ The failing @nytimes writes false story after false story about me. They don't even call to verify the facts of a story. A Fake News Joke! _E_ I hope everyone or rather almost everyone had a GREAT EASTER! We need our leaders to make great and wise decisions in these troubled times _E_ While @BarackObama seeks to further destroy our credit our economy continues to hemorrhage jobs. Such a total failure as a President. _E_ Big day planned on NATIONAL SECURITY tomorrow. Among many other things we will build the wall! _E_ Success seems to be connected with action. Successful people keep moving. They make mistakes but they don't quit. Conrad Hilton _E_ Crooked Hillary will NEVER be able to solve the problems of poverty education and safety within the African American & Hispanic communities _E_ "@HoganSeaisle129: @realDonaldTrump who who who ... Say it just say it #CelebApprentice" Watch and see what happens! _E_ Thank you to Carmen Yulin Cruz the Mayor of San Juan for your kind words on FEMA etc.We are working hard. Much food and water there/on way _E_ #FoxNews Poll THANK YOU!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Terrible tragedy at the Empire State Building today. Must have fast trials and death penalty for the animals. _E_ Happy 30th Birthday #Ghostbusters! It was great to have @TrumpTowerNY be a part of the series. __HTTP__ _E_ If you're passionate about your work you will never give up. _E_ Our online campaign store is open! Visit __HTTP__ for #MakeAmericaGreatAgain merchandise including my signature hat! _E_ With today's struggling job numbers it is clear that there is one choice this November. @MittRomney can turn the economy around. _E_ Seth Meyers was terrible co hosting with Kelly. Marbles in his mouth & he must stop picking at his hands insult to the great Regis Philbin! _E_ Drop A Rod in the order and cut his salary based on unreported drug use. Also not a pressure player. _E_ .@limbaugh is right. Watergate is much different than Benghazi. No one died in Watergate. _E_ "If the freedom of speech is taken away then dumb and silent we may be led like sheep to the slaughter." George Washington _E_ The Ted Cruz wiseguy apology to the people of New York is a disgrace. Remember his wife's employer and his lender is located there! _E_ We are using the absolute wrong negotiating technique with respect to the Iran nuclear talks. Strengthen sanctions until GREAT deal is made! _E_ "Talent wins games but teamwork and intelligence wins championships." – Michael Jordan @Jumpman23 _E_ At least 7 dead and 48 wounded in terror attack and Mayor of London says there is no reason to be alarmed! _E_ Congratulations to @joniernst on delivering a strong conservative message in her #SOTU response. Joni will be a great senator. _E_ Amazing... @VattenfallGroup tried to destroy Aberdeen. _E_ Entrepreneurs: Winners see problems as just another way to prove themselves. Remember to focus on the solution not the problem. _E_ Great meeting with the @RepublicanStudy Committee this morning at the @WhiteHouse! __HTTP__ _E_ Lyin' Ted Cruz can't get votes (I am millions ahead of him) so he has to get his delegates from the Republican bosses. It won't work! _E_ The United States needs the security of the Wall on the Southern Border which must be part of any DACA approval. The safety and security of our country is #1! __HTTP__ _E_ John Lewis said about my inauguration It will be the first one that I've missed. WRONG (or lie)! He boycotted Bush 43 also because he... _E_ For those that constantly say that "global warming" is now "climate change"—they changed the name. The name global warming wasn't working _E_ When I broadly proclaimed Mitt "choked" – and would do it again—everybody said yeah he's right. _E_ 'True blue collar billionaire Donald Trump shows Hillary Clinton is out of touch' __HTTP__ _E_ Remember what I previously said Obama will someday attack Iran in order to show how tough he is. _E_ The failing @nytimes talks about anonymous sources and meetings that never happened. Their reporting is fiction. The media protects Hillary! _E_ I will be interviewed by @TheBrodyFile on @CBNNews tonight at 11pm. Enjoy! _E_ I have been a guest on The View many times when it was successful show. Now the show is dying for lack of ratings. Too bad! _E_ I'm not going to be watching much NFL football anymore. Too time consuming too boring too many flags and too soft. Focus on other things! _E_ I just had to fire someone he didn't have a clue he reminded me of Obama on Wednesday night. _E_ Put the glamour beauty & mystery back in the Oscars and the ratings will zoom. Also & most importantly the Oscars need credibility. _E_ New York Yankees President Randy Levine: 'End of the Republican Party' If Donald Trump Not Nominated. __HTTP__ _E_ Thank you for your support last night Iowa! #VoteTrump #Trump2016 #IACaucus #FITN #IAPolitics __HTTP__ _E_ #TrumpVine Where is the money @MacMiller? __HTTP__ _E_ Prior to the end of the year I will be traveling to Israel. I am very much looking forward to it. _E_ Kirsten Powers: Anti Trump Operative was Aggressively Shopping Cruz Story via the Gateway Pundit: __HTTP__ _E_ Wow just in John Beale the top person in government on climate change (EPA) is a total fraud and just admitted it! What can they say now _E_ Jeb Bush just talked about my border proposal to build a fence. It's not a fence Jeb it's a WALL and there's a BIG difference! _E_ It appears that @THEGaryBusey is entranced with @MELANIATRUMP and rightly so! #CelebApprentice _E_ On June 1st. near Washington D.C. I will be opening the greatest championship golf course in the U.S. All holes front on the Potomac River _E_ #ICYMI: John Podesta's Brother Pocketed $180000 from Putin's Uranium Company: __HTTP__ __HTTP__ _E_ The dying NY Daily News put out a false report about my kids not wanting me to criticize Obama...totally false! _E_ .@NikWallenda #Skywire As much credit as he's been given he wasn't given enough credit for his incredible feat over Grand Canyon. _E_ Our next President must stop China's Rip off of America. _E_ .@NBCNews is bad but Saturday Night Live is the worst of NBC. Not funny cast is terrible always a complete hit job. Really bad television! _E_ New The Next Generation videos @donaldjtrumpjr __HTTP__ @ivankatrump __HTTP__ @erictrump __HTTP__ _E_ Via @DMRegister by @newsmanone: "'Moon cycle' can't defeat @ShawnJohnson on @ApprenticeNBC" __HTTP__ _E_ Happy Anniversary to my wonderful wife @MELANIATRUMP a truly great decision by me! __HTTP__ _E_ Via @globegazette by @GGMaryP: "Trump: We'll bring American dream back" __HTTP__ _E_ RT @realDonaldTrump: #USA #Japan __HTTP__ _E_ Just returned to Bedminster NJ from Camp David. GREAT meeting on National Security the Border and the Military! #MAGA __HTTP__ _E_ Get rich quick! Crooked Hillary Clinton's pay to play guide: __HTTP__ _E_ Only a fool would believe that the meeting between Bill Clinton and the U.S.A.G. was not arranged or that Crooked Hillary did not know. _E_ A political commentator for @cnn which I no longer watch said Trump showed some weakness in the Repub Primaries. I set all time record! _E_ Loved Dallas and the tremendous crowd last night. Will be back! _E_ Via @CNNMoney: Donald Trump gets into crowdfunding __HTTP__ #FundAnything _E_ .@megynkelly Will be on Fox now. Watch and enjoy! _E_ Do you think @THEGaryBusey will be able to "step up" as PM? I know @lisarinna is hoping so. #CelebApprentice _E_ No surprise Assad is not destroying his chemical weapons. He never intended to in the first place. _E_ ....And it will get even better with Tax Cuts! __HTTP__ _E_ TRUMP'S BIG ANNOUNCEMENT: HE'LL GIVE $5 MILLION TO CHARITY OF OBAMA'S CHOICE IF... __HTTP__ By @billyhallowell @theblaze _E_ I will be interviewed by @jessebwatters on @oreillyfactor tonight at 8pm. Enjoy! _E_ Putin is not feeling too nervous or scared. #DemDebate _E_ The media has not reported that the National Debt in my first month went down by $12 billion vs a $200 billion increase in Obama first mo. _E_ When I left Conference Room for short meetings with Japan and other countries I asked Ivanka to hold seat. Very standard. Angela M agrees! _E_ Will be joining @GovMikeHuckabee tonight at 8pmE on @TBN. Enjoy! __HTTP__ _E_ Today's groundbreaking at the Old Post Office Building in D.C. was amazing. Great people great dedication. @usgsa __HTTP__ _E_ As I have said the Tea Party is alive and well and fighting hard for the USA. BIG WIN TODAY! _E_ I will be doing the @TODAYshow with my wife Melania and the rest of my family in a major Town Hall. Hopefully it will be fun! Enjoy.7A.M. _E_ Thank you. __HTTP__ _E_ Will be on Bill O'Reilly @oreillyfactor tonight at 8 PM. Enjoy! _E_ I hope that Derek Jeter has such a fantastic year with @Yankees that he changes his mind about retiring. Great guy! _E_ With fellow inductees in front of the sold out crowd at MSG. #WWEHOF __HTTP__ _E_ RT @RealEagleBites: @realDonaldTrump It is the height of hypocrisy. Obama and Clinton in effect gave nuclear weapons to North Korea by thei... _E_ Entrepreneurs: Be prepared and be tough. Cover your bases! There are a lot of ups and downs but you can ride them out if you're prepared. _E_ ... and many others. Drop to your knees Sugar and say thank you Mr. Trump. _E_ .@BretBaier I will be interviewed by Bret (on Fox) tonight at 6:00. Watch it will be good! _E_ .@BreitbartNews is much smarter than sleepy eyes @chucktodd @nbc __HTTP__ Thanks to Steve Bannon & real reporters. _E_ I wonder what the great generals like Patton the big M or Robert E. LEE would have thought about our stupid broadcasting of an attack? _E_ Jodi Arias jury is having a hard time with the death penalty judge just sent them back for further deliberatuon. _E_ The unbiased reporters and attendees said mine was the best and most well received speech at CPAC THANK YOU! _E_ Congratulations to Mitt Romney. He was not only good he was absolutely fantastic tonight! _E_ Government needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. It's (cont) __HTTP__ _E_ Our Southern border is unsecure. I am the only one that can fix it nobody else has the guts to even talk about it. __HTTP__ _E_ He @FLGovScott handled the Zimmerman matter very well. I am glad to see there will be a trial. Justice. Now let's wait for a fair trial. _E_ I do not understand how so many of my Jewish friends backed Obama in the last election. He is a TOTAL DISASTER FOR ISRAEL AND ALWAYS WILL BE _E_ True! __HTTP__ _E_ I had a great day campaigning in Connecticut. Looking for a big vote on Tuesday! _E_ Wrong! Under @BarackObama's watch @Israel is not being invited to NATO summit in Chicago this month __HTTP__ _E_ My @FoxNews interview with @gretawire discussing the 2012 GOP primary and ObamaCare's attack on the Catholic Church. __HTTP__ _E_ Minorities Line Up Behind Donald Trump #Trump2016 __HTTP__ _E_ Does everyone see that the Democrats and President Obama are now because of me starting to deport people who are here illegally. Politics! _E_ Illegal immigrant children non Mexicans surge across border at record rate __HTTP__ _E_ The stimulus is a net negative effect on the growth of GDP over 10 years as admitted by @BarackObama's own CBO __HTTP__ _E_ Via @si_golf: "Donald Trump Rory McIlroy and Matt Kuchar are guys to watch at @DoralResort" __HTTP__ @CadillacChamp _E_ Don't find fault. Find a remedy. Henry Ford _E_ What a great night. Thank you South Carolina a special place with truly amazing people! LOVE _E_ Will be back soon Virginia. We are going to MAKE AMERICA GREAT AGAIN! #TrumpPence16 __HTTP__ _E_ #IACaucus #CaucusForTrump#iCaucused #iVoted __HTTP__ _E_ There is nothing @TrumpSoHo did not think about for the holidays @RobbReport dives in: __HTTP__ _E_ Why does@ Bill O'Reilly keep putting Karl Rove on his show a total waste of time. Rove spent $400 000 000 and didn't win a race pathetic! _E_ RT @CPACnews: ACU Announces @realDonaldTrump will be a featured speaker at #CPAC2013! Get tickets today at __HTTP__ _E_ I will be interviewed by @LouDobbs tonight on @FoxBusiness 7pm ET _E_ .@JebBush is a sad case. A total embarrassment to both himself and his family he just announced he will continue to spend on Trump hit ads! _E_ Can @pennjillette @lisarinna and @THEGaryBusey continue to co exist? Find out on this Sunday's Celebrity All Star @ApprenticeNBC. _E_ ObamaCare is an absolute disaster which will destroy 16% of the economy and ultimately more! _E_ It is actually hard to believe how naive (or dumb) the Failing @nytimes is when it comes to foreign policy...weak and ineffective! _E_ Wow @CNN Town Hall questions were given to Crooked Hillary Clinton in advance of big debates against Bernie Sanders. Hillary & CNN FRAUD! _E_ #NeverForget __HTTP__ _E_ I look forward to my press conference @TrumpTurnberry Scotland this Wednesday lots of great people attending. _E_ Happy Easter to all have a great day! _E_ People ask me every day to pose for pictures but the camera never works the first time they are never prepared or maybe just very nervous! _E_ Don't worry THE UNITED STATES WILL BE GREAT AGAIN! _E_ Fines and penalties against Wells Fargo Bank for their bad acts against their customers and others will not be dropped as has incorrectly been reported but will be pursued and if anything substantially increased. I will cut Regs but make penalties severe when caught cheating! _E_ Congratulations to the 7 @TrumpCollection properties who made @USNewsTravel's Best Hotels List: __HTTP__ _E_ Life always presents new opportunities you would never expect. I hosted @WrestleMania & then I starred in one which sold most PPVs. _E_ Watch @BarackObama admit Obamacare is a TAX __HTTP__ The GOP must continue to Disrupt Dismantle & Repeal! _E_ The U.S. under my administration is completely rebuilding its military and they're spending hundreds of billions of dollars to the newest and finest military equipment anywhere in the world being built right now. I want peace through strength! __HTTP__ _E_ Romney Ryan Slam Obama Administration on China Currency Manipulation __HTTP__ via @ABC _E_ Just learned that Jon @Ossoff who is running for Congress in Georgia doesn't even live in the district. Republicans get out and vote! _E_ I settled the Trump University lawsuit for a small fraction of the potential award because as President I have to focus on our country. _E_ Are Republicans suicidal? Now they want to push amnesty through Congress. Allowing Democrats into the country. _E_ President Obama campaigned hard (and personally) in the very important swing states and lost.The voters wanted to MAKE AMERICA GREAT AGAIN! _E_ Republicans should just REPEAL failing ObamaCare now & work on a new Healthcare Plan that will start from a clean slate. Dems will join in! _E_ If Jon Stewart is so above it all & legit why did he change his name from Jonathan Leibowitz? He should be proud of his heritage! _E_ I'll be on THe Willis Report @GerriWillisFBN today at 5 pm EST _E_ Via @fitsnews by @TaylahhKane: Donald Trump's Refreshing Lack Of A Filter __HTTP__ _E_ Remember Celebrity Apprentice tonight on CNBC at 9. Amazing episode watch Omarosa get the boot! _E_ Entrepreneurs: Set your mind on winning and losing won't have a chance. See yourself as victorious! _E_ It's about time Italy recognized the innocence of @AmandaKnox great news! _E_ Journalists shower Hillary Clinton with campaign cash __HTTP__ __HTTP__ _E_ Under @BarackObama 5 major banks now control 56% of economy from 43% in 2007 __HTTP__ Another catastrophe is brewing. _E_ Just watched recap of #CrookedHillary's speech. Very short and lies. She is the only one fear mongering! _E_ WOW! I just heard that the previously unknown singer Mac Miller has received over 67 million hits on his song Donald Trump. _E_ Via @washingtonpost by @OConnellPostbiz:"Bidding to stay at Trump's hotel for '17 inauguration?Pick the next POTUS. __HTTP__ _E_ Thank you @RepLouBarletta! __HTTP__ __HTTP__ _E_ This is the definition of ransom ⬇ __HTTP__ _E_ My representatives had a great meeting w/ the Hispanic Chamber of Commerce at the WH today. Look forward to tremendous growth & future mtgs! _E_ Video of my day at The Old Post Office soon to be the most fabulous hotel! __HTTP__ _E_ "When you're at a meeting monitor your behavior and work at being an observer – of yourself and of others." – Think Like a Billionaire _E_ Some great quotes from the legendary and courageous Winston Churchill: Never never never give up. ... _E_ #SuperBowl Vote for me and @CENTURY21 __HTTP__ _E_ Dangerous The USC ObamaCare ruling means the government can now tax you for inactivity. _E_ Sometimes by losing a battle you find a new way to win the war. Don't ever get down on yourself just keep fighting in the end you WIN! _E_ Yesterday was @BarackObama's favorite day of the year he collects our taxes to redistribute. _E_ Going to North Carolina to make keynote speech sold out crowd! _E_ Wow I hear @Morning_Joe has gone really hostile ever since I said I won't do or watch the show anymore.They misrepresent my positions! _E_ While all agree the U. S. President has the complete power to pardon why think of that when only crime so far is LEAKS against us.FAKE NEWS _E_ Phylis Schlafly: 'Marco Rubio Betrayed Us All' __HTTP__ _E_ Getting rdy to leave for France @ the invitation of President Macron to celebrate & honor Bastille Day and 100yrs since U.S. entry into WWI. _E_ Six hours left to #VoteTrump Connecticut! __HTTP__ _E_ Patience is the greatest of all virtues. Cato _E_ One of the best moves I ever made was staying out of last decade's artificial real estate boom. But I used the downturn to my advantage. _E_ True! __HTTP__ _E_ I'll be interviewed by Greta Van Susteren @Gretawire tonight at 10 pm ET on Fox. _E_ Must read @AmSpec article by Jeffrey Lord: "The Ruling Class Liberty Medal" __HTTP__ _E_ Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_ Reuters just announced that Secret Service never spoke to me or my campaign. Made up story by @CNN is a hoax. Totally dishonest. _E_ Donald E. Ballard on behalf of the people of the United States THANK YOU for your courageous service. YOU INSPIRE US ALL! #ALConv2017 __HTTP__ _E_ Thanks. __HTTP__ _E_ Now the Chinese are hacking @nytimes __HTTP__ & Twitter __HTTP__ When will we hold these thieves accountable? _E_ .@Richard_Meier a highly overrated architect has had many problems with buildings he has designed. _E_ With great patriots in Mason City who also want to bring the American Dream back! We can Make America Great Again! __HTTP__ _E_ Obama talks about what he is going to do why the hell didn't he just do it especially in the first 2 years when he had all votes necessary _E_ Thank you Las Vegas Review Journal! EDITORIAL: 'Donald Trump for president' __HTTP__ via @reviewjournal _E_ Pat Buchanan gave a fantastic interview this morning on @CNN way to go Pat way ahead of your time! _E_ Wow interview released by Wikileakes shows quid pro quo in Crooked Hillary e mail probe.Such a dishonest person & Paul Ryan does zilch! _E_ .@BarbaraJWalters called my office to ask me to do election night coverage with her sadly I won't be able to do it. _E_ The US is always getting ripped off! China gets cheap oil from Iran and Iraq as US pays for Hormuz Patrols to (cont) __HTTP__ _E_ Why is @MittRomney the only guy who talks about getting tough with China and their currency manipulation? _E_ Entrepreneurs: Keep your focus and keep your momentum. Believe in yourself if you don't no one else will either. _E_ A spectacular lake front club w/ dramatic course designed by @SharkGregNorman @Trump_Charlotte is NC's top club __HTTP__ _E_ Thank you to @AmSpec & Jeffrey Lord for the lovely article "Governor Trump? The conservative Nelson Rockefeller." __HTTP__ _E_ Congratulations to Martin Kaymer for winning the 2014 #USOpen. #USGA Great playing from beginning to end! _E_ My @CNN interview with @wolfblitzercnn yesterday discussing by meeting with @MittRomney __HTTP__ _E_ "As you go through life you've got to see the valleys as well as the peaks." – Neil Young _E_ The SECRET meeting between Bill Clinton and the U.S.A.G. in back of closed plane was heightened with FBI shouting go away no pictures. _E_ Marco Rubio is a member of the Gang Of Eight or very weak on stopping illegal immigration. Only changed when poll numbers crashed. _E_ This is a time for big ideas. This is a time for real reform for a real recovery. @PaulRyanVP _E_ Just received a standing ovation at #NCGOPCon when I said We need to bring the American Dream back better and stronger than ever before! _E_ The Trump Organization continues to expand internationally at a record pace. Many new announcements to come soon. _E_ Thank you Louisville Kentucky. Together we will MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ _E_ One of the big problems facing Atlantic City are the ridiculously high real estate taxes which I fought for years before leaving.Corruption! _E_ RT @benshapiro: Pope on Trump: A person who thinks only about building walls...is not Christian. This is Vatican City. __HTTP__ _E_ .@Mayor_Nutter of Philadelphia who is doing a terrible job should be ashamed for using such a disgusting word in referring to me.Low life! _E_ I am supportive of Lamar as a person & also of the process but I can never support bailing out ins co's who have made a fortune w/ O'Care. _E_ The only Forbes 5 Star & 5 Diamond hotel with a 5 Star & 5 Diamond restaurant @TrumpNewYork offers elite luxury __HTTP__ _E_ Age wrinkles the body. Quitting wrinkles the soul. General Douglas MacArthur _E_ Stopped by @TrumpDC to thank all of the tremendous men & women for their hard work! __HTTP__ _E_ Hillary just said that she will not use the term radical Islamic but was incapable of saying why. She is afraid of Obama & the e mails! _E_ Obama said he never met his uncle Oscar who was arrested for whatever. Turns out he lived with his uncle in Boston. SO MANY LIES! _E_ Watch a powerful and frank interview with Donald Trump about the economy on Greta Van Susteren's On The Record: __HTTP__ _E_ America is proud to stand shoulder to shoulder w/a free & ind UK. We stand together as friends as allies & as a people w/a shared history. _E_ The invention of email has proven to be a very bad thing for Crooked Hillary in that it has proven her to be both incompetent and a liar! _E_ At the debate the President kept talking of what he is going to do. I kept saying why didn't he do it? He lost me a long time ago. _E_ I didn't start the fight with Lyin'Ted Cruz over the GQ cover pic of Melania he did. He knew the PAC was putting it out hence Lyin' Ted! _E_ Pres. O a bump in the road in reference to our Ambassador's (and others) killing in Libya _E_ 'Trump lays out policies for first 100 days in White House' __HTTP__ _E_ Today both @BarackObama and @MittRomney are giving speeches on their economic policies in Ohio. The choice is (cont) __HTTP__ _E_ ...owed to Wall Street and the banks which sadly must be dealt with. Food water and medical are top priorities and doing well. #FEMA _E_ Why did the failing @nytimes refuse to use any of the names given to them that I was so proud to have helped with their careers. DISHONEST! _E_ I love @LibertyUniversity such great people! _E_ ...(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong).To bad the Dems have no one who can change tones! _E_ 'U.S. Consumer Comfort Just Reached Its Highest Level in a Decade' __HTTP__ __HTTP__ _E_ Trump Nat'l Jupiter's @jacknicklaus designed course is a challenging & innovative 7531 yds w/special features __HTTP__ _E_ Does anybody think that @CNBC will get their fictitious polling numbers corrected sometime prior to the start of the debate. Sad! _E_ Happy to have @ralphreed and the FFC's endorsement of the Newsmax @iontv debate. FFC is a great organization. _E_ If the Senate Democrats ever got the chance they would switch to a 51 majority vote in first minute. They are laughing at R's. MAKE CHANGE! _E_ I will bring jobs back and get wages up. People haven't had a real wage increase in almost twenty years. Clinton killed jobs! _E_ Will be interviewed by @StephenAtHome tonight by phone a late show first @CBS @colbertlateshow. Enjoy! #Colbert #LSSC _E_ How dumb is our president to send thousands of poorly trained and ill equipped soldiers over to West Africa to fight Ebola. Stop all flights _E_ Thank you Attorney General Gonzales so many people feel this way. __HTTP__ _E_ While I'm beating my opponents in the polls I'm also beating lobbyists special interests & donors that are supporting them with billions. _E_ I told you whenever I go to a @Yankees game the @Yankees win. _E_ Must read article: "Conservative Fury at Rove Erupts" __HTTP__ By Jeffrey Lord @AmSpec _E_ Gov. Gary Johnson pulling votes from @MittRomney Don't waste your vote. Obama must go! _E_ "We have a system that increasingly taxes work and subsidizes nonwork." Milton Friedman _E_ Via @BreitbartNews: "EXCLUSIVE TRUMP COUNSEL 'CANNOT CONFIRM OR DENY' INTEREST IN BUYING NEW YORK TIMES" __HTTP__ _E_ 62 years ago this week a brave seamstress in Montgomery Alabama uttered one word that changed history... __HTTP__ _E_ My friend Ronald Kessler explains in @washingtonpost that Secret Service problems are much bigger than prostitutes __HTTP__ _E_ Big poll comes out today on Face The Nation at 10:30 on @CBSNews. _E_ 26000 sexual assaults in the military last year way up from previous years. Armed Forces are in total turmoil! _E_ Just landed in the Philippines after a great day of meetings and events in Hanoi Vietnam! _E_ Keep up the GREAT work. I am with you 100%! ISIS is losing its grip... Army Colonel Ryan DillonCJTF–OIR __HTTP__ __HTTP__ _E_ President Obama I have an idea! Pretend that West Africa is Israel and then you will be able to stop the Ebola area flights. _E_ Don't tread water. Get out there and go for it. There's nothing wrong with bringing your talents to the surface. _E_ My @KWWL int. from @WartburgCollege discussing how politicians have failed us & Making America Great Again! __HTTP__ _E_ 'Hillary Clinton Had Gun Control Supporters Planted In Town Hall Audience' __HTTP__ _E_ Bernie Sanders gave Hillary the Dem nomination when he gave up on the e mails. That issue has only gotten bigger! _E_ With Spitzer & Anthony Weiner running for office New York is pervert central! Pathetic _E_ Join me tomorrow in Sanford or Tallahassee Florida!Sanford at 3pm: __HTTP__ at 6pm: __HTTP__ _E_ Join me tonight in Fayetteville North Carolina at 7pm! #ThankYouTour2016 Tickets: __HTTP__ __HTTP__ _E_ My warmest condolences and sympathies to the victims and families of the terrible Las Vegas shooting. God bless you! _E_ Ed Gillespie will be a great Governor of Virginia. His opponent doesn't even show up to meetings/work and will be VERY weak on crime! _E_ My @foxandfriends interview duscussing my meeting with @newtgingrich the Newsmax @iontv debate and #TimeToGetTough __HTTP__ _E_ Unemployment is down to 4.1% lowest in 17 years. 1.5 million new jobs created since I took office. Highest stock Market ever up $5.4 trill _E_ Remember when you hear the words sources say from the Fake Media often times those sources are made up and do not exist. _E_ Also tomorrow night I will be going to Boone and Ames. Really look forward to seeing all of my friends in Iowa. _E_ RT @EricTrump: .@LaraLeaTrump and I look forward to being on @JudgeJeanine tonight at 9pm! @FoxNews #MakeAmericaGreatAgain __HTTP__ _E_ "Actions are the seed of fate deeds grow into destiny." – Harry S. Truman _E_ A message for Hollywood __HTTP__ _E_ A friend is one who has the same enemies as you have. Pres. Abraham Lincoln _E_ On Taxes: "This is the biggest corporate rate cut ever going back to the corporate income tax rate of roughly 80 years ago.This is a huge pro growth stimulus for the economy. Every year the Obama WH overstated how the economy would grow. Now real economics and jobs." @WSJ Report _E_ Another radical Islamic attack this time in Pakistan targeting Christian women & children. At least 67 dead400 injured. I alone can solve _E_ Via @SteveKingIA's Steve King for Congress Facebook Page: "Donald Trump has a special announcement!" __HTTP__ _E_ If you are interested in balancing work and pleasure you will never succeed! _E_ Why would college graduates want Crooked Hillary as their President? She will destroy them! __HTTP__ _E_ Brits spent $57.8M on the royal family. Obamas cost us $1.4B in expenses including entertainment __HTTP__ Living large on us. _E_ Really looking forward to watching The Masters this weekend one of THE GREATEST SHOWS ON EARTH! _E_ Crooked Hillary refuses to say that she will be raising taxes beyond belief! She will be a disaster for jobs and the economy! _E_ I was viciously attacked by Mr. Khan at the Democratic Convention. Am I not allowed to respond? Hillary voted for the Iraq war not me! _E_ Baltimore had a really tough night only great leadership can solve the many inner city problems facing our country. Jobs jobs jobs! _E_ Crooked Hillary Clinton overregulates overtaxes and doesn't care about jobs. Most importantly she suffers from plain old bad judgement! _E_ It's Friday. How many millions has the White House wasted on the ObamaCare website today? _E_ Watch me on Late Night with Jimmy Fallon tomorrow night at 12:35 a.m. on NBC I'll be making a big announcement! _E_ Via @USATODAY Amateur hour with the Iran nuclear deal __HTTP__ _E_ #WeeklyAddress __HTTP__ _E_ I'm very proud of my daughter Ivanka. Great interview. __HTTP__ _E_ Isis terror group has now fully taken over large sections of Iraq and will soon have control of massive oil reserves. I told you so. _E_ Don't underestimate yourself or your possibilities keep your focus intact and focus on the positives. _E_ Not looking good for our great Military or Safety & Security on the very dangerous Southern Border. Dems want a Shutdown in order to help diminish the great success of the Tax Cuts and what they are doing for our booming economy. _E_ We are being embarrassed by Russia and China on Snowden (and much more) yet Obama is talking about global warming on Tuesday. _E_ Global warming is based on faulty science and manipulated data which is proven by the emails that were leaked __HTTP__ _E_ What I am saying is that we never should have been in Iraq in the first place. Bush was terrible Obama is worse! Make America GREAT again. _E_ Visited some very beautiful golf courses this weekend...this is one... __HTTP__ _E_ Steve Bannon will be a tough and smart new voice at @BreitbartNews...maybe even better than ever before. Fake News needs the competition! _E_ "All Star Celebrity Apprentice" ranked #1 for the 10 o'clock hour among ABC CBS and NBC with a season high 19% margin. _E_ Thank you @kayleighmcenany for your nice words great knowledge and style! We are doing really well in South Carolina. @CNN @donlemon _E_ Just got to the #USWomensOpen in Bedminster New Jersey. People are really happy with record high stock market up over 17% since election! _E_ Made in America? @BarackObama argues that his long form birth certificate is irrelevant in court. __HTTP__ _E_ Even if you're on the right track you'll get run over if you just sit there. Will Rogers _E_ #CelebrityApprentice Who will win? __HTTP__ Find out tonight live Season Finale at 9PM ET on NBC. _E_ Congratulations to @TrumpCollection's @TrumpPanama for receiving the Certificate of Excellence & Top 10 Hotels in Panama from @TripAdvisor! _E_ Gas prices are the lowest in the U.S. in over ten years! I would like to see them go even lower. _E_ Via @kcautv: "Donald Trump Coming to Sioux City in May" __HTTP__ _E_ Remarks from the Roosevelt Room with @SenateMajLdr Mitch McConnell @SpeakerRyan and Secretary of Defense General James Mattis. __HTTP__ _E_ Looking forward to being with @SenTedCruz at our big rally in D.C. on Wednesday (1:00 P.M. at the Capitol) to protest insane Iran nuke deal! _E_ "The worst thing you can possibly do in a deal is seem desperate to make it." – The Art of The Deal. _E_ I'm sick of always reading about outsourcing. Why aren't we talking about onshoring ? We need to bring manufa... (cont) __HTTP__ _E_ #Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_ WH refused a meeting with the Israeli Defense Minister. If only Obama hated Iran as much as he dislikes Israel. _E_ The only American who has met with the North Korean man child is Dennis Rodman. Isn't that frightening and sad? _E_ #ObamacareFail __HTTP__ _E_ RT @TeamTrump: A @realDonaldTrump Administration will bring JOBS BACK! #Debates2016 __HTTP__ _E_ Someone just wrote that "you predicted every single major event that's now happening—and they knock you instead of giving you credit." _E_ The only one to fix the infrastructure of our country is me roads airports bridges. I know how to build pols only know how to talk! _E_ Russia must be laughing up their sleeves watching as the U.S. tears itself apart over a Democrat EXCUSE for losing the election. _E_ Representative Devin Nunes a man of tremendous courage and grit may someday be recognized as a Great American Hero for what he has exposed and what he has had to endure! _E_ .@BarackObama sent over 100000 jobs and Canadian oil to China all because he would not approve Keystone XL. _E_ Interesting reading re September 11th __HTTP__ _E_ Read Ivanka's blog about last night's Apprentice on Entertainment Weekly ... __HTTP__ _E_ This is the summer of box office bombs. Who is green lighting this garbage? The scripts are terrible. _E_ Thank you Greta. __HTTP__ _E_ 08 09 2011 19:33:31 _E_ Trump shows complete domination of Facebook conversation __HTTP__ _E_ Free enterprise is still the greatest force for upward mobility economic security and the expansion of the middle class. @MittRomney _E_ ObamaCare Tragedy Primed to Further Explode the Deficit __HTTP__ And @Obama transferred $500 billion from Medicare to fund it! _E_ A great honor to welcome President Juan Manuel Santos of Colombia to the White House today! Joint Press Conf... __HTTP__ _E_ My sons Don and Eric are in Ireland looking at my new club. It will be phenomenal! @LodgeatDoonbeg _E_ Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ _E_ Dennis—Thank you for being honest. Somebody put words in your mouth & you wouldn't take it. Great! @dennisrodman _E_ .@JebBush today said he didn't want to be the front runner he would rather be where he is now 2%. That is the talk of a loser can't win! _E_ On @FoxNews at 7:00 P.M. Special: Meet the Trumps Hope you enjoy! _E_ Tony Romo just made a great play Giants are getting killed! _E_ Iraq was one of our biggest mistakes. We got absolutely nothing for our sacrifices.The country will collapse (cont) __HTTP__ _E_ Hillary Clinton just had her 47% moment. What a terrible thing she said about so many great Americans! _E_ Via @usweekly: "Donald Trump Sounds Off on Joan Rivers' Death: 'I Think The Doctors Made a Terrible Mistake'" __HTTP__ _E_ .@davidaxelrod use Buffet Icahn Sam Zell Leon Black Kravis Caesars and many more when talking about using the bankruptcy laws not me! _E_ Via @ WSOC_TV: "Blair Miller talks with Donald Trump about Charlotte ventures" __HTTP__ _E_ Derek must move back into one of my buildings immediately. It will be lucky for him like in past. _E_ I'm in Scotland getting ready for a major news conference on the Great Dunes of Scotland announcing the second North Sea course amazing! _E_ Get ready for @Oreillyfactor tonight at 8 always interesting! _E_ Amazing. People are sending letters of support for @TrumpChicago's sign to my other properties including even @TrumpScotland. Thank you! _E_ Get your ballots in Colorado I will see you soon and we will win!#MakeAmericaGreatAgain __HTTP__ _E_ After 200 days rarely has any Administration achieved what we have achieved..not even close! Don't believe the Fake News Suppression Polls! _E_ Any deal completed before the fiscal curb must have tangible cuts on expenditures in baseline spending so we can get our credit back. _E_ China has so much of our debt that they can't put us in default w/o killing themselves US needs our toughest negotiator and fast! _E_ Melania and I will be appearing on The View tomorrow at 11 a.m. on CBS. Tune in for some great fun! _E_ RT @IvankaTrump: Very proud of Arabella and Joseph for their performance in honor of President Xi Jinping and Madame Peng Liyuan's official... _E_ Even with lower profit projections American firms are still throwing money into China __HTTP__ Obama is killing investment. _E_ Great meeting w/ coal miners & leaders from the Virginia coal industry thank you! #MAGA __HTTP__ __HTTP__ _E_ Look when it comes to China America better stop messing around. China sees us as a naive gullible foolish (cont) __HTTP__ _E_ With long gas lines & total disarray from storm the hurricane may yet be a negative for Obama. _E_ As a favor to my friends at EXTRA I am co hosting tonight at 7 p.m. on @nbc _E_ The American dream is back. We're going to create an environment for small business like we haven't had in many ma... __HTTP__ _E_ Thanks to @SteveKingIA for the kind introduction at the IA Freedom Summit & congrats to @David_Bossie & @Citizens_United on a great success! _E_ Be sure to watch "The History of WrestleMania" on @netflix. My interview explains how I supported the event early on. I'm proud of it. _E_ The failing @NYDailyNews which just raised its prices because it's dying said I wear a "wig" when they know I don't. Dishonest. _E_ Even though I have the legal right to use Steven Tyler's song he asked me not to. Have better one to take its place! _E_ Thx Mark I appreciate your words about the school. You sound like you're doing well happy for you. @businessinsider __HTTP__ _E_ The Establishment and special interests are absolutely killing our country. Stop them now: __HTTP__ _E_ Thank you for joining me this afternoon New Hampshire! Will be back soon. #FollowTheMoneySpeech transcript:... __HTTP__ _E_ Thank you New Orleans Louisiana!#MakeAmericaGreatAgain #VoteTrump __HTTP__ __HTTP__ _E_ Weak and low energy @JebBush whose campaign is a disaster is now doing ads against me where he tries to look like a tough guy. _E_ If you like having the world collapse and being told America is leading from behind vote Obama. _E_ I will be on @foxandfriends at 7:00 A.M. So much to talk about but not much good news for the U.S.A. MAKE AMERICA GREAT AGAIN! _E_ I told Rex Tillerson our wonderful Secretary of State that he is wasting his time trying to negotiate with Little Rocket Man... _E_ The ring announcers are working hard to justify the Mayweather victory. They should be ashamed of themselves! A TOTAL JOKE. _E_ Many of the Syrian rebels are radical jihadi Islamists who are murdering Christians. Why would we ever fight with them? _E_ Just arrived at Camp David where I am closely watching the path and doings of Hurricane Harvey as it strengthens to a Category 3. BE SAFE! _E_ It is amazing how rude much of the media is to my very hard working representatives. Be nice you will do much better! _E_ America's relationship with China is at a crossroads. We only have a short window of time to make the tough (cont) __HTTP__ _E_ Develop your gut instincts and act on them. You will have your biggest successes when you go with your gut but be very smart & careful. _E_ I will be interviewed by @oreillyfactor tonight on @FoxNews at 8pm. Enjoy! _E_ My interview with @IngrahamAngle discussing the real unemployment number and how the 7.8% number is a fraud __HTTP__ _E_ Negotiations on DACA have begun. Republicans want to make a deal and Democrats say they want to make a deal. Wouldn't it be great if we could finally after so many years solve the DACA puzzle. This will be our last chance there will never be another opportunity! March 5th. _E_ Congratulations to Roy Moore on his Republican Primary win in Alabama. Luther Strange started way back & ran a good race. Roy WIN in Dec! _E_ #AmericaFirst #RNCinCLE __HTTP__ _E_ .@AScottPGA Really solid playing keep going! _E_ I always enjoy watching young entrepreneurs enter the business world. I can tell who reads my books and who doesn't. #MidasTouch _E_ THe 2012 election is the most important in my lifetime. We must nominate a candidate who will win and will roll back @BarackObama's damage. _E_ So many false and phony T.V. commercials being broadcast in Indiana. Reminds me of Florida where thousands were put up I won in a landslide! _E_ Mitt's got it right: @RickSantorum's attacks on @MittRomney's pro growth tax cut proposal are foolish. _E_ I am the only one who can beat Hillary Clinton. I am not a Mitt Romney who doesn't know how to win. Hillary wants no part of Trump _E_ "The world is changing very fast. Big will not beat small anymore. It will be the fast beating the slow." @rupertmurdoch _E_ Such a great honor! __HTTP__ _E_ Why should we have any defense cuts in any deal? America must remain strong. _E_ RT @seanhannity: HRC mishandles and destroys classified info NO PROBLEM! Pay/play on Uranium one NO PROBLEM! Lynch BC tarmac: it's a matte... _E_ Wow! What a great honor from @DRUDGE_REPORT __HTTP__ _E_ Congratulations to our new #VASecretary Dr. David Shulkin. Time to take care of Veterans who have fought to protect... __HTTP__ _E_ Via @BreitbartNews TRUMP WINS NASHVILLE GRASSROOTS STRAW POLL WITH 52 PERCENT __HTTP__ _E_ Great thanks. __HTTP__ _E_ ...If you plan for the worst—if you can live with the worst—the good will always take care of itself. _E_ Looking forward to speaking @acnnews International Convention tomorrow morning in Charlotte NC __HTTP__ _E_ Enjoy the #SuperBowl and then we continue: MAKE AMERICA GREAT AGAIN! _E_ You can listen to my interview today with Jay Sekulow Live and the @JordanSekulow show here __HTTP__ @12PM EST. _E_ Thank you Ohio! VOTE so we can replace Obamacare and save healthcare for every family in the United States! Watch:... __HTTP__ _E_ I let @pennjillette come back on the record 13th season of 'All Star' @CelebApprentice after he relentlessly begged me to good t.v. _E_ When a country is no longer able to say who can and who cannot come in & out especially for reasons of safety &.security big trouble! _E_ #MakeAmericaGreatAgain! __HTTP__ _E_ My @TMZ interview with @HarveyLevinTMZ discussing how I will see my $5M lawsuit against @billmaher to the end __HTTP__ _E_ #MakeAmericaGreatAgain! __HTTP__ _E_ We should tell China that we don't want the drone they stole back. let them keep it! _E_ The President of the U.S. is the leader of the Free World. He should dress like it at all times. Wear a suit and a tie for major interviews. _E_ I wonder how much our leaders have promised or given Russia in order for them to behave and not make the U.S. look even worse? _E_ Magician extraordinaire @pennjillette is back in the All Star @ApprenticeNBC. This time he has even more tricks up his sleeve. _E_ Americans who can afford to buy enough food is now at a 3 year low. Is this @BarackObama's 'recovery'? __HTTP__ _E_ Entrepreneurs: Realize that success requires 100% effort and 100% focus. Nothing less. _E_ I fully support the @NYPD @MayorBloomberg and @CommissionerKelly. They should all be honored for protecting us since 9/11 not demonized. _E_ The bus driver who saved the woman from jumping off the bridge was really cool great guy. I'm going to send him $10 000 he deserves it! _E_ Hillary Clinton failed all over the world. LIBYA SYRIA IRAN IRAQ ASIA PIVOT RUSSIAN RESET BENGHAZI... __HTTP__ _E_ Wow Hillary Clinton was SO INSULTING to my supporters millions of amazing hard working people. I think it will cost her at the Polls! _E_ I will be in California this weekend making a speech for Clint Eastwood. Then to Arizona and Vegas. Big crowds. Discussing illegals & more! _E_ "There are no environments where you're only going to win because life just isn't like that." Bobby Orr _E_ It is my great honor to send $25000 to Sgt. Andrew Tahmooressi. #marinefreed _E_ Despite major outside money FAKE media support and eleven Republican candidates BIG R win with runoff in Georgia. Glad to be of help! _E_ Six months in it is the hope of GROWTH📈that is making AmericaFOUR TRILLION DOLLARS RICHER. Stuart @VarneyCo __HTTP__ __HTTP__ _E_ Another solar company @BarackObama funded with our money has filed for bankruptcy __HTTP__ One (cont) __HTTP__ _E_ My @gretawire interview re: how the debt ceiling is key point the fiscal curb & why we must & can make a great deal. __HTTP__ _E_ The biggest winner of Obama's '08 win Vladimir Putin. Ultimately he could be tied with Iran after Tehran becomes a nuclear power. _E_ Once Iran has nuclear weapons they will shut down the Strait of Hormuz. Oil will be over $300/Barrel. Iran'... (cont) __HTTP__ _E_ People of Ohio are fantastic. Thank you so much. What an evening! __HTTP__ _E_ 2 million more people just dropped out of ObamaCare. It is in a death spiral. Obstructionist Democrats gave up have no answer = resist! _E_ Via @NewsInTheBurg: "@chefjoseandres to open restaurant in Trump Int'l Washington D.C." __HTTP__ _E_ Merry Christmas & Happy Holidays!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Once again under@BarackObama the US has fallen down the ranks of global competitiveness __HTTP__ We must do better. _E_ No American should be separated from their loved ones because of preventable crime committed by those illegally in our country. Our cities should be Sanctuaries for Americans – not for criminal aliens! __HTTP__ _E_ My thoughts on Gadhafi's death @BarackObama and the misery index... __HTTP__ #trumpvlog _E_ Big meetings today at the United Nations. So many interesting leaders. America First will MAKE AMERICA GREAT AGAIN! _E_ Mariano Rivera is one of top @Yankees of all time. Greatest closer of all time. A true warrior. Last night's MVP award well deserved. _E_ Congratulations to my children Don and Tiffany on having done a fantastic job last night. I am very proud of you! _E_ #CNNDebate __HTTP__ _E_ Look Snowden is bad done tremendous damage to our country and standing but we have far worse in our government (guess who?). _E_ Thank you to all of the men and women who protect & serve our communities 24/7/365! #LawEnforcementAppreciationDay... __HTTP__ _E_ Why does HI Revised Statute 338 17.8 allow an HI resident who doesn't have to be US citizen to procure an official Hawaii birth certificate? _E_ Trump: Rove 'Made a Fool Out of Himself' __HTTP__ via @cnsnews _E_ Heading to New Hampshire. Will be talking about the disaster known as ObamaCare! _E_ Russia beat the United States in the Olympics another Obama embarrassment! Isn't it time that we turn things around and start kicking ass? _E_ Kim Jong Un of North Korea who is obviously a madman who doesn't mind starving or killing his people will be tested like never before! _E_ The racial divide in our country is almost at an all time high and getting worse every time you turn on the television. _E_ Make the Boston killer talk before our doctors make him better. Once he is well he will say speak to my lawyers. _E_ Last night's @extratv 's interview by @MarioLopezExtra of gorgeous 2012 @MissUniverse @oliviaculpo __HTTP__ Great job! _E_ Once again someone we were told is ok turns out to be a terrorist who wants to destroy our country & its people how did he get thru system? _E_ I just left the Doral in Miami it is going to be amazing! __HTTP__ _E_ The Clinton's are the real predators... __HTTP__ _E_ For all of today's voters please remember that I am the only candidate that is self funding my campaign I am not bought and paid for! _E_ .@MarkHalperin showed a focus group on @Morning_Joe me using a very bad word. I never said the word left an open blank. Please apologize! _E_ Watching @loudobbsnews fantastic show! Has very interesting take on Paul Ryan. _E_ I am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ _E_ When I intelligently turned down The Club For Growth crazy request for $1000000 they got nasty.What a waste of money that would have been _E_ My joint @seanhannity int. on @FoxNews with @GeraldoRivera recapping @ApprenticeNBC & discussing the 2016 election __HTTP__ _E_ Dine With The Donald and Mitt: __HTTP__ _E_ Will be participating in a Town Hall tonight on @SeanHannity at 10pmE from Austin Texas. Enjoy! __HTTP__ _E_ Weiner says many more pictures may be out there—this is just what NYC needs a pervert Mayor. _E_ Wow two candidates called last night and said they want to go to my event tonight at Drake University. _E_ Let's be honest if Obama thought he could get away with campaigning during the storm then he would have been in Ohio on Monday. _E_ 38 stories high @TrumpWaikiki's 462 luxury guest rooms & suites offer exceptional services __HTTP__ _E_ .@VanityFair's 2013 dwindling sales continue to sink at an even faster record rate under Graydon Carter __HTTP__ Disaster! _E_ Wall Street paid for ad is a fraud just like Crooked Hillary! Their main line had nothing to do with women and they knew it. Apologize? _E_ I'm getting ready to be inducted tonight into the WWE Hall of Fame at Madison Square Garden a great honor for me and the Trump family! _E_ As a presidential candidate I have instructed my long time doctor to issue within two weeks a full medical report it will show perfection _E_ TRUMP: GOP MUST DUMP 'USELESS' ROVE TO WIN PRESIDENTIAL ELECTIONS __HTTP__ by @mboyle1 @BreitbartNews _E_ Another new post debate poll. THANK YOU! #VoteTrump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Paul Ryan is far from my first choice but a very nice guy. The Republicans should go for tough and (very) smart this time no games! _E_ RT @markets: What Is Trump worth to Twitter? One analyst estimates $2 billion __HTTP__ __HTTP__ _E_ Sue them Tom! #TrumpVlog __HTTP__ _E_ Today is a big day for us and for Toronto: Trump International Hotel & Tower Toronto opens today. (cont) __HTTP__ _E_ For all those sick degenerates contemplating a knockout attack please remember the late great Charles Bronson no more crime! _E_ I will be interviewed by Anderson Cooper at 8pm on @CNN from New Hampshire. Should be very interesting! _E_ The Republicans never discuss how good their healthcare bill is & it will get even better at lunchtime.The Dems scream death as OCare dies! _E_ Just leaving Florida. Big crowds of enthusiastic supporters lining the road that the FAKE NEWS media refuses to mention. Very dishonest! _E_ As our Country rapidly grows stronger and smarter I want to wish all of my friends supporters enemies haters and even the very dishonest Fake News Media a Happy and Healthy New Year. 2018 will be a great year for America! _E_ Hillary Clinton conceded the election when she called me just prior to the victory speech and after the results were in. Nothing will change _E_ Too many people rely on auto correct...an assistant of mine apologizes! _E_ HAPPY 4TH OF JULY TO EVERYONE! MAKE AMERICA GREAT AGAIN! _E_ I will implement effective missile defenses to protect against threats. On this there will be no flexibility with Vladimir Putin. Mitt _E_ Happy Birthday President Reagan #FlashbackFriday __HTTP__ _E_ Buy American & hire American are the principles at the core of my agenda which is: JOBS JOBS JOBS! Thank you @exxonmobil. _E_ The reality is that no gun bill will ever stop tragedies. And as we have learned from ObamaCare Washington only makes things worse! _E_ I WILL DEFEAT ISIS. THEY HAVE BEEN AROUND TOO LONG! What has our leadership been doing?#DrainTheSwamp __HTTP__ _E_ Can you imagine the Boston killer being lovingly tended to in a hospital room right next to his victims who lost their arms legs and worse! _E_ The single greatest Witch Hunt in American history continues. There was no collusion everybody including the Dems knows there was no collusion & yet on and on it goes. Russia & the world is laughing at the stupidity they are witnessing. Republicans should finally take control! _E_ A detainee released from Gitmo has killed an American. When will our so called leaders ever learn! _E_ I don't want to be the only billionaire in America I want all Americans to be rich. _E_ Frank "FX" Giaccio On behalf of @FLOTUS Melania & myself THANK YOU for doing a GREAT job this morning! @NatlParkService gives you an A+! __HTTP__ _E_ Talking with @SammartinoBruno backstage __HTTP__ #WWEHOF _E_ .@realDonaldTrump will PROTECT and DEFEND the Constitution #Debate #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_ My son @EricTrump is in Memphis at St. Jude Children's Research Hospital... _E_ Flashback – Jeb Bush received a $4M tax payer bailout in 1990 __HTTP__ Guess who was POTUS then? _E_ "Donald Trump to headline SC Tea Party Convention" __HTTP__ via @wyffnews4 _E_ El Chapo comes to the U.S. often thru our border—it's been revealed he has CA drivers license. __HTTP__ _E_ Members from Obama's own job council are endorsing @MittRomney __HTTP__ Not surprising _E_ Watch Coach Mike Ditka a great guy and supporter tonight at 8pmE on #WattersWorld with @jessebwatters @FoxNews. _E_ It never ends! __HTTP__ _E_ It was only after I informed NBC that I wouldn't do the Apprentice that they became upset w/ me. They couldn't care less about "inclusion _E_ Malfeasance at Fannie Mae and Freddie Mac helped cause our current financial meltdown. _E_ Does anybody remember when Bill Clinton in 2008 worked long and hard for Hillary? She LOST! Now Bill is at it again. Just watch. _E_ The greatest commodity to own is land. It is finite. God is not making any more of it. _E_ Glad to hear my @foxandfriends' Monday interview continues to get big ratings. Great way to start your week _E_ I will be interviewed on @FaceTheNation this morning at 10:00 A.M. Have a great day! _E_ .@KarlRove is a failed Jeb Bushy. Never says anything good & never will even after I beat Hillary. Shouldn't be on the air! _E_ Little Marco Rubio treated America's ICE officers like absolute trash in order to pass Obama's amnesty. __HTTP__ _E_ The Iranians are sure happy with Obama's nomination of Hagel. Already praising Hagel as 'Anti Israel' __HTTP__ _E_ How long will it take for chants bring back the replacement refs when a bad call is made? _E_ When will @davidaxelrod realize he is on a fool's errand trying to defend @BarackObama's ineptitude? _E_ Hypocrite! @HillaryClinton claims she needs a "public and a private stance" in discussions with Wall Street banks. #Debate _E_ "The road to Easy Street goes through the sewer." – John Madden _E_ How can Hillary run the economy when she can't even send emails without putting entire nation at risk? _E_ So what will happen to the Big O on Celebrity Apprentice tonight. Remember I only fire people when it is deserved not for other reasons! _E_ We will fight the #FakeNews with you! __HTTP__ _E_ Diligence is the mother of good luck. Benjamin Franklin _E_ Meeting with "Chuck and Nancy" today about keeping government open and working. Problem is they want illegal immigrants flooding into our Country unchecked are weak on Crime and want to substantially RAISE Taxes. I don't see a deal! _E_ 'Presidential Executive Order on the Establishment of Presidential Advisory Commission on Election Integrity'... __HTTP__ _E_ If ObamaCare should not be repealed then why has Obama & Congress exempted their staffs? _E_ We ALL must be united & condemn all that hate stands for. There is no place for this kind of violence in America. Lets come together as one! _E_ I wonder why @BarackObama is not going to the NAACP Convention. Is it because he can't answer questions about 14.7% Black unemployment? _E_ .@FoxNews is much better and far more truthful than @CNN which is all negative. Guests are stacked for Crooked Hillary! I don't watch. _E_ Just had a great legal victory in Ft. Lauderdale won trial now will receive tremendous $ in legal fees from losers. Love it! _E_ Wow just announced that Lyin' Ted and Kasich are going to collude in order to keep me from getting the Republican nomination. DESPERATION! _E_ Via @NOLAnews by @DaveWalkerTV: Donald Trump praises @Joan_Rivers as 'strong' 'vibrant' in @ApprenticeNBC return __HTTP__ _E_ National Review is a failing publication that has lost it's way. It's circulation is way down w its influence being at an all time low. Sad! _E_ I will be interviewed on @megynkelly's The Kelly File tonight. Be sure to watch on @FoxNews! _E_ "Donald Trump to visit metro Detroit in May" __HTTP__ via @wxyzdetroit _E_ I am continuing to get rid of costly and unnecessary regulations. Much work left to do but effect will be great! Business & jobs will grow. _E_ After watching all about the horror story that is A Rod I realized again that it is time to let Pete Rose into the Baseball Hall of Fame! _E_ I will be doing Fox and Friends at 7 A.M. this morning. _E_ My #TrumpTuesday @SquawkCNBC interview discussing golf VP choices the real estate market & healthcare reform __HTTP__ _E_ Looking forward to tonight's conversation w/ David Rubenstein @TheEconomicClub. Airing live on @cspan at 7PM EST __HTTP__ _E_ Wow @Macys shares are down more than 40% this year. I never knew my ties & shirts not being sold there would have such a big impact! _E_ RT @DonnaWR8: @realDonaldTrump I wonder what this BRAVE American would give to stand on his OWN two legs just ONCE MORE for our #Anthem?... _E_ Watch me explain on the @Late_Show how my charitable offer to Obama changes the election and is about transparency __HTTP__ _E_ He @BarackObama believes that the War on Terror is over __HTTP__ Who does he think won? _E_ both countries will perhaps work together to solve some of the many great and pressing problems and issues of the WORLD! _E_ Join me for my #WeeklyAddress __HTTP__ __HTTP__ _E_ Pocahontas bombed last night! Sad to watch. _E_ Thank you for your incredible support Wisconsin and Governor @ScottWalker! It is time to #DrainTheSwamp & #MAGA!... __HTTP__ _E_ .@JebBush's opening and closing in the debate were said by all to be terrible fumbled around incoherent. _E_ .@SenJohnMcCain should be defeated in the primaries. Graduated last in his class at Annapolis dummy! _E_ The Fed's pumping is great news in the short term but it can't last forever. Be prudent in your market investing. _E_ Looks like my work here is done bringing a close to the first ever #NBC #SweepsTweet. Keep watching @ApprenticeNBC every Sunday 9/8c. _E_ .@Israel could very well be close to attacking Iran. Could be this election's big October surprise... _E_ Thank you to Bob Woodward who said That is a garbage document...it never should have been presented...Trump's right to be upset (angry)... _E_ Via @BreitbartNews by @j_strong: "Obama Administration Quietly Prepares 'Surge of Millions' of New Immigrant Ids" __HTTP__ _E_ Via @FoxSportsGolf: Trump's protégé earns US Open spot __HTTP__ _E_ Some jerk fraudulently tweeted that his parents said I was a big inspiration to them + pls RT—out of kindness I retweeted. Maybe I'll sue. _E_ I am going to repeal and replace ObamaCare! Read more about my positions on healthcare reform here: __HTTP__ _E_ This is one of the COLDEST WINTERS ever freezing all over the country for long periods of time! So much for GLOBAL WARMING. _E_ Discussing the 9/11 attack and coverage with @kingsthings while hosting the 25th anniversary of his @CNN show __HTTP__ _E_ Doing Fox and Friends in two minutes! _E_ Will be doing Fox & Friends at 7 2 minutes. _E_ The Trans Pacific Partnership will lead to even greater unemployment. Do not pass it. _E_ I guess I have reached yet another ceiling 49.7% with four people. My highest Reuters poll yet! Thank you! __HTTP__ _E_ Even the @NYTimes and @WashingtonPost Editorial Boards condemned Justice Ginsburg for her ethical and legal breach. What was she thinking? _E_ Via @ConMonitorNews by @CMonitor_JVF: Donald Trump guest speaker at event honoring James Foley __HTTP__ _E_ Donald Trump returns to the 'Apprentice' boardroom __HTTP__ via @BW _E_ I campaigned on creating a merit based immigration system that protects U.S. workers & taxpayers. Watch: __HTTP__ #RAISEAct __HTTP__ _E_ Nation's Immigration And Customs Enforcement Officers (ICE) Make First Ever Presidential Endorsement: __HTTP__ _E_ Congratulations to @PiersMorgan on winning @BritishGQ TV Personality Of The Year. Piers deserves his success! _E_ .@DannyZuker Danny—Let your bosses on Modern Family lend you the money to play the game. Show courage! _E_ I never want someone working for me who doesn't want to be there and in the same way you shouldn't want to be there either. _E_ If elected POTUS I will stop RADICAL ISLAMIC TERRORISM in this country! In order to do this we need to... __HTTP__ _E_ Getting ready for the big news conference in Dubai. It should all be happening in the U.S. but it isn't SAD! _E_ President Obama has a personal responsibility to visit & embrace all people in the US who contract Ebola! _E_ Melania and I just had interview with the legendary @BarbaraJWalters. Watch #abc2020 this Friday. Tonight we talk ISIS @WNTonight _E_ WaPo attack on alleged high school incidents by @MittRomney is a hit job to me. Where are @BarackObama's high school and college records? _E_ Obama should meet with Putin snd convince him to do what is good for the U.S. It's called good dealmaking or simply leadership! Cajole. _E_ Great leaders listen to and support law enforcement officials. Police discuss no go areas: __HTTP__ __HTTP__ _E_ Diane Black of Tennessee the highly respected House Budget Committee Chairwoman did a GREAT job in passing Budget setting up big Tax Cuts _E_ Hopefully Republican Senators good people all can quickly get together and pass a new (repeal & replace) HEALTHCARE bill. Add saved $'s. _E_ Since November 8th Election Day the Stock Market has posted $3.2 trillion in GAINS and consumer confidence is at a 15 year high. Jobs! _E_ No cuts to welfare no cuts to food stamps & NOT A SINGLE CUT TO OBAMACARE yet the new budget cuts military benefits. Sad! _E_ "Build your reputation on intelligence responsibility and results. That's building the right way." – Think Like a Champion _E_ Iran's nuclear program must be stopped – by any and all means necessary. _E_ Before I bought the site the Sun Times had the biggest ugliest sign Chicago has ever seen. Mine is magnificent and popular. _E_ THANK YOU SYRACUSE! #NYPrimary __HTTP__ __HTTP__ _E_ Set the example. You can motivate others as well as yourself by remembering that you are setting the example. _E_ Not since Watergate have we been going thru a time like this Benghazi IRS wiretapping of @AP... _E_ My @CNN interview with @wolfblitzercnn where I discuss @BarackObama's 'birth certificate' and why @CNN has low ratings __HTTP__ _E_ There must be a higher standard of accuracy in the media. Incredible that some so called journalists can make up lies and get away with it _E_ RT @TrumpInaugural: Counting down the days until the swearing in of @realDonaldTrump & @mike_pence. Check in here for the latest updates. #... _E_ Rowanne Brewer the most prominently depicted woman in the failing @nytimes story yesterday joined @foxandfriends. __HTTP__ _E_ Why do we always know how the four liberals are going to rule but have to think about which side the Republican judges will go. _E_ In the span of two months @BarackObama the habitual vacationer has called America soft and lazy. He loves to criticize America. _E_ Must read @IBDinvestors editorial: "Child Alien Crisis Obama's Fault But GOP Won't Pounce" __HTTP__ _E_ .@CPACnews had its largest ever ticket sales the day of my announcement. Really an honor. Can't wait to see everyone. _E_ Congrats to the new Gov. of Texas @GregAbbott_TX for taking a tough & bold stance at the border. Should have been done long ago by Perry. _E_ Sadly there is no way that Ted Cruz can continue running in the Republican Primary unless he can erase doubt on eligibility. Dems will sue! _E_ Entrepreneurs: Keep an open mind! Business is a creative endeavor. There are always opportunities and possibilities. _E_ Lyin'Ted Cruz is weak & losing big so now he wants to debate again. But according to DrudgeTime and on line polls I have won all debates _E_ My father Fred Trump left me a relatively small amount of money (compared to where I am today over $10 billion) but vast amount of knowledge _E_ Why isn't Hillary Clinton 50 points ahead?#DebateNight __HTTP__ _E_ Despite the false @nytimes story about Jeb Bush being happy with the Trump surge he fell more than anybody & is miserable. _E_ Emails prove WH knew ObamaCare website wouldn't work in October why didn't they delay the launch? __HTTP__ _E_ I look forward to attending Saturday Night Live on Sunday night. I am sure it will be a great show. @nbcsnl __HTTP__ _E_ .@megynkelly used this poll (nobody else did) when I was down—wonder if she'll use it now that I'm up? __HTTP__ _E_ My @FoxNews interview last night with @Gretawire __HTTP__ _E_ To all the Bernie voters who want to stop bad trade deals & global special interests we welcome you with open arms. People first. _E_ Intrinsic means basic inborn elemental. If you have an intrinsic value it cannot be taken away. Think Like a Champion _E_ The Chinese must still be laughing at Kerry's trip to China. He got nothing gave them everything and promised even more. _E_ The Donald J. Trump Signature Collection available @Macys offers this fall's top styles in ties shirts & suits __HTTP__ _E_ I am in IstanbulTurkey. Just opened magnificent #TrumpTowers a big hit. _E_ RT @DonaldJTrumpJr: Someone please fact check her coal comments. Give me a break. #debates _E_ RT @netanyahu: Ever Strongerחזקים תמיד 🇱 __HTTP__ _E_ Thank you Hilton Head South Carolina! @SCTeamTrump #Trump2016 __HTTP__ __HTTP__ _E_ Upstate New York needs jobs. Frack Now & Frack Fast! Pay off NY State debt. _E_ Via Politico: Trump Extends Lead in New Hampshire Poll __HTTP__ _E_ RT @JaydaBF: VIDEO: Islamist mob pushes teenage boy off roof and beats him to death! __HTTP__ _E_ Chuck Jones who is President of United Steelworkers 1999 has done a terrible job representing workers. No wonder companies flee country! _E_ Despite the Fake News Media in conjunction with the Dems an amazing job is being done in Puerto Rico. Great people! _E_ Everything you can imagine is real. —Pablo Picasso _E_ Isn't it interesting that now that I'm #1 in the polls the networks show polls that are a month old! _E_ More radical Islam attacks today it never ends! Strengthen the borders we must be vigilant and smart. No more being politically correct. _E_ I am at the Saturday Night Live Studio electricity all over the place. We will be doing a tweeting skit so stay tuned! _E_ Trump National Golf Club Los Angeles on the Palos Verdes Peninsula overlooking the Pacific Ocean spectacular! __HTTP__ _E_ We should start an immediate investigation into @SenSchumer and his ties to Russia and Putin. A total hypocrite! __HTTP__ _E_ Will be doing a joint press conference in Hanoi Vietnam then heading for final destination of trip the Phillipines. _E_ I know some of you may think l'm tough and harsh but actually I'm a very compassionate person (with a very high IQ) with strong common sense _E_ .@WSJ Editorial says Clinton primary vote total is 8646551.Trump's is 7533692 a knock. But she had only 3 opponents I had 16.Apologize _E_ Bob Schieffer will do a great job tonight. Always treated me fairly. _E_ RT @foxandfriends: Yesterday's hearings provided zero evidence of collusion between our campaign and the Russians because there wasn't any... _E_ With all of the Fake News coming out of NBC and the Networks at what point is it appropriate to challenge their License? Bad for country! _E_ How quality a woman is Rowanne Brewer Lane to have exposed the @nytimes as a disgusting fraud? Thank you Rowanne. _E_ Maybe Boehner will stop this one sided deal in the House...I hope so! _E_ NEW FBI TEXTS ARE BOMBSHELLS! _E_ Great news! Just out the highly respected USA Today/Suffolk University Poll. Enjoy! __HTTP__ _E_ RT @MollyCBraswell: WHAT?! @realDonaldTrump is speaking at #CPAC2013? This conference just became like a hundred times more awesome! _E_ Thank you so nice. _E_ The Supercommittee is a disaster. The Republicans made a crucial mistake agreeing to this debt deal. They hat... (cont) __HTTP__ _E_ Looking over New York City with luxurious 5 Star hotel rooms @TrumpNewYork top dining & amenities __HTTP__ _E_ Wow just in ObamaCare projected to cause large scale drop in jobs even Dems are shocked by 2.5 million number. DISASTER! _E_ The U.S. needs to protect our intelligence assets especially in China. If the Chinese want to spy on us then we need to return the favor. _E_ Paula Broadwell's book on Gen. Petreus is titled All In. Did she know something? _E_ It's time for politicians to be reminded they work for us! We can get it done. Let's Make America Great Again! __HTTP__ _E_ RT @RSBNetwork: LIVE Stream now: Donald Trump press conference #TrumpTrain #Trump2016 __HTTP__ _E_ Interesting polls on who won the GOP debate. __HTTP__ _E_ At the Old Post Office __HTTP__ _E_ Congratulations to Miss Mexico Jimena Navarrete our new Miss Universe 2010 and congratulations to everyone for a fantastic show. _E_ Donald Trump Leads Polls in Florida __HTTP__ _E_ Even though I beat him in the first six debates especially the last one Ted Cruz wants to debate me again. Can we do it in Canada? _E_ .@StephenBaldwin7 and me at a press event for All Star @ApprenticeNBC earlier today at @TrumpTowerNY.... __HTTP__ _E_ Obamacare premiums continue to rise and bend up the cross curve. And the back end of the website does not even work. _E_ I was on CNN last night with @ErinBurnett. _E_ .@WhoopiGoldberg had better surround herself with better hosts than Nicole Wallace who doesn't have a clue. The show is close to death! _E_ If Trump became president he would do an amazing job if Obama took over Celebrity Apprentice he'd fail. What's your opinion? I agree! _E_ Looking forward to a big rally in Nashville Tennessee tonight. Big crowd of great people expected. Will be fun! _E_ Economic growth can save Social Security Medicare and America. _E_ THANK YOU Council Bluffs Iowa! The silent majority is silent no more!#Trump2016 #FITN __HTTP__ __HTTP__ _E_ The Council is concerned over the health & safety for the village of Blackdog w/ placement of sub station. @AlexSalmond @pressjournal _E_ .@MarkHalperin I totaly won the RJC meeting yesterday. Know many members who said not even close. Only FULL standing O. But don't want $'s _E_ "I don't think you should ever run from history. You should learn from it and embrace it." @LAClippers Coach Doc Rivers _E_ The failing @NYTimes would do much better if they were honest! __HTTP__ _E_ Yesterday there was yet another massive intelligence leak by the @BarackObama administration. __HTTP__ _E_ MILITARY LIVES MATTER! END GUN FREE ZONES! OUR SOLDIERS MUST BE ABLE TO PROTECT THEMSELVES! THIS HAS TO STOP! _E_ New and great selection of ties shirts and cufflinks@Macy's check them out! _E_ I was right—TV ratings for US Open are way down from last year. People don't want to look at a burned out ugly course! _E_ He @RickSantorum should get out of the race so Republicans can focus on @BarackObama. _E_ Please send a psychiatrist to help @Rosie she's in a bad state. To @Rosie's girlfriend's parents get (cont) __HTTP__ _E_ I will be on @foxandfriends this morning at 8:30. Enjoy! _E_ The ruling @GOP consultant class of losers like @KarlRove have no respect for the Tea Party. They do this at their own peril! _E_ .@Yankees are making a big mistake sending the doping @AROD to rehab assignment. Should suspend him until investigation is over. _E_ .@MittRomney will create 2 million new jobs if elected POTUS. If reelected @BarackObama will create over $12T in new debt. Easy choice. _E_ Brian Williams who is not the nice guy that people think he is has now become totally irrelevant. He will never again hold court! _E_ My @greta interview discussing why we do not need another Bush __HTTP__ _E_ Absentee Governor Kasich voted for NAFTA and NAFTA devastated Ohio a disaster from which it never recovered. Kasich is good for Mexico! _E_ This season's cast of @ApprenticeNBC brings excitement to the Board Room. Lots of surprises & great tasks. Enjoy – Jan. 4th! _E_ Celebrating New Year's Eve in the Windy City? Join @TrumpChicago for the chic & elegant Cirque Soiree Celebration __HTTP__ _E_ Via @eonline: Donald Trump wants Katherine Webb for Miss USA judge __HTTP__ _E_ .@MacMiller has over 79M hits on YouTube & just hit platinum with his Donald Trump song—screw you Mac! _E_ I want to thank my @Cabinet for working tirelessly on behalf of our country. 2017 was a year of monumental achievement and we look forward to the year ahead. Together we are delivering results and MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ Great news Chinese companies who were fixing prices and accounting are leaving the US stock market __HTTP__ #TimeToGetTough _E_ Good news is Melania's speech got more publicity than any in the history of politics especially if you believe that all press is good press! _E_ In game 7 of the World Series tonight the Giants are making a big mistake in not starting their ace against K.C. even with two days rest. _E_ .@GMA at 7:00 A.M. _E_ If your actions inspire others to dream more learn more do more and become more you are a leader. – John Quincy Adams _E_ So nice thank you very much. __HTTP__ _E_ .@AlexSalmond Heatwave in Scotland makes wind turbines useless. Big problem expensive mess. _E_ .@THR The Donald Trump Ratings Bump: Who's Benefiting Most? __HTTP__ _E_ My major hotel conversion of The Old Post Office on Pennsylvania Avenue in D.C. is under budget and ahead of schedule. Should be U.S.A. _E_ I visited our Trump Tower campaign headquarters last night after returning from Ohio and Arizona and it was packed with great pros WIN! _E_ Via @washingtonpost's @goingoutguide by @timcarman: " @gzchef open the National at the Old Post Office Pavilion" __HTTP__ _E_ Great day today in South Carolina. Fantastic capacity crowd amazing people! _E_ Today Judge St. Eve ruled in my favor on the two remaining claims brought by Goldberg in Chicago. The case is now officially over... _E_ My @Newsmax_Media int. with @SteveMTalk on my Iowa @theFAMiLYLEADER speech @jonkarl 2016 & Benghazi __HTTP__ _E_ RT @shawgerald4: @realDonaldTrump Thank you President TRUMP!! __HTTP__ _E_ Leaving today for California to inspect my fantastic golf course & club on the Palos Verdes peninsula. Big success. __HTTP__ _E_ Dave Brubeck was great and will be missed! _E_ Jeb Bush should stop trying to defend his brother and focus on his own shortcomings and how to fix them. Also Rubio is hitting him hard! _E_ Via @Newsmax_Media by "Donald Trump: Don't Give Obama Fast Track Trade Authority" __HTTP__ _E_ A strong Poland is a blessing to the nations of Europe and a strong Europe is a blessing to the West and to the world. __HTTP__ _E_ The measure of who we are is what we do with what we have. Vince Lombardi _E_ Wow Jeb Bush just lost three of his top fundraisers they quit! _E_ I've helped pass and signed 38 Legislative Bills mostly with no Democratic support and gotten rid of massive amounts of regulations. Nice! _E_ Thank you Indiana we were just projected to be the winner. We have won in every category. You are very special people I will never forget! _E_ Miss Universe Paulina Vega criticized me for telling the truth about illegal immigration but then said she would keep the crown Hypocrite _E_ "Strive for wholeness and keep your sense of wonder intact & you will find yourself ready for a grand slam." Think Like A Champion _E_ All eyes are on Florida today. I will be watching the GOP primary results very closely. We need the right candidate to beat @BarackObama. _E_ Why do you need a photo ID to buy a drain cleaner __HTTP__ not to vote? _E_ "Trump Tiger Team Up to Create 'Stunning' Golf Course in Dubai" __HTTP__ via @Newsmax_Media by @Jlorenz _E_ Via @fitsnews by Will Folks: "'THE DONALD' REBUKES OBAMATRADE" __HTTP__ _E_ THANK YOU! #AmericaFirst __HTTP__ _E_ Clinton campaign & DNC paid for research that led to the anti Trump Fake News Dossier. The victim here is the President. @FoxNews _E_ Arriving to check out the border. __HTTP__ _E_ Leaking and even illegal classified leaking has been a big problem in Washington for years. Failing @nytimes (and others) must apologize! _E_ We have got to take our country back. It's time! _E_ Major League Baseball: The best thing you can do is let @PeteRose_14 your all time hits leader into the Hall of Fame. It's time! _E_ Eric Trump on @foxandfriends now! _E_ The jury was not told the killer of Kate was a 7 time felon. The Schumer/Pelosi Democrats are so weak on Crime that they will pay a big price in the 2018 and 2020 Elections. _E_ Beyond eliminating the wasteful spending we need to get tough in cracking down on the hundreds of billions of (cont) __HTTP__ _E_ I have been saying it for sometime now!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Exclusive Davi: Trump The Lion We Need __HTTP__ _E_ .@TraceAdkins the winner of @ApprenticeNBC after last night's victory __HTTP__ _E_ "Go as far as you can see when you get there you'll be able to see farther." J. P. Morgan _E_ RT @rupertmurdoch: As predicted Trump reaching out to make peace with Republican establishment . If he becomes inevitable party would be... _E_ .@seanhannity should have corrected Jeb Bush when he said that I ran for president twice. Never ran merely considered running! _E_ "Always remember: Dress for the job you want not the job you have." – Think Like a Billionaire _E_ Thank you to everyone who came out & joined us @TrumpTurnberry yesterday! @EricTrump @IvankaTrump @DonaldJTrumpJr __HTTP__ _E_ Losers and haters are invited to watch Celebrity Apprentice along with the many great and productive people in the hope that you will learn. _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Bowe Bergdahl walked off the base after he was told not to. Solders died looking for him. U.S. should NEVER have made the deal! PUNISHMENT? _E_ Obama's promise to build an international coalition against ISIS is already broken. No one trusts him at home or abroad. _E_ ...It's old electrical grid which was in terrible shape was devastated. Much of the Island was destroyed with billions of dollars.... _E_ My sense is that people are far angrier at the President than they are at Congress re the shutdown—an interesting turn! _E_ Thanks @SherriEShepherd 4 your nice comments today on The View. U were terrific! _E_ He's saddled our children with more debt than we accumulated in 225 years in America. @BarackObama has done an (cont) __HTTP__ _E_ From The Desk Of Donald Trump two new videos up at __HTTP__ and __HTTP__ _E_ You're hired! The @CENTURY21 ad is airing during the #SuperBowl and you need to get voting! Vote for me & @CENTURY21: __HTTP__ _E_ Congrats to Team USA & Capt. @AllenWronowski for retaining the PGA Cup! Well done and well deserved! _E_ Met a big fan today! __HTTP__ _E_ Talks on Repealing and Replacing ObamaCare are and have been going on and will continue until such time as a deal is hopefully struck. _E_ Keep lightweight Marco and his friends out of the White House. #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ If we want to renew our PROSPERITY restore OPPORTUNITY & re establish our economic DOMINANCE then we need tax reform that is pro growth.. __HTTP__ _E_ Thank you @DallasPD! __HTTP__ _E_ 'Dem Operative Who Oversaw Trump Rally Agitators Visited White House 342 Times' #DrainTheSwamp __HTTP__ _E_ Thanks. __HTTP__ _E_ Getting ready to watch the debate as they say let's get ready to rumble ! _E_ Remember the harder you work the luckier you get! _E_ Some people dream of great accomplishments while others stay awake and do them." Anonymous _E_ ObamaCare is a broken mess. Piece by piece we will now begin the process of giving America the great HealthCare it deserves! _E_ Legal immigrants want border security. It is common sense. We must build a wall! Let's Make America Great Again! __HTTP__ _E_ Our country is totally divided and our enemies are watching. We are not looking good we are not looking smart we are not looking tough! _E_ Thank you to Prime Minister of Australia for telling the truth about our very civil conversation that FAKE NEWS media lied about. Very nice! _E_ Together we will show the world that the forces of destruction and extremism are NO MATCH for the BLESSINGS of PROSPERITY and PEACE! __HTTP__ _E_ Will be joining @jimmyfallon on @FallonTonight at 11:35pmE tonight. Enjoy! _E_ My @amtalker int. on @whoradio w/@SteveKingIA discussing my upcoming campaign visit for Steve this Sat. in Iowa __HTTP__ _E_ Via @TV3Xpose: "@IvankaTrump: Think pink in the boardroom." __HTTP__ _E_ Boston incident is terrible. We need energy and passion but we must treat each other with respect. I would never condone violence. _E_ Last night William Shatner had more airtime than any winner. It should have been called the William Shatner show... _E_ Massive crowds already forming in Jacksonville will be and incredible day 12 noon! MAKE AMERICA GREAT AGAIN! _E_ Hope we all enjoy @60Minutes tomorrow night. I do believe they will treat me fairly! _E_ Thank you to Sue Kruczek who lost her wonderful and talented son Nick to the Opioid scourge for your kind words while on @foxandfriends. We are fighting this terrible epidemic hard Nick will not have died in vain! _E_ Still time to get out and VOTE!#WIPrimary #Trump2016 #MAGA __HTTP__ _E_ I told the Republicans the debt ceiling talks should come before election & we would have a Republican president—they wouldn't listen. _E_ Q/A @stalkinpeople Yes I'd give the real numbers. _E_ I'll be in one of my favorite places this morning Staten Island. Big crowd will be fun! _E_ Join us in Salt Lake City Utah tonight!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ In getting the endorsement of the 16500 Border Patrol Agents (thank you) the statement was made that the WALL was very necessary! _E_ Via @Newsmax_Media by @wandacarruthers: Donald Trump: US Defeating ISIS Only in John Kerry's Imagination __HTTP__ _E_ I hope the @GOP realizes that if they blow this election the Tea Party won't be with them next time. _E_ When candidate John Kasich on the @oreillyfactor talked about dismantling Medicare and Medicaid he was referring to Ben Carson. _E_ Amazing rally in Florida this is a MOVEMENT! Join us today at __HTTP__ __HTTP__ _E_ Fascinating to watch people writing books and major articles about me and yet they know nothing about me & have zero access. #FAKE NEWS! _E_ Today @FLOTUS hosted a Military Mother's Day Event in the East Room of the WH. It was an honor to stop by say hel... __HTTP__ _E_ "A leader does not deserve the name unless he is willing occasionally to stand alone." Henry A. Kissinger _E_ We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_ Failing @GlennBeck lost all credibility. Not only was he fired @ FOX he would have voted for Clinton over McCain. __HTTP__ _E_ Will be on @Morning_Joe in 5 minutes at 7:00. Enjoy! _E_ The @CelebApprentice will be broadcast tonight on @CNBC at 9 PM. _E_ Tomorrow is #TrumpTuesday on @SquawkCNBC 7:30 AM. Tune in! _E_ Ashley Judd has just thanked Karl Rove for all the attention he has given her—unreal!—how stupid can we get? _E_ Waste! With a $16T debt and $1T budget deficit @BarackObama is sending $770M overseas to fight global warming __HTTP__ _E_ Seems to be the next election must be about jobs and gas prices not birth control. _E_ Vote for your favorite TRUMP HOTEL COLLECTION hotels in Travel + Leisure's 2012 World's Best Awards Survey __HTTP__ _E_ How long did it take your staff of 823 people to think that up and where are your 33000 emails that you deleted? __HTTP__ _E_ I truly LOVE all of the millions of people who are sticking with me despite so many media lies. There is a great SILENT MAJORITY looming! _E_ Now @BarackObama has decided there are 5 million Palestinian refugees __HTTP__ He always goes against @Israel's interest. _E_ Marco Rubio lost big last night. I even beat him in Virginia where he spent so much time and money. Now his bosses are desperate and angry! _E_ Do you notice that because of Ebola ISIS etc. ObamaCare has gone to the back burner despite horrible results coming out. A disaster! _E_ The massive TAX CUTS/REFORM that I have submitted is moving along in the process very well actually ahead of schedule. Big benefits to all! _E_ The Democrats in the Southwest part of Virginia have been abandoned by their Party. Republican Ed Gillespie will never let you down! _E_ Gas prices are at crazy levels fire Obama! _E_ I hope everybody is having a FANTASTIC Christmas! No matter how tough things may seem remember that you will ride it out & go on to victory! _E_ Is everyone seeing how incompetently our country is being run by watching the mess with Syria? Our leaders don't know what they are doing! _E_ .@AlexSalmond –the man who let terrorist (Pan Am Flight 103) al Megrahi go lost another battle over ugly wind turbines in Blackdog. _E_ Via @FOXSports: Trump 'blowing up' @DoralResort after WGC @CadillacChamp __HTTP__ by @AP _E_ A fantastic day and evening in Washington D.C.Thank you to @FoxNews and so many other news outlets for the GREAT reviews of the speech! _E_ I employ many people in Hawaii at my great hotel in Honolulu. I'll be there very soon. Vote for me Hawaii! _E_ It is my great honor to support our Veterans with you! You can join me now. Thank you! #Trump4Vets __HTTP__ _E_ Tune in tonight at 1 AM EST to the QVC network to watch Melania Trump debut her first 2012 Melania Timepieces & (cont) __HTTP__ _E_ VOTE 4 @mariamenounos & derekhough#01 tonight! She's doing a great job on Dancing with the Stars #DWTS (& a good person). 1 800 868 3401 _E_ The Yankees really have to be embarrassed losing all four games to the Mets my great friend George Steinbrenner would be going nuts! _E_ Getting ready to lift off for Laredo. Will land at 1:OO P.M. Should be exciting and informative! _E_ It was a great honor to represent the United States at the magnificent #BastilleDay parade. Congratulations President @EmmanuelMacron! __HTTP__ _E_ .@MELANIATRUMP @IvankaTrump @EricTrump @DonaldJTrumpJr & I thank our loyal fans for another great season of @ApprenticeNBC! _E_ RT @WhiteHouse: #Obamacare has led to higher costs and fewer health insurance options for millions of Americans. It has failed the American... _E_ Later today I'm being honored at the Park Hyatt in Washington D.C. by the Wharton Club. The Joseph Wharton Award Dinner. A great honor. _E_ My son Donald will be interviewed by @seanhannity tonight at 10:00 P.M. He is a great person who loves our country! _E_ No wonder Sony is doing so badly. Really stupid leadership that wants Al Sharpton to help. Watch him turn the tables on chief Amy Pascal. _E_ RT @charliekirk11: Incredible video: @CBS does a special on the GOP tax plan The result?Every middle class family they sat down with SA... _E_ Verlander is great but very beatable. Does not have a good ERA in playoff games _E_ Heading to D.C. to speak at Faith and Freedom Coalition and visit OPO. _E_ "No matter how good you get you can always get better and that's the exciting part." @TigerWoods _E_ Yesterday was amazing—5 victories. Lyin' Ted Cruzhad zero. Things are going very well! _E_ Thank you for your support!#AmericaFirst #ImWithYou __HTTP__ _E_ Thank you Louisiana! #Trump2016#SuperSaturday _E_ I am greatly honored to receive Sarah Palin's endorsement tonight. Video: __HTTP__ __HTTP__ _E_ ObamaCare is one of the worst political disasters of all time 4992343 AMERICANS LOSING COVERAGE LESS THAN 50OOO NEW SIGNUPS. _E_ .... but you only want to talk about 10 years later when I still win 10PM in all key demos.@DannyZuker _E_ Via @todayshow by @ReeHines: "Donald Trump reveals new @ApprenticeNBC cast talks Joan Rivers' role on show" __HTTP__ _E_ Sometimes your best investments are the ones you don't make. _E_ I will be visiting Trump Int'l Golf Links in Scotland tomorrow. Always great to see the Great Dunes of Scotland. __HTTP__ _E_ I'll soon be leaving for Washington where @AmSpec will give me the T. Boone Pickens Entrepreneur Award. Very exciting! _E_ Crooked Hillary Attacks Foreign Government Donations While Ignoring Her Own: __HTTP__ _E_ I think that both candidates Crooked Hillary and myself should release detailed medical records. I have no problem in doing so! Hillary? _E_ Yogi Berra was not only a great baseball player he was a great guy. Yogi will be missed. __HTTP__ _E_ RT @piersmorgan: Trump makes a funny obvious joke about Russia going after Hillary's emails & U.S. media goes insane with fury.He plays t... _E_ Back by popular demand @latoyajackson returns to the 13th season of All Star @CelebApprentice. She is fierce in the Board Room! _E_ Anyone reading this profile of Marco Rubio would never vote for him. Never made ten cents & is totally controlled! __HTTP__ _E_ Join me in Sacramento California tomorrow evening @ 7pm! #Trump2016 __HTTP__ __HTTP__ _E_ LIVE on #Periscope __HTTP__ _E_ The three great essentials to achieve anything worth while are: Hard work stick to itiveness and common sense. Thomas A. Edison _E_ A test: tweet me the reason @billmaher was fired from @ABC (other than his bad ratings). _E_ The @USNavy is conducting search and rescue following aircraft crash. We are monitoring the situation. Prayers for all involved. __HTTP__ _E_ Meryl Streep one of the most over rated actresses in Hollywood doesn't know me but attacked last night at the Golden Globes. She is a..... _E_ Can you believe that the U.S. will be sending 3000 troops to Africa to help with Ebola.They will come home infected? We have enough problems _E_ Honored to welcome Georgia Prime Minister Giorgi Kvirikashvili to the @WhiteHouse today with @VP Mike Pence.... __HTTP__ _E_ .@BernieSanders who blew his campaign when he gave Hillary a pass on her e mail crime said that I feel wages in America are too high. Lie! _E_ The failing @politico news outlet which I hear is losing lots of money is really dishonest! _E_ I am happy to announce that theoriginal Apprentice which will offer job opportunities to those in need is coming back. _E_ If Obama keeps pushing wind turbines our country will go down the tubes economically environmentally & aesthetically. _E_ Trust yourself. Create the kind of self that you will be happy to live with all your life. Golda Meir _E_ RT @IvankaTrump: Beautiful article about @realDonaldTrump written by my friend the incredibly talented golfer Natalie Gulbis: __HTTP__ _E_ Guess what folks the ObamaCare website just went down again. What a disaster. _E_ As we are learning the hard way both domestically & internationally hope is not a strategy. _E_ My @gretawire int. on Obama scandals not resonating no retribution on Benghazi Obama not being engaged __HTTP__ _E_ Jerry Buss was a great guy and friend. He will be missed! _E_ Deportations are now at a record low. Obama manipulated the numbers to lie to the public that they were at a record high. Secure the border! _E_ .@AGSchneiderman has never once said that he didn't ask for campaign contributions during the investigation. _E_ ...and will be very embarrassed unless they get smart fast. _E_ The US is stupidly closing all of its coal fired plants while at the same time we're selling our coal to (cont) __HTTP__ _E_ Democrats are holding our Military hostage over their desire to have unchecked illegal immigration. Can't let that happen! _E_ #timetogettough The White House Correspondents' Dinner in my new book Time To Get Tough .....watch the #trumpvlog __HTTP__ _E_ Tim Kaine is and always has been owned by the banks. Bernie supporters are outraged was their last choice. Bernie fought for nothing! _E_ Always nice to see the terrific @mariamenounos at the #WWEHOF. __HTTP__ _E_ Obama told @NBC that Egypt is no longer an ally. They used to be until he pushed out Mubarak. _E_ Her instincts are suboptimal. __HTTP__ _E_ After 14 years U.S. beef hits Chinese market. Trade deal an exciting opportunity for agriculture. __HTTP__ _E_ Donald Trump appeared on the final episode of The Jay Leno Show to deliver a very special message: __HTTP__ _E_ .@AndreBauer Great job and advice on @CNN @jaketapper Thank you! _E_ .@johnboehner—if you can't make a great deal go over the cliff & negotiate new deal along with debt ceiling in February!—Trump 101. _E_ Ben Bradlee was truly one of the greats. What an amazing life he led. My warmest condolences to Dino & the whole family. #BenBradlee _E_ .@RealDonaldTrump wants a SAFE America w/ stronger borders no amnesty and an END to sanctuary cities. He is... __HTTP__ _E_ If the @yankees can somehow beat Verlander tonight then they can still salvage the series. And I will go to games 6& 7 so they will win! _E_ The replacement refs are getting blamed for everything. I've seen many bad sports calls over the years. _E_ Bernie Sanders has been treated terribly by the Democrats—both with delegates & otherwise. He should show them and run as an Independent! _E_ "A list golfing buddy! @Tegan__Martin enjoys golf w/Donald Trump ahead of@MissUniverse" __HTTP__ via @DailyMailCeleb _E_ It's Thursday how many more bias press reports will be released against @MittRomney? _E_ Congratulations to Bill O'Brien on being named the Republican Speaker of the NH House. Well earned & well deserved. A great guy. _E_ Statement Regarding Recent Executive Order Concerning Extreme Vetting: __HTTP__ _E_ The electric power grid in Puerto Rico is totally shot. Large numbers of generators are now on Island. Food and water on site. _E_ We fully support @SaveCulzean in Turnberry great for beauty & tourism. Wind turbines are death to environment. __HTTP__ _E_ It is that time of the year. The Trump Wollman Skating Rink is open to the public in Central Park. The greatest ice rink in the country. _E_ Failed presidential candidate @MittRomney was made to look like a fool by Senator Harry Reid & didn't release his tax returns until 9/21/12. _E_ .@Deadspin will never make it—they don't understand graciousness or money—and best guy is leaving? _E_ Will be on @foxandfriends now. _E_ "You have to have a good reason for doing what you're doing because people connect with the why." – Midas Touch _E_ .@tuckercarlson is doing a really good job on Fox especially when talking politics. He has come a long way fast! _E_ Thank you James Freeman of the @WSJ for the very nice words. All polls said I won the debate except NBC (3rd). Explain to Daniel Henninger! _E_ Why would a very low ratings radio talk show host like Hugh Hewitt be doing the next debate on @CNN. He is just a 3rd rate gotcha guy! _E_ Obama thinks he can just laugh off the fact that he refuses to release his records to the American public. He can't. _E_ By folding Penn State leadership made things worse. The deal is ridiculous & punishes the wrong people. I hope the alumni sue to overturn. _E_ What does.Obama know about the VA or business nothing just look at the five billion dollar ObamaCare website. We need a real leader! _E_ Trump Chicago was featured in Transformers 3. Trump Tower was featured in Dark Knight Rises. Both are summer blockbusters. #MidasTouch _E_ #TrumpVlog Be careful with Iran. __HTTP__ _E_ THe WH should not have hosted the Muslim Brotherhood. @BarackObama's friends are enemies of the US and @Israel. The Islamist winter is here. _E_ Make your NYC getaway memorable @TrumpNewYork provides both true luxury and top access to Midtown West __HTTP__ _E_ Unemployment is up in 44 states showing July's unemployment numbers to be broad based __HTTP__ @BarackObama is a job killer. _E_ Brian @kilmeade wrote a wonderful book called George Washington's Secret Six that is truly worth reading. __HTTP__ _E_ "If you strike out nobody is going to help you not your friends not the government. You have to look to look out for yourself." Think Big _E_ I will be interviewed on @NewDay @CNN at 7:00 A.M. _E_ Illegal immigration is a wrecking ball aimed at US taxpayers. Washington needs to get tough and fight for W... (cont) __HTTP__ _E_ Entrepreneurs: Difficulties mistakes & setbacks are an inevitable part of business & life. Remember to keep your equilibrium intact. _E_ How can Ted Cruz be an Evangelical Christian when he lies so much and is so dishonest? _E_ Fact – while Jeb was governor & Rubio was House Majority Leader Florida's debt more than doubled. Conservatives? _E_ "I love it when people doubt me. It makes me work harder to prove them wrong." – Derek Jeter _E_ NEVADA! Tomorrow is the deadline to register Republican.Visit: __HTTP__ from @IvankaTrump: __HTTP__ _E_ John Sununu was more right than he even knew yesterday @BarackObama indeed needs to learn how to be an American. _E_ Must read @WSJ column by Senator Phil Gramm "The Multiple Distortions of Wind Subsidies" __HTTP__ _E_ Played golf today with Prime Minister Abe of Japan and @TheBig_Easy Ernie Els and had a great time. Japan is very well represented! _E_ Such a great experience in New Hampshire amazing people! Will be leaving for a big event in South Carolina today. _E_ AN AMERICA FIRST ENERGY PLAN#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Amazing NH poll released! We are getting ready to Make America Great Again! __HTTP__ _E_ A general is just as good or just as bad as the troops under his command make him." Gen. Douglas MacArthur _E_ Welcome to the United States @IsraeliPM Benjamin & Sara!#ICYMI 🇱Joint Press Conference: __HTTP__ __HTTP__ _E_ Would love to send the NYC terrorist to Guantanamo but statistically that process takes much longer than going through the Federal system... _E_ This is what we can expect from #CrookedHillary. More Taxes. More Spending. #BigLeageTruth #DrainTheSwamp #Debates __HTTP__ _E_ More $ thrown away @BarackObama gave $20M to Amonix and praised its success in '10. It just filed for bankruptcy __HTTP__ _E_ 19 firefighters killed in Arizona terrible tragedy! _E_ Trump Links will be a great championship golf course that will host many major tournaments and bring tremendous $'s & prestige to N.Y.C.! _E_ Printing money is neither a short or long term solution to our country's economic woes. The Fed is destroyin... (cont) __HTTP__ _E_ Republicans and Democrats must come together now to make America great again! _E_ Thank you! #MAGA #AmericaFirst __HTTP__ _E_ I will rebuild the military take care of vets and make the world respect the US again! Join me today. Info: __HTTP__ _E_ "Listen to others but never negate your own instincts." – Trump Never Give Up _E_ Do you believe this Iran wants to trade our 3 prisoners (not 4) for 19 prisoners held by the U.S. Should have been let go with last deal! _E_ Low energy candidate @JebBush has wasted $80 million on his failed presidential campaign. Millions spent on me. He should go home and relax! _E_ #NEPrimary #VoteTrump #Trump2016 __HTTP__ __HTTP__ _E_ .@SecShulkin's decision is one of the biggest wins for our VETERANS in decades. Our HEROES deserve the best! ... __HTTP__ _E_ Join me in Oklahoma tomorrow night!#MakeYoutubeGreatAgain #Trump2016 __HTTP__ _E_ Go to Trump National Doral Miami and watch Tiger Phil Ernie Rory and all of the other great players compete in The WGC Cadillac Champ! _E_ If Hillary thinks she can unleash her husband with his terrible record of women abuse while playing the women's card on me she's wrong! _E_ RT @IsraelUSAforevr: @realDonaldTrump __HTTP__ _E_ The Herschel Walker interview on The Tim McCarver Show was fantastic much can be learned from watching. Congrats to Herschel and Tim! _E_ I have been drawing very big and enthusiastic crowds but the media refuses to show or discuss them. Something very big is happening! _E_ Entrepreneurs: See yourself as an organization. Pay attention to every facet of your life. What's strong? What's weak? What's missing? _E_ Sadly because president Obama has done such a poor job as president you won't see another black president for generations! _E_ Entrepreneurs: See yourself as victorious. Look at the solution not the problem. Keep your focus positive. _E_ An investment in knowledge pays the best interest. Benjamin Franklin _E_ Meeting w/ Washington D.C. @MayorBowser and Metro GM Paul Wiedefeld about incoming winter storm preparations here... __HTTP__ _E_ Veterans please call 855 VETS 352 or email address veterans@donaldtrump.com to share your stories about the need to reform the VA. _E_ #CelebApprentice What do you think of the new teams/PMs? _E_ Just got back from Las Vegas. @TrumpLasVegas Hotel was fantastic in every way but the fight was a total waste of time. The aggressor lost? _E_ WWII vs. Now! During the 3 1/2 years of World War II that started with the Japanese bombing of Pearl Harbor (cont) __HTTP__ _E_ Thank you Denver Colorado! #MakeAmericaGreatAgain! __HTTP__ _E_ ... The Republicans just didn't resonate with the people—but they will have better days. _E_ The White House is continuing to be openly uncooperative with the Fast and Furious investigation. American lives were lost. We need answers. _E_ ObamaCare premiums could jump as high as 51% __HTTP__ Terrible for economy. Repeal & Replace with free market solution! _E_ Just arrived at @trumpdoral for the @cadillacchamp starting tomorrow __HTTP__ _E_ Iron Mike Tyson was not asked to speak at the Convention though I'm sure he would do a good job if he was. The media makes everything up! _E_ Keep an open mind business is a creative endeavor. Strive for innovative ideas. _E_ RT @CLewandowski_: Please watch @foxandfriends today at 7:30 AM to watch me discuss @realDonaldTrump. _E_ Thank you. __HTTP__ _E_ "The more predictable the business the more valuable it is. Predictability also means consistency of brand experience." Midas Touch _E_ The failing @nytimes hates the fact that I have developed a great relationship with World leaders like Xi Jinping President of China..... _E_ Crooked Hillary Clinton has zero natural talent she should not be president. Her temperament is bad and her decision making ability zilch! _E_ Pres. Obama was touting Yemen as a great success story it just fell. Obama doesn't know what he is doing. Saudi Arabia is in big trouble. _E_ Eliot Spitzer has failed at everything he has ever done and now he wants to be comptroller. Thrown out of politics and off of TV CRAZY! _E_ It is so important to audit The Federal Reserve and yet Ted Cruz missed the vote on the bill that would allow this to be done. _E_ .@Gracematters Thank you a very wise bet! Best wishes. _E_ Via @trdmiami: "@TrumpDoral project will boast 800 hotel rooms" __HTTP__ $250M renovation on 800 acres in sunny Miami. _E_ Michelle Obama made a terrible mistake in Iowa. When endorsing Bruce Braley before a large crowd she called him Bruce Bailey seven times. _E_ .@CarlyFiorina Carly not just you I also told Gov. Kasich to "let Jeb talk give him a chance" because Kasich was constantly cutting in. _E_ It was great seeing @Schwarzenegger at the #WWEHOF. __HTTP__ _E_ How come the @TODAYshow & @chucktodd show the new @NBCNews Poll for Hillary vs Bernie but do not show the SAME poll where I am killing Cruz? _E_ JFK Files are being carefully released. In the end there will be great transparency. It is my hope to get just about everything to public! _E_ Looks like @bwilliams is having some problems with his Rock Center with Brian Williams show I hate to see such bad ratings for @NBC. _E_ .@CNN is #FakeNews. Just reported COS (John Kelly) was opposed to my stance on NFL players disrespecting FLAG ANTHEM COUNTRY. Total lie! _E_ China's Financial Institutions are expanding overseas. __HTTP__ They will own everything if we don't stop them now. _E_ I will be going to Trump National Doral in Miami early today to check on the construction of the hotel and the new Blue Monster. AMAZING! _E_ I can't believe that President Obama isn't able or willing to make just one phone call to the family of Kate Steinle.Come on Pres MAKE CALL! _E_ On Tuesday I visited with the incredible men & women of @ICEgov & @DHSgov Border Patrol in Yuma AZ. Thank you. We respect & cherish you! __HTTP__ _E_ Stock Market at all time high unemployment at lowest level in years (wages will start going up) and our base has never been stronger! _E_ My prayers are with the victims and hostages in the horrible Paris attacks. May God be with you all. _E_ I cannot imagine that Congress would dare to leave Washington without a beautiful new HealthCare bill fully approved and ready to go! _E_ For the haters out of hundreds of deals or transactions I have used the bankruptcy laws 4 times in order to cut better deals. _E_ Watch video of Ivanka Trump sharing business advice with 4 entrepreneurial women on GMA: __HTTP__ _E_ The people of Scotland love the golf course I have built it is now considered perhaps the greatest ever built! Thank you also to Robb Report _E_ So nice to get an endorsement from the founder and owner of Pizza Ranch in Iowa! A great guy and great places! #CaucusForTrump _E_ Have we ever had a POTUS before @BarackObama who earned over 1/3 of his income from foreign sources and paid taxes to another country? _E_ Who's the outsourcer? @BarackObama's campaign is using a travel company with outsourced jobs in China and India. __HTTP__ _E_ this election. That is a direct threat to our democracy. She then said We have to accept the results and look to the future Donald _E_ Never let them see you sweat! __HTTP__ _E_ Pakistani intelligence had full knowledge that Bin Laden was living in Abbottabad. They were sheltering him. _E_ The @NBCNews story has just been totally refuted by Sec. Tillerson and @VP Pence. It is #FakeNews. They should issue an apology to AMERICA! _E_ In case you missed it last week's @extratv interview with @AJCalloway discussing Tiger Woods & much more __HTTP__ _E_ Obama can open the Mall for illegals to protest our country yet he continues to barricade WWII memorial. That's an absolute disgrace. _E_ Prior to the election it was well known that I have interests in properties all over the world.Only the crooked media makes this a big deal! _E_ Right now we are running a massive $300 billion trade deficit with China. That means every year. China is (cont) __HTTP__ _E_ Re: immigration. Do the Republicans not realize that Dems will get 100% of 11 million votes no matter what they do? _E_ After the way I beat Gov. Scott Walker (and Jeb Rand Marco and all others) in the Presidential Primaries no way he would ever endorse me! _E_ WELCOME HOME AYA!#GodBlessTheUSA __HTTP__ _E_ RT @foxandfriends: Wall Street hits record highs after Trump pulls out of Climate pact __HTTP__ _E_ Obama should stop talking about wind turbines they are a disaster for a country or community & are very expensive & unreliable. _E_ I always felt I would be running and winning against Bernie Sanders not Crooked H without cheating I was right. _E_ I will be making the announcement of my Vice Presidential pick on Friday at 11am in Manhattan. Details to follow. _E_ #TBT Here I am with @gwenstefani and @donaldjtrumpjr __HTTP__ _E_ Subject to the receipt of further information I will be allowing as President the long blocked and classified JFK FILES to be opened. _E_ Working in Bedminster N.J. as long planned construction is being done at the White House. This is not a vacation meetings and calls! _E_ .@kimguilfoyle just watched you on @OutnumberedFNC thank you! _E_ This week we saw what Obama Care actually does when implemented. It is a losing issue for @BarackObama and must be repealed. _E_ I'll be on @foxandfriends this morning at 7:00. So much to talk about! _E_ High above the city @TrumpLasVegas' pool deck mixes business & pleasure over a soaring bar of sky bound gold __HTTP__ _E_ Obama called Reverend Wright his friend counselor & great leader then dumped him like a dog! _E_ even those registered to vote who are dead (and many for a long time). Depending on results we will strengthen up voting procedures! _E_ Why did @BarackObama let Iran keep our drone? Now it is going straight to the Chinese. He should have taken it out. _E_ Major rescue operations underway! _E_ Can't believe I finally got a good story in the @washingtonpost. It discusses the enthusiasm of Trump voters through campaign.... _E_ Truly honored to receive the first ever presidential endorsement from the Bay of Pigs Veterans Association. #MAGA... __HTTP__ _E_ Treasury has refused to name China a currency manipulator even though the yuan "remains significantly undervalued" __HTTP__ _E_ Why would I call China a currency manipulator when they are working with us on the North Korean problem? We will see what happens! _E_ Trump Virginia Office Announces Statewide TV Ad Strategy and Leadership Team: __HTTP__ __HTTP__ _E_ He @RickSantorum is now losing in the latest @ppppolls to @MittRomney in Pennsylvania __HTTP__ Rick is wasting everyone's time. _E_ Van Jones: 'There Is A Crack in the Blue Wall' — It Has to Do With Trade: __HTTP__ _E_ Via @BreitbartNews @biggovt: DONALD TRUMP TO SPEAK AT CPAC __HTTP__ by @michaelpleahy _E_ The United Nations Security Council just voted 15 0 in favor of additional Sanctions on North Korea. The World wants Peace not Death! _E_ Wife Huma wants @RepWeiner to pull a @billclinton by giving a tell all interview. Unlike Clinton Anthony is a sick puppy. _E_ Under his administration oil and gas production on public land is down over 10% __HTTP__ Obama did not tell truth last night. _E_ Via @nypost by @GeoffEarle: "Polls show 'President Trump' may not be so far fetched" __HTTP__ _E_ Obama's own top donor is now laying employees off and lowering hours in anticipation of Obama Care __HTTP__ The new reality. _E_ Entrepreneurs: You have to have passion. If you love your work success will follow. _E_ Big things going on today at Trump National Westchester! _E_ Together we will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_ With 3.5 million Americans receiving bonuses or other benefits from their employers as a result of TAX CUTS 2018 is off to great start!✅Unemployment rate at 4.1%.✅Average earnings up 2.9% in the last year.✅200000 new American jobs.✅#MAGA __HTTP__ _E_ The polls are really looking good—#1 everywhere despite all lobbyist & special interest $ being spent against me. I'm turning down millions. _E_ The White House Correspondents' dinner was so boring this year I guess that's because I didn't attend(even... __HTTP__ _E_ "Never think of learning as a burden. It may require some discipline but it prepares you for a new beginning."– Think Like a Champion _E_ So General Flynn lies to the FBI and his life is destroyed while Crooked Hillary Clinton on that now famous FBI holiday "interrogation" with no swearing in and no recording lies many times...and nothing happens to her? Rigged system or just a double standard? _E_ I want to thank Elizabeth Steve Brian and all of the great folks of @foxandfriends for the long and successful run we had together. NICE! _E_ It's Wednesday how much money is China stealing from us today? _E_ ObamaCare has cut workers' pay by over $22B & eliminated 350000+ small business jobs __HTTP__ Repeal before it's too late! _E_ Great new poll thank you!#MakeAmericaGreatAgain __HTTP__ _E_ I try to learn from the past but I plan for the future by focusing exclusively on the present. That's where the fun is. ~Donald Trump _E_ The Fed should not bail out the EU. Europe's financial mess is their problem not our problem! _E_ Democrats are laughingly saying that McCain had a moment of courage. Tell that to the people of Arizona who were deceived. 116% increase! _E_ I knew disgusting and unwanted porn star @REPWEINER was a sleazebag the first time I met him. Thank goodness he was revealed (so to speak). _E_ If @amazon ever had to pay fair taxes its stock would crash and it would crumble like a paper bag. The @washingtonpost scam is saving it! _E_ China is raising its defense budget by 11% __HTTP__ @BarackObama wants to cut ours by over $1Trillion. Wrong policy. _E_ "Big jobs usually go to the men who prove their ability to outgrow small ones." Theodore Roosevelt _E_ Was in Iowa yesterday great people. Record crowds at both speeches. Something big is happening. Pols are all talk. Make America great again! _E_ Wow! Does Eliot Spitzer have a girlfriend? This is getting exciting. _E_ First Minister of Scotland released bomber of Pan Am flight #103 on compassionate grounds. Do you believe? _E_ Thank you Kevin. With unification of the party Republican wins will be massive! __HTTP__ _E_ YOU NEED BOTH A PUBLIC AND A PRIVATE POSITION @HillaryClinton #Debates2016 __HTTP__ _E_ .@MRbelzer is a stone cold loser with no talent why did they ever put him on Law and Order? _E_ Weekly jobless claims are up once again. The economy cannot recover with Obama in office. _E_ Thank you Springfield Ohio. Get out and #VoteTrumpPence16!#ICYMI watch here: __HTTP__ __HTTP__ _E_ I have long stated that Brian Williams was not a very smart guy all you have to do is look at his past. Now he has proven me correct! _E_ I will be on ON THE RECORD @gretawire tonight at 10 pm _E_ Donald Trump's back with 14 'Apprentice' All Stars __HTTP__ via @AP _E_ Thank you for your support! Together we can #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_ #MakeAmericaWorkAgain #TrumpPence16 #RNCinCLE __HTTP__ __HTTP__ _E_ In Charlottesville VA @trumpwinery is Virginia's largest winery with 200 acres of French vinifera varieties __HTTP__ _E_ country and with the massive cost reductions I have negotiated on military purchases and more I believe the people are seeing big stuff. _E_ Derek Jeter's baseball and more in today's #trumpvlog... __HTTP__ _E_ I believe @BarackObama is manipulating the jobless numbers __HTTP__ _E_ While Hillary said horrible things about my supporters and while many of her supporters will never vote for me I still respect them all! _E_ ObamaCare does indeed ration care. Seniors are now restricted to comfort care instead of brain surgery. Repeal now! __HTTP__ _E_ I am doing On the Record With Greta Van Susteren at 10 P.M. on Fox. We will be talking about the bad economy and other subjects of interest! _E_ How ironic that @BarackObama's campaign would call me a charlatan. Have they looked at their boss's record? _E_ And the FAKE NEWS winners are... __HTTP__ _E_ Lightweight Senator Marco Rubio is VERY weak on immigration knows nothing about finance and would be incapable of making great trade deals! _E_ I have determined that it is time to officially recognize Jerusalem as the capital of Israel. I am also directing the State Department to begin preparation to move the American Embassy from Tel Aviv to Jerusalem... __HTTP__ _E_ .@NRO @JonahNRO Wow just looked at the stats for National Review. Dying fast doing very little business. Save this conservative voice! _E_ Robin Williams was a truly wonderful actor & comedian. One of the few people who could make me laugh. Very tragic. _E_ Great going to all of Dubai in winning what will be a fantastic #Expo2020 we will all be there! _E_ .@redcross CEO's salary in 2011 was $951957. Where is the outrage? _E_ Where is the outrage for this Disney book? Is this the 'Star of David' also? Dishonest media! #Frozen __HTTP__ _E_ .@JonahNRO You should be totally focused on trying to save the badly failing National Review instead of focusing on me. Work hard! @NRO _E_ Via @WBJonline by @WBJHolan: "Donald Trump hints at presidential run promises 'great luxury hotel' for D.C." __HTTP__ _E_ The Chinese are smart. They bought up over $7B in US housing last year __HTTP__ U.S. is busy making China even richer. _E_ A general is just as good or just as bad as the troops under his command make him. Douglas MacArthur _E_ Have you ever seen our country look weaker or more pathetic: Snowden ObamaCare VA Russia jobs decimated military debt and so much more _E_ The now $1.2B ObamaCare website is as bad as ever insurers not getting the proper data. __HTTP__ _E_ Have a happy successful and healthy New Year! _E_ Congratulations to the 2016 @ClemsonFB Tigers!Full ceremony: __HTTP__ __HTTP__ _E_ Congratulations to @AmericansElect for winning a spot on the California 2012 ballot. A major feat! __HTTP__ _E_ "See yourself as victorious! That will focus you in the right direction." – Trump Never Give Up _E_ Will be in Alabama tonight. Luther Strange has gained mightily since my endorsement but will be very close. He loves Alabama and so do I! _E_ Michelle Nunn will be a solid vote for Obama. She supports ObamaCare & opposes 2nd Amendment. Vote for @Perduesenate to change things! _E_ President Obama is losing on so many fronts in fact all fronts that I am concerned he will do something totally irrational. He can't lead! _E_ We just have to get tough get smart and get a president willing to stand up for America and stick it to the (cont) __HTTP__ _E_ Michelle Nunn will be a rubber stamp for Barack Obama. @Perduesenate. GOTV for David this Tuesday! _E_ I will be in Maryland this afternoon for a major rally. Things are looking good for Tuesday! _E_ I will be interviewed on @greta at 7:00 P.M. @FoxNews _E_ Re build the United.States not places that hate our country and everything we stand for! _E_ The Obama's Spain vacation cost taxpayers over $476K __HTTP__ They love to spend money. _E_ I watched the last two minutes of the @dallasmavs game last night I just loved watching them lose. _E_ Not only does ObamaCare have at least 21 new taxes but it will lead to a tremendous doctor shortfall. _E_ I'm a star maker Adrian has continued to receive many fans in @TrumpTowerNY and @AmandaTMiller is definitely on the map! #CelebApprentice _E_ China's new AND ADVANCED currency manipulation is killing the U.S. Help! _E_ Whitney Houston was a great friend and an amazing talent. We will all miss her and send our prayers to her family. _E_ We have millions in our country unemployed yet we are wasting millions arming Syrian 'rebels.' What is wrong with Washington?! _E_ Good luck to my new friends on your testimony in DC. You are amazing people doing something so important stopping illegal immigration! _E_ Via @worldnetdaily: JAILED U.S. PASTOR'S WIFE PRAISES TRUMP: 'I hope more people like him will speak out' __HTTP__ _E_ Crooked Hillary says we must call on Saudi Arabia and other countries to stop funding hate. I am calling on cont'd: __HTTP__ _E_ I've always been a fan of Steve Jobs especially after watching Apple stock collapse w/out him – but the yacht he built is truly ugly. _E_ Have to go now to sign a great and job producing deal! Good night. _E_ Hope to see you tomorrow in Trump Tower (5th Ave betw 56 and 57) I'll be signing copies of my book #TimeToGetTough from noon until 2 pm _E_ Nobody will protect our Nation like Donald J. Trump. Our military will be greatly strengthened and our borders will be strong. Illegals out! _E_ Old Post Office Building in DC will be a world class Trump property. Honored to be doing this historic building Washington will be proud. _E_ Move slowly carefully and then strike like the fastest animal on the planet! _E_ Thank you to our amazing law enforcement officers! #MAGA __HTTP__ _E_ American steel & American hands have constructed a 100000 ton message to the world: American MIGHT IS SECOND TO NONE!#USSGeraldRFord #USA __HTTP__ _E_ Hurricane is good luck for Obama again he will buy the election by handing out billions of dollars. _E_ We will never cut spending until we actually work off of a budget. The Democrats haven't passed one in over 3 years. What a joke. _E_ The Better Business Bureau report with an A rating for Trump University. #GOPDebate __HTTP__ __HTTP__ _E_ RT @DRUDGE_REPORT: RICE ORDERED SPY DOCS ON TRUMP? __HTTP__ _E_ .@RNC leadership should not be afraid of a government shutdown. They should be afraid of not defunding ObamaCare. _E_ Trump's Campaign Hat Becomes an Ironic Summer Accessory The New York Times. __HTTP__ _E_ Even though I have a very biased and unfair judge in the Trump U civil case in San Diego I have thousands of great reviews & will win case! _E_ Will be speaking with Italy this morning! _E_ Tremendous investment by companies from all over the world being made in America. There has never been anything like it. Now Disney J.P. Morgan Chase and many others. Massive Regulation Reduction and Tax Cuts are making us a powerhouse again. Long way to go! Jobs Jobs Jobs! _E_ There's only one candidate who cut medicare and that's Barack Obama. Cut over $700M to move into ObamaCare. _E_ My @gretawire interview where I discuss fixing the economy killing Bin Laden the John Edwards trial and fair trade. __HTTP__ _E_ Isn't the WORLD tired of hearing President Obama say he knew nothing about anything time to take responsibility for all of your mistakes! _E_ I am happy to see the majority of the GOP candidates agree with me that the tax code must be simplified and the rates dropped. _E_ Will be doing Fox and Friends in two minutes! _E_ The only reason I am critical of the Pinehurst look is because I'm a lover of golf—and that look on TV hurts golf badly. _E_ Vegas' top destination @TrumpLasVegas is a 64 story tower of golden glass __HTTP__ What goes on there stays there! _E_ .@ThisWeekABC with @GStephanopoulos had fantastic numbers last Sunday Trump interview. Nice! _E_ #TBT Taking piano lessons from my friend Elton John. __HTTP__ _E_ Just got a great new selection of ties & shirts @Macys. Go buy them now for Father's Day—they're beautiful! _E_ Going to Charlotte NC to speak before more than 20000 people on Saturday morning—total sellout crowd—will be great! _E_ I hear @billmaher really bombed in Springfield people were leaving show way early stupid guy! _E_ Great POLL numbers are coming out all over. People don't want another four years of Obama and Crooked Hillary would be even worse. #MAGA _E_ I'll be on @foxandfriends on Monday at 7:30 AM. _E_ "Invincibility lies in the defence the possibility of victory in the attack." Sun Tzu _E_ They found Jessica in Colorado body was mutilated death to the pervert killer. _E_ Looking forward to being the special guest at tonight's Dutchess County #GOP dinner to a SOLD OUT crowd. It will be great fun. _E_ Just landed in the Philippines after a great day of meetings and events in Hanoi Vietnam! __HTTP__ _E_ Respected Morning Consult poll just out. I lead all Republicans and beat Hillary head to head by a wide margin 45 to 40! _E_ 'Presidential Executive Order on Promoting Agriculture and Rural Prosperity in America'Executive Order:... __HTTP__ _E_ I received calls from the President of Mexico and the Prime Minister of Canada asking to renegotiate NAFTA rather than terminate. I agreed.. _E_ 'Scandals surround Clinton's gatekeeper at State'#DrainTheSwamp __HTTP__ _E_ Yesterday Obama compared Nelson Mandela to George Washington in Africa. Do you think he really believes it? _E_ Had a meeting with the terrific @GovPenceIN of Indiana. So excited to campaign in his wonderful state! __HTTP__ _E_ I will be in Cincinnati Ohio tomorrow night at 7:30pm join me! #OhioVotesEarly #VoteTrumpPence16 Tickets:... __HTTP__ _E_ I will be on @oreillyfactor at 8:00 P.M. Enjoy! _E_ My comments on a larger screen iPhone were in addition to existing unit not a replacement. Screen should be 10% larger than Samsung. _E_ Why is the UN condemning @Israel and doing nothing about Syria? What a disgrace. _E_ The podium in the Oval Office looks odd! Not good but the words will be the key. _E_ Majority of Independents want Obamacare overturned __HTTP__ The best way to do it is by voting out @BarackObama _E_ He made a great contribution to the press @AndrewBreitbart will be missed. _E_ Our President should stop trying to be an economist to the world and start fighting for our economy. Instead (cont) __HTTP__ _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ I am honored that the great men and women of the @Teamsters have created a movement from within called Teamsters for Trump! Thank you. _E_ I knew Chris Matthews when he was sane and quite honestly wonderful. Now he's gone off the deep end as an Obama surrogate. @hardball_chris _E_ Looking forward to visiting the Trump Vineyard Estates today in Charlottesville VA for a press conference and the grand opening. _E_ Exclusively @Macys The Donald J. Trump Signature Collection features the best ties & shirts at the best prices. __HTTP__ _E_ ... Is a third party coming? I hope not. _E_ My conversation from ON THE RECORD @Gretawire __HTTP__ _E_ .@GovernorPerry stopped by to say hello. __HTTP__ _E_ If @BarackObama had to use the same labor participation he had when he entered office then the unemployment number would be 11.2% _E_ "Competitive golf is played mainly on a five and a half inch course... the space between your ears." Bobby Jones _E_ Former Obama White House economic adviser @Austan_Goolsbee gave his old boss a 'C' on the economy __HTTP__ Pretty generous! _E_ The @MissUSA 2012 contestants pose for a picture with me at Trump Tower in New York City __HTTP__ _E_ It's Friday. How much money has been wasted on defunct ObamaCare website today? _E_ .@KevinHart4real joined @woodmank104 @katek104 @K1047 & was asked about his thoughts on @realDonaldTrump #Trump2016 Thanks Kevin so nice! _E_ There is no question who will handle the threat of terrorism best as #POTUS. #Trump2016 __HTTP__ __HTTP__ _E_ President Obama has a major meeting on the N.Y.C. Ebola outbreak with people flying in from all over the country but decided to play golf! _E_ Fake News is at an all time high. Where is their apology to me for all of the incorrect stories??? _E_ I will be on @SeanHannity tonight at 10pmE talking about my new book #CrippledAmerica and much more! #MakeAmericaGreatAgain #Trump2016 _E_ Further proof that Gang of Eight member Marco Rubio is weak on illegal immigration is Paul Singer's Mr. Amnesty endorsement.Rubs can't win _E_ .@DanaPerino & @BradThorThank you so much for the wonderful compliment. Working hard! #MAGA __HTTP__ _E_ For what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_ Now the world is looking to China for an economic 'lift' __HTTP__ @BarackObama has ruined our economic hegemony. _E_ Economics behind ugly bird killing wind turbines do not work will destroy Scotland's beautiful coastline. (cont) __HTTP__ _E_ .@aaronschock Aaron it was great to meet you at Trump Tower. Also really good job on television! _E_ The Greater Miami area and numerous others are fighting hard to get the Miss Universe Pageant. A decision will be made very soon! _E_ RT @MichaelCohen212: I have never been to Prague in my life. #fakenews __HTTP__ _E_ Should not pass bad deal! __HTTP__ _E_ The only thing that can stop this corrupt machine is YOU. The only force strong enough to save our country is US.... __HTTP__ _E_ People forget it was Club for Growth that asked me for $1 million. I said no & they went negative. Extortion! __HTTP__ _E_ I will be on the @colbertlateshow tonight at 11:30 __HTTP__ _E_ Negotiation is persuasion more than power. Be reasonable and flexible and never let anyone know exactly where you're coming from. _E_ Thank you @TheFix Chris Cillizza. It is a true person of character that can change his opinion & do what is right. __HTTP__ _E_ I think somebody should pick Johnny Football he will be a star. _E_ As long as we have faith in each other and confidence in our values then there is no challenge too great for us to conquer! #ALConv2017 __HTTP__ _E_ Looking forward to receiving the T. Boone Pickens Entrepreneur Award at tomorrow's @AmSpec Robert L. Bartley Gala dinner. _E_ Beautiful weather all over our great country a perfect day for all Women to March. Get out there now to celebrate the historic milestones and unprecedented economic success and wealth creation that has taken place over the last 12 months. Lowest female unemployment in 18 years! _E_ .@AlexSalmond Ireland just ended the bird killing wind farm near my great resort on the Atlantic Ocean. The reason would hurt tourism! _E_ ...they do NOTHING for us with North Korea just talk. We will no longer allow this to continue. China could easily solve this problem! _E_ Photo from @IvankaTrump of Trump International Golf Links & Hotel Ireland __HTTP__ _E_ Let's get out of Afghanistan. Our troops are being killed by the Afghanis we train and we waste billions there. Nonsense! Rebuild the USA. _E_ The best vision is insight. Malcolm Forbes _E_ Reporter @AlHunt is one boring and low vision guy! _E_ We all have the capability to read or sense what's happening with others. It can often give you the edge (cont) __HTTP__ _E_ .@GiulianaRancic & @nickjonas are co hosting Miss USA 2013 Sunday night at 9 PM ET on NBC. @JonasBrothers will be performing. Tune in! _E_ .@THEGaryBusey is definitely different. #CelebApprentice _E_ Democrats are smiling in D.C. that the Freedom Caucus with the help of Club For Growth and Heritage have saved Planned Parenthood & Ocare! _E_ GOP Voters Trust Donald Trump to Keep Our Country Safe __HTTP__ _E_ I am the only potential owner of the @buffalobills who will keep the team in Buffalo where it belongs! _E_ Congratulations to @Yankees Derek Jeter on passing Eddie Murray last night to become the 11th all time @MLB hit leader. _E_ Does anyone really believe that President Obama found out about Petraeus immediately after the election? _E_ ... among ABC CBS and NBC in the key news demo of adults.... _E_ I'm with YOU! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Sleazy Adam Schiff the totally biased Congressman looking into Russia spends all of his time on television pushing the Dem loss excuse! _E_ Political strategist Stuart Stevenswho led Romney down the tubes in what should have been an easy victoryhas terrible political instincts! _E_ My interview yesterday with @foxandfriends discussing the failure of the Super Committee and GOP 2012.... __HTTP__ _E_ Biggest Tax Bill and Tax Cuts in history just passed in the Senate. Now these great Republicans will be going for final passage. Thank you to House and Senate Republicans for your hard work and commitment! _E_ In times of tragedy the bonds that sustain us are those of family faith community and country. These bonds are stronger than the forces of hatred and evil and these bonds grow even stronger in the hours of our greatest need. __HTTP__ __HTTP__ _E_ NBC Wall St Journal Poll of African American voters: 94% @BarackObama 0% @MittRomney.Even worse than Hillary's old numbers. Is that racism? _E_ I am impressed with how clearly @PaulRyanVP explains the challenges we face and the solutions @MittRomney will bring as President. _E_ Have a great game today @USArmy and @USNavy I will be watching. We love our U.S. Military. On behalf of an entire Nation THANK YOU for your sacrifice and service! #ArmyNavyGame #USA __HTTP__ _E_ "Golfer bids $130000 for round with Donald Trump" in Scotland for charity __HTTP__ via Evening Express _E_ Via @Reuters: Donald Trump takes steps toward 2016 presidential run __HTTP__ _E_ Entrepreneurs: Keep your eyes on your ideals as well as reality. Accentuate the positive without being blind to the negative. _E_ Chinese demand is raising the price of oil to$123/Barrel __HTTP__ We need to use our own energy resources. _E_ 2011 #CelebrityApprentice winner @JohnRich and @MarleeMatlin interviewed the final four in this week's episode __HTTP__ _E_ .@HuffingtonPost is doing very badly. Also very inaccurate stories. Like AOL when will they fail? _E_ Why is @BarackObama delaying the sale of F 16 aircraft to Taiwan? Wrong message to send to China. #TimeToGetTough _E_ I am offering the chance for Barack Obama to redistribute $5M to any charity of his choice. Everyone wins. Take the deal. _E_ China is pushing North Korea! _E_ #WVPrimary #VoteTrump #Trump2016 __HTTP__ __HTTP__ _E_ ObamaCare is a major threat to America's entrepreneurial spirit and competitiveness. Small businesses will b... (cont) __HTTP__ _E_ An Iranian nuclear scientist's car exploded in Tehran yesterday lots of problems to come @BarackObama we need real leadership. _E_ If our border is not secure we can expect another attack. A country with open borders is open to the terrorists. _E_ Do not allow our very stupid leaders to sign a deal that keeps us in Afghanistan through 2024 with all costs by U.S.A. MAKE AMERICA GREAT! _E_ Yesterday was Veterans Day. I hope our armed service members felt appropriately honored. This nation loves and respects all of you. _E_ Just had a very nice meeting with @Reince Priebus and the @GOP. Looking forward to bringing the Party together and it will happen! _E_ Crooked's top aides were MIRED in massive conflicts of interests at the State Dept. We MUST #DrainTheSwamp __HTTP__ #Debate _E_ Someone unknown tweeted incorrectly that I'm for Sen. Mitch @McConnellPress for speaker. I'm supporting him for Senate Majority Leader _E_ Time Magazine called to say that I was PROBABLY going to be named "Man (Person) of the Year" like last year but I would have to agree to an interview and a major photo shoot. I said probably is no good and took a pass. Thanks anyway! _E_ Donald Trump plans return to Iowa __HTTP__ via @KCCINews _E_ "Iowa hirings suggest Donald Trump serious about 2016 White House bid" __HTTP__ via @WashTimes by @SethMcLaughlin1 _E_ ...It's called intellectual property rights something they know nothing about. _E_ The ObamaCare website still is not complete. $5 billion and no progress. Scary and sad! _E_ David Brooks of the New York Times is closing in on being the dumbest of them all. He doesn't have a clue. _E_ These Tsarnaev brothers did not work alone. They had help and assistance from other cell members. Be vigilant and on the lookout. _E_ The tragedy in Newtown really makes you understand how life is so fragile. Must appreciate every minute! _E_ Trump vows to fight 'epidemic' of human trafficking __HTTP__ _E_ It all begins today! I will see you at 11:00 A.M. for the swearing in. THE MOVEMENT CONTINUES THE WORK BEGINS! _E_ Honor to have been interviewed by the very wonderful @bishopwtjackson in Detroit last week tune in at 9pmE. Enjoy! __HTTP__ _E_ The New Black Panthers are back at the same Philly polling station from '08 __HTTP__ Don't let them intimidate you! _E_ Time to start building in our country with American workers & with American iron aluminum & steel. It is time to... __HTTP__ _E_ "To keep your momentum going you must have intrinsic values as well as monetary values. Know when to give back." – Think Big _E_ I will be going to the funeral of my friend Joan Rivers today. I got to know her really well when she became the winner of The Apprentice! _E_ Wouldn't it be nice if our government could build a wall on the border under budget and ahead of schedule?! my @SRQRepublicans speech. _E_ Excited that @OurCountryPAC's @Amy Kremer has endorsed the Newsmax @iontv debate. The Tea Party Express is a great group. _E_ Why is the @GOP congress focusing on amnesty when so many Americans are unemployed? _E_ Thank you Andrew Jackson! #POTUS7 #USA __HTTP__ _E_ Hillary Clinton made a speech today using the biggest teleprompter I have ever seen. In fact it wasn't even see through glass it was black _E_ RT @Reince45: Promise kept. @POTUS exits flawed #ParisAccord to seek better deal for U.S. workers & economy. This WH will always put #Ameri... _E_ Many people walked out on Madonna's concert when she told them to vote for Obama. Years ago I walked out because the concert was terrible! _E_ It was an honor to welcome President Al Sisi of Egypt to the @WhiteHouse as we renew the historic partnership betwe... __HTTP__ _E_ Watch to see the new cast of @ApprenticeNBC __HTTP__ _E_ A big day for the U.S. at the United Nations! _E_ Just out: TRUMP GOP DEBATE 18000000. CLINTON DEMOCRAT DEBATE 6700000. And they were on major network vs. cable! _E_ Be weak on immigration and ensure Democratic victory. _E_ Everyone should cancel HBO until they fire low life dummy Bill Maher! Get going now and feel good about yourself! _E_ Rickie Fowler @therealrickiefowler Instagram photos | Websta __HTTP__ via @websta _E_ Even NY Democrats are avoiding @BarackObama's convention __HTTP__ He is dragging his own party down with him _E_ Getting ready to take off for Nashua New Hampshire. Big crowd will be there soon. Fun! _E_ .@cyndilauper Condolences on the passing of your uncle and best wishes. _E_ ...really hard to help but many have lost their homes. Military is now on site and I will be there Tuesday. Wish press would treat fairly! _E_ Jeff Sessions is an honest man. He did not say anything wrong. He could have stated his response more accurately but it was clearly not.... _E_ I have many great people but also an amazing number of haters and losers responding to my tweets why do these lowlifes follow nothing to do! _E_ Less than ten days until I keynote @bobvanderplaats' @theFAMiLYLEADER Leadership Summit. Tix going fast. __HTTP__ _E_ Everybody is arguing whether or not it is a BAN. Call it what you want it is about keeping bad people (with bad intentions) out of country! _E_ In the last 2 weeks I had $35M of negative ads against me in Florida & I won in a massive landslide.The establishment should save their $$! _E_ I really liked everyone at the @WWE Hall of Fame ceremony fantastic people! _E_ Amazing various celebrities were far harsher than me with political statements but media doesn't care about... __HTTP__ _E_ RT @realDonaldTrump: I will be interviewed tonight on @FoxNews by @SeanHannity at 9pmE. Enjoy! _E_ Putin is having such a good time. Our President is making him look like the genius of all geniuses. Do not fearwe are a NATION OF POTENTIAL _E_ RT @EricTrump: Friends: If you live in AL AK AR CO GA MA MN OK TN TX VT or VA get out and VOTE on Tuesday! #Trump2016 __HTTP__ _E_ Here I am with Whitney Houston at a party at Mar a Lago. __HTTP__ _E_ Thank you! #GOPDebate MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Here I am with @IvankaTrump and erictrump presenting the WGC @CadillacChamp Trophy to Tiger Woods at... __HTTP__ _E_ Rated Toronto's #1 hotel the 65 story 5 Star @TrumpTO is located in the heart of the city's finest attractions __HTTP__ _E_ Thank you for having me! I enjoyed the tour and spending time with everyone. See you soon. #MAGA __HTTP__ _E_ Dishonest reporters knowingly write lies that I said "children should not get vaccinated." I believe fully (cont) __HTTP__ _E_ RT @MoskowitzEva: .@BetsyDeVos has the talent commitment and leadership capacity to revitalize our public schools and deliver the promise... _E_ If @RepMarkMeadows @Jim_Jordan and @Raul_Labrador would get on board we would have both great healthcare and massive tax cuts & reform. _E_ Home ownership is at a 19 year low. If you can buy now. You will thank me later. _E_ Via @AmSpec by Jeffrey Lord: "Donald Trump: America's Entrepreneur" __HTTP__ Wow thank you to Jeffrey Lord & @AmSpec! _E_ While @BarackObama tries to push gun control __HTTP__ He still has not answered for Project Gun Runner __HTTP__ _E_ Fight is over Mayweather lost big but lets see what judges say! _E_ If Obama mentions Mitt's tax returns in tomorrow's debate then Mitt should immediately ask for Obama's college records & applications _E_ The White House never looked more beautiful than it did returning last night. Important meetings taking place today. Big tax cuts & reform. _E_ Congratulations to @DanaPerino on your book going to number one on Amazon. Great book Great job! _E_ Can someone explain to me how a Chechnyan permanent resident non citizen in our country is planning Jihad while on welfare? _E_ Ed Gillespie worked hard but did not embrace me or what I stand for. Don't forget Republicans won 4 out of 4 House seats and with the economy doing record numbers we will continue to win even bigger than before! _E_ Watch my live book signing now! __HTTP__ _E_ Just watched lightweight Marco Rubio lying to a small crowd about my past record. He is not as smart as Cruz and may be an even bigger liar _E_ Thank you @BillKristol. I am going to Make America Great Again! _E_ American corporations and entrepreneurs are masters of technological and business innovation but the Chinese (cont) __HTTP__ _E_ Clive Davis gave a great eulogy at my friend Whitney Houston's funeral absolutely amazing! _E_ RT @DonaldJTrumpJr: Thanks New Hampshire!!! #NH #NewHampshire #MAGA __HTTP__ _E_ Which team do you think has the edge in this interactive photo experience task assignment? _E_ Via @EW: "@CelebApprentice All Stars' first trailer" __HTTP__ _E_ Everyone join me tomorrow at 11 AM in Trump Tower atrium. _E_ Great Tax Cut rollout today. The lobbyists are storming Capital Hill but the Republicans will hold strong and do what is right for America! _E_ RT @DRUDGE_REPORT: GREAT AGAIN: FEDS ARREST MURDER SUSPECT IN 'FAST AND FURIOUS' SCANDAL... __HTTP__ _E_ Since the Democrats decided to kill the filibuster they now own it.Republicans should keep the new rule when they're in the majority. _E_ Broken promises. A broken billion dollar website. ObamaCare can't be fixed. Repeal! _E_ Serious stuff IRS Commissioner visited White House 157 times far more than Sec. of State or Defense. What a big story this is! _E_ Taking a photo with my family on the opening day of Trump International Golf Links Scotland __HTTP__ _E_ The U.S. is now begging Russia to give back Edward Snowden. In a letter they promised no death penalty for the traitor. No respect! _E_ So great to have the endorsement and support of Paul Ryan. We will both be working very hard to Make America Great Again! _E_ We should be focused on magnificently clean and healthy air and not distracted by the expensive hoax that is global warming! _E_ According to new WPOST ABC poll Obama has just lost 14 points on public trust with economy _E_ On my way to South Carolina. Big Crowd look forward to it! _E_ Dee Dee Sorvino @deedeegop I am betting on Trump _E_ If someone made a nasty or controversial statement about me to the president do you really think he would come to my rescue? No chance! _E_ Via @ABCPolitics by @rickklein: Trump Blasts Romney Bush Says GOP Has 'Nobody Like Trump' __HTTP__ _E_ I have a judge in the Trump University civil case Gonzalo Curiel (San Diego) who is very unfair. An Obama pick. Totally biased hates Trump _E_ Apple must make the IPhone screen bigger. Losing major market share. _E_ Many think that the Championship Course at Turnberry home of The Duel In The Sun will be the worlds best after the renovation. _E_ To @TigerWoods He is truly a great champion and we were honored to have him at Trump National Doral. @DoralResort #Trump _E_ While the Republicans and Democrats in Congress are working hard to come up with a solution to DACA they should be strongly considering a system of Merit Based Immigration so that we will have the people ready willing and able to help all of those companies moving into the USA! _E_ MUST READ @IBDeditorials: "President Obama's Amnesty At Any Price" __HTTP__ Congress Use the Power of Purse! Defund Amnesty! _E_ My speech at yesterday's @SteveKingIA @Citizens_United Iowa Freedom Summit __HTTP__ via @FoxNews _E_ Just the beginning & it is going to get worse. Rates & deductibles are so high nobody is going to be able to use it. __HTTP__ _E_ Today Obama will give another speech on the economy. Tomorrow our country will still be $17T+ in debt with 18% real unemployment. _E_ Went to the Yankees game last night with Bill O'Reilly we had a great time watching the Yankees win! _E_ Wow Kasich didn't qualify to run in the state of Pennsylvania not enough signatures. Big problem! _E_ On @seanhannity show @FoxNews now. ENJOY! _E_ Join me on Wednesday May 25th at the Anaheim Convention Center!#Trump2016 #MAGA Tickets: __HTTP__ __HTTP__ _E_ The failing @nytimes reporters don't even call us anymore they just write whatever they want to write making up sources along the way! _E_ Thank you @FLGovScott. __HTTP__ _E_ I will be interviewed on @foxandfriends at 8:30 A.M. ENJOY! _E_ Business is an art in itself and powerful negotiation skills are one of the techniques necessary to facilitate success. _E_ Put big game trophy decision on hold until such time as I review all conservation facts. Under study for years. Will update soon with Secretary Zinke. Thank you! _E_ As the phony Russian Witch Hunt continues two groups are laughing at this excuse for a lost election taking hold Democrats and Russians! _E_ .@TrumpNewYork on CPW in NYC is the home of the globe that has become an icon in the city. #CelebApprentice _E_ 90 stories over midtown New York Trump World Tower's glass curtain wall is a true landmark __HTTP__ _E_ Realize that an entrepreneur's most important gift to the world is jobs security and well being for others. Midas Touch _E_ The #MissUniverse Pageant is the biggest pageant of them all—by far! _E_ Behind the scenes photo of @Gretawire and I filming an interview __HTTP__ Watch tonight at 10PM ET on @FoxNews. _E_ When the economy is bad @BarackObama wants to raise taxes. When the economy is good @BarackObama wants to raise taxes. Notice a trend? _E_ If these scandals happened before the election Obama could not have won. _E_ RT @billoreilly: Hannity crushing MSNBC at 9. Good for him! Check the No Spin News on __HTTP__ Killing England a huge bests... _E_ A great honor to visit the 9/11 Memorial Museum with my wife @MELANIATRUMP today. #NewYorkValues __HTTP__ _E_ I am surprised that Hugo Chavez can keep power in his weak physical condition! _E_ Leaving Hamburg for Washington D.C. and the WH. Just left China's President Xi where we had an excellent meeting on trade & North Korea. _E_ House of Representatives shouldn't give anything to Obama unless he terminates Obamacare. _E_ The money losing @politico is considered by many in the world of politics to be the dumbest and most slanted of the political sites. Losers! _E_ Michelle Obama likes to be addressed as Your Excellency. __HTTP__ She is an excellent spender of taxpayer money on herself. _E_ All new @ApprenticeNBC starts right now! __HTTP__ _E_ Who should win Celebrity Apprentice on Monday night? Show will be telecast LIVE! _E_ If a player wants the privilege of making millions of dollars in the NFLor other leagues he or she should not be allowed to disrespect.... _E_ It's a shame the ruling class of Republicans don't attack Obama and the Democrats the way they hit Senators Cruz & Lee. _E_ .@DavidGregory got thrown off of TV by NBC fired like a dog! Now he is on @CNN being nasty to me. Not nice! _E_ Make sure to catch @history's season finale of "The Men Who Built America" on Sun November 11th. Great show. _E_ Thank you Dan I agree! Best wishes. __HTTP__ _E_ Remember don't believe sources said by the VERY dishonest media. If they don't name the sources the sources don't exist. _E_ Thank you for the incredible support Melania Barron Ivanka Jared Tiffany Don Vanessa Eric and Lara! __HTTP__ _E_ I don't blame China I blame the incompetence of past Admins for allowing China to take advantage of the U.S. on trade leading up to a point where the U.S. is losing $100's of billions. How can you blame China for taking advantage of people that had no clue? I would've done same! _E_ When @BarackObama is not vacationing he is hosting his top donors in the White House __HTTP__ Always having a good time! _E_ There is no way my friend Bob Kraft agreed not to appeal the NFL decision without making a deal to at least get something. We love Tom Brady _E_ If a person is #1 at Harvard and comes from Europe or Asia they can't get into the U.S. From Mexico etc. with a criminal record no problem _E_ "What you dream about is what you will do. If you cannot even dream of doing big things you will never do anything big in life." Think Big _E_ Our thoughts and prayers are with everyone in the path of California's wildfires. I encourage everyone to heed the advice and orders of local and state officials. THANK YOU to all First Responders for your incredible work! __HTTP__ _E_ Obama has not passed a single budget in 4 years. Democrats don't even vote them in Congress. He has failed to lead! _E_ RT @DonaldJTrumpJr: Great pic from a friend on @CBPflorida @CustomsBorder who have been helping with #harvey recovery and now with #irma. T... _E_ Thank you Governor @Mike_Pence!Lets MAKE AMERICA SAFE AND GREAT AGAIN with the American people. #AmericaFirst... __HTTP__ _E_ Leaving for New Hampshire now. Will be doing the @TODAYshow there live at 7:00 A.M. New @CBSNews Poll of New Hampshire: Trump 38 Carson 12! _E_ Huge crowd expected tomorrow night! VT Police say first come first serve. Arrive early! _E_ I have offered DACA a wonderful deal including a doubling in the number of recipients & a twelve year pathway to citizenship for two reasons: (1) Because the Republicans want to fix a long time terrible problem. (2) To show that Democrats do not want to solve DACA only use it! _E_ The #G20Summit was a wonderful success and carried out beautifully by Chancellor Angela Merkel. Thank you! _E_ RT @DRUDGE_REPORT: Fears of new terror attack after van 'mows down 20 people' on London Bridge... _E_ Together we dream of a Korea that is free a peninsula that is safe and families that are reunited once again! __HTTP__ _E_ A NEW ERA IN AMERICAN ENERGY! #MadeInTheUSAWatch here: __HTTP__ __HTTP__ _E_ Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight! _E_ #ICYMI: Governor @mike_pence and I were in Valley Forge Pennsylvania today. You can watch it here:... __HTTP__ _E_ Via @AP: "Donald Ivanka Trump say DC's Old Post Office Pavilion will be 1 of country's finest hotels" __HTTP__ _E_ Donald Trump bids to buy the Oreo Double Stuf Racing League. Check it out: __HTTP__ _E_ Do not go back into Iraq unless they agree in a signed formal instrument to give the U.S. 50% of their oil reserves.Make the deal dummies! _E_ What is the standard for which you want to be known? Identify that standard and follow it. _E_ Recently opened @TrumpToronto it's beautiful and here is a video of the ribbon cutting ceremony.. __HTTP__ _E_ Via @beforeitsnews: "WATCH: See How Trump Just Torched Obama Biden Kerry For Snubbing Paris Anti Terror March" __HTTP__ _E_ Crooked Hillary is wheeling out one of the least productive senators in the U.S. Senate goofy Elizabeth Warren who lied on heritage. _E_ Every day Mexico continues to hold Sgt. Tahmooressi is an insult to our country. _E_ Of course @hardball_chris attacked 'birthers' in praising @CondoleezzaRice's speech. Chris has completely lost it. _E_ Honor of a lifetime to meet His Holiness Pope Francis. I leave the Vatican more determined than ever to pursue PEAC... __HTTP__ _E_ Read about my @LibertyU speech in @jameshohmann's @politico Morning Score __HTTP__ _E_ "Don't find fault. Find a remedy." – Henry Ford _E_ ObamaCare premiums are going up up up just as I have been predicting for two years. ObamaCare is OWNED by the Democrats and it is a disaster. But do not worry. Even though the Dems want to Obstruct we will Repeal & Replace right after Tax Cuts! _E_ "Trump hails liberation of Raqqa as critical breakthrough in anti ISIS campaign" __HTTP__ _E_ I am pleased to inform you that I have just named General/Secretary John F Kelly as White House Chief of Staff. He is a Great American.... _E_ Big news just out NEW @CNN POLL TRUMP 39 and leads in every major category. Likeability way up. CRUZ 18 CARSON 10 RUBIO 10 _E_ 'Must Act Immediately': Clinton Charity Lawyer Told Execs They Were Breaking The Law __HTTP__ _E_ Watch me tonight at 9PM ET on @CNN full hour. @Piersmorgan won @ApprenticeNBC before taking over Larry King's slot should be interesting. _E_ .@BarbaraJWalters Barbara—get better fast & stay healthy forever. _E_ Despite all the statements to the contrary Obama's policies will increase taxes on everyone __HTTP__ Enjoy! _E_ Another great charity that the $5M could go to just a recommendation to the Pres. the Wounded Warriors represented so well by @TraceAdkins _E_ Crooked Hillary Clintons foreign interventions unleashed ISIS in Syria Iraq and Libya. She is reckless and dangerous! _E_ It's hard to believe that we are rationing gas in NYC. OPEC is laughing all the way to the bank. _E_ I will be asking for a major investigation into VOTER FRAUD including those registered to vote in two states those who are illegal and.... _E_ A big part of the country even the southern states is under massive attack from snow and freezing cold. Global warming anyone? _E_ "Image is important and speaks more than the words or fine print that goes along with the product." – Midas Touch _E_ Iran is rapidly taking over more and more of Iraq even after the U.S. has squandered three trillion dollars there. Obvious long ago! _E_ Last night's live show was so much fun. Congrats to the entire cast they are all winners! From beginning over $13 million for charity. _E_ Via @ConroeCourier by @StephenGreen91:"Trump talks 2016 run jobs at @TXPatriotsPAC" __HTTP__ _E_ ICYMI via @PageSix by @Mohris: "Donald Trump honored at Marine Corps charity gala" __HTTP__ _E_ Thank you #Biloxi #Mississippi! Remember this night & spread the word to get out & #VoteTrump2016! __HTTP__ _E_ Democrat Jon Ossoff would be a disaster in Congress. VERY weak on crime and illegal immigration bad for jobs and wants higher taxes. Say NO _E_ Staff at Trump Park Avenue disliked A Rod to put it mildly The staff at Trump World Tower loves Derek Jeter. _E_ Karl Rove is a total loser. Money given to him might as well be thrown down the drain. _E_ By the way if Russia was working so hard on the 2016 Election it all took place during the Obama Admin. Why didn't they stop them? _E_ Thanks. __HTTP__ _E_ Iran's threats are no excuse for the 9 month high price of oil. OPEC is ripping us off while @BarackObama watches. __HTTP__ _E_ Entrepreneurs: Success is good. Success with significance is even better. Work on what you will be proud to be associated with. _E_ My thoughts and prayers are with those affected by the tragic storms and tornadoes in the Southeastern United States. Stay safe! _E_ Entrepreneurs: Having a product requires something very important you have to think about the market. Do your due diligence. _E_ I just saw my new tie & shirt collection—it's fantastic—unbelievable look. Go to Macy's now to buy! _E_ China owes us money.... __HTTP__ #trumpvlog _E_ Now AP is banning the term illegal immigrants What should we call them? 'Americans'?! This country's political press is amazing! _E_ .@HillaryClinton's Careless Use Of A Secret Server Put National Security At Risk: __HTTP__ #VPDebate#BigLeagueTruth _E_ Just landed in D.C. __HTTP__ _E_ Glad to see no charges against Greg Kelly. His accusers' charges never made sense! _E_ Will go back on for a final question now! _E_ We are going to bring steel and manufacturing back to Indiana! _E_ Obamacare is far toooo expensive far toooo complicated (thousands of pages) and most importantly doesn't work. WE CAN DO MUCH BETTER! _E_ Be yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_ Why didn't the writer of the twelve year old article in People Magazine mention the incident in her story. Because it did not happen! _E_ Thank you for a wonderful evening in Washington D.C. #Inauguration __HTTP__ _E_ Leightweight @Lord Sugar virtually begged my reps to have me stop mocking him. Every time this dope goes on Apprentice I make money too easy _E_ Thank you Wilkes Barre Pennsylvania! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ More of your questions answered in today's video at __HTTP__ here is my appearance on Neil Cavuto __HTTP__ _E_ Churches in Texas should be entitled to reimbursement from FEMA Relief Funds for helping victims of Hurricane Harvey (just like others). _E_ .@Lord_Sugar If you think ugly windmills are good for Scotland you are an even worse businessman than I thought... _E_ Meeting with African American Pastors at Trump Tower was amazing. Wonderful news conference followed. Now off to Georgia for big speech! _E_ You have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_ Occupy Wall Street is at it again go out and get a job. It's actually easier work and far more rewarding. _E_ Performing live on the Miss Universe Pageant from the Mandalay Bay Resort & Casino will be Telemundo Orianthi John Legend and The Roots. _E_ South Korea must in some form pay for our help the U.S. must stop being stupid! _E_ ...and people like Ms. Heyer. Such a disgusting lie. He just can't forget his election trouncing.The people of South Carolina will remember! _E_ Dummy @mcuban made up a story about a visit to Mar a Lago last night on Leno. It never happened—I don't talk that way. _E_ RT @KellyannePolls: more media #polls showing @realDonaldTrump ahead in states Pres Obama won twice. __HTTP__ _E_ 9 million fewer people voted for Obama this election than last & yet the Republicans lost—do you think they might be doing something wrong! _E_ Receiving @AmericanCancer Lifetime Achievement Award & chairing @FollowLola debut @CarnegieHall on Jan.19 __HTTP__ _E_ I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ .@JonahNRO You stated that I started "relentlessly tweeting like a 14 year old girl..." Horrible insult to women. Resign now or later! _E_ The S&P are losers. They did this for personal publicity in order to straighten out their terrible reputatio... (cont) __HTTP__ _E_ Will be on @megynkelly tonight at 9:00. WILL BE TALKING ABOUT EVERYTHING! _E_ Ebola patient Duncan lied on his exit papers by saying he never came into contact with a person with Ebola. He knew he did and person died. _E_ All the contestants have arrived to compete in Trump Miss Universe Pageant in Las Vegas. Today's welcoming ceremony will be terrific! _E_ .@MittRomney must ask for Obama's college records & applications why is he not doing this? _E_ If Americans understood just how many hidden government fees and taxes are absorbed into the prices of the (cont) __HTTP__ _E_ There is an incredible spirit of optimism sweeping the country right now—we're bringing back the JOBS! __HTTP__ _E_ Graydon Carter is laughing at the stupidity of Chuck Townsend on his contract renewal even he doesn't believe it! @CondeNastCorp _E_ The @nytimes states today that DJT believes more countries should acquire nuclear weapons. How dishonest are they. I never said this! _E_ Best Apprentice episode EVER tonight at 8:00. _E_ Experience knowledge and prescience are a formidable combination of powers. Do not underestimate them. Think Like a Champion _E_ NBC news is #FakeNews and more dishonest than even CNN. They are a disgrace to good reporting. No wonder their news ratings are way down! _E_ Thank you Bobby Bowden for the intro tonight and your support! I hope I can do as well for Florida as you have done! __HTTP__ _E_ .@_KatherineWebb with some of my memorabilia. __HTTP__ _E_ RT @EricTrump: Thank you to @GolfDigest for this incredible feature! Golfer in Chief @RealDonaldTrump __HTTP__ __HTTP__ _E_ Was there another loan that Ted Cruz FORGOT to file. Goldman Sachs owns him he will do anything they demand. Not much of a reformer! _E_ Rick Perry a good man a great family and a patriot. _E_ .@ForbesInspector 5 Star & @TripAdvisor #1 Luxury Hotel @TrumpToronto offers style luxury & impeccable service __HTTP__ _E_ Will be on Fox & Friends at 7.00 Enjoy! _E_ "Whether you realize it or not your brand can be many times more valuable than your business." – Midas Touch _E_ I'll be on @foxandfriends at 7:30 AM Monday _E_ .@CNN just doesn't get it and that's why their ratings are so low and getting worse. Boring anti Trump panelists mostly losers in life! _E_ Rick Perry failed at the border. Now he is critical of me. He needs a new pair of glasses to see the crimes committed by illegal immigrants. _E_ We _E_ RT @IvankaTrump: Thank you Angie Phillips for inviting me to tour your plant Middletown Tube Works. #Ohio __HTTP__ _E_ I will be speaking at the NRA event today in Nashville. Many friends will be there. _E_ I guess @BillMaher saw my ratings on the @Late_Show the other night where Letterman beat Leno. Bill you are no Letterman. _E_ From my first day in office we've taken swift action to lift the crushing restrictions on American energy. Remarks... __HTTP__ _E_ I explained to the President of China that a trade deal with the U.S. will be far better for them if they solve the North Korean problem! _E_ If Scotland would have gone independent predicated on $100 $150 oil they would now be bust! _E_ "ACU ANNOUNCES DONALD TRUMP TO ADDRESS CPAC 2013" __HTTP__ via @CPACnews _E_ Briarcliff Manor should get a better town manager. Philip Zegarelli has no clue—bad roads a total puppet of the mayor? @westchestergov _E_ Amazing that Crooked Hillary can do a hit ad on me concerning women when her husband was the WORST abuser of woman in U.S. political history _E_ Will be signing the biggest ever Tax Cut and Reform Bill in 30 minutes in Oval Office. Will also be signing a much needed 4 billion dollar missile defense bill. _E_ Autism Speaks' Bob and Suzanne Wright will address the Pontifical Council on Health Care Workers at the Vatican in Rome. November 20 22 _E_ Marco Rubio is totally weak on illegal immigration & in favor of easy amnesty. A lightweight choker bad for #USA! _E_ Via @WashTimes by @dsherfinski __HTTP__ _E_ President Obama strongly considering a plan to bring non U.S. citizens with Ebola to the United States for treatment. Now I know he's nuts! _E_ RT @DanScavino: 'Trump as Commander in Chief Making the Hard Decisions' by LTG (Ret) Kellogg a highly decorated Vietnam War Vet: __HTTP__ _E_ Honolulu's best @TrumpWaikiki features a dozen distinct tropically decorated Hawaii hotel rooms and suite layouts __HTTP__ _E_ The Tea Party delivered the House for @GOP so they could be fiscally responsible. Instead they have been irresponsible! _E_ .@JebBush had a tiny 300 person crowd at Senator Tim Scott's forum. I had thousands and they had real passion! __HTTP__ _E_ Thank you Arkansas! #Trump2016#SuperTuesday _E_ I will take full credit for Mitt Romney dropping out of the race—looks like he won't be endorsing Trump any time soon. _E_ RT @foxandfriends: FOX NEWS ALERT: U.S. flexes its defense muscles destroys incoming test missile off coast of Alaska __HTTP__ _E_ Via @todayshow: Trump: Attorney general behind lawsuit a 'total lightweight' __HTTP__ _E_ Join me in Colorado Springs Colorado tomorrow at 1:00pm! #MAGA Tickets: __HTTP__ _E_ Happy Lá Fheile Phadraig to all of my great Irish friends! _E_ Will CNN send its cameras to the border to show the massive unreported crisis now unfolding or are they worried it will hurt Hillary? _E_ How can General Martin Dempsey tell Obama that delaying the Syria bombardment will have no consequences? He is no Patton or MacArthur. _E_ Great to see @RandPaul looking well and back on the Senate floor. He will help us with TAX CUTS and REFORM! _E_ Thank you America! #Trump2016 __HTTP__ __HTTP__ _E_ Tune in at __HTTP__ and get the word out #BigLeagueTruth #Debate Help us spread the TRUTH stop the... __HTTP__ _E_ While Bernie has totally given up on his fight for the people we welcome all voters who want a better future for our workers. _E_ Sally Yates made the fake media extremely unhappy today she said nothing but old news! _E_ If only the morons @AP were as concerned with Obama's inconsistent statements on the Embassy attacks as they are (cont) __HTTP__ _E_ #AmericaFirst! __HTTP__ _E_ Thank you to Joe Passov (Travelin' Joe) of Golf Magazine for the great article... __HTTP__ __HTTP__ _E_ Philly FOP Chief On Presidential Endorsement: Clinton 'Blew The Police Off' __HTTP__ _E_ Can't wait to meet patriotic small business owners next week in Sarasota and Tampa! Hey @BarackObama We Did Build It! _E_ So what did you think of my decision? What would you have done? #CelebApprentice _E_ Happy Birthday @IvankaTrump! You are an amazing daughter! _E_ Our views trump the rest for the #Thanksgiving #MacysParade. Stay @TrumpNewYork for exclusive parade access __HTTP__ _E_ See Charles Gasparino's article in today's NYPost about Eric Schneiderman's witch hunt against Republicans __HTTP__ _E_ Time for Sebelius to be fired. She has admitted that the Administration did not vet the ObamaCare website __HTTP__ _E_ The Democrats sent a very political and long response memo which they knew because of sources and methods (and more) would have to be heavily redacted whereupon they would blame the White House for lack of transparency. Told them to re do and send back in proper form! _E_ My ties and shirts are doing very big numbers @Macy's beyond my wildest thoughts! Thanks @GoAngelo and the rest of the losers for mentions! _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ Syrian ceasefire seems to be holding. Many lives can be saved. Came out of meeting. Good! _E_ Thank you to Matt Boyle @BreitbartNews for analytical & well written piece on sleazebag blogger @mckaycoppins & irrelevant @BuzzFeed _E_ 27 days until America's greatest test since our founding. In this election we decide whether we become great again. _E_ ...didn't do it so now we have a big deal with Dems holding them up (as usual) on Debt Ceiling approval. Could have been so easy now a mess! _E_ I will be working late into the evening closing a big real estate deal—soon to be announced. Happy Easter and/or Holiday to all. _E_ Let Pete Rose into the Hall of Fame now 35 years is enough! _E_ Governor John Kasich of the GREAT GREAT GREAT State of Ohio called to congratulate me on the win. The people of Ohio were incredible! _E_ Via @limbaugh: "Trump Doubled Down and It Worked" __HTTP__ _E_ You've got something unique to offer find out what it is. Ask yourself: What can I provide that does not yet exist? Innovation can follow.. _E_ Welcome to the new ObamaCare reality – Doctor spent 2 hours on hold w/insurance company to get approval for surgery __HTTP__ _E_ Capital isn't scarce vision is. Sam Walton _E_ I wonder how @JoeBiden feels after last night's love fest between Obama and Hilary on @60Minutes. Can't be too happy. _E_ .@MileyCyrus – don't worry about Liam. You can do much better and you have plenty of time—remain strong! _E_ .@BrandenRoderick returns in All Star @ApprenticeNBC 2001 Playmate of the Year is a determined competitor. She is terrific! _E_ I will be interviewed at 7:00 A.M. on @foxandfriends Enjoy. _E_ My new book #TimeToGetTough is the best present of the holiday season. A great gift for anyone who cares about this country. _E_ Will be interviewed by @ainsleyearhardt on @foxandfriends Enjoy! _E_ Thank you. __HTTP__ _E_ .@DeeSnider @StephenBaldwin7 and the rest of your favorites are back! All Star @ApprenticeNBC premieres Sunday... __HTTP__ _E_ Jeb's policies in Florida helped lead to its almost total collapse. Right after he left he went to work for Lehman Brothers—wow! _E_ Be sure to watch my interview on @Gretawire tonight! _E_ RT @Carl_C_Icahn: 2/2 How many of our presidents even our great presidents would have handled the antics that went on in that auditorium... _E_ "Intellectuals solve problems geniuses prevent them." – Albert Einstein _E_ I laugh when I see Marco Rubio and Jeb Bush pretending to love each other with each talking of their great friendship. Typical phony pols _E_ I received such a nice letter today from someone who took refuge in Trump Tower during Sandy. It was my pleasure to help. _E_ ... Also if they're at home who the hell knows what they're doing (a second job maybe). _E_ No matter what happens in the election @davidaxelrod deserves a lot of credit. He has kept Obama in it even with his terrible record. _E_ Maybe Derek Jeter should ask A Rod about renting his apartment next year. Very soon A Rod won't need a place in NYC. _E_ UL has lost all credibility under Joe McQuaid w circulation dropping to record lows. They aren't worthy of representing the great people NH. _E_ Met with @RepCummings today at the @WhiteHouse. Great discussion! _E_ Two years ago I told everybody to start looking & buying houses—I hope you listened! (but there is still time). _E_ HAPPY THANKSGIVING your Country is starting to do really well. Jobs coming back highest Stock Market EVER Military getting really strong we will build the WALL V.A. taking care of our Vets great Supreme Court Justice RECORD CUT IN REGS lowest unemployment in 17 years....! _E_ For the sake of transparency @BarackObama should release all his college applications and transcripts both from Occidental and Columbia. _E_ #ArizonaPrimary message from @IvankaTrump! #AZPrimary #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ This is a terrible deal for the country and an embarrassment for Republicans! _E_ Crooked Hillary has been fighting ISIS or whatever she has been doing for years. Now she has new ideas. It is time for change. _E_ New poll WOW 53% say President Obama is not honest & trustworthy. What took them so long. Go back and look at his house purchase in Chicago _E_ Lots of people are asking whether or not I should have run for President—stay tuned for the answer. _E_ Do you think I will get credit for keeping Ford in U.S. Who cares my supporters know the truth. Think what can be done as president! _E_ This will be the biggest TAX CUT in the history of our country and we need it! #TaxReform Read more: __HTTP__ __HTTP__ _E_ "Sure the home field is an advantage but so is having a lot of talent." @DanMarino _E_ Wow Trump International Hotel & Tower Toronto was just ranked #1 out of 138 hotels in Toronto! @TrumpToronto _E_ The Clintons spend millions on negative ads on me & I can't tell the truth about her husband? Don't feel sorry for crooked Hillary! _E_ My thoughts on The O'Reilly Factor and more here... __HTTP__ _E_ Very proud to announce that Mar a Lago was awarded top Historic building in the state by the illustrious (cont) __HTTP__ _E_ A letter written to one of my many critics! __HTTP__ _E_ N.Y.C. has the worst Mayor in the United States. I hate watching what is happening with the dirty streets the homeless and crime! Disgrace _E_ While Jon Stewart is a joke not very bright and totally overrated some losers and haters will miss him and his dumb clown humor. Too bad! _E_ Join me in Phoenix Arizona tomorrow at 4pm! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_ So nice thank you! __HTTP__ _E_ Thousands of fans have been sending letters to Trump Tower in anticipation of @CelebApprentice. Really good show. _E_ Morning Consult poll: Trump Leads __HTTP__ _E_ #trumpvlog My thoughts on the State of the Union address Apple and a great @WSJ article.... __HTTP__ _E_ Outrageous @BarackObama has spent over $2.7B on implementing @ObamaCare since the oral arguments at SCOTUS __HTTP__ _E_ You mean the fact that my father left me some money (as a good father will) and I multiplied it many many times to over $10 billion is bad? _E_ Don't be fooled. In 2008 @BarackObama promised immigration reform in his 1st yr of his 1st term. Now promising (cont) __HTTP__ _E_ I really like Chelsea Clinton an amazing young woman. She got the best of both parents. (@IvankaTrump agrees) _E_ Trade between China and North Korea grew almost 40% in the first quarter. So much for China working with us but we had to give it a try! _E_ When Obama took office in 2009 employer provided premiums cost $13375. Today they are $18142. Thanks Obama. _E_ Watch my interview with @ericbolling on @FoxNews today at 11:30AM ET _E_ #sweepstweet @johnrich and @marleematlin were on #CelebrityApprentice—and they're back! _E_ Check out the last webisode in our 3 part series featuring me with Serta. Which one was your favorite? www.youtube.com/user/mattressserta _E_ Remember Republicans are 5 0 in Congressional Races this year. The media refuses to mention this. I said Gillespie and Moore would lose (for very different reasons) and they did. I also predicted "I" would win. Republicans will do well in 2018 very well! @foxandfriends _E_ The Carson story is either a total fabrication or if true even worse trying to hit mother over the head with a hammer or stabbing friend! _E_ I'm glad President Obama followed my lead and lowered the flags half staff. It's about time! _E_ RE: FB Vanity URLs: SF Chronicle David Beckham was one of the first along with Britney Spears & Donald Trump. __HTTP__ _E_ __HTTP__ _E_ FoxNewsInsider with comments on my speech at CPAC in Washington DC __HTTP__ _E_ It was a great honor to be with President @EmmanuelMacron of France this afternoon with his delegation. Great bilateral meeting! #UNGA __HTTP__ _E_ Join me in Cedar Rapids Iowa tomorrow at 7:00pm! #MAGA __HTTP__ __HTTP__ _E_ Great poll numbers out of @UMassAmherst. Thank you! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Strong debate by @PerdueSenate. No question he won. We need more business leaders with bold vision to fix Washington. #GASen _E_ Had dinner this week at @MEGUNYC (at Trump World Tower) opposite the United Nations—fantastic food! 212.964.7777 _E_ Massive crowds expected in Mississippi tomorrow night. Look forward to it! 2015 IN PHOTOS: __HTTP__ __HTTP__ _E_ Obama once again just missed a self imposed deadline with Iran. Our leadership is weak & ineffective. Double the sanctions! _E_ Congrats @LindseyGrahamSC. You just got 4 points in your home state of SC—far better than zero nationally. You're only 26 pts behind me. _E_ #CrookedHillary = Obama's third term which would be terrible news for our economic growth seen below. __HTTP__ _E_ Just arrived in Syracuse NY. Big crowd great place! We will bring back the desperately needed jobs. #NYPrimary __HTTP__ _E_ You get what you vote for. 21% of small business owners planning to cut their workforce in 2013 __HTTP__ _E_ Watched Crooked Hillary Clinton and Tim Kaine on 60 Minutes. No way they are going to fix America's problems. ISIS & all others laughing! _E_ You have all been waiting the response has been amazing! Watch my announcement now press release to follow at 12:15. __HTTP__ _E_ Spoke with Governor @PatMcCroryNC of North Carolina today. He is doing a tremendous job under tough circumstances. _E_ My thoughts on last night's Celebrity Apprentice __HTTP__ also an observation I made recently __HTTP__ _E_ Today I hosted an immigration roundtable ahead of two votes taking place in Congress tomorrow. Watch and read more... __HTTP__ _E_ Obamacare will bankrupt our country and lead to socialized medicine. We must all focus now on electing @MittRomney this November. _E_ Don't talk about Rolling Stone Magazine but most importantly don't buy it. This degenerate killed and maimed so many wonderful people! _E_ "Home Prices Reach New All Time Highs in August" Read more: __HTTP__ __HTTP__ _E_ Which team is your favorite? _E_ .@jamieaydt Happy Birthday Jamie! _E_ On my way to Cedar Falls Iowa now. Will be great I love the people of Iowa! _E_ Remember Russia still has Snowden. When are we going to bring that piece of human garbage back home to stand trial? He caused great damage! _E_ .@TigerWoods is playing like his old self in the Farmers Insurance Open. He will have a great year. _E_ ...We cannot keep FEMA the Military & the First Responders who have been amazing (under the most difficult circumstances) in P.R. forever! _E_ Republicans seem intent on negotiating against themselves. Many senior Senators are doing Obama's bidding. Can't win this way. _E_ Really dumb @CheriJacobus. Begged my people for a job. Turned her down twice and she went hostile. Major loser zero credibility! _E_ The #SOTU speech is really boring slow lethargic very hard to watch! _E_ Tomorrow!Las Vegas NV 11a: __HTTP__ CO 4p: __HTTP__ NM 7p: __HTTP__ _E_ Great to see Tony La Russa manage one last game last night. Congratulations to the National League on winning the @MLB All Star Game. _E_ China is worried. The polls are trending for @MittRomney. They won't be able to steal from us anymore. _E_ Deportations are "plummeting" __HTTP__ while Obama continues to grant amnesty. _E_ Great reporting by @foxandfriends and so many others. Thank you! _E_ It's record cold all over the country and world where the hell is global warming we need some fast! _E_ Unbelievable how he gets away with it: @BarackObama is flying around on Air Force One laughing at everybod... (cont) __HTTP__ _E_ July U.S. construction had biggest drop in 12months. Bad indicator on economic numbers for rest of the year. _E_ I am so honored by all the great NY State Repubs who came to my office called & wrote for me to run for Governor. If I do I will win. _E_ Congress must defund ObamaCare. It is destroying Medicare and breaking promises to our Seniors including veterans. _E_ Via @dallasnews' @neighborsgo by Heather Noel: Shelton School graduate receives handwritten note from Donald Trump __HTTP__ _E_ ATTN: @HillaryClinton Why did five of your staffers need FBI IMMUNITY?! #BigLeagueTruth #Debates _E_ .@TrumpChicago's river lake and skyline views in each of its deluxe 5 Star guestrooms __HTTP__ _E_ Thank you for all of the nice compliments and reviews on the State of the Union speech. 45.6 million people watched the highest number in history. @FoxNews beat every other Network for the first time ever with 11.7 million people tuning in. Delivered from the heart! _E_ If I were president Sgt. Andrew Tahmooressi would be let out of jail with one phone call. If notMexico would pay a price like never before! _E_ Morning Joe Panel is stealing many of my statements and ideas to better America without giving credit the story of my life! _E_ Obama & the Democrats want this shutdown. They think it helps their electoral prospects for 2014. Don't believe! _E_ Anybody who watched all of Ted Cruz's far too long rambling overly flamboyant speech last nite would say that was his Howard Dean moment! _E_ Via @wbtwnews13 by @elizabethk_wbtw: "Donald Trump will deliver keynote address to the SC Tea Party Convention" __HTTP__ _E_ Our athletes in the Olympics are proving once again to be the greatest competitors in the world. Makes us proud to be Americans. _E_ I will be releasing the full interview with a guy named Baxter @antbaxter only to show the bias and stupidity of him and @BBCWorld. Clowns! _E_ .@LaToyaJackson & @Omarosa are not likely to become friends –ever! #CelebApprentice _E_ Sorry for such silence—spent weekend at closing of Ritz Carlton in Jupiter Florida—just bought it will be great! _E_ So how and why are they so sure about hacking if they never even requested an examination of the computer servers? What is going on? _E_ Honoring the men and women who made the ultimate sacrifice in service to America. Home of the free because of the brave. #MemorialDay _E_ ...Based on that the Military has hit ISIS much harder over the last two days. They will pay a big price for every attack on us! _E_ I am thrilled to nominate Dr. @RealBenCarson as our next Secretary of the US Dept. of Housing and Urban Development... __HTTP__ _E_ Senator @lisamurkowski of the Great State of Alaska really let the Republicans and our country down yesterday. Too bad! _E_ Congratulations to @ScottKWalker of Wisconsin a great victory. A smart and tough guy. Great going. _E_ A general is just as good or just as bad as the troops under his command make him. General Douglas MacArthur _E_ "Take calculated risks. That is quite different from being rash." George S. Patton _E_ New Reuters Poll just came out and has me at 32% highest number yet.The silent majority is back and we will MAKE AMERICA GREAT AGAIN! _E_ Pervert alert serial sexter @repweiner is polling to test the waters for NYC political run. __HTTP__ _E_ RT @FLOTUS: Had a wonderful visit from @JBA_NAFW children today at the @whitehouse! #WhiteHouseChristmas __HTTP__ _E_ The rescue icebreaker trying to free the ship of the GLOBAL WARMING scientists has turned back the ice is massive (a record). IRONIC! _E_ A government shutdown will be devastating to our military...something the Dems care very little about! _E_ No surprise Saudis turned down spot on UN Security Council. They don't want responsibility. Just have us do their heavy lifting. _E_ The brand new Blue Monster Golf Course at Trump National Doral is doing fantastic business. Also the new driving range is open at night! _E_ This will be a big week for Infrastructure. After so stupidly spending $7 trillion in the Middle East it is now time to start investing in OUR Country! _E_ With the run on our dollar about to take place commodity prices will rise. Gold silver & timber will spike also certain real estate. _E_ Someone should look into who paid for the small organized rallies yesterday. The election is over! _E_ I am on my way! See you all soon! __HTTP__ _E_ Wow great news from Wisconsin. Just made two speeches there with a big one coming tonight. Thank you! __HTTP__ _E_ RT @SpeakerRyan: For individuals and families the final Tax Cuts & Jobs Act:✔lowers individual taxes✔nearly doubles the standard deducti... _E_ I had a great time answering your questions in the latest #AskTheDonald. Watch and see if your question made it in __HTTP__ _E_ Congrats to Roger Clemens he showed great courage. This case never should have been brought to trial. Andy Pettitte did the right thing. _E_ Press conference at the opening of the @GaryPlayer Villa at @TrumpDoral . __HTTP__ _E_ .@FoxNews is MUCH more important in the United States than CNN but outside of the U.S. CNN International is still a major source of (Fake) news and they represent our Nation to the WORLD very poorly. The outside world does not see the truth from them! _E_ Interesting...the last time a Democrat succeeded a two term Democratic pres. was in 1836 when Martin Van Buren succeeded Andrew Jackson. _E_ Thank you Louisiana! #Trump2016 __HTTP__ _E_ People having a great time in the Trump Tower atrium unlike others I stayed open. __HTTP__ _E_ What has happened in Orlando is just the beginning. Our leadership is weak and ineffective. I called it and asked for the ban. Must be tough _E_ The meeting with Republican Senators yesterday outside of Flake and Corker was a love fest with standing ovations and great ideas for USA! _E_ Not so smart after all ... Man with name on Duke law library must pay me legal fees after Trump trial victory. _E_ I'm watching Knicks game I'd bet all of those guys with the terrible tattoos wish they never got them too bad too late! _E_ My @gretawire interview discussing @BarackObama's economic failures attack on capitalism and playing class warfare. __HTTP__ _E_ Statement on John McCain __HTTP__ _E_ .... Do I get the credit for this? Thank you! __HTTP__ _E_ I agreed to take the worst spot at CPAC because nobody else wanted it and it was the only time I could be there it was great fun! _E_ Our country needs to reestablish the work ethic. In NY welfare pays better than jobs __HTTP__ Zero incentive. _E_ To all @MittRomney supporters make sure you have taken advantage of early voting now so you can GOTV on election day. _E_ The UK is seriously thinking about halting wind turbine subsidies. Good news killing country. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ My thoughts are with all those observing Yom Kippur the holiest day of the Jewish year. __HTTP__ _E_ Join me LIVE on my Facebook page in St. Augustine Florida! Lets #DrainTheSwamp & MAKE AMERICA GREAT AGAIN!... __HTTP__ _E_ There will be NO change to your 401(k). This has always been a great and popular middle class tax break that works and it stays! _E_ RT @WhiteHouse: President Trump proclaims today as #WorldAIDSDay: __HTTP__ __HTTP__ _E_ WHO IS GOING TO GET IRAQ'S OIL??????? _E_ Together we will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ We are making great progress with healthcare. ObamaCare is imploding and will only get worse. Republicans coming together to get job done! _E_ It was my great honor to join our wonderful Veterans at AMVETS Post 44 in Youngstown Ohio this evening. A grateful nation salutes you! __HTTP__ _E_ #WeeklyAddress __HTTP__ __HTTP__ _E_ .@NicolleDWallace Your father is a brilliant man with wonderful sense therefore you must be good! _E_ Looking forward to receiving 2015 Statesman of the Year award tonight by @SRQRepublicans. A record 2000+ sell out __HTTP__ _E_ My @foxandfriends interview discussing @BarackObama's #WHCD lowering tax rates Republic of Georgia & (cont) __HTTP__ _E_ .@DJohnsonPGA We are so proud of you Dustin. Your reaction under pressure was amazing. First of many Majors. You are a true CHAMPION! _E_ I will be on @foxandfriends at 7.00 45 minutes. Talking about Ebola Obama and other strange U.S. happenings! _E_ My @gretawire interview discussing @MittRomney debate responses Obama's hidden records my tweets and unemployment __HTTP__ _E_ Is President Obama going to finally mention the words radical Islamic terrorism? If he doesn't he should immediately resign in disgrace! _E_ Does anybody really believe that a reporter who nobody ever heard of went to his mailbox and found my tax returns? @NBCNews FAKE NEWS! _E_ .@IsraeliPM @netanyahu delivered an excellent speech yesterday at the UN. Too bad @AmbassadorRice wasn't there. _E_ My @SquawkCNBC interview discussing today's primary contests @MittRomney's lead and my stock picks __HTTP__ _E_ I will be the best by far in fighting terror. I'm the only one that was right from the beginning & now Lyin' Ted & others are copying me. _E_ Watched @davidaxelrod on @oreillyfactor and the dog hit me even after I made a big contribution to his charity. I never went bankrupt! _E_ I really enjoy doing @foxandfriends every Monday at 7 AM. @sdoocy @ehasselbeck and @kilmeade are great people. _E_ "Manufacturing Optimism Rose to Another All Time High in the Latest @ShopFloorNAM Outlook Survey" __HTTP__ _E_ Very nice article from Daily Mail __HTTP__ _E_ The statement about leaving the base came directly from CBS Evening News. _E_ Gallup poll proves that @BarackObama's regulation and Obamacare are stopping small business owners from hiring __HTTP__ SHOCK! _E_ #MakeAmericaGreatAgain __HTTP__ _E_ Then ask: What am I pretending not to see? These two simple questions can pave the way for some very clear answers. _E_ The 9.11.12 attack on the Benghazi consulate was a sophisticated multi prong wave attack. When will all the 50+ fighters face justice? _E_ Lyin' Ted Cruz can't win with the voters so he has to sell himself to the bosses I am millions of VOTES ahead! Hillary would destroy him & K _E_ "Never judge someone by their job title. You'd be surprised at the talents people can have." – Midas Touch _E_ The plane I saw on television was the hostage plane in Geneva Switzerland not the plane carrying $400 million in cash going to Iran! _E_ Whatever you are doing right now make sure to stop for a minute focus and ask yourself "Am I thinking BIG?" _E_ .@David_Cameron Why do you give Scotland so much money to destroy their magnificent land with wind turbines causing massive taxes & E bills _E_ We can make Washington work for us. It's time for real leadership. Let's #MakeAmericaGreatAgain! __HTTP__ _E_ Iran has been formally PUT ON NOTICE for firing a ballistic missile.Should have been thankful for the terrible deal the U.S. made with them! _E_ How will Mitt Romney defend his record on jobs and Romneycare in tonight's debate? _E_ Thank you Las Vegas Nevada I love you! Departing for Greeley Colorado now. Get out & VOTE! #ICYMI watch here:... __HTTP__ _E_ My @foxandfriends interview re: @SuperBowl blackout @BobbyJindal's stupid comment & suing @billmaher f/$5M __HTTP__ _E_ Obama is trying to block sequester layoff notices in Virginia __HTTP__ Another example of sleazy politics! _E_ As dishonest as @RollingStone is I say @HuffingtonPost is worse. Neither has much money sue them and put them out of business! _E_ President Obama's literary agent (in 1991) promoted a book about the first African American president of the (cont) __HTTP__ _E_ Entrepreneurs – be tough resolute & trustworthy. The most crucial time to build your reputation is when you start making deals. _E_ The NFL has decided that it will not force players to stand for the playing of our National Anthem. Total disrespect for our great country! _E_ Countries charge U.S. companies taxes or tariffs while the U.S. charges them nothing or little.We should charge them SAME as they charge us! _E_ The lights went out at the White House today __HTTP__ Symbolic of the Obama presidency. _E_ I wonder why somebody doesn't do something about the clowns @politico and their totally dishonest reporting. _E_ Press release. Video response to follow. __HTTP__ _E_ The banks were bailed out by us. They should start lending to private entrepreneurs. The banks are slowing American growth. _E_ Unlike the other Republican candidates I will be in Nevada all day and night I won't be fleeing in and out. I love & invest in Nevada! _E_ Will be in Bangor Maine today! Join me 4pmE at the Cross Insurance Center! __HTTP__ __HTTP__ _E_ I hope the Republicans are happy. Just as I predicted that stupid deal they voted for only whetted Obama's appetite for more taxes. _E_ "President Trump?" __HTTP__ via @MiamiHerald by Wayne E. Williams _E_ Via @HorsetalkNZ: "Florida's Trump Invitational to kick off showjumping year" __HTTP__ Mar a Lago's 3rd ann. Trump Grand Prix! _E_ The Government spends 30% more than it admits __HTTP__ @BarackObama is out of control with his deficit spending. _E_ Obama administration fails to screen Syrian refugees' social media accounts: __HTTP__ _E_ Via @DailyCaller: Trump on Obama and Congress: 'Lock them up' in a room like Vatican conclave __HTTP__ by @NicholasBallasy _E_ Very tacky set! _E_ With Boston terrorist cell widening in suspects it's now clear that it was a mistake to read the bomber the Miranda warning so early. _E_ I love the Lakers and when you love the Lakers you want them to win so badly that you will work tirelessly. Dr. Jerry Buss _E_ #JointSession #MAGA __HTTP__ _E_ With taxes set to go up and Obama about to cut the mortgage deduction now is the time to buy a house if you can. Can get a great deal. _E_ Does anyone know that Crooked Hillary who tried so hard was unable to pass the Bar Exams in Washington D.C. She was forced to go elsewhere _E_ Bill Clinton has been Obama's most effective surrogate out on the trail. _E_ It doesn't cost any money to think bigger. The Art of the Deal _E_ "Did you know that with the natural gas reserves we have in the United States we could power America's (cont) __HTTP__ _E_ Via @MarketWatch: "@TrumpSoHo New York Unveils $50 Million Presidential Penthouse" __HTTP__ _E_ Thank you. __HTTP__ _E_ .@redstate I miss you all and thanks for all of your support. Political correctness is killing our country. weakness. _E_ RT @ricardorossello: Briefed @POTUS @realDonaldTrump in #SituationRoom and thanked him for his leadership quick response & commitment to o... _E_ FEMA & First Responders are doing a GREAT job in Puerto Rico. Massive food & water delivered. Docks & electric grid dead. Locals trying.... _E_ RT @ScottAdamsSays: Trump's speech today is the best persuasion I have ever seen. Game over. Now running unopposed: __HTTP__ _E_ .@mkhammer a Fox contributor isn't smart enough to know what is going on at the border. @TheJuanWilliams made the point far better! _E_ Such a great honor to be the Republican Nominee for President of the United States. I will work hard and never let you down! AMERICA FIRST! _E_ RT @EricTrump: Who has voted today??? Feedback from the polls? I'm like a kid on Christmas! #SuperTuesday #MakeAmericaGreatAgain __HTTP__ _E_ .@timkaine is wrong for defense: __HTTP__ #BigLeagueTruth #VPDebate _E_ Leaving for Mobile Alabama right now can't be late! _E_ It is going to be a long and tough road to turn around CNN they are looking at the wrong people! _E_ Great read by @VDHanson: "Mexico's Hypocrisy Is Evident In Its Own Strict Policy Toward Immigrants" __HTTP__ _E_ Via @DailyCaller by @rpollockDC: "NYC Mayor Action Against Donald Trump Is 'Not the American Way'" __HTTP__ _E_ Do you believe Obama just said that America would be less safe with a travel ban from West Africa.This is the thinking of a total mad man! _E_ The Solyndra Scandal @BarackObama's $500Million photo op. He loves wasting our money. _E_ If you love it own it. @TrumpCondosLV bring unparalleled style elegance and world class amenities to Las Vegas __HTTP__ _E_ Join me at 2:00pmEST today live from Trump Tower via Facebook & Periscope! __HTTP__ _E_ The Boston Bomber got immediate emergency surgery for a gunshot yet our vets die on waiting lines at the VA. We must do better! _E_ What team would you choose to win? #CelebApprentice _E_ .@AllenWest Great seeing you last night at record setting Mar a Lago Republican event. The crowd loved you! _E_ Presidential Memorandum for the @CommerceGov @SecretaryRoss re: Aluminum Imports and Threats to National Security:... __HTTP__ _E_ Thank you Minnesota! It is time to #DrainTheSwamp & #MAGA! #ICYMI watch: __HTTP__ __HTTP__ _E_ Wow Lyin' Ted Cruz really went wacko today. Made all sorts of crazy charges. Can't function under pressure not very presidential. Sad! _E_ I feel sure that my friend @RandPaul will come along with the new and great health care program because he knows Obamacare is a disaster! _E_ I strongly pressed President Putin twice about Russian meddling in our election. He vehemently denied it. I've already given my opinion..... _E_ I will be doing Fox & Friends tomorrow morning at 7.00. Will be discussing all sorts of current disasters! _E_ Wow @CNN ratings are up 75% because it's all Trump all the time. The networks are making a fortune off of me! MAKE AMERICA GREAT AGAIN! _E_ Al Qaeda taking over Libya after we made it possible really amazing. _E_ Obama has unilaterally & unconstitutionally drawn 4 ObamaCare exemptions for his friends. All @GOP wants is (cont) __HTTP__ _E_ RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_ Thank you Marco I agree! __HTTP__ _E_ In 2008 @BarackObama campaigned against $3.50 gas __HTTP__ It is now $6 in Florida and on the rise. He is a disaster! _E_ Happy Canada Day to all of the great people of Canada and to your Prime Minister and my new found friend @JustinTrudeau. #Canada150 _E_ via Bloomberg: Fox News Couldn't Kill Trump's Momentum Made Him Stronger @FoxNews @business __HTTP__ _E_ If the Republicans need a chief negotiator I am always available or can recommend some really good ones! _E_ .@BillRancic Bill fantastic job this morning on @foxandfriends you are a total winner and I am proud of you as first Apprentice CHAMP! _E_ Isn't it sad the way Putin is toying with Obama regarding Snowden. We look weak and pathetic. Could not happen with.a strong leader! _E_ Great players in sports make the game fun to watch. @DerekJeter has continued to impress with another amazing season. Absolute professional. _E_ Our ally Canada wants to send their oil down south to us. @BarackObama is forcing Canada to send it west to China. _E_ Sorry folks I'm just not a fan of sharks and don't worry they will be around long after we are gone. _E_ I will be interviewed by Chris Wallace at 2:00 P.M. on @FoxNews Turn off the football for 15 minutes Make America Great Again! _E_ Via @pbpost: "Faldo calls team up for golf course with Trump 'entertaining'" __HTTP__ _E_ Great job on the Larry King Live Gulf Telethon last night $1.3 million was raised in 2 hours. _E_ I wish everyone including the haters and losers a very happy Easter! _E_ RT @TheFive: Media bias is not just about what they report it's also about what they don't report. @jessebwatters #thefive _E_ We need to fix our broken education system! #StopCommonCore #MakeAmericaGreatAgain Video: __HTTP__ __HTTP__ _E_ WOW great new poll New York! Thank you for your support! #Trump2016#NewYorkValues __HTTP__ __HTTP__ _E_ If dummy Bill Kristol actually does get a spoiler to run as an Independent say good bye to the Supreme Court! _E_ The national debt continues to rise at record levels and today @BarackObama is in Disney World. He lives in a fantasy. _E_ Photo from yesterday's USGA announcement that Trump National Golf Club Bedminster will host the 2017 U.S. Women's Open __HTTP__ _E_ The special interests and people who control our politicians (puppets) are spending $25 million on misleading and fraudulent T.V. ads on me. _E_ Not only is @Toure a racist (and boring) he's a really dumb guy! _E_ Join me in San Jose California tomorrow evening at 7pm! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Thanks for your support! __HTTP__ _E_ I hear the very ungrateful @ArsenioHall has a show that is absolutely dying in the ratings. Really too bad! _E_ Best wishes to the Republic of Korea on hosting the @Olympics! What a wonderful opportunity to show everyone that you are a truly GREAT NATION! __HTTP__ _E_ I don't know Putin have no deals in Russia and the haters are going crazy yet Obama can make a deal with Iran #1in terror no problem! _E_ .@ArsenioHall just got "fired"—the people spoke ratings were terrible. The Apprentice brought him back from the dead but he blew it! _E_ Great conversations with President @EmmanuelMacron and his representatives on trade military and security. _E_ Just out the new nationwide @FoxNews poll has me alone in 2nd place closely behind Jeb Bush but Bush will NEVER Make America Great Again! _E_ Hillary's 33000 deleted emails about her daughter's wedding. That's a lot of wedding emails. #debate _E_ I can't believe he would choose @OMAROSA as his first choice! She is hard to handle. #CelebApprentice _E_ Crowd is booing the hell out of that phony decision place is angry and going wild. Fight was not even close! DISGUSTING. _E_ Unfortunately@BarackObama's continued attack on the US $ will lead to ever rising gas prices at the pump and lots of other really bad things _E_ Gas prices are up 30 cents this month rising 21 days in a row __HTTP__ Don't worry @BarackObama has a solution ALGAE! _E_ Heading to Pennsylvania for a big rally tonight. We will MAKE AMERICA GREAT AGAIN! _E_ Middle Eastern countries must participate militarily (no running away) and big league financially in order for us to go in and save them! _E_ RT @TeamTrump: ONLY @realDonaldTrump will end what even @BillClinton called a CRAZY SYSTEM. #BigLeagueTruth #Debate __HTTP__ _E_ HAPPY BIRTHDAY to our 40th President of the United States of America Ronald Reagan! __HTTP__ _E_ "In the end you're measured not by how much you undertake but what you finally accomplish. The Art of the Deal _E_ I have NOTHING to do with The Apprentice except for fact that I conceived it with Mark B & have a big stake in it. Will devote ZERO TIME! _E_ Donald Trump Partners with TV1 on New Reality Series Entitled Omarosa's Ultimate Merger: __HTTP__ _E_ Today we remember and honor our fellow Americans and NYPD and FDNY who fell 11 years ago. _E_ .@chucktodd is so dishonest in his reporting...and to think he was going off the air until I came along no ratings. I will beat Hillary! _E_ I hope everyone (especially the haters and losers) goes to Macy's today and buys some DJT ties shirts and suits and SUCCESS Fragrance love! _E_ Has anyone looked at the really poor numbers of @VanityFair Magazine. Way down big trouble dead! Graydon Carter no talent will be out! _E_ .#RogerStone was just banned by @CNN their loss! Tough loyal guy. _E_ I just had an amazing day in Mumbai India. Building an almost 80 story building super luxury which is doing great! Press is going wild. _E_ Going to South Carolina now great place SRO crowd. Iowa was amazing yesterday! _E_ Join me in Wilmington Ohio tomorrow at 4:00pm! It is time to #DrainTheSwamp! Tickets: __HTTP__ __HTTP__ _E_ Where's the press? 1484: 72% of Afghan Casualties Have Occurred Under Obama __HTTP__ _E_ Hillary can never win over Bernie supporters. Her foreign wars NAFTA/TPP support & Wall Street ties are driving away millions of votes. _E_ Great day of bilateral meetings at #ASEANSummit on trade which we are turning around to be great deals for our country! __HTTP__ _E_ I wonder if Apple is upset with me for hounding them to produce a large screen iPhone. I hear they will be doing it soon—long overdue. _E_ RT @SLandinSoCal: When you kneel for our #NationalAnthem you aren't protesting a specific issue you are protesting our Nation and EVERYTH... _E_ When will the unemployment numbers be corrected? Sadly after the election! _E_ Melania and I will be appearing on Larry King Live tonight 9 p.m. on CNN. Be sure to tune in for some great conversation! _E_ Do you believe that Secy. KERRY just went to Egypt to talk about human rights problems and this as everything is being blown up around him _E_ Are you a young professional getting ready for a big meeting? Pick up a #Trump suit @Macys __HTTP__ Look your best! _E_ THANK YOU! #Trump2016#IACaucus finder: __HTTP__ __HTTP__ _E_ Tweet me tonight who your favorite is during the live telecast of the Miss Universe Pageant. _E_ "60 Minutes" treats President Obama with kid gloves Mike Wallace is spinning in his grave! _E_ Proud to welcome our great Cabinet this afternoon for our first meeting. Unfortunately 4 seats were empty because S... __HTTP__ _E_ .@CarlyFiorina Ben Carson said in his own book that he has a pathological temper & pathological disease. I didn't say it he did. Apology? _E_ I will be announcing my decision on Paris Accord Thursday at 3:00 P.M. The White House Rose Garden. MAKE AMERICA GREAT AGAIN! _E_ .@TheAlabamaBand was great last night in D.C. playing for 147 Diplomats and Ambassadors from countries around the world. Thanks Alabama! _E_ If @rihanna is dating @chrisbrown again then she has a death wish. A beater is always a beater just watch! _E_ Via @KingwoodNews by @JayRJordan:: "@TXPatriotsPAC gives public a chance to hear Trump speak" __HTTP__ _E_ .@oreillyfactor Please correct I WON Virginia! _E_ Now @BarackObama wants us to believe the Republicans cancelled Keystone and are responsible for $4 gas. He (cont) __HTTP__ _E_ How badly will the Country be hurt by the three scandals and the very poor implementation and cost of Obama Care? _E_ But @mcuban is physically weak he has no clubhead speed or game! _E_ The Democrats' solution is the same solution they have for everything tax tax tax. Just one problem: it doesn't work #TimeToGetTough _E_ I turned down going to the debate tonight so that I could do live tweets to my many followers. _E_ I'm very proud of my new crystal collection. Here's a sneak peak of my favorite collection Elmsford __HTTP__ (cont) __HTTP__ _E_ My shirts ties & suits (and fragrance Success) are doing great go over & check out Macy's now—beautiful new selection! _E_ ObamaCare is on LIFE SUPPORT it will soon be DEAD ON ARRIVAL A bad concept that was imcompetently administered! _E_ Remember I am the only candidate who is self funding. While I am given little credit for this by the voters I am not bought like others! _E_ The only people who are not interested in being the V.P. pick are the people who have not been asked! _E_ Happy 94th birthday to Nelson Mandela! _E_ Ask yourself Is this a blip or is it a catastrophe? and your equilibrium will be kept in check if/when hard times hit. _E_ Funny if you listen to @FoxNews the Democrats did not have a good day. If you listen to the other two they are fawning. What a difference _E_ "Trump on Obama: 'I never thought it would be this bad'" __HTTP__ via @breitbarttv _E_ Obama Care's taxes vest in 2014. Conveniently after the 2012 election. Coincidence? _E_ Wishing everyone a wonderful Independence Day weekend. We have a lot to be thankful for. _E_ Last Saturday A Rod was 0 3 and left 6 stranded. But he was still hitting on girls from the dugout __HTTP__ He is very selfish! _E_ Despite my great respect for King Abdullah II I will not be visiting Jordan at this time. This is in response to the false @AP report. _E_ The W.H. is functioning perfectly focused on HealthCare Tax Cuts/Reform & many other things. I have very little time for watching T.V. _E_ On Fifth Avenue the iconic @TrumpTowerNY is one of NYC's most heavily visited tourist attractions __HTTP__ _E_ Awarded the prestigious 2014 @ForbesInspector 5 Star Guide @TrumpToronto is located in beautiful downtown Toronto __HTTP__ _E_ I wish the 23 million who are unemployed were able to celebrate like the Democrats in Charlotte right now. _E_ Great first day with world leaders at the #G20Summit here in Hamburg Germany. Looking forward to day two! #USA __HTTP__ _E_ My opponents big bosses lobbyists and donors are trying to do damage. They will fail! Money down the drain! _E_ .@TomLlamasABC cannot report the news truthfully. Why not apologize for your fraudulent story on World News Tonight.Gang members & criminals _E_ Cadillac Championship at Doral a great success I just bought Doral it will be amazing. Cadillac a great American car. _E_ No president in history has lied to the American people more than President Obama in fact it is not even close! _E_ Innovation distinguishes between a leader and a follower. Steve Jobs _E_ Great poll out of Nevada thank you! See you soon. #MAGA #AmericaFirst __HTTP__ __HTTP__ _E_ .@Rosie get better fast. I'm starting to miss you! _E_ Congrats @marklevinshow on fantastic article when "B" writes so nicely about you it really means something. __HTTP__ _E_ I will be on Face The Nation this morning at various times across the U.S. @CBSNews Enjoy! _E_ .@NBCNews is so knowingly inaccurate with their reporting. The good news is that the PEOPLE get it which is really all that matters! Not #1 _E_ What controversy? 2 'active' @BarackObama supporters at Bain have confirmed that @MittRomney left in '99 __HTTP__ No story here. _E_ Why Donald Trump Won't Touch Your Entitlements: DES MOINES Iowa—Donald Trump says if he runs for p... __HTTP__ _E_ Thank you! #ImWithYou __HTTP__ _E_ Young entrepreneurs – Remember that your first deals are the most important of your career. Win & gain confidence. _E_ Why does Obama continue to release the worst of the worst from Gitmo?! Look at Paris and wake up! _E_ I know of no more encouraging fact than the unquestionable ability of man to elevate his life by conscious endeavor. Henry David Thoreau _E_ Waste! The CBO now estimates that @BarackObama's stimulus cost $831B and a ridicuous $4.1M per job created __HTTP__ _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Behind the scenes video with "Uncle Sam" (eagle's name) and me. __HTTP__ _E_ Great night in Denver Colorado thank you! Together we will MAKE AMERICA GREAT AGAIN! #ICYMI watch rally here:... __HTTP__ _E_ Iraq is far worse and of more danger to the U.S. now than it ever was under Saddam Hussein and this after $2 trillion and so many lives! _E_ I had a great time at @TwitterNYC #AskTrump __HTTP__ _E_ If Bernie Sanders after seeing the just released e mails continues to look exhausted and done then his legacy will never be the same. _E_ That said the rich Arab countries should get involved with the Syrian mess not us.We should start rebuilding our own country & military. _E_ Looking like my 5 victories on Tuesday will be just as good as if I won Ohio. Two more days and Ohio was mine! _E_ The Democrats want to shut down the Government over Amnesty for all and Border Security. The biggest loser will be our rapidly rebuilding Military at a time we need it more than ever. We need a merit based system of immigration and we need it now! No more dangerous Lottery. _E_ This is the time for the United States to be strengthening all important military components not rolling over and dealing from weakness! _E_ Notice the first word in my Think Big credo: Think = the 1st step. Use everything in your power to utilize & develop that capacity. _E_ Why doesn't Fake News talk about Podesta ties to Russia as covered by @FoxNews or money from Russia to Clinton sale of Uranium? _E_ .@BLTPrimeMiami @TrumpDoral's signature restaurant has set the standard for today's modern steakhouse __HTTP__ _E_ The Mar a Lago Club has the best meatloaf in America. Tasty. __HTTP__ _E_ RT @ProgressPolls: Who is a better President of the United States? #ObamaDay _E_ .@Graeme_McDowell Great playing this weekend. You are a true winner. We look forward to having you back to Trump National Doral. _E_ RT @RealJamesWoods: I've never witnessed such hatred for a man who is willing to work for free to make his beloved country a better place.... _E_ Why isn't @chucktodd using the much newer @CNN Poll when discussing how well I am doing instead of the older Q Poll? CNN even better! _E_ They just called this the biggest scandal in U.S. sports history (GMA). Roger looks really weak and indecisive must put on a better front! _E_ The failing @nytimes was forced to apologize to its subscribers for the poor reporting it did on my election win. Now they are worse! _E_ Gov. @BobbyJindal referred to the Republicans as "the stupid party". Now he has given Dems a phrase to use. _E_ #WheresHillary? Sleeping!!!!! _E_ Chuck Hagel's nomination has been held up for at least 12 more days. A lot can happen. _E_ Restoring American wealth will require that we get tough. The next president must understand that America's (cont) __HTTP__ _E_ Spent full day with contractors at Trump National Doral it will be amazing! __HTTP__ _E_ Double digit premium increases because of ObamaCare. Dems trying to delay showing numbers until after election but news is spreading fast! _E_ "TRUMP HITS BACK AT CHRIS MATTHEWS' BIRTHER RANT: 'HE USED TO BE A MUCH MORE INTELLIGENT MAN' __HTTP__ @MadeleineBlaze _E_ The media must denigrate ISIS at all levels or youth will continue to be drawn to it. These are low level degenerates NOT masterminds! _E_ Thank you! __HTTP__ _E_ On rugged Aberdeenshire coastline@TrumpScotland's Par 72 course is 7400 picturesque yds. threaded through dunes __HTTP__ _E_ RT @ericbolling: Good morning friends! The Swamp out today. President Trump has a copy... get yours here __HTTP__ #maga... _E_ Donna Brazile Shreds Obama Economy Acting DNC chair says 'people are more in despair about how things are' __HTTP__ _E_ We cannot solve our problems with the same thinking we used when we created them. Albert Einstein _E_ My @SteveDeaceShow int. on China the economy and my upcoming keynote at @theFAMiLYLEADER Leadership Summit __HTTP__ _E_ The number of unemployed Americans has increased over 60% during Obama's term. The economy can't survive another 4 years. _E_ Know when to walk away from the table. The Art of the Deal _E_ Many people have commented that my fragrance "Success" is the best scent & lasts the longest. Try it & let me know what you think! _E_ Another horrific attack this time in Nice France. Many dead and injured. When will we learn? It is only getting worse. _E_ I am a Republican...but the Republicans may be the worst negotiators in history! _E_ Jeff Flake with an 18% approval rating in Arizona said a lot of my colleagues have spoken out. Really they just gave me a standing O! _E_ Who is paying for that tedious Smokey Bear commercial that is on all the time enough already! _E_ Story will be released today at 12 noon EST on Twitter and Facebook. _E_ I can't wait to read this...RT @Newsmax_Media: SEAL Book Explodes Obama Furious __HTTP__ _E_ Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_ .@MittRomney was a disaster candidate who had no guts and choked! Romney is a total joke and everyone knows it! _E_ Tomorrow we celebrate Independence Day America's 236th birthday. Here is America's actual birth certificate __HTTP__ _E_ My interview with @BarbaraJWalters in her @ABC special 'Most Fascinating People of 2011' __HTTP__ _E_ Welcome to Twitter @melaniatrump! _E_ RT @TrumpNV: #NVcaucus locator > __HTTP__ _E_ Via @NRO by @JOELMENTUM: "Matchless Name Recognition and Deep Pockets Make Trump a Threat in Iowa" __HTTP__ _E_ Much as it pays to emphasize the positive there are times when the only choice is confrontation. #TheArtofTheDeal _E_ Just got back from New Hampshire. Great crowd great people! Will be back soon! _E_ RT @LouDobbs: #AmericaFirst @KellyannePolls: The Middle Class & businesses will benefit from @POTUS' historic tax revolution. #Dobbs #MAGA... _E_ My interview on @ASavageNation discussing why @MittRomney will defeat @BarackObama with a strong campaign. __HTTP__ _E_ Via @espn: Donald Trump would fire A Rod __HTTP__ _E_ Enter the contest.... __HTTP__ stay at Trump International Hotel Las Vegas _E_ Huma calls it a MESS the rest of us call it CORRUPT! WikiLeaks catches Crooked in the act again.#DrainTheSwamp __HTTP__ _E_ RT @TomOdell: .@FoxNews Pope who lives in a Vatican city fortified with huge walls thinks it's wrong to build walls? Really? __HTTP__ _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ All Star Celebrity @ApprenticeNBC continues to surprise our loyal viewers each and every week. More and bigger coming... _E_ For someone who demanded 20 years of Mitt's tax returns you would think my offer to donate $5M to charity for his records is an easy go. _E_ "Don't emphasize the problem so much emphasize the solution. It's a mindset that works." – Think Like a Champion _E_ Just finished my second speech. 20K in Dayton & 25K in Cleveland perfectly behaved crowd. Thanks I love you Ohio! __HTTP__ _E_ A number of months ago I was not expected to win South CarolinaTed Cruz was and yet I won in a landslide every group and category. WIN! _E_ Asking why my dislike of A Rod dishonorable dealings with me on an apartment deal _E_ The day after @BarackObama blocks a Texas voter photo ID law @JamesOkeefeIII exposes more dead people getting ballots __HTTP__ _E_ Nothing ever happened with any of these women. Totally made up nonsense to steal the election. Nobody has more respect for women than me! _E_ I look forward to going to the Land Investment Expo in Iowa on Jan. 23. Record crowd—sold out venue. @LandExpo @PeoplesCompany _E_ Getting ready to go to Las Vegas (Freedom Fest) great crowd. Then on to amazing Phoenix that will be a total happening! Love America. _E_ #TrumpVlog Quarantine the nurse! __HTTP__ _E_ Storm turned Hurricane is getting much bigger and more powerful than projected. Federal Government is on site and ready to respond. Be safe! _E_ Be sure to watch @IvankaTrump's @FoxBusiness @FBNATB interview from the @NYU #HospitalityConference __HTTP__ _E_ Via @Entrepreneur by @MDMSEO : "8 Lessons for Entrepreneurs That @ApprenticeNBC Emphasizes" __HTTP__ _E_ Dress your best. Trump Signature Collection exclusively available @Macys tops all male business attire __HTTP__ _E_ I am promising you a new legacy for America. We're going to create a new American future. Thank you OHIO! #ImWithYou __HTTP__ _E_ .@FinancialTimes writes that @BarackObama should pray that China overtakes US __HTTP__ Don't worry he is making it happen. _E_ .@ariannahuff is unattractive both inside and out. I fully understand why her former husband left her for a man he made a good decision. _E_ The WALL which is already under construction in the form of new renovation of old and existing fences and walls will continue to be built. _E_ Remember when guns are outlawed only outlaws will have guns! _E_ .@CelebApprentice wins 10 11 o'clock hour in all key ratings demographics including most importantly the 18 49 age group. _E_ Asians are very offended that JEB said that anchor babies applies to them as a way to be more politically correct to hispanics. A mess! _E_ Mar a Lago my club in Palm Beach and one of the greatest mansions ever built has been nominated as one of (cont) __HTTP__ _E_ Obama will send troops back into Iraq combat zone. Don't believe anything he says. Just covering for his mistakes. _E_ With America's debt topping $21T by the end of his presidency Obama will have effectively bankrupted our country. @davidaxelrod _E_ RT @Scavino45: #WeThePeople#USAatUNGA #UNGA __HTTP__ _E_ Trump International Golf Links was just rated one of the greatest courses in the world. Virtually all reviews are saying the same thing. _E_ Via @eonline: @IvankaTrump Wears @MissUniverse 2014's $300000 Crown Nails Beauty Pageant Winner Look __HTTP__ _E_ Mexico sent USMC Andrew Tahmooressi back to jail after court hearing. Mexico does not respect our border or U.S. Boycott! #FreeOurMarine _E_ Signing The Facebook Wall __HTTP__ _E_ Ted Cruz Was For Welcoming Syrian Refugees Before He Was Against It __HTTP__ _E_ You know the world is crazy when New York gets hit by a hurricane and Florida doesn't. _E_ We will have to see what Russia's next move will be. They may have given him an out of an embarrassing situation or drove into deeper mess! _E_ "In this game knowledge is the key to power." Think Big _E_ My @SquawkCNBC #TRUMPTUESDAY interview discussing the upcoming debates the real state of unemployment & bias media __HTTP__ _E_ I highly recommend the just out book THE FIELD OF FIGHT by General Michael Flynn. How to defeat radical Islam. _E_ .@megynkelly the most overrated anchor at @FoxNews worked hard to explain away the new Monmouth poll 41 to 14 or 27 pt lead. She said 15! _E_ My @SquawkCNBC interview discussing Jamie Dimon the Doral Miami purchase OPEC's output & @PGATOUR Open __HTTP__ _E_ The Democrats are turning down services and security for citizens in favor of services and security for non citizens. Not good! _E_ Donald Trump could again defy the conventional wisdom of the chattering class in November. @Newsmax_Media's cover The Trump Effect _E_ How come Snowden and ObamaCare have access to all records and information but don't have even the smallest tidbits on President Obama? _E_ This Sunday at 9/8C the real playoffs begin with the premiere of @apprenticenbc! Game on in the Boardroom... __HTTP__ _E_ Marco Rubio had no idea what he was doing on Chris Wallace show. Said Iraq was not a mistake. He looked clueless! _E_ A great evening in Iowa! Thank you Des Moines Area Community College for a great forum! #Trump2016 #IAForums __HTTP__ _E_ 'President Donald J. Trump Proclaims September 3 2017 as a National Day of Prayer' #HurricaneHarvey #PrayForTexas __HTTP__ __HTTP__ _E_ We need to secure our borders ASAP. No games we must be smart tough and vigilant. MAKE AMERICA GREAT AGAIN & MAKE AMERICA STRONG AGAIN! _E_ My @gretawire int. discussing business difficulties with ObamaCare & how it is stopping businesses from hiring __HTTP__ _E_ They changed the name from "global warming" to "climate change" after the term global warming just wasn't working (it was too cold)! _E_ Ted Cruz is totally unelectable if he even gets to run (born in Canada). Will loose big to Hillary. Polls show I beat Hillary easily! WIN! _E_ Congratulations to @IvankaTrump and Jared on the big news. I will have yet another grandchild this fall! _E_ The reason lyin' Ted Cruz has lost so much of the evangelical vote is that they are very smart and just don't tolerate liars a big problem! _E_ Do you think @SenTedCruz knows about @bobvanderplaats dealings? Actually I doubt it! _E_ ....and Japan will put up with this much longer. Perhaps China will put a heavy move on North Korea and end this nonsense once and for all! _E_ Record crowd in Tampa Florida thank you! We will WIN FLORIDA #DrainTheSwamp in Washington D.C. and MAKE AMERICA... __HTTP__ _E_ Nothing great was ever achieved without enthusiasm. Ralph Waldo Emerson _E_ "Stay confident even when something bad happens. It is just a bump in the road. It will pass." – Think Big _E_ China is already preparing to benefit economically from this mess. They will pick up the pieces and make yet another fortune & laugh at us! _E_ Olympic Gold Medalist Evan Lysacek just left my office. He is in town and wanted to meet me he's a fanastic guy and a true champion. _E_ Thank you for all of your support South Carolina! #Trump2016 __HTTP__ _E_ Watch @DonaldJTrumpJr and @EricTrump accept my #ALSIceBucketChallenge __HTTP__ _E_ Once the Bloomberg administration selected Trump to take over the very expensive and years late project I kicked ass and got it done fast _E_ Truth will ultimately prevail where there is pains to bring it to light. George Washington _E_ Goofy Elizabeth Warren has been one of the least effective Senators in the entire U.S. Senate. She has done nothing! _E_ Ted Cruz lifts the Bible high into the air and then lies like a dog over and over again! The Evangelicals in S.C. figured him out & said no! _E_ Via @FoxNews: Donald Trump sends Bill Maher birth certificate awaits $5 million __HTTP__ _E_ Thank you Maryland! #Trump2016 __HTTP__ _E_ ObamaCare is clearly unconstitutional. Hopefully the USC rules correctly but in the end repealing ObamaCare requires a political solution. _E_ Today we continued a wonderful American Tradition at the White House. Drumstick and Wishbone will live out their days in the beautiful Blue Ridge Mountains at Gobbler's Rest... __HTTP__ _E_ HISTORIC rainfall in Houston and all over Texas. Floods are unprecedented and more rain coming. Spirit of the people is incredible.Thanks! _E_ Thank you Miss Katie's Diner!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ The Democrats are in a total meltdown but the biased media will say how great they are doing! E mails say the rigged system is alive & well! _E_ Creativity and control can go hand in hand. Brainpower is the ultimate leverage. _E_ "Never Ignore Donald Trump" __HTTP__ by Jeffrey Lord @AmSpec _E_ Now the @BarackObama campaign is fundraising off of me. I should get a tax rebate! __HTTP__ _E_ .@chelseahandler stop calling my office for me to do your rather gross show I have less interest in you than Andre. _E_ Pervert Weiner is dead in his race for mayor of NYC but WOW Eliot Spitzer has dropped way down in recent poll for comptroller. SLEAZE! _E_ Big wins against ISIS! _E_ Will be interviewed on @SeanHannity on @FoxNews from #Wisconsin tonight. My wife Melania will join me for the entire show. _E_ I suggest that we add more dollars to Healthcare and make it the best anywhere. ObamaCare is dead the Republicans will do much better! _E_ In a clumsy move to get out of his anchor babies dilemma where he signed that he would not use the term and now uses it he blamed ASIANS _E_ My deepest condolences to the victims of the terrible shooting in Douglas County @DCSheriff and their families. We love our police and law enforcement God Bless them all! #LESM _E_ I will be interviewed on @foxandfriends at 7:00 30 minutes. Some very interesting topics. _E_ .@acuconservative's #CPACCO kept up the momentum from the debate. @MittRomney even made a surprise appearance. Now go win CO! _E_ Why did @BarackObama liberate Libya and do nothing for the Iranian protestors? Iran is a threat to our national security. _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ "I've found that people work harder when they are held accountable and their confidence rises along with that." – Midas Touch _E_ It is now a FACT that President Obama lied in order to get ObamaCare passed that is fraud and the legislation should be recinded INTERESTING _E_ Flying to New Hampshire to keynote the @LoebSchool First Amendment Award Ceremony. Always great to visit the Granite State! _E_ The RNC which is probably not on my side just illegally put out a fundraising notice saying Trump wants you to contribute to the RNC. _E_ Our country has become so politically correct that it has lost all sense of direction or purpose. We are unable to move forward painlessly. _E_ I predict that President Obama will at some point attack Iran in order to save face! _E_ .@GOP congress needs to actually defund ObamaCare not waste time passing non binding resolutions. _E_ I have raised/given a tremendous amount of money to our great VETERANS and have got nothing but bad publicity for doing so. Watch! _E_ So sad that Burt Reynolds has lost all of his money. I wish he came to me for advice he would be rich as hell! _E_ #CrookedHillary sending U.S. intelligence info. to Podesta's hacked email is 'unquestionably an OPSEC violation' __HTTP__ _E_ ...do the typical political thing and BLAME. The fact is ObamaCare was a lie from the beginning. Keep you doctor keep your plan! It is.... _E_ The House of Representatives seeks contempt citations(?) against the JusticeDepartment and the FBI for withholding key documents and an FBI witness which could shed light on surveillance of associates of Donald Trump. Big stuff. Deep State. Give this information NOW! @FoxNews _E_ The constant interruptions last night by Tim Kaine should not have been allowed. Mike Pence won big! _E_ Entrepreneurs: See yourself as victorious: Look at the solution not the problem. _E_ Won't be a buyer's market for long. If you can purchase a home but remember I told you this three years ago. _E_ It was my great honor to welcome President @JC_Varela & Mrs. Varela from Panama this afternoon. ... __HTTP__ _E_ Great news that @FoxNews has cancelled the additional debate. How many times can the same people ask the same question? I beat Cruz debating _E_ ...hasn't worked agreements violated before the ink was dry makings fools of U.S. negotiators. Sorry but only one thing will work! _E_ "@MissUSA Nia Sanchez of Nevada is ready for @missuniverse" __HTTP__ via @lasvegassun by @Robin_Leach _E_ The weak illegal immigration policies of the Obama Admin. allowed bad MS 13 gangs to form in cities across U.S. We are removing them fast! _E_ RT @SecretService: Our thoughts & prayers are with the families friends & colleagues of #Virginia's @VSPPIO Lt Cullen & Tpr Bates #Charlot... _E_ Sadly Vanity Fair is a rapidly dying magazine. Needs new blood and fast! Going the way of SPY Magazine. _E_ A nation without borders is no nation at all. We must build a wall. Let's Make America Great Again! __HTTP__ _E_ Entrepreneurs: Work on what you will be proud to be associated with. Make your work count. _E_ Trump National Golf Club Los Angeles will be the host in October for the @PGAGrandSlam. __HTTP__ _E_ We deserve all the answers on Benghazi! __HTTP__ @RepWOLFPress _E_ Nice mention by Brian Kelly __HTTP__ Conservative Action Alerts _E_ Why didn't A.G. Sessions replace Acting FBI Director Andrew McCabe a Comey friend who was in charge of Clinton investigation but got.... _E_ "My advice to you regarding momentum is definitive: Get yours going!" – Think Like a Champion _E_ I love watching these poor pathetic people (pundits) on television working so hard and so seriously to try and figure me out. They can't! _E_ Revisionist history. Now Obama claims he never told us that everyone could keep their healthcare plans. Crazy! _E_ Never never never give up. Winston Churchill _E_ My friend Larry King @kingsthings asked me to do an interview with him—he was always great to me—& I agreed. Watch tonight 9 PM on RTV. _E_ There is a good possibility that a person who treated patients in West Africa and who FLEW into New York has Ebola. Touched many bedlam! _E_ I am self funding my campaign & don't owe anybody anything! I only owe it to the American people! #Trump2016Watch: __HTTP__ _E_ .@BarackObama said he doesn't take the Navy Seals campaigning against him too seriously. _E_ The failing @nytimes has been wrong about me from the very beginning. Said I would lose the primaries then the general election. FAKE NEWS! _E_ Will Barack Obama personally read the Boston terrorist his Miranda Rights? _E_ About to begin a rally here in Henderson Nevada. New Reuters poll just out thank you! Join the MOVEMENT:... __HTTP__ _E_ Our worst threat to unemployment is @ObamaCare. It will also destroy our country's basic standards. _E_ Where's the leadership? Obama only met with Sebelius ONCE since ObamaCare passed __HTTP__ His signature legislation... _E_ People don't know that Eliot's father is very rich. Eliot likes to pretend he's poor to appeal to voters. _E_ I will make my final decision on the Paris Accord next week! _E_ Sun Sentinel says: Rubio lacks the experience work ethic and gravitas needed to be president. HE HAS NOT EARNED YOUR VOTE! _E_ Good luck to Derek on his operation. I know it will be a success he is a great champion. _E_ Life is very fragile and success doesn't change that. If anything success makes it more fragile. The Art of the Deal _E_ The Green Party just dropped its recount suit in Pennsylvania and is losing votes in Wisconsin recount. Just a Stein scam to raise money! _E_ Do not settle for remaining in your comfort zone. Being complacent is a good way to get nowhere. Take control and move forward every day. _E_ "The object of war is not to die for your country but to make the other bastard die for his." Gen. George S. Patton _E_ We better be vigilant careful and strong. __HTTP__ _E_ RT @NRA: .@RealDonaldTrump is right. If @HillaryClinton gets to pick her anti #2A #SCOTUS judges there's nothing we can do. #NeverHillary _E_ .@foxandfriends at 7:00 A.M. _E_ In January '12 3 turbines were wrecked in rough weather ... __HTTP__ _E_ .@FoxNews legal analyst & former prosecutor @kimguilfoyle destroyed hack Schneiderman's suit on @FNTheFive yesterday.She's very sharp! _E_ The EPA is caught saying that their philosophy is to crucify oil companies __HTTP__ That will sure lower the price of gas. _E_ Will be doing @seanhannity at 10 PM on @FoxNews. As always with Sean will be interesting! _E_ RT @realDonaldTrump: Unemployment is down to 4.1% lowest in 17 years. 1.5 million new jobs created since I took office. Highest stock Mark... _E_ South Carolina rally last night was so unbelievably exciting (and fun). I am now off to Iowa for two big rallies packed houses. Love it! _E_ Thank you very much for the nice story I greatly appreciate it __HTTP__ _E_ We need a tax system that is FAIR to working FAMILIES & that encourages companies to STAY in AMERICA GROW in AMERICA and HIRE in AMERICA! __HTTP__ _E_ Sarah Jessica Parker voted "unsexiest woman alive" – I agree. She said "it's beneath me to comment on the... __HTTP__ _E_ Thank you Naples Florida! Get out and VOTE #TrumpPence16 on 11/8. Lets #MakeAmericaGreatAgain! Full Naples rally... __HTTP__ _E_ Greece's financial calamity should serve as a warning. @BarackObama's massive deficit spending is unsustainable. _E_ Home of the 2022 @PGAChampionship Trump Nat'l Bedminster features 36 holes designed by famed architect Tom Fazio __HTTP__ _E_ #ImWithYou #AmericaFirst __HTTP__ _E_ If Vera Coking had taken my millions of $'s like she should have she would have lived for many years in Palm Beach Florida. _E_ Our Q1 GDP was 2.9%. Worst in memory ObamaCare killing jobs stopping growth and making small business insecure. _E_ Just arrived in Italy for the G7. Trip has been very successful. We made and saved the USA many billions of dollars and millions of jobs. _E_ .@KieranLalor I created far more jobs and success in Dutchess than you you should be Fired. _E_ In other words our military has a very big problem! _E_ Why is lightweight A.G. Eric Schneiderman allowed to ask for campaign contributions from my people during settlement negotiations? _E_ Scary Obama's budget deficits are so out of control that he has to borrow 40 cents on every dollar he spends. _E_ Via @KCRG by @markwcarlson: "Donald Trump stops in Coralville" __HTTP__ _E_ Lightweight @JebBush is spending a fortune of special interest against me in SC. False advertising desperate and sad! _E_ How amazing the State Health Director who verified copies of Obama's "birth certificate" died in plane crash today. All others lived _E_ American Exceptionalism and the Navy Yard shooting do not go hand in hand. Foreign countries in particular Russia are mocking the U.S. _E_ Oil should not cost more than $40 a barrel. Ideally it should be $25. Cheap to produce and we protect the OPEC countries. _E_ Wing bangers the name given to wind turbines by bird lovers for the thousands of birds they kill in the U.S. _E_ Good Morning America weather headline for U.S. NEVER ENDING COLD _E_ Spoke to Jerry Jones of the Dallas Cowboys yesterday. Jerry is a winner who knows how to get things done. Players will stand for Country! _E_ Just landed in Paris France with @FLOTUS Melania. __HTTP__ _E_ I love Mexico but not the unfair trade deals that the US so stupidly makes with them. Really bad for US jobs only good for Mexico. _E_ Hope everyone is watching the Finale rerun of Celebrity Apprentice on CNBC especially the haters and losers! It is on right now. _E_ I am in Iowa. Will be making two speeches today. Good luck to all of the great folks on the East coast. Enjoy the beauty of the storm! _E_ My @SquawkCNBC interview discussing why the Fed shouldn't do a QE3 @BarackObama's college records & 2012 election __HTTP__ _E_ My condolences to those involved in today's horrible accident in NJ and my deepest gratitude to all of the amazing first responders. _E_ Was the brother of John Podesta paid big money to get the sanctions on Russia lifted? Did Hillary know? _E_ Good response on jobs by @MittRomney. _E_ In addition to doing a lousy job in taking care of our Vets John McCain let us down by losing to Barack Obama in his run for President! _E_ Lightweight @AGSchneiderman is pushing for the Moreland Commission to be disbanded immediately—because he is being looked at! _E_ The ObamaCare website is in the news again it is turning out to cost even more than previously thought AND IT DOESN'T WORK! Big trouble! _E_ It's Monday. How many fundraisers will Obama hold today? _E_ "The only place success comes before work is in the dictionary." – Vince Lombardi _E_ The person that Hillary Clinton least wants to run against is by far me. It will be the largest voter turnout ever she will be swamped! _E_ I thought and felt I would win big easily over the fabled 270 (306). When they cancelled fireworks they knew and so did I. _E_ Hillary Clinton strongly stated that there was absolutely no connection between her private work and that of The State Department. LIE! _E_ The polls are now showing that I am the best to win the GENERAL ELECTION. States that are never in play for Repubs will be won by me. Great! _E_ I look forward to tonight's debate but look far more forward to making America great again. It can happen! _E_ THEBillMcGee @realDonaldTrump after a year of wear your shirts still look great! Glad I made the purchase! Thank you. _E_ The Trump Organization Finalizes Purchase of Legendary Turnberry Resort in Scotland. It's absolutely... __HTTP__ _E_ THANK YOU Baton Rouge Louisiana! WE will #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_ We are now at 1001 delegates. We will win on the first ballot and are not wasting time and effort on other ballots because system is rigged! _E_ The Consumer Financial Protection Bureau or CFPB has been a total disaster as run by the previous Administrations pick. Financial Institutions have been devastated and unable to properly serve the public. We will bring it back to life! _E_ The Unaffordable Care Act will soon be history! _E_ ...big dollars ($700000) for his wife's political run from Hillary Clinton and her representatives. Drain the Swamp! _E_ Can you envision Jeb Bush or Hillary Clinton negotiating with 'El Chapo' the Mexican drug lord who escaped from prison? .... _E_ .@timkaine oversaw unemployment INCREASE by 179249 while @mike_pence DECREASED unemployment in Indiana by 113826.... __HTTP__ _E_ Both of our New York hotels are on the Top Ten list of the most luxurious hotels in NYC... __HTTP__ Congrats to all! _E_ WSJ covers Ride of Fame __HTTP__ _E_ Just made the point at #NCGOPcon that we have to protect our border & I think everyone here knows nobody can build a wall like Trump! _E_ Obama was very disloyal to Wisconsin Democrats. @BarackObama never showed up to help them even though he (cont) __HTTP__ _E_ I find it really hard to listen to @BarackObama's speeches. He doesn't have a clue. _E_ In the ridiculous @JebBush ad about me Jeb is speaking to me during the debate but doesn't allow my answer which destroys him SO SAD! _E_ .@mcuban Mark okay with me but don't start your bullshit again! _E_ Change before you have to. Jack Welch _E_ My interview yesterday with @MyFoxNY __HTTP__ _E_ The Miss Universe Pageant will be on August 23 (9 11 p.m. on NBC ET) with Bret Michaels and Natalie Morales to co host live from Las Vegas _E_ Look great for Thanksgiving. Trump Signature Collection exclusively available @Macys offers top men's styles __HTTP__ _E_ I'm in Moscow for Miss Universe tonight picking a winner is very hard they are all winners. Total sellout of arena. Big night in Russia! _E_ .@THEGaryBusey as Project Manager... is Team Power in trouble?? #CelebApprentice _E_ First Obama says Egypt is not an ally. Then he promises to keep handing over aid __HTTP__ Incompetent and unqualified. _E_ #MakeAmericaGreatAgain I will be in Cedar RapidsIA this Saturday. Get your tickets __HTTP__ _E_ There's definitely no love lost between Piers and Omarosa. _E_ .@KyleStephens30 #asktrump __HTTP__ _E_ .@MannyPacquiao and friends at @TrumpDoral __HTTP__ _E_ Our country is in a major crisis of incompetent leadership. We cannot continue to go on with these politicians who do nothing but talk. _E_ Gov. Cuomo's Moreland Comm should be looking at AG Schneiderman shaking down those under investigation/ in litigation for campaign $$$ _E_ .@RollingStone admitted their scam. Phony @HuffingtonPost and others are no better total joke! _E_ The Fed is destroying the dollar. When inflation hits the economy then even more jobs will go overseas. _E_ Everyone is talking about how Trump Tower is the exterior for Wayne Enterprises in Dark Knight Rises it's true. __HTTP__ _E_ Mark Udall was the deciding vote for ObamaCare & now 250000 Coloradans were dropped from their plans. Vote @CoryGardner! _E_ #ObamacareFail #HillarycareFail __HTTP__ _E_ Jeb why did your brother attack and destabalize the Middle East by attacking Iraq when there were no weapons of mass destruction? Bad info? _E_ Entrepreneurs: Be a cautious optimist. I call it positive thinking with a lot of reality checks. _E_ The only place where our border is protected is from Europeans. We educate them in our finest institutions & then have them deported. _E_ Thank you Vermont! #VoteTrumpVT __HTTP__ _E_ Thank you High Point NC! I will fight for every neglected part of this nation & I will fight to bring us together... __HTTP__ _E_ A tough negotiator can make the Chinese back off. We've done it before. #TimeToGetTough __HTTP__ __HTTP__ _E_ Continued success is built on building a brand people know will deliver. Unless you're @KarlRove. Then you just blame the Tea Party. _E_ Via @Newsmax_Media by @JAGERFILE: "Donald Trump: 'Morally Unfair' to Use Soldiers in Ebola Fight" __HTTP__ _E_ THANK YOU! #VoteTrump __HTTP__ _E_ Entrepreneurs: See yourself as victorious. Look at the solution not the problem. Be tough be strong be tenacious. _E_ Charlie Hebdo reminds me of the satirical rag magazine Spy that was very dishonest and nasty and went bankrupt. Charlie was also broke! _E_ China is stealing our jobs. We need to demand China stop manipulating its currency and end its rampant corporate espionage. _E_ Our country does not feel 'great already' to the millions of wonderful people living in poverty violence and despair. _E_ Doesn't want to remove Assad worries what comes next. _E_ .@DiamondandSilk Just watched you on #WattersWorld with a large group of people. Everybody loves you two amazing people! #Trump2016 _E_ Watch this great behind the scenes video of @IvankaTrump's Spring 2013 photo shoot __HTTP__ _E_ Join me! #Trump20166/10: Richmond __HTTP__ Tampa __HTTP__ Pittsburgh __HTTP__ _E_ The reporter that called Kevin Durant Mr. Unreliable should be fired or at least apologize. He is a truly great player and a winner! _E_ Anytime you see someone talking about celebrity weight loss on my twitter it is a total scam! _E_ I appreciate the kind words of Mike Huckabee a fine American __HTTP__ _E_ The #2A to our Constitution is clear. The right of the people to keep & bear Arms shall not be infringed upon. __HTTP__ _E_ Those who lack courage will always find a philosophy to justify it. Albert Camus _E_ Today I signed an Executive Order @ the U.S. Dept. of @Interior: 'Review of Designations Under the Antiquities Act... __HTTP__ _E_ Is President Obama trying to destroy Israel with all his bad moves? Think about it and let me know! _E_ Watch the 63rd Annual @MissUniverse Pageant tomorrow on NBC at 8PM! __HTTP__ _E_ Thank you Wayne Root we will #MakeAmericaGreatAgain! __HTTP__ _E_ Congratulations to Gabby Douglas on winning the Gold for the USA in gymnastics. She is terrific! _E_ Very sad & dangerous that soon to be ex Intelligence Chair Dianne Feinstein released the CIA report. Glad she is losing her Comm. Chair. _E_ The National Border Patrol Council (NBPC) said that our open border is the biggest physical & economic threat facing the American people! _E_ The 2013 @NJPGA Course of the Year Trump Nat'l Bedminster is honored to be hosting the 2022 @PGAChampionship __HTTP__ _E_ I just don't know why some of these NFL teams with lousy quarterbacks don't give Tim Tebow a chance what do they have to lose? _E_ When you're "hot" the lowlifes really shoot at you... and they try hitting from every angle! Never let the bastards get you down. _E_ UK is freezing through longest & coldest winter in over 50 years __HTTP__ Where's the global warming? @gatewaypundit _E_ Watching the show. #WWEHOF __HTTP__ _E_ On top of the disrespect shown by Russia don't forget they still have Snowden who has given them (& everyone) massive US secrets. _E_ We.signed our deal to take over the historic Old Post Office on Pennsylvania Ave. from the U.S. and convert it into super luxury hotel jobs! _E_ ... I will soon start naming magazines that I think will fold I predicted Newsweek. _E_ Libya is adopting a more radical form of Sharia Law now under their new leadership. Is this what @BarackObama wanted? _E_ Lucky for New York highly respected John Cahill is running for NY State AG against incumbent lightweight dope @AGSchneiderman @CahillForAG _E_ If you read my last number of tweets only one opinion can be formed that our President and therefore leader is grossly incompetent! _E_ The golden rule of negotiation: He who has the gold makes the rules. _E_ When will Sleepy Eyes Chuck Todd and @NBCNews start talking about the Obama SURVEILLANCE SCANDAL and stop with the Fake Trump/Russia story? _E_ .@KirschnerDavid @realDonaldTrump Congrats Mr. Trump on making @Forbes list of wealthiest in the world. Thanks! _E_ ...expect the country to be further downgraded in the future. The rich are all leaving! _E_ I am being investigated for firing the FBI Director by the man who told me to fire the FBI Director! Witch Hunt _E_ The WGC @CadillacChamp leadership board is available here: __HTTP__ @DoralResort _E_ All those politicians in Washington and not one good negotiator. _E_ China is not our friend. They are not our ally. They want to overtake us and if we don't get smart and tough soon they will. _E_ As bad as Qaddafi was what comes next in Libya will be worse just watch. _E_ Bill Kristol has been wrong for 2yrs an embarrassed loser but if the GOP can't control their own then they are not a party. Be tough R's! _E_ Thank you Maine New Hampshire and Iowa. The waiting is OVER! The time for change is NOW! We are going to... __HTTP__ _E_ .@SharkGregNorman @Trump_Charlotte Looking great love the improvements to the buildings and grounds! not to mention course Thank you. _E_ The Islamists are taking over Egypt through the election. __HTTP__ Why did @BarackObama force Mubarak out? He was an ally. _E_ When someone attacks me I always attack back...except 100x more. This has nothing to do with a tirade but rather a way of life! _E_ Graydon Carter whose reign over failing @VanityFair has been a disaster has acted in two movies both bombed & got bad reviews. _E_ The @BarackObama recovery US unemployment is 9.1% US underemployment is 19.1% __HTTP__ Businesses won't hire under Obama. _E_ For many years our country has been divided angry and untrusting. Many say it will never change the hatred is too deep. IT WILL CHANGE!!!! _E_ Libyan Rebels should have given us 50% of the oil in return for our military support we don't even ask! _E_ #CelebApprentice I will be live tweeting(no spoilers) during tonight's all new @ApprenticeNBC at 8PM ET. _E_ Alabama will shine tomorrow. It will be a big and glorious day! _E_ .@stuartpstevens made some of the dumbest political decisions of all time in helping Romney to get destroyed by Obama. Should have won! _E_ .@Toure when you are fired from MSNBC for your bad ratings and racist coverage stop by and say hello. _E_ Thank you @SenatorSessions!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ I see where Mayor Stephanie Rawlings Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! _E_ I see @FLGovScott poll numbers are improving. Good man doing a good job. _E_ We must stop outsourcing our jobs overseas and end our multi billion dollar trade deficits. _E_ Wow in the new CBS Poll I went way up into the forties! Thank you! _E_ I will be making a big speech tomorrow to discuss the failed policies and bad judgment of Crooked Hillary Clinton. _E_ Incredible progress at @trumptowerpde – Punta del Este Uruguay the views are going to be fantastic! __HTTP__ _E_ Via @WalidShoebat: "Watch Donald Trump: He Is Patriotic And He Can Fix America" __HTTP__ _E_ Remember as a senator Obama did not vote for increasing the debt ceiling __HTTP__ I guess things change when President?! _E_ California gas prices going thru the roof others to follow. An election losing event for Obama. _E_ ...a real loser named Tim O'Brien and it's never recovered. _E_ Naghmeh Abedini the lovely wife of the Christian Pastor Saeed being held in an Iranian jail just left my office. #savesaeed _E_ Will be on @jimmykimmel in 20 minutes on @ABC. #Kimmel #Trump2016 #MakeAmericaGreatAgain _E_ The United States must immediately institute strong travel restrictions or Ebola will be all over the United States a plague like no other! _E_ Democrats jeopardizing the safety of our troops to bail out their donors from insurance companies. It is time to put #AmericaFirst _E_ Do you think the three UCLA Basketball Players will say thank you President Trump? They were headed for 10 years in jail! _E_ #MadeInAmerica📸 __HTTP__ __HTTP__ _E_ A great day in Puerto Rico yesterday. While some of the news coverage is Fake most showed great warmth and friendship. _E_ A lot of comments re @MELANIATRUMP vs. Milania last week. I think spelling has taken on a new significance. #CelebApprentice _E_ I won the debate if you decide without watching the totally one sided spin that followed. This despite the really bad microphone. _E_ It's amazing that people can say such bad things about me but if I say bad things about them it becomes a national incident. _E_ My thoughts and prayers are with the @KissimmeePolice and their loved ones. We are with you!#LESM _E_ #PeaceOfficersMemorialDay and#PoliceWeek Proclamation: __HTTP__ __HTTP__ _E_ Trump right: Illegal families crossing border set to double 51152 so far __HTTP__ _E_ Joe thanks for not running! __HTTP__ _E_ A sneak peek at Sunday's episode of The Celebrity Apprentice... __HTTP__ #trumpvlog _E_ .@KarlRove is a biased dope who wrote falsely about me re China and TPP. This moron wasted $430 million on political campaigns and lost 100% _E_ Via @peoplemag by @amandamichl: "@IvankaTrump: @Joan_RiversWas 'Very Warm' During Appearance on @ApprenticeNBC" __HTTP__ _E_ "Don't toss off your problems and don't dwell on them either. Deal with them!" – Think Like a Champion _E_ Via David Ebner re Stanley Cup & Trump poster: "If you're going to be thinking anything you might as well think big" __HTTP__ _E_ One of @GolfWorldUS top private clubs @TrumpNationalNY features a Jim Fazio designed 7291 yd par 72 course __HTTP__ _E_ .@ritter1025 Wishing your wife a Happy Birthday _E_ Standing room only in Mason City Iowa! Thanks to the record crowd of over 400 supporters! __HTTP__ _E_ .@karlrove's ad is the best thing that ever happened to Ashley Judd—simply increases her profile. _E_ I was on @SquawkBox this morning __HTTP__ _E_ I would like to wish everyone including all haters and losers (of which sadly there are many) a truly happy and enjoyable Memorial Day! _E_ It's important that we help poor people to become independent self sufficient individuals who gain the benefits of work. #TimeToGetTough _E_ He @newtgingrich is sounding more and more like a real team player...he is a really good guy! _E_ .... I only respond to people that register more than 1% in the polls. I never thought he had a chance and I've been proven right. _E_ Just got back from Wisconsin. Great day great people! _E_ Karzai of Afghanistan is not sticking with our signed agreement. They are dropping us like dopes. Get out now and re build U.S.! _E_ China's top academics are working w/ PLA in cyber espionage of our state secrets & R&D __HTTP__ They are laughing at us! _E_ Congratulations to @DavidWright of the #Mets. What a great season he is having batting over 400 and clutch hitting. Also a fantastic guy. _E_ Why the nation's debt keeps growing a Dept of Agriculture employee made over $242K with a $63K bonus __HTTP__ Ridiculous. _E_ If Karl Rove & @GOP Establishment continue to attack the Tea Party who delivered in 2010 then there will be a 3rd Party in 2016. _E_ Will be on Fox & Friends at 7.00 this morning ENJOY! _E_ On Jimmy Fallon tonight. _E_ Remember when you vote Obamacare is a disaster! _E_ Get ready this should be informative and fun! #VPDebate _E_ Trump Int'l Golf Club Turnberry Scotland. A legendary course ... and rightly so. __HTTP__ _E_ Golf is a brain game & is a great way to improve your business skills. Concentrationassessment technique & passion...it's all there. _E_ The Fed continues to recklessly flood the market with dollars. This will eventually create record inflation. It has to stop. #TimeToGetTough _E_ What a night! 10000 amazing supporters in Greenville South Carolina! THANK YOU!VOTE on Saturday! #VoteTrumpSC __HTTP__ _E_ Had @SenScottBrown asked me to do a robo call for him I would have done it and he would have won. _E_ Haim Saban: Hillary Clinton's Top Hollywood Donor Demands Racial Profiling of Muslims __HTTP__ _E_ A letter to @CNN President Jeff Zucker __HTTP__ _E_ It would be really nice if the Fake News Media would report the virtually unprecedented Stock Market growth since the election.Need tax cuts _E_ Thanks Dave! __HTTP__ _E_ Jay Sekulow on @foxandfriends now. _E_ Thrilled to hear that @RakutenTravelJP has awarded @TrumpWaikiki the 'Rakuten Diamond Award' for the 4th consecutive year! Congrats! _E_ Here's the deal: when your secretary of defense tells you that your proposed cuts will erode America's military (cont) __HTTP__ _E_ Entrepreneurs: Remember the golden rule of negotiating he who has the gold makes the rules. _E_ RT @realDonaldTrump: Thank you to our GREAT Military/Veterans and @PacificCommand.Remember #PearlHarbor. Remember the @USSArizona!A day... _E_ It's springtime and it just started snowing in NYC. What is going on with global warming? _E_ Many NATO countries have agreed to step up payments considerably as they should. Money is beginning to pour in NATO will be much stronger. _E_ Made additional remarks on Charlottesville and realize once again that the #Fake News Media will never be satisfied...truly bad people! _E_ The lobbyist and political hack that President Obama just appointed as the Ebola Czar just missed his first major meeting on Ebola A joke _E_ I was on The View this morning. We talked about The Apprentice. Tonight's episode is a great one tough exciting and surprising. 10 pm/NBC _E_ Today it was my pleasure and great honor to announce my nomination of Jerome Powell to be the next Chairman of the @FederalReserve. __HTTP__ _E_ Via @TheStreet by @swan_investor: Trump Tees Up Another 'Hole in One' in Scotland __HTTP__ _E_ Great defense by the @nyjets this weekend—congratulations to @woodyjohnson4—only 6 points allowed! _E_ wanting to sell their product cars A.C. units etc. back across the border. This tax will make leaving financially difficult but..... _E_ Some day when things calm down I'll tell the real story of @JoeNBC and his very insecure long time girlfriend @morningmika. Two clowns! _E_ .@SheriffClarke Great insight in dealing with the media today. You are a wonderful representative of calm and reason a real pro! _E_ Donald Trump's commercial free WWE Raw does big rating: __HTTP__ _E_ .@antbaxter I tried watching but fell asleep. _E_ There are many editorial writers that are good some great & some bad. But the least talented of all is frumpy Gail Collins of NYTimes. _E_ Hillary Clinton may be the most corrupt person ever to seek the presidency. Donald J. Trump _E_ Great news! Thank you Governor Ralph DLG Torres! #Trump2016 __HTTP__ _E_ Thank you to our amazing Wounded Warriors for their service. It was an honor to be with them tonight in D.C.... __HTTP__ _E_ Shirley B did a very good job singing Goldfinger! Not easy. _E_ So I have spent almost nothing on my run for president and am in 1st place. Jeb Bush has spent $59 million & done. Run country my way! _E_ One of the most accurate polls last time around. But #FakeNews likes to say we're in the 30's. They are wrong. Some people think numbers could be in the 50's. Together WE will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Do you think I made the right decision? #CelebApprentice _E_ "To succeed one must be creative and persistent." John H. Johnson _E_ You can be an @nfl player with murder charges and not be suspended. Yet with NO EVIDENCE @nfl targeted Tom Brady. B.S.! _E_ Obama is finally stopping the Chinese from buying something in America – windfarms __HTTP__ What a joke! _E_ Many meetings today in Bedminster including with Secretary Linda M and Small Business. Job numbers are looking great! _E_ Learn work and think in equal proportions and you'll be going in the right direction. _E_ Here we go again via @timesunion.com __HTTP__ ... another bad deal. _E_ The first ever All Star Celebrity @ApprenticeNBC premieres Sunday March 3rd! __HTTP__ _E_ This is a crossroads in the history of our civilization that will determine whether or not We The People reclaim c... __HTTP__ _E_ I am deeply committed to preserving our strong relationship & to strengthening America's long standing support for... __HTTP__ _E_ To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment. Ralph Waldo Emerson _E_ Trump Tower at Century City brings luxury to Makati the financial & social capital of Philippines __HTTP__ _E_ It is finally happening for our great clean coal miners! __HTTP__ _E_ Things will work out fine between the U.S.A. and Russia. At the right time everyone will come to their senses & there will be lasting peace! _E_ Amazing that while I lead by big numbers in the new Q and and USA Today polls the the press only wants to report on the phony WSJ/NBC poll. _E_ ... will happen when you go against the tide when you take a risk and it works. Think Big _E_ "Failure is simply the opportunity to begin again this time more intelligently." Henry Ford _E_ .... to do The Apprentice but I approved you anyway. Without my show you'd be nothing! _E_ ...design or negotiations yet. When I do just like with the F 35 FighterJet or the Air Force One Program price will come WAY DOWN! _E_ Thank you @FrankLuntz for saying I was a winner tonight. It is my great honor. #Trump2016 _E_ Jeb has been confused for forty years __HTTP__ _E_ Good news. Voters give @MittRomney the edge over @BarackObama on handling the economy according to @gallupnews __HTTP__ _E_ I will be on Fox & Friends at 7 A.M. 10 minutes. Much to talk about enjoy! _E_ THANK YOU @MayorGimenez for following the RULE OF LAW! Sanctuary cities make our country LESS SAFE! Full remarks: __HTTP__ __HTTP__ _E_ RT @TeamTrump: RT if you agree @HillaryClinton & @timkaine are WRONG for America! #VPDebate #MAGA __HTTP__ _E_ I am going to keep our jobs in the U.S. and totally rebuild our crumbling infrastructure. Crooked Hillary has no clue! @Teamsters _E_ New polls are good because the media has deceived the public by putting women front and center with made up stories and lies and got caught _E_ #FlashbackFriday #CrookedHillary __HTTP__ _E_ Why would anyone in Kentucky listen to failed presidential candidate Rand Paul re: caucus. Made a fool of himself (1%.)KY his 2nd choice! _E_ The Donald J. Trump Signature Collection exclusively available @Macys offers the finest style in menswear __HTTP__ _E_ The shale boom is saving our economy __HTTP__ Good for jobs national security & trade balance. Frack Now & Frack Fast! _E_ ...In other words Secy John Kerry is so out of his element... _E_ 'Presidential Executive Order on Identifying and Reducing Tax Regulatory Burdens' Executive Order:... __HTTP__ _E_ THE HARDER YOU WORK THE LUCKIER YOU GET! _E_ I am really beginning to respect Mark Halperin and John Heilemann as political reporters they truly get why Trump poll numbers are high. _E_ Incompetent Hillary despite the horrible attack in Brussels today wants borders to be weak and open and let the Muslims flow in. No way! _E_ China has a business tax rate of 15%. We should do everything possible to match them in order to win with our economy. Jobs and wages! _E_ The $85B sequester is just 2% of Obama's $3.5T record deficit spending budget. Our leaders are ruining our children's future. _E_ Should I do the #GOPdebate? __HTTP__ _E_ "Success depends...on how effectively you learn to manage the game's two ultimate adversaries: the course and yourself." @jacknicklaus _E_ All because of me people don't care about you Cher. @cher My week on twitter 1k retweets 29 new listings 15k new followers 2k mentions. _E_ In analyzing the Alabama Primary race Fake News always fails to mention that the candidate I endorsed went up MANY points after Election! _E_ Barack Obama has everything to gain. Why would anyone ever deny $5M to charity? _E_ Sorry there is no STAR on the stage tonight! _E_ "I succeeded by saying what everyone else is thinking." @Joan_Rivers _E_ Our law enforcement officers deserve our appreciation for the incredible job they do. Video: __HTTP__ __HTTP__ _E_ "It's not that I'm so smart it's just that I stay with problems longer." Albert Einstein _E_ I am a registered Republican. __HTTP__ With @MittRomney as the nominee we can defeat @BarackObama. _E_ Democrats don't want massive tax cuts how does that win elections? Great reviews for Tax Cut and Reform Bill. _E_ Our country's debt crisis cannot be solved by tax increases. We must cut government spending. _E_ Central America's tallest building @TrumpPanama's sleek design evokes a majestic sail fully deployed in the wind __HTTP__ _E_ Don't let them build a wind turbine in your backyard (or near your house). It will destroy your property value. _E_ More poll results from last nights Commander in Chief Forum. #AmericaFirst #TrumpTrain __HTTP__ _E_ Appreciate the congrats for being right on radical Islamic terrorism I don't want congrats I want toughness & vigilance. We must be smart! _E_ Great to see @MittRomney being well received in Poland __HTTP__ The Poles understand the value of freedom through strength _E_ Dems had a very good and professional convention. The Republicans must be smart and tough and fast! _E_ So many people who know nothing about me are commenting all over T.V. and the media as though they have great D.J.T. insight. Know NOTHING! _E_ California shooting looks very bad. Good luck to law enforcement and God bless. This is when our police are so appreciated! _E_ "If you like your healthcare plan you can keep it." = "I was born in Hawaii." _E_ Another false story this time in the Failing @nytimes that I watch 4 8 hours of television a day Wrong! Also I seldom if ever watch CNN or MSNBC both of which I consider Fake News. I never watch Don Lemon who I once called the "dumbest man on television!" Bad Reporting. _E_ I win awards for speaking but the enemies either won't comment or will say only bad...leave Clint alone! _E_ Thank you @EricTrump! __HTTP__ _E_ No investor would be stupid enough to pour their money into the bottomless Vattenfall pit. They totally gave up __HTTP__ _E_ 'It's just a 2 point race Clinton 38% Trump 36%' __HTTP__ _E_ The American work ethic is what led generations of Americans to create our once prosperous nation. (cont) __HTTP__ _E_ He @MittRomney had another impressive win last night in Illinois. His delegate lead is insurmountable. It is (cont) __HTTP__ _E_ .@TrumpChicago is the Windy City's sole skyscraper to feature a 4 star hotel 4 star restaurant & spa __HTTP__ _E_ Radical Islamic Terrorism must be stopped by whatever means necessary! The courts must give us back our protective rights. Have to be tough! _E_ Over $1T in annual deficit spending and adding over $6T to the debt for what? May jobless numbers are horrendous. The great Obama recovery. _E_ A working dinner tonight with Prime Minister Abe of Japan and his representatives at the Winter White House (Mar a Lago). Very good talks! _E_ Where were all the @VanityFair exposes on When Rev. Wright disciples go to Washington? Sad! _E_ Jonah Goldberg @JonahNRO of the once great @NRO #National Review is truly dumb as a rock. Why does @BretBaier put this dummy on his show? _E_ Shoplifting is a very big deal in China as it should be (5 10 years in jail) but not to father LaVar. Should have gotten his son out during my next trip to China instead. China told them why they were released. Very ungrateful! _E_ I hope everyone had a great Memorial Day! _E_ Americans never quit. General Douglas MacArthur _E_ I'll be going to the Old Post Office Building on Pennsylvania Avenue in D.C. today. Will create one of world's great hotels. Lots of jobs! _E_ 7 of 10 Americans prefer 'Merry Christmas' over 'Happy Holidays' __HTTP__ No surprise. _E_ I wish tonight's debate would cover more than foreign policy. _E_ RT @RightlyNews: What's a high priced Clinton attorney doing representing a low level IT staffer for the Democrats? @jessebwatters on t... _E_ Check out Trump International Hotel & Tower New York spectacular! __HTTP__ _E_ If you're interested in 'balancing' work and pleasure stop trying to balance them. Instead make your work more pleasurable. _E_ The Miss Universe Pageant raked in some great ratings! A great job by everyone. _E_ ...Senators should focus their energies on ISIS illegal immigration and border security instead of always looking to start World War III. _E_ One of @GolfWorldUS' top public courses @TrumpGolfLA's course stands as a testament to the greatness of golf __HTTP__ _E_ Read this @BarackObama's birth certificate cannot survive judicial scrutiny because of phantom numbers __HTTP__ _E_ Looks like yet another terrorist attack. Airplane departed from Paris. When will we get tough smart and vigilant? Great hate and sickness! _E_ It's often necessary to boast but it's even better if others do it for you." – Think Like A Billionaire _E_ Great new poll. Thank you Texas! #VoteTrump #MakeAmericaGreatAgain __HTTP__ _E_ "On 1/20 the day Trump was inaugurated an estimated 35000 ISIS fighters held approx 17500 square miles of territory in both Iraq and Syria. As of 12/21 the U.S. military estimates the remaining 1000 or so fighters occupy roughly 1900 square miles..." via @jamiejmcintyre __HTTP__ _E_ Statement on House Passage of Kate's Law and No Sanctuary for Criminals Act. __HTTP__ _E_ ICYMI you can watch my full press conference with @SteveKingIA on @shanevanderhart's @CaffThoughts __HTTP__ _E_ "Even such traits as who makes the most eye contact in conversation can be an indication of who seeks to dominate." Think Like A Billionaire _E_ What a great day it was yesterday showing the public Trump Links at Ferry Point. I took over a disaster and made it GREAT! Good job to all! _E_ Via @thestate by @andyshain: "Donald Trump joins other 2016 prospects speaking at SC Tea Party Convention" __HTTP__ _E_ ...Bad decisions can be devastating. _E_ "In N.H. Trump says his business experience would play well in government" __HTTP__ via @ConMonitorNews by @AP _E_ My prayers and condolences to the families of the victims of the terrible Florida shooting. No child teacher or anyone else should ever feel unsafe in an American school. _E_ A vote to CUT TAXES is a vote to PUT AMERICA FIRST. It is time to take care of OUR WORKERS to protect OUR COMMUNITIES and to REBUILD OUR GREAT COUNTRY! __HTTP__ __HTTP__ _E_ Under @MittRomney Bain had an 80% success rate with annual returns of over 50%. Under @BarackObama America has added over $6T in debt. _E_ Obamacare has to be killed now before it grows into an even bigger mess as it inevitably will. #TimeToGetTough _E_ Trump: Something 'mentally wrong' with Weiner __HTTP__ via @hilltube by @DanielStrauss4 _E_ They changed the name global warming to climate change because the concept of global warming just wasn't working! _E_ New National Rasmussen Poll: __HTTP__ _E_ "Winning takes talent to repeat takes character." John Wooden _E_ Just as I have been predicting for years Iraq will fall to the people that hate the U.S. the most just outside of Baghdad. Keep the oil _E_ Crazy Joe Scarborough and dumb as a rock Mika are not bad people but their low rated show is dominated by their NBC bosses. Too bad! _E_ What people don't know about Kasich he was a managing partner of the horrendous Lehman Brothers when it totally destroyed the economy! _E_ I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss! _E_ Good morning Wisconsin! The polls are now open! #VoteTrump today & we will MakeAmericaGreatAgain! __HTTP__ _E_ Great decision by @SpeakerBoehner in placing @TGowdySC as chairman of the Benghazi select committee. Gowdy is a seasoned prosecutor. _E_ "Discovery breeds discovery as in success breeds success. Questions are thoughts with a quest." – Think Like a Champion _E_ ICYMI my speech this past Monday at the South Carolina Tea Party Convention in Myrtle Beach __HTTP__ #SCTeaParty15 _E_ Benghazi is now a full blown training center for jihadists __HTTP__ Congratulations to the Obama administration. _E_ Do you think Crooked Hillary will finally close the deal? If she can't win Kentucky she should drop out of race. System rigged! _E_ Congratulation to Adam Scott and all of the folks at Trump National Doral on producing a really great WGC Tournament. Amazing finish! _E_ New York State's lightweight A.G. is driving business & jobs out of N.Y. Look into his past he shouldn't even be allowed to hold office! _E_ "He who defends everywhere defends nowhere." – Sun Tzu _E_ When people find out how bad a job Scott Walker has done in WI they won't be voting for him. Massive deficit bad jobs forecast a mess. _E_ Join me in Manheim Pennsylvania on Saturday at 7pm! #TrumpRallyTickets: __HTTP__ __HTTP__ _E_ When you have exhausted all possibilities remember this you haven't. Thomas A. Edison _E_ Report raises questions about 'Clinton Cash' from Russians during 'reset' __HTTP__ _E_ #CelebApprentice fans watch today's #trumpvlog __HTTP__ to find out about our new App __HTTP__ _E_ A family in Las Vegas just stopped a violent home invasion by shooting one of the perpetrators the other fled and will be captured. Great! _E_ .....Guy in front asked for picture said he was the biggest fan never saw the guy in back. _E_ I spoke with Fox and Friends today watch here __HTTP__ _E_ The Trump Signature Collection exclusively available @Macys offers high end fashion for men. Dress your best. __HTTP__ _E_ #sweepstweet @lisalampanelli wins $100000 for her charity and that's a nice gift. _E_ The Prayer Breakfast was used by @BarackObama to say that the Bible commands higher income taxes. That's not the way it is! _E_ Today is April 15th Obama's favorite day of the year. T E A. TAXED ENOUGH ALREADY! _E_ The attack on Mosul is turning out to be a total disaster. We gave them months of notice. U.S. is looking so dumb. VOTE TRUMP and WIN AGAIN! _E_ A woman is suing one of my businesses despite the fact that she loved her classes. Our legal system is a mess. Watch __HTTP__ _E_ My daughter @IvankaTrump will be on @Greta tonight at 7pm. Enjoy! __HTTP__ _E_ Great story on @TrumpToronto in @globeandmail about our new Sky and Wellness Suites: __HTTP__ _E_ Mr. President tell Iran to immediately free the CHRISTIAN PASTOR as a sign of good faith & if they refuse break off talks big sanctions _E_ Re Negotiation: Know what you want & think about what the other side wants. Know where they're coming from & do not underestimate them. _E_ Donald Trump CPAC Speech: U.S. Is Run By 'Very Stupid People' __HTTP__ via @HuffPostPol by @elisefoley _E_ For all of the morons who have been complaining about my comment on sexual assault & rape in the military (cont) __HTTP__ _E_ Celebrity Apprentice on in 5 minutes on CNBC it's great! _E_ I think Joe Biden made correct decision for him & his family. Personally I would rather run against Hillary because her record is so bad. _E_ No complaints but how many people would be watching these really dumb but record setting debates if I wasn't in them? Interesting question! _E_ Thank you New York! I love you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ The news about our beautiful Miss Venezuela Monica Spear is devastating to all who knew her. A spectacular woman she will be missed. _E_ Be tough be smart be personable but don't take things personally. That's good business. _E_ .@BenFergusonShow just watched you on @CNN. Thank you for your nice comments. _E_ Congratulations to Bret Michaels the new Celebrity Apprentice. Bret's a true champion all of us were happy to see him and to see him win! _E_ My @foxandfriends int. on Benghazi cover up the ObamaCare mess & firing @TheRealMarilu on @ApprenticeNBC __HTTP__ _E_ Enthusiasm is a vital element in individual success. ― Conrad Hilton _E_ While everyone is waiting and prepared for us to attack Syria maybe we should knock the hell out of Iran and their nuclear capabilities? _E_ HAPPY THANKSGIVING! __HTTP__ _E_ I will be doing @foxandfriends at 8.00 a.m. _E_ Via @TheYBF: "@msvivicafox Attends A Private Screening + Donald Trump DONATES $25K To @peachespulliam'S Kamp Kizzy" __HTTP__ _E_ .@CNBC Titans: Donald Trump' is available to live stream on @netflix and @hulu. Watch! _E_ Main Street is BACK! Strongest Holiday Sales bump since the Great Recession beating forecasts by BILLIONS OF DOLLARS. __HTTP__ _E_ Today it was my honor to welcome President Nursultan Nazarbayev of Kazakhstan to the @WhiteHouse! __HTTP__ _E_ Just returned from Ireland Scotland and Dubai. Amazing trip great places but always good to be back. _E_ .@BarackObama's Super PAC has continually called @MittRomney a murderer __HTTP__ Ironic since Obama is destroying Medicare. _E_ Joy Behar who was fired from her last show for lack of ratings is even worse on @TheView. We love Barbara! _E_ MUST READ – via @IBDinvestors: "VA Scandal Grows As Bonuses Went To Worst Hospitals" __HTTP__ _E_ I believe @BarackObama made a deal with the Saudis to increase oil production until after the election. Then (cont) __HTTP__ _E_ China is expanding its military bases abroad. We must expand our naval fleet. Now is no time for defense cuts. (cont) __HTTP__ _E_ A great honor to spend time with our brave HEROES at the @USMC Air Station Yuma. THANK YOU for your service to the United Staes of America! __HTTP__ _E_ Big win in Montana for Republicans! _E_ Just as I predicted ObamaCare is a complete disaster which is failing on its own. May never be fully implemented. _E_ .@TrumpNationalNY a great place! __HTTP__ @TrumpGolf _E_ I never gave anybody hell! I just told the truth and they thought it was hell. Harry S. Truman _E_ It's Tuesday. How many terrible predictions and advice will Karl 1.6% Rove make today? _E_ Entrepreneurs: Ask yourself: What can I provide that does not yet exist? Be open to new ideas. Be innovative! _E_ Big announcement by Ford today. Major investment to be made in three Michigan plants. Car companies coming back to U.S. JOBS! JOBS! JOBS! _E_ ...Never let yourself be pushed around but treat the good folks great. _E_ Loved doing the debate last night on @CNBC. Check out all of the polls! Everyone agrees that Harwood bombed! _E_ War on the families. Price of electricity hit record high in October __HTTP__ Terrible especially during holiday season. _E_ $5 a gallon gas and we have yet to approve the Keystone XL Pipeline. OPEC is laughing at us. _E_ Flashback from October 2013: "Donald Trump demands larger iPhone screen" __HTTP__ You're welcome! Apple listened. _E_ .@HillaryClinton has been a foreign policy DISASTER for the American people. I will #MakeAmericaStrongAgain #Debate... __HTTP__ _E_ Looks like another great day for the Stock Market. Consumer Confidence is at Record High. I guess somebody likes me (my policies)! _E_ The Blue Monster at Trump National Doral. __HTTP__ _E_ The U.S. is going to substantialy reduce taxes and regulations on businesses but any business that leaves our country for another country _E_ I WILL BE ON @foxandfriends AT 7:30 NOW! _E_ They should have got Darrell Hammond as the Donald Trump impersonator. #CelebApprentice _E_ Great article by Chris Ruddy @Newsmax_Media: @AnnDRomney and Jackie's Example __HTTP__ _E_ Watch my speech at CPAC in Washington DC yesterday ... __HTTP__ _E_ Read about how this hotel came into being in my book "Never Give Up"—it's quite a story. #CelebApprentice @TrumpNewYork _E_ Obama's speech on climate change was scary. It will lower our standard of living and raise costs of fuel & food for everyone. _E_ One hit wonder @DannyZuker I notice you are not disputing all of the failures that I said you had. Let's talk about it! _E_ #CrookedHillary "was at center of negotiating $12M commitment from King Mohammed VI of Morocco" to Clinton Fdn. __HTTP__ _E_ My @USATOpinion piece: Trump: I don't need to be lectured __HTTP__ _E_ We should be building up our military and our missile defense systems to their highest levels ever. Must be very strong to prosper & survive _E_ As hard as it is to believe sexting pervert Anthony Weiner is leading in some polls for Mayor of NYC. _E_ .@JordanSpieth Great job you are a true champion! See you soon. _E_ will only get higher. Car companies and others if they want to do business in our country have to start making things here again. WIN! _E_ The time has come to take action to IMPROVE access INCREASE choices and LOWER COSTS for HEALTHCARE! __HTTP__ __HTTP__ _E_ Prayers and condolences to all of the families who are so thoroughly devastated by the horrors we are all watching take place in our country _E_ RT @billoreilly: FNC dominated ratings last night. MSNBC disaster demonstrating folks don't trust the network. __HTTP__ toni... _E_ When Ted Cruz quits the race and the field begins to clear I will get most of his votes no problem! _E_ Perhaps this is the kind of thinking we need in Washington ... __HTTP__ _E_ Wow! Letterman show @Late_Show won the ratings last night big time and guess who was his guest? DJT _E_ Congratulations to Obama and the @DNC. The federal deficit has topped $1T for a fourth year in a row __HTTP__ Nice work! _E_ Winning isn't everything but the will to win is everything. Vince Lombardi _E_ Today will be a great day at work have only one word in mind VICTORY! _E_ Look here's the deal: @BarackObama has been a total disaster. He has spent this country into the ground and destroyed jobs #TimeToGetTough _E_ Dine with The Donald and Mitt __HTTP__ _E_ Thank you @RepReneeEllmers! __HTTP__ __HTTP__ _E_ Will be on @oreillyfactor tonight. Signing a copy of Crippled America for Bill! __HTTP__ _E_ The evidence continues to mount against lightweight @AGSchneiderman. It is time for JCOPE and Moreland Commissions to act. _E_ "Being true to yourself...will give you a lot of power over any negatives thrown your way." – Midas Touch _E_ April is Autism Awareness Month join me in raising awareness get your "Light It Up Blue" sign here! #LIUB __HTTP__ _E_ Howard Stern will do a great job on @America'sGotTalent. He's very smart and really gets what talent is. @HowardStern _E_ Via @AmSpec by @JeffJlpa1: Exclusive: Trump Says Obama Shows 'Total Desperation' on Iran __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ .@antbaxter Only the stupid @BBC would air your garbage—no wonder they are in such deep trouble. _E_ A lot of people have imagination but can't execute you have to execute with the imagination. Donald J. Trump __HTTP__ _E_ Numerous polls have me beating Hillary Clinton. In a race with her voter turnout will be the highest in U.S. history I get most new voters! _E_ RT @mike_pence: History teaches us that weakness arouses evil. America needs to be strong on the world stage. #VPDebate __HTTP__ _E_ .@Morning_Joe is so off on Iowa which I am leading big in new @CNN poll. I will win Iowa. Also I beat Hillary easily! _E_ Well the year has officially begun. I have many stops planned and will be working very hard to win so that we can turn our country around! _E_ .@jessebwatters Watching your show from Arizona where we just had a big rally. It is fantastic everybody loves it!#MakeAmericaGreatAgain _E_ Obama and the Democrats are laughing at the deal they just made...the Republicans got nothing! _E_ Obama's budget spends $2B making our navy ships algae powered __HTTP__ The strong world is laughing at us. _E_ "My view is that not only has Trump been vindicated in the last several weeks about the mishandling of the Dossier and the lies about the Clinton/DNC Dossier it shows that he's been victimized. He's been victimized by the Obama Administration who were using all sorts of....... _E_ I've just released my position papers on The Second Amendment. __HTTP__ _E_ Fast and Furious put semi automatics in the hands of Mexican drug lords that killed Americans @BarackObama should answer all questions. _E_ China has been unfairly subsidizing the export of cars & auto parts. I've been saying this for 3 years... _E_ "@Algemeiner Honors @Joan_Rivers Donald Trump @YuliEdelstein at Second Annual 'Jewish 100′ Gala" __HTTP__ via @Algemeiner _E_ Don Jr. will present the Keynote Address in South Africa on Dec. 1st @TheInvestShow _E_ I look forward to being in Lowell Massachusetts today. I hear a very big crowd is expected we will have lots of fun! _E_ Only 15 days until ObamaCare is implemented. Congress must waive the monstrosity for regular Americans. Why should they be punished? _E_ My sons Don and Eric are on @foxandfriends now 7:35. Great kids enjoy! _E_ Getting the support of @DanaWhite of UFC means a lot. A total winner who has done an amazing job. Just ordered his fight to watch tonight! _E_ Shining over Fifth Avenue @TrumpTowerNY (a NY icon) offers a full service restaurant bar cafe ice cream parlor and Gucci. _E_ "It's a tough game and you never want to take that aspect out of the game." – @NYRangers Stanley Cup Champion Mark Messier _E_ Better off? The $16T US debt works out to $136260 per household a 50% increase since @BarackObama took office. _E_ Tune in tonight at 9 pm on TV One for The Ultimate Merger starring the one and only Omarosa and twelve brave bachelors ... _E_ Tune in for #TrumpTuesday on @SquawkCNBC tomorrow morning. _E_ .@foxandfriends in five minutes. Enjoy! _E_ Sgt.Thamooressi has been held in Mexico for 115 Days. Mexico has zero respect for our border & our servicemen. Boycott! #freeourmarine _E_ The Veterans of our country have been treated like third class citizens for many years... _E_ Many people have said I'm the world's greatest writer of 140 character sentences. _E_ "Build up your weaknesses until they become your strong points." Knute Rockne _E_ My @foxandfriends interview discussing Chuck Hagel nomination Republicans terrible deal making & where we go next __HTTP__ _E_ Interview with David Muir of @ABC News in 10 minutes. Enjoy! _E_ Via @BreitbartNews: GAME ON: TRUMP RESPONDS TO JEB __HTTP__ _E_ Remember save your evening to watch Celebrity Apprentice tonight at 9 increased to a full two hours great episode watch Gary B. _E_ "Age is whatever you think it is. You are as old as you think you are." @MuhammadAli _E_ 40 days until the election. Crunch time. @MittRomney must stay on offense and take the fight to Obama. _E_ Announced 3 years ago that Scottish course would close in winter like Kingsbarns and others too cold. _E_ The shirts and ties at Macy's are so good beautiful and do so well that guys like the one that sued me wrongly want a piece l kicked his ass _E_ RT @APCampaign:Trump to Obama: $5 million donation to charity if you release passport and college records __HTTP__ #Election2012 _E_ See you in Arizona on Friday and Saturday. __HTTP__ _E_ Learn more about @TrumpIntRealty's @Mgriffithnyc and some of our spectacular real estate in NYC __HTTP__ _E_ How many more of our soldiers have to be shot by the Afghanis they are training? Let's get the hell out of there and focus on U.S. _E_ Bernie Madoff and Tony La Russa in today's #trumpvlog... __HTTP__ _E_ RT @RealBHorowitz: @VinceMcMahon @realDonaldTrump @WWE My two favorite billionaires! _E_ By @BarackObama's design the middle class will be hit with record taxes under ObamaCare through inflation __HTTP__ REPEAL! _E_ China is going to complete 59 new theme parks by 2020 over $23B in expansion. That would take over 100 years in our country. _E_ Congratulations to @IvankaTrump on being named @FoxNewsSunday Power Player of the Week. Ivanka is doing a great job w/ DC Post Office. _E_ Florida Power & Light has disgusting rotting utility poles outside Doral in Miami. They should put in new ones or will be sued. _E_ Yet another weak hit by a candidate with a failing campaign. Will Jeb sink as low in the polls as the others who have gone after me? _E_ The people of Scotland have spoken—a great decision. I wish @AlexSalmond well & look forward to playing golf with him at Aberdeen! _E_ .@EdGoeas thank you for your support tonight on @JudgeJeanine. _E_ .@FrankLuntz is a total clown. Has zero credibility! @FoxNews @megynkelly _E_ Congratulation to Roy Moore and Luther Strange for being the final two and heading into a September runoff in Alabama. Exciting race! _E_ .@marthamaccallum Martha great interview with my son @EricTrump smart tough & professional. Thank you! @FoxNews _E_ My golf club @TrumpNationalNY in Westchester a great place! __HTTP__ _E_ #sweepstweet @3nVMusic I very much rely on my own 'take' of the situation and people involved. My instincts (cont) __HTTP__ _E_ Congressman John Lewis should spend more time on fixing and helping his district which is in horrible shape and falling apart (not to...... _E_ Very few people read the National Review because it only knows how to criticize but not how to lead. _E_ By the end of this year China will be the number one economic power on earth and the U.S. will owe 20 trillion dollars much of it to China! _E_ Amtrak crash near Philadelphia train derails many hurt some badly. Our country has horrible infrastructure problems. Pols can't solve! _E_ Great job Kevin we are all proud of you! __HTTP__ _E_ Don't miss my Fabulous World of Golf now in its second season on Golf Channel beginning tonight at 9 pm ET __HTTP__ _E_ #MerryChristmas __HTTP__ _E_ The GOP primary schedule is a disaster. Not enough time. _E_ What do we get from our economic competitor South Korea for the tremendous cost of protecting them from North Korea? NOTHING! _E_ Be focused be disciplined be patient there are very few cases of instant gratification. _E_ Freedom is never more than one generation away from extinction. Ronald Reagan #MakeDCListen #DefundObamaCare _E_ If the U.S. does not win this case as it so obviously should we can never have the security and safety to which we are entitled. Politics! _E_ Nominating Chuck Hagel for SOD is the wrong move for Obama. He doesn't need the fight. Too much political capital will be wasted. _E_ Entrepreneurs: Believe in yourself. If you don't no one else will either. _E_ .@ericbolling did a fantastic job on O'Reilly tonight. Way to go Eric! _E_ Entrepreneurs: Set the example and you'll be a magnet for the right people. Great leaders determine the teams they assemble. _E_ When I look at all of the money the special interests and lobbyists are giving to candidates beware the candidates are mere puppets $$$$! _E_ Wow really nice and unexpected from Ed Schultz. Thank you Ed! @edshow __HTTP__ _E_ I would like to wish everyone A HAPPY AND HEALTHY NEW YEAR. WE MUST ALL WORK TOGETHER TO FINALLY MAKE AMERICA SAFE AGAIN AND GREAT AGAIN! _E_ SEAL who shot Bin Laden is unemployed & can't feed his family __HTTP__ Everyone can get welfare but this SEAL can't eat! _E_ Heading over to the Miss USA Pageant. The young women participating are amazing and accomplished. Competition is very tough. ENJOY THE SHOW! _E_ Great meeting with @NaghmehAbedini the wonderful wife of Christian Pastor Saeed who is in Iranian prison. #savesaeed __HTTP__ _E_ Just arrived in Las Vegas for a packed house speech tomorrow. Big poll results today Leading big everywhere. MAKE AMERICA GREAT AGAIN! _E_ I'll bet Obama goes down just like Washington because he doesn't use our(this country's) best people to win. _E_ Heading to South Carolina really big crowd! Will be back in New Hampshire tomorrow.#MakeAmericaGreatAgain _E_ .@seanhannity Carly whose campaign is dead is making false statements about me in order to salvage hope! Sad. _E_ "There is a point in every contest when sitting on the sidelines is not an option." Dean Smith _E_ I got George Zimmerman right watch __HTTP__ _E_ Romney campaign used me in 6 primary states and won every one they should have used me in Florida and Ohio & he would be President. _E_ .@AP and @HuffingtonPost should change their fraudulent story to say THAT I DROPPED @NBC & The Apprentice to run for President! _E_ For the sake of New York City all recent sexting victims of Anthony 'Carlos Danger' Weiner should come forward. _E_ Lucky to have been chosen for the purchase of the magnificent The Point Lake and Golf Club on Lake Norman in (cont) __HTTP__ _E_ "President Trump is not getting the credit he deserves for the economy. Tax Cut bonuses to more than 2000000 workers. Most explosive Stock Market rally that we've seen in modern times. 18000 to 26000 from Election and grounded in profitability and growth. All Trump not 0... _E_ RT @Team_Trump45: @realDonaldTrump We won. Move on. __HTTP__ _E_ THANK YOU for another wonderful evening in Washington D.C. TOGETHER we will MAKE AMERICA GREAT AGAIN __HTTP__ _E_ "Failure has a thousand explanations. Success doesn't need one." Alec Guinness _E_ I can't believe that the judge in the Oscar Pistorious case has found him not guilty of murder. No one has been more guilty since O.J.! _E_ I've always defended @jayleno but he never defends me. He's not a loyal person & I now understand why everybody dumped him. Jay sucks! _E_ The Republican Establishment has been pushing for lightweight Senator Marco Rubio to say anything to hit Trump.I signed the pledge careful _E_ Save your time @rosie and focus on your horrible ratings and don't mention my name on talk shows anymore or you will get more of the same. _E_ I do not know the reporter for the @nytimes or what he looks like. I was showing a person groveling to take back a statement made long ago! _E_ ...is all of the illegal leaks of classified and other information. It is a total witch hunt! _E_ Sleepy eyes Chuck Todd a man with so little touch for politics is at it again.He could not have watched my standing ovation speech in N.C. _E_ Thank you New Hampshire! #MakeAmericaGreatAgain #FITN __HTTP__ _E_ Why did @BarackObama and his family travel separately to Martha's Vineyard? They love to extravagantly spend on the taxpayers' dime. _E_ ...the entire World WAS laughing and taking advantage of us. People like liddle' Bob Corker have set the U.S. way back. Now we move forward! _E_ The Fake News is working overtime. As Paul Manaforts lawyer said there was no collusion and events mentioned took place long before he... _E_ The protesters blocked a major highway yesterday delaying entry to my RALLY in Arizona by hours and the media blames my supporters! _E_ Congratulations to Emmanuel Macron on his big win today as the next President of France. I look very much forward to working with him! _E_ Not only did Egypt destroy its civil society w/ the Muslim Brotherhood now it is a complete economic mess __HTTP__ _E_ Major grudge match this weekend between @nyjets & @Patriots. I have a dilemma I am good friends w/ both Woody (cont) __HTTP__ _E_ Building a brand is like building a skyscraper the foundation comes first. The bigger the building the deeper the foundation needs to be _E_ The final Wisconsin vote is in and guess what we just picked up an additional 131 votes. The Dems and Green Party can now rest. Scam! _E_ .@CNN Poll just came out amazing numbers for those who want to MAKE AMERICA GREAT AGAIN! TRUMP 36% a 20 point lead over 2nd place. Thanks. _E_ Join me at Clemson University on Wednesday February 10th! #MakeAmericaGreatAgain __HTTP__ _E_ The YouTube of the 2012 Miss USA contestants @GiulianaRancic and me singing Call Me Maybe __HTTP__ has over 2M views. _E_ Trump Invitational at Mar a Lago was a huge success. Raised millions for charity and was the 1st equestrian event held in Palm Beach. _E_ Thank you Henderson NV. This is a MOVEMENT like never seen before! Watch some of the rally via my Facebook page:... __HTTP__ _E_ Entrepreneurs: Everything starts with you. Realize that you're in charge. Whatever happens you're responsible. _E_ The Federal government has increased its employment by 12% since 2007. We need to stop replacing retired workers unless position is needed. _E_ During the GOP convention CNN cut away from the victims of illegal immigrant violence. They don't want them heard. __HTTP__ _E_ Big progress being made in ridding our country of MS 13 gang members and gang members in general. MAKE AMERICA SAFE AGAIN! _E_ Somebody got rich building the ObamaCare website which doesn't even come close to working where has the money gone? _E_ Working hard to get the Olympics for the United States (L.A.). Stay tuned! _E_ Just out but lightly reported: Fewest jobless claims since 1973 show firm U.S. Job Market Lowest since March 1973. @bpolitics _E_ People do business with those people they like and trust. Ralph J. Roberts Founder of Comcast _E_ So nice when media properly polices media. Thank you @BreitbartNews. __HTTP__ _E_ Weiner and Spitzer are on top of the latest polls. A sad day for the greatest city on earth! They will spend lots of time together. _E_ While I greatly appreciate the efforts of President Xi & China to help with North Korea it has not worked out. At least I know China tried! _E_ We need strong tough and brilliant leadership now more than ever! MAKE AMERICA GREAT AGAIN! _E_ Together we are going to MAKE AMERICA GREAT AGAIN!#AmericaFirst __HTTP__ _E_ Had a fantastic dinner last night at Quattro in the Trump SoHo Hotel. It's already one of the hottest new restaurants in the city. _E_ Raising the capital gains tax in this fragile economic time is the dumbest thing Washington could do. So they will probably do it. _E_ Just heard that crazy and very dumb @morningmika had a mental breakdown while talking about me on the low ratings @Morning_Joe. Joe a mess! _E_ Thank you Kansas! Thousands of people inside and thousands outside who couldn't get into the hall. Really amazing! #CaucusForTrump _E_ Back by popular demand TV personality @TheRealMarilu returns in the record 13th season of 'All Star' @CelebApprentice. Marilu does great! _E_ For those of you that have conveniently fotgotten dummy Jon Stewart is a bad filmmaker. His last effort was a real bomb (in all ways)! _E_ Our heartfelt prayers go out to our fellow Americans suffering from the storms & tornadoes. _E_ My @foxandfriends interview discussing how @BarackObama should release his college applications & records __HTTP__ _E_ Really bad ratings for Lawrence O'Donnell on MSNBC O'Reilly is killing him! _E_ Hope you enjoy the story in the highly respected Real Estate Weekly __HTTP__ _E_ See you in D.C. tomorrow at 1:00 P.M. at the Capitol to protest the horribly negotiated deal with Iran. Really sad! _E_ Is it even slightly possible that Jodi Arias could be set free wow what a miscarriage of justice that would be! _E_ Five people killed in Washington State by a Middle Eastern immigrant. Many people died this weekend in Ohio from drug overdoses. N.C. riots! _E_ I just arrived at Trump National Doral in Miami where I'll spend the day checking work just completed by contractors. This place is amazing! _E_ Mitt Romney who was one of the dumbest and worst candidates in the history of Republican politics is now pushing me on tax returns. Dope! _E_ about that...Those Intelligence chiefs made a mistake here & when people make mistakes they should APOLOGIZE. Media should also apologize _E_ Thanks to @SenateMajLdr McConnell and the @SenateGOP we are appointing high quality Federal District... _E_ 2. The celebrity with the highest totals by Tuesday noon ET gets an extra donation to his or her charity... _E_ The Democrats want to shut government if we don't bail out Puerto Rico and give billions to their insurance companies for OCare failure. NO! _E_ Just arrived in New Hampshire. Thank you to all of my supporters!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Congratulations to the dedicated professionals of the USSS as they celebrate their 152nd anniversary. Thank you! __HTTP__ __HTTP__ _E_ I made a fortune in Atlantic City got out years ago (great timing) and havn't been back in many years. I have NOTHING to do with A.C. _E_ Thank you Rep. @MarshaBlackburn! __HTTP__ __HTTP__ _E_ I would gain a whole new respect for President Obama if he would say look we made a big mistake sorry! No more lies or deception. _E_ RT @ABCNewsRadio: Global fund championed by Ivanka Trump to help women entrepreneurs begins operations __HTTP__ __HTTP__ _E_ Thank you to our U.S. Navy for protecting our country both in times of peace & war. Together WE WILL MAKE AMERICA... __HTTP__ _E_ With the high prices of corn to continue expect even more inflation on the price of food. _E_ Happy Birthday to my wonderful daughter @IvankaTrump. _E_ In politics and in life ignorance is not a virtue. This is a primary reason that President Obama is the worst president in U.S. history! _E_ Watch Gary B tonight on Celebrity Apprentice some really crazy things happen! _E_ Because of me the Republican Party has taken in millions of new voters a record. If they are not careful they will all leave. Sad! _E_ Crooked Hillary Clinton made up facts about me and forgot to mention the many problems of our country in her very average scream! _E_ It's amazing how many people still come up to me to thank me for 'The Art of The Deal.' The book has changed a lot of lives. _E_ Obama has blocked ICE officers and BP from doing their jobs. That ends when I am President! _E_ What did you think of @THEGaryBusey's mechanical dog idea? _E_ My kids never negatively discussed my criticism of President Obama with me or anyone...it's not in their nature! _E_ The least number of hurricanes in the U.S. in decades. So they change global warming (too cold) to climate change now what will they call it _E_ Tonight's episode of The Apprentice has a big surprise at the top of the show don't miss it! 10 p.m. on NBC. _E_ If ObamaCare is so amazing then why is Obama delaying significant parts of the bill before the election? #MakeDCListen _E_ The voting booth process was a total disaster—it could and should be much better and more efficient—tremendous room for error! _E_ Within the heart of beautiful Somerset County Trump Nat'l Bedminster is the proud host of the 2022 @PGAChampionship __HTTP__ _E_ Very proud of Trump Int'l Golf Links in Aberdeen Scotland. Just got the five star award from @VisitScotNews __HTTP__ _E_ Thank you North Carolina! #MAGA __HTTP__ _E_ PM Sarah Westcot Williams incompetence should not be rewarded. You should vote for anyone who runs against her—loser! @PrimeMinisterSX _E_ Glad to see RomneyCare/ObamaCare architect Gruber being eviscerated on the Hill today. He should return all taxpayer money he was paid. _E_ #MakeAmericaGreatAgainVideo: __HTTP__ __HTTP__ _E_ North Carolina's most exclusive club @Trump_Charlotte's features @SharkGregNorman designed golf course which fronts the biggest lake in NC _E_ If you accept the expectations of others especially negative ones then you will never change the outcome. Michael Jordan _E_ When will we see @BarackObama's passport records (sealed)? _E_ Thank you for all of the great comments on the debate last night. Very exciting! _E_ "@BrandiGlanville @KenyaMoore Talk @ApprenticeNBC Feud" __HTTP__ via @ChristianPost by Virnelli Mercader _E_ My experience yesterday in Poland was a great one. Thank you to everyone including the haters for the great reviews of the speech! _E_ This very expensive GLOBAL WARMING bullshit has got to stop. Our planet is freezing record low tempsand our GW scientists are stuck in ice _E_ The developer of the Scottish wind monstrosities Vattenfall just laid off 2500 people & has serious financial difficulties. _E_ My lawyers want to sue the failing @nytimes so badly for irresponsible intent. I said no (for now) but they are watching. Really disgusting _E_ .@GiulianaRancic & @nickjonas both did a wonderful job hosting @MissUSA! Everyone loved @JonasBrothers & @DJPaulyD's performances! _E_ .@SteveRattner While I think you should have gone to prison for what you did I guess Obama saved you. But watch – I will win! _E_ Has everyone forgotten our marine who now sits in a Mexican prison because we have a president too incompetent or too lazy to make a call? _E_ Thank you to Donald Rumsfeld for the endorsement. Very much appreciated. Clinton's conduct has been disqualifying. _E_ Back by popular demand @GiulianaRancic and @BravoAndy are co hosting tonight's #MissUniverse pageant. They are great! _E_ A great afternoon in Tampa Florida. Thank you! #TrumpPence16 __HTTP__ _E_ Let me sum this up for you... __HTTP__ _E_ People are struggling to get gasoline for their cars we are like a third world country. _E_ All I heard in the SOTU was proposals for more govt more spending and more bureaucrats. Very bad! _E_ Congrats to Jim Lipton and Inside the Actors Studio for winning the Emmy Award for the 250th Episode. I was honored to appear in it. _E_ A Rod's forgery defense is blown __HTTP__ The more he lies the worse it's going to get. @yankees want out of his contract _E_ I will be live on all of the major morning talk shows. Enjoy! _E_ Rick Santorum making a strong point on the Newsmax @iontv debate: @RickSantorum. __HTTP__ _E_ Trump Int'l Palm Beach offers an award winning par 72 Championship measuring 7326 yards. Florida's top course __HTTP__ _E_ Senator Dicky Durbin totally misrepresented what was said at the DACA meeting. Deals can't get made when there is no trust! Durbin blew DACA and is hurting our Military. _E_ Thank you to the great crowd of supporters in Newtown Pennsylvania. Get out & VOTE on 11/8/16. Lets #MAGA! Watch:... __HTTP__ _E_ The World is falling apart around us but we don't have people who know how to play the game. The U.S. is in big trouble no leadership! _E_ Great line from @TheGaryBusey: "I am an angel in an earth suit." Do you agree? #CelebApprentice _E_ Always remember I was the one who got Obama to release his birth certificate or whatever that was! Hilary couldn't McCain couldn't. _E_ Donald Trumps Speech Is a Game Changer. __HTTP__ __HTTP__ _E_ Now that African Americans are seeing what a bad job Hillary type policy and management has done to the inner cities they want TRUMP! _E_ Today's announcement by @BarackObama on immigration was done for reelection. He is using the office of the presidency as a campaign tool. _E_ The cast and producers of Hamilton which I hear is highly overrated should immediately apologize to Mike Pence for their terrible behavior _E_ A legitimate article about me... __HTTP__ _E_ Fraud lightweight Marco made a TV ad on TrumpU featuring 2 people who signed these letters: __HTTP__ _E_ I will hold a press conference in the near future to discuss the business Cabinet picks and all other topics of interest. Busy times! _E_ How does Ben Carson survive this problem – really big. Similar story on front page of New York Times. __HTTP__ _E_ Also the more desperate you are to close a deal the less likely it will happen. Stay calm and focused on your ultimate goals. Be smart! _E_ #VoteTrump #SuperTuesday✅Florida✅Illinois✅Missouri✅North Carolina✅Ohio #TrumpTrain __HTTP__ __HTTP__ _E_ All former Bush administration officials should have zero standing on Syria. Iraq was a waste of blood & treasure. _E_ The fact is right now and for the foreseeable future the planet runs on oil and that means we need to get (cont) __HTTP__ _E_ Tonight be sure to watch Melania and Ivanka on Larry King Live for a Celebrity Relief Telethon __HTTP__ _E_ So I speak badly of China but I speak the truth and what do the consumers in China want? They want Trump. (cont) __HTTP__ _E_ Jackie Evancho's album sales have skyrocketed after announcing her Inauguration performance.Some people just don't understand the Movement _E_ Join @mike_pence at the University of Northwestern Ohio tonight at 7pm. Tickets: __HTTP__ _E_ When it comes to Iran's nuclear weapons program here's my advice: Distrust dismantle and verify. @IsraeliPM @netanyahu _E_ Looking forward to keynoting the Nackey S. Loeb School of Communications First Amendment Awards event tomorrow in New Hampshire. _E_ Strange statement by Bob Corker considering that he is constantly asking me whether or not he should run again in '18. Tennessee not happy! _E_ The only place where success comes before work is in the dictionary. Vidal Sassoon _E_ Via @BBCNews: "Donald Trump golf clubhouse at Menie approved" __HTTP__ _E_ Thank you to all of the television viewers that made my speech at the Republican National Convention #1 over Crooked Hillary and DEMS. _E_ The Keystone pipeline will create 20000 jobs and lower gas prices. But Obama says No. Dumb. _E_ Our trade deficit continues to rise at record rates __HTTP__ The US manufacturing sector is being (cont) __HTTP__ _E_ Trading Shots with Donald Trump a great article in the Wall Street Journal __HTTP__ _E_ I feel sorry for the 4000 soldiers who are being forced to go the West Africa to fight Ebola. Their families are up in arms. Not trained. _E_ The Fed should not do another 'stimulus.' We can't keep spending our children's future away on waste. _E_ I am having a great time in Iowa at Jack Trice Stadium! Unbelievable people. _E_ .@mike_pence and I will defeat #ISIS. __HTTP__ #VPDebate _E_ RT @OCChoppers: Bike we built for @realDonaldTrump. The gold flakes in the paint out in the sunlight looked amazing! __HTTP__ _E_ Lines for my @CPACnews address start at 7:00AM outside the Potomac Ballroom. ACU has asked that you get there early. #CPAC2013 _E_ Rex Tillerson never threatened to resign. This is Fake News put out by @NBCNews. Low news and reporting standards. No verification from me. _E_ For those on TV defending my use of the word schlonged bc #MSM is giving it false meaning tell them it means beaten badly. Dishonest #MSM _E_ While not at all presidential I must point out that the Sloppy Michael Moore Show on Broadway was a TOTAL BOMB and was forced to close. Sad! _E_ Via @reviewjournal "Event offers glimpse of Trump high life" by Holly Ivy Dore __HTTP__ Great interview @EricTrump! _E_ Does anyone remember this @BillMaher clip when he got fired from ABC in fact fired like a dog! __HTTP__ _E_ .@THEGaryBusey doesn't need instructions. Couch time is more fun. #CelebApprentice _E_ Ice storm rolls from Texas to Tennessee I'm in Los Angeles and it's freezing. Global warming is a total and very expensive hoax! _E_ Watch @JudgeJeanine on @FoxNews tonight at 9:00 P.M. _E_ Remembering the fallen heroes on #DDay June 6 1944. __HTTP__ _E_ Thank you! WE will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ "You can't put a limit on anything. The more you dream the farther you get." @MichaelPhelps _E_ Wow despite the switch to Monday night @ApprenticeNBC ratings were higher than even the Sunday night show. _E_ "America is too great for small dreams." — Pres. Ronald Reagan _E_ Many of life's failures are people who did not realize how close they were to success when they gave up. Thomas A. Edison _E_ We need leaders who can negotiate great deals for Americans. It is common sense. Let's Make America Great Again! __HTTP__ _E_ He @BarackObama made a deal with Saudi Arabia to pump the hell out of oil until after the election. Watch what (cont) __HTTP__ _E_ Donald Trump retains national lead in new ABC News/WaPo poll with 37%: __HTTP__ __HTTP__ _E_ I promoted the hell out of Trump Tower but I also had a great product. The Art of the Deal _E_ We need a President who isn't a laughing stock to the entire World. We need a truly great leader a genius at strategy and winning. Respect! _E_ People are smart. They know you can't be for jobs but against those who create them. It doesn't work. (cont) __HTTP__ _E_ Statement on Clinton Foundation: __HTTP__ _E_ These politicians like Cruz and Graham who have watched ISIS and many other problems develop for years do nothing to make things better! _E_ Lawyer Elizabeth Beck was easy for me to beat. Ask her clients if they are happy with her results against me. Got total win and legal fees. _E_ In business you make decisions that are in your best interests. Time for the US gov't to do the same. Let's Make America Great Again! _E_ I will be on @piersmorganlive tonight at 9PM. __HTTP__ _E_ Just watched the very incompetent Mitt Romney Campaign Strategist Stuart Stevens. Now I know why Mitt lost so badly. Stevens is a clown! _E_ Looking forward to being interviewed by Sam Clovis tomorrow at @MorningsideEdu in Sioux City at 10AM CT! Let's Make America Great Again! _E_ Via @scotsmandotcom: Via Donald Trump makes plans for Menie Estate marquee __HTTP__ _E_ .@OMAROSA You were fantastic on television this weekend. Thank you so much – you are a loyal friend! _E_ Thank you Charlotte North Carolina! We are going to have an AMAZING victory on November 8th...because this is all... __HTTP__ _E_ Great article in @torontodotcom @DonaldJTrumpJr: the original apprentice __HTTP__ _E_ Congratulations to @NYCParks on quickly repairing the Lasker Rink. Record skaters this past Thanksgiving! _E_ Congratulations on the GREAT job done by POLICE and law enforcement on the California shootings. Give credit where credit is due. _E_ Hey @POTUS WE AGREE!#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_ It's more important to be smart than tough. I know businessmen who are brutally tough but they're not smart." – Think Like A Billionaire _E_ Why do losers & haters always say I wear a "wig" when they know I don't. Like it or not it's all mine—just ask Barbara Walters. _E_ Our thoughts and prayers are w/ the families of the 19 brave firefighters who died fighting the Arizona wildfire. God bless them. _E_ Now Sebelius is "'urging' insurers to cover people who haven't paid" __HTTP__ Complete mess. Enrollment Numbers are a sham. _E_ .@billmaher says that the Iraelis are controlling our government __HTTP__ @HBO. Let's fire him a second time. _E_ Thank you for all of your support! Most importantly we need to get everyone out to VOTE! #VoteTrump2016 __HTTP__ _E_ Arriving at Joint Base Andrews with @SecretaryPerry @SecretaryZinke and @SecPriceMD..... __HTTP__ _E_ It's amazing @hardball_chris has completely lost all connections to reality. He is a complete shill for Obama. _E_ John McCain couldn't get him to release "it" and neither could Hillary Clinton—but Donald did! _E_ Will be doing Fox and Friends in 10 minutes at 7.05 enjoy! _E_ USA should take oil from Iraq in repayment for their liberation. __HTTP__ _E_ Thank you to Brad Blakeman on @FoxNews for grading year one of my presidency with an "A" and likewise to Doug Schoen for the very good grade and statements. Working hard! _E_ So if Iran is going to take over the oil I say we take over the oil first by hammering out a cost sharing plan with Iraq. #TimeToGetTough _E_ What is better advice The Art of the Deal or Rules for Radicals ? I know which one @BarackObama prefers. _E_ .@WayneDupreeShow A fantastic guy! _E_ Ex Presidential Pollster Pat Cadell says most voters sick of both parties and their failure. _E_ I would have done even better in the election if that is possible if the winner was based on popular vote but would campaign differently _E_ HILLARY FAILED ALL OVER THE WORLD. #BigLeagueTruth LIBYA SYRIA IRAN IRAQ ASIA PIVOT RUSSIAN RESET BENGHAZI... __HTTP__ _E_ It is really too bad that the scientists studying GLOBAL WARMING in Antarctica got stuck on their icebreaker because of massive ice and cold _E_ He's back! @THEGaryBusey returns to cause even more trouble in the13th season of All Star @CelebApprentice. _E_ Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat __HTTP__ _E_ Thank you Iowa! #Trump2016#MakeAmericaGreatAgain #FITN __HTTP__ _E_ Via @dcexaminer: @realDonaldTrump to speak at @LibertyU __HTTP__ _E_ President Obama has made one mistake after another for a very long time and the people of the United States are just plain tired of it! _E_ ISIS is advancing even against Obama's airstrikes. Obama is disengaged and making the Middle East even more dangerous. _E_ Just got to listen to Rush Limbaugh the guy is fantastic! _E_ I hope voters in Mississippi cast their ballot for @senatormcdaniel. He is strong he is smart & he wants things to change in Washington. _E_ This is just not the right time for Jeb Bush. His campaign is in total disarray too much staff being paid way too much money = U.S. GOVT. _E_ What do you think of my suing @billmaher for $5M for charity? He made an offer I accepted. _E_ Do you really believe our once great country can continue to survive with incompetent leadership. The answer is no and we better move fast! _E_ Tell Saudi Arabia and others that we want (demand!) free oil for the next ten years or we will not protect their private Boeing 747s.Pay up! _E_ How is ABC Television allowed to have a show entitled Blackish ? Can you imagine the furor of a show Whiteish ! Racism at highest level? _E_ Re Kerry admitting to "working" for Pastor Abedini's release why has US already released Iranian spies & nuclear scientist? Dumb! _E_ A GREAT day in South Carolina. Record crowd and fantastic enthusiasm. This is now a movement to MAKE AMERICA GREAT AGAIN! _E_ THANK YOU Arkansas! Get out & #VoteTrump on Tuesday. We will MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ _E_ The Trump Tower atrium is such a great place & kept thousands of people warm & safe during the storm thanks staff! _E_ 2016 Republican Primary Morning Consult Poll was just released. TRUMP 32 CARSON 12 BUSH 11 FIORINA 6 RUBIO 5 CRUZ 5. Taken after debate _E_ A winning attitude will put everything in perspective. Keep negative thoughts and people where they belong out of the big picture. _E_ Where's the transparency? Despite Obama's denial @sfchronicle stands by report he just talked with Jeremiah Wright. _E_ Phoenix Convention Center officials did not want to have thousands of people standing outside in the heat so they let them in. A GREAT day! _E_ I would bet that we have many great American technology companies that would build and fix the pathetic ObamaCare website for ZERO dollars! _E_ ...Hence I would fully expect Corker to be a negative voice and stand in the way of our great agenda. Didn't have the guts to run! _E_ Just put in ad for a real estate executive: "Hard work low pay mean boss!" _E_ Great time last night in Louisiana. Big and energetic crowd. Go out and vote now polls open. MAKE AMERICA GREAT AGAIN! _E_ Dr. Ben Carson blasted Ted Cruz for deceit and dirty tricks and lies. _E_ Spoke at the Congressional @GOP Retreat in Philadelphia PA. this afternoon w/ @VP @SenateMajLdr @SpeakerRyan. Th... __HTTP__ _E_ EXCLUSIVE: FBI Agents Say Comey 'Stood In The Way' Of Clinton Email Investigation: __HTTP__ _E_ Left Paris for U.S.A. Will be heading to New Jersey and attending the#USWomensOpen their most important tournament this afternoon. _E_ I don't know if President Obama isn't stopping the flights from Ebola torn West Africa because he is stubborn stupid or just doesn't care! _E_ "When you can't make them see the light make them feel the heat." – Ronald Reagan _E_ "Concentration comes out of a combination of confidence and hunger." Arnold Palmer _E_ China is buying our shale and gas fields __HTTP__ & Obama still won't approve Keystone __HTTP__ Pathetic! _E_ Thanks to @pnehlen for your kind words very much appreciated. _E_ Dummy @GoAngelo who had 11 people show up for 15 min. at his "massive" rally at Macy's is trying to get publicity for self by using me _E_ WRONG!@BarackObama capitulated to China by releasing Chen Guangcheng out of the US Embassy __HTTP__ China really has our number _E_ In terms of energy we need to be exploring and developing numerous approaches...and I also include in that (cont) __HTTP__ _E_ On Saturday a great man Elie Wiesel passed away.The world is a better place because of him and his belief that good can triumph over evil! _E_ Most people can learn from their own experiences quite well but many ignore the experiences and lessons of others. The Way To The Top _E_ To all young entrepreneurs entering the business world stay positive focused and remember everything has its ups and downs. _E_ .@FoxNews Outgoing CIA Chief John Brennan blasts Pres Elect Trump on Russia threat. Does not fully understand. Oh really couldn't do... _E_ Union Leader refuses to comment as to why they were kicked out of the ABC News debate like a dog. For starters try getting a new publisher! _E_ Just terrible! #Oscars _E_ Celebrity Apprentice tonight at 9 on NBC some amazing things happen! _E_ .@KarlRove is far more to blame for Obama's victory than the Tea Party. _E_ Did @BarackObama try to bribe Rev. Wright with $150K? __HTTP__ I am sure the media will be all over this. _E_ .@realDonaldTrump will do more in the first 30 days in office than Hillary has done in the last 30 years! #Debate... __HTTP__ _E_ Can you imagine if @billmaher said about Obama what he said about me (orangutan etc)—the press would run him out of the country... _E_ Must watch @IvankaTrump interview on @gma discussing #Girlpower __HTTP__ _E_ Will be on @foxandfriends at 7:00 A.M. Enjoy! _E_ It was an honor to welcome @GLFOP to the @WhiteHouse today with @VP Pence & Attorney General Sessions. THANK YOU fo... __HTTP__ _E_ The U.S. has gained more than 5.2 trillion dollars in Stock Market Value since Election Day! Also record business enthusiasm. _E_ The only place success comes before work is in the dictionary. Vince Lombardi _E_ If you can't say great things about yourself who do you think will? Think Like a Champion _E_ LIVE on #Periscope: Tax Plan Press Conference#Trump2016 __HTTP__ _E_ .@ashleycam2883 Re: Libya Hillary took the blame for Obama. _E_ W/ signature Trump amenities 5 star rooms & world class restaurants @TrumpWaikiki brings excellence to Hawaii __HTTP__ _E_ It's the Democrats' total weakness & incompetence that gave rise to ISIS not a tape of Donald Trump that was an admitted Hillary lie! _E_ STATEMENT ON MELANIA SPEECH __HTTP__ _E_ In 1999 @BarackObama said that he didn't support Welfare Reform __HTTP__ He just gutted the entire program. _E_ Do you think Putin will be going to The Miss Universe Pageant in November in Moscow if so will he become my new best friend? _E_ US Army Reserve @leezeldin will bring Conservative solutions to DC. Next Tuesday vote for Lee in the NY 1 primary. #zeldinforcongress _E_ .@ralphreed is doing a great job! _E_ For the record I have ZERO investments in Russia. _E_ Just arrived in Scotland. Place is going wild over the vote. They took their country back just like we will take America back. No games! _E_ Chelsea Clinton will be very successful in the world of politics. She's always been a great person a winner. (cont) __HTTP__ _E_ Thank you Arizona! #Trump2016#MakeAmericaGreatAgain #TrumpTrain __HTTP__ __HTTP__ _E_ The $200M in renovations of Trump Int'l Washington DC are on track. The Old Post Office is being transformed into true luxury. _E_ To be a winner think like a winner. Practice positive thinking with reality checks. _E_ Many of life's failures are people who did not realize how close they were to success when they gave up. Thomas A. Edison _E_ Join us at 10pmE on @ABC2020 @ABC with @BarbaraJWalters! #MeetTheTrumps #ABC2020 __HTTP__ _E_ Today it was my honor to join the great men and women of @DHSgov @CustomsBorder @ICEgov and @USCIS at the U.S. Customs and Border Protection National Targeting Center in Sterling Virginia. Fact sheet: __HTTP__ __HTTP__ _E_ .@IvankaTrump and I are looking forward to visiting Vancouver next week. Big announcement... _E_ RT @TeamTrump: .@realDonaldTrump is here to talk about the REAL issues #BigLeagueTruth #Debates2016 __HTTP__ _E_ #FullRepeal: Stopping Obamacare is now up to the American people. We must elect @MittRomney this November. _E_ Looking forward to RALLY in the Great State of Pennsylvania tonight at 7:30. Big crowd big energy! _E_ their country (the U.S. doesn't tax them) or to build a massive military complex in the middle of the South China Sea? I don't think so! _E_ My @SquawkCNBC interview discussing the GOP primary gas prices the Doral purchase and my outlook on the economy. __HTTP__ _E_ Shock @BarackObama's DNC Convention has a $27M deficit and events are starting to be canceled. __HTTP__ _E_ Great going. _E_ RT @IvankaTrump: It was an honor to meet with you Prime Minister Modi. Thank you for co hosting the 8th annual Global Entrepreneurship Summ... _E_ Donald Trump Jr. Ivanka Trump Eric Trump and myself in front of The Old Post Office D.C. on Pennsylvania... __HTTP__ _E_ Glad to hear @ehasselbeck will be staying on @theviewtv. Elizabeth has great presence & doesn't back down from sharing her views. _E_ Oil would be $25 a barrel if our government would let us drill. Our country would be rich again who needs OPEC. _E_ My fragrance Success is flying off the shelves @Macys. The perfect Christmas gift! _E_ "Successful people don't have fewer problems.They have determined that nothing will stop them from going forward." Dr. Benjamin Carson _E_ Always be prepared to start." @JoeMontana _E_ Clinton betrayed Bernie voters. Kaine supports TPP is in pocket of Wall Street and backed Iraq War. _E_ Entrepreneurs: always remember that deals are fluid. Terms are always negotiable and time can be the best option for success. _E_ The Miss USA Pageant #MissUSA was a big ratings hit for @nbc NBC won the evening. Thank you Donald. _E_ First candidate in Virginia with over 16000 validated signatures for the ballot. An honor thank you! #Trump2016 #MakeAmericaGreatAgain _E_ Passion gives great momentum and can be the catalyst for great achievement. _E_ A TRULY GREAT CHAMPION WILL SELDOM FAIL AND ALWAYS COME BACK. NEVER UNDERESTIMATE THE POWER OF GREATNESS! _E_ Waterboarding KSM gave us the intelligence that lead to Bin Laden. _E_ Tune in to The Marriage Ref onThursday night at 10 p.m. on NBC I'm on the panel of experts along with Gloria Estefan & Adam Carolla. _E_ In order to get elected @BarackObama will start a war with Iran. _E_ Via @EWErickson: "Stop Complaining About Donald Trump" __HTTP__ _E_ 'S&P 500 Edges Higher After Trump Renews Jobs Pledge' __HTTP__ _E_ I just got Mike Leach's new book Swing Your Sword. He's a great coach and he's written a great book. It's definitely worth reading. _E_ Lyin' Ted Cruz steals foreign policy from me and lines from Michael Douglas— just another dishonest politician. _E_ Trump National Golf Club Jupiter is close to Palm Beach and designed by Jack Nicklaus a masterpiece of a course. __HTTP__ _E_ How did Obama go to a Las Vegas fundraiser on 9.12 the day after he refused to send help to Americans in Benghazi? _E_ It's good to see that @FLGovScott is protecting the sanctity of this November's elections Voter fraud must be broken. _E_ Everybody loves @bretmichaels! He's a great champion and this is where he should be. He agrees! _E_ Wonderful weekend at Camp David. A very special place. A lot of very important work done. Heading back to the @WhiteHouse now. __HTTP__ _E_ We will stop heroin and other drugs from coming into New Hampshire from our open southern border. We will build a WALL and have security. _E_ Our American comeback story begins 11/8/16. Together we will MAKE AMERICA SAFE & GREAT again for everyone! Watch:... __HTTP__ _E_ Thank you Illinois! Great news! #VoteTrumpIL on 3/15!Trump 28%Cruz 15%Rubio 14%Kasich 13%Bush 8%Carson 6%Simon Poll/SIU _E_ It's amazing how people can talk about me but I'm not allowed to talk about them. _E_ If you look at the horrible picture on the front page of the NY Times of the rebels executing prisoners you would say forget the rebels! _E_ America's primary goal with Iran must be to destroy its nuclear ambitions. Let me put this as plainly as I know (cont) __HTTP__ _E_ Don't miss my fabulous World of Golf now in its second season on Golf Channel beginning January 31 at 9 pm ET. Celebrity matches and more... _E_ Got the endorsement of Brian France and @NASCAR yesterday in Georgia. Also many of the sports great drivers. Thank you Nascar and Georgia! _E_ General Motors is sending Mexican made model of Chevy Cruze to U.S. car dealers tax free across border. Make in U.S.A.or pay big border tax! _E_ It seems there is never a problem for which @BarackObama cannot find a reason for another speech and another tax. _E_ #TBT With Tommy Lee Jones at Mar a Lago. __HTTP__ _E_ Big increase in traffic into our country from certain areas while our people are far more vulnerable as we wait for what should be EASY D! _E_ "Failure isn't fatal but failure to change might be" – John Wooden _E_ Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_ Happy Birthday to my friend @garyplayer... __HTTP__ _E_ Don't forget next Friday December 9th: I'll be signing my new book @HowToGetTough in Trump Tower from 11 a.m... (cont) __HTTP__ _E_ Army officer who led a sexual abuse prevention unit was just fired after being charged with violently going after his wife.What is going on? _E_ Remember get TIME magazine! I am on the cover. Take it out in 4 years and read it again! Just watch... _E_ Still waiting to hear from @billmaher. Every day he dodges me is one less day that $5M is being used for charity. _E_ All civilized nations must join together to protect human life and the sacred right of our citizens to live in safety and in peace. _E_ A big fat hit job on @oreillyfactor tonight. A total waste of time to watch boring and biased. @brithume said I would never run a dope! _E_ Leaving now for a one night trip to Scotland in order to be at the Grand Opening of my great Turnberry Resort. Will be back on Sat. night! _E_ Celebrity Apprentice on CNBC tonight at 9. _E_ Boycott all Apple products until such time as Apple gives cellphone info to authorities regarding radical Islamic terrorist couple from Cal _E_ Cover your bases know everything you can about what you're doing. Keep your focus by being well informed on a daily basis. _E_ "If we ever forget that we are One Nation Under God then we will be a nation gone under." Ronald Reagan (Feb. 6 1911–June 5 2004) _E_ Congratulations to my brother Robert & Ann Marie on the success of @MontesKitchen in Dutchess County New York (Amenia.) Great food! _E_ Entrepreneurs: See yourself as victorious look at the solution not the problem. _E_ The @nytimes sent a letter to their subscribers apologizing for their BAD coverage of me. I wonder if it will change doubt it? _E_ The Chicago machine is scared. @PaulRyanVP shows that @MittRomney will run on a conservative & coherent platform. 85 days until victory! _E_ Q1 GDP has just been revised down to 1.9% __HTTP__ The economy is in deep trouble. _E_ ISIS is in retreat our economy is booming investments and jobs are pouring back into the country and so much more! Together there is nothing we can't overcome even a very biased media. We ARE Making America Great Again! _E_ Unsustainable. With our $17T debt & $90T in unfunded liabilities government "blatantly" wasted $30B this year __HTTP__ _E_ RT @LouDobbs: We are Watching A Leader Who for the First Time in Three Presidencies Will Put America and Americans First! @realDonaldTrump... _E_ The highly respected Suffolk University poll just announced that I am alone in 2nd place in New Hampshire with Jeb Bust (Bush) in first. _E_ Fidel Castro is dead! _E_ I hope everyone that read @DanAmira's reprehensible statement will cancel their subscription to @NYMag in protest. Let me know. _E_ from Donald Trump: I saw Lady Gaga last night and she was fantastic! _E_ .@lisarinna is the last lady standing in All Star Celebrity @ApprenticeNBC. Watch out men she's sharp and tough. _E_ The legendary @BarbaraJWalters will be asking me questions about the Presidential campaign on @WNTonight at 6:30 PM. _E_ #Trump360 Watch this 360 video of my speech last night at Trump Tower __HTTP__ _E_ Entrepreneurs: In the best negotiations everyone wins. This is a possibility and it's the ideal situation to strive for. _E_ President Obama & Putin fail to reach deal on Syria so what else is new? Obama is not a natural deal maker. Only makes bad deals! _E_ Via @PVPatch by Paige Austin: "Trump to Donate 12 Acres for Conservation in Palos Verdes" __HTTP__ _E_ RT @SecretarySonny: Serious @Cabinet meeting today called by @POTUS at Camp David. Reports on #Irma's track potential impact fed & state... _E_ A Rod is now looking for an expensive home in Beverly Hills why aren't the @Yankees terminating his contract for misrepresentation? _E_ Via @DMRegister by @KObradovich: "Donald Trump: Next president needs to be 'a great one'" __HTTP__ _E_ Third quarter GDP was lowered to 2% . There won't be any economic recovery until @BarackObama is defeated. _E_ George Will one of the most overrated political pundits (who lost his way long ago) has left the Republican Party.He's made many bad calls _E_ Wow  I never saw the Petraeus thing coming. A straight laced guy! Very sad for him and his family. _E_ Join me live in Toledo Ohio. Time to #DrainTheSwamp & #MAGA! __HTTP__ _E_ Go to the website for the Judge's full decision re Trump University: __HTTP__ _E_ Vera Coking made a big mistake in Atlantic City by turning down many millions of $'s years ago for property that just sold for $530000. _E_ I will be interviewed by @GStephanopoulos on @ABC at 10:00 A.M. _E_ As I predicted 1 year ago gasoline prices hit a record high today...OPEC is having a ball at our expense. _E_ The President has until tomorrow at 12 noon to pick up $5M for his favorite charity. Looking like he won't be doing it. What is he hiding? _E_ I will be talking about my wonderful experience in Iowa and the simultaneous unfair treatment by the media later in New Hampshire. Big crowd _E_ Hope he won't spend too much time ripping apart the 2nd. Amendment! _E_ Today we heard the experiences of law enforcement professionals and community leaders working to combat the threat of MS 13 and the reforms we need from Congress to defeat it. Watch here: __HTTP__ __HTTP__ _E_ RT @MikeHolden42: @foxandfriends @realDonaldTrump He's a fascist so not unusual. _E_ Thank you Albany New York!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ .@MelindaDC Don't misrepresent in order to make a point. I was always tough on ISIS as you'll find out after I get elected. _E_ The United States has been reminded time and again in recent years that economic security is not merely RELATED to national security economic security IS national security. It is vital to our national strength. #APEC2017 __HTTP__ _E_ I was never a fan of Bush 2 FOR MANY REASONS including the fact that we should never have gone into Iraq but once there kept the oil! DUMB _E_ Fact – the tighter the gun laws the more violence. The criminals will always have guns. _E_ I hope everybody goes to Macy's today to get Donald J. Trump shirts ties suits and cufflinks they are really beautiful at low price _E_ Heroin overdoses are taking over our children and others in the MIDWEST. Coming in from our southern border. We need strong border & WALL! _E_ Gas prices are still too high. We really need to pressure OPEC to lower the price of oil. _E_ Texas is lucky to have him @GovernorPerry is a great guy! _E_ Why does @FoxNews give @KarlRove so much airtime. He (and other Fox pundits) is so biased. Still thinks Romney won. Unfair coverage of Trump _E_ .@IvankaTrump and @PiersMorgan will be wonderful advisors. #CelebApprentice _E_ ObamaCare will cost 3 times as much as Obama promised – $2.6T __HTTP__ It is not sustainable. (h/t @gatewaypundit) _E_ Thank you Pennsylvania! #Trump2016 __HTTP__ __HTTP__ _E_ Watching the returns at 9:45pm. #ElectionNight #MAGA __HTTP__ _E_ "If you're still in school pay attention. Education is a money machine." – Think Like a Billionaire _E_ Isn't it time that Obama release his college records and applications? Boy would that create a mess! He is not who you think. _E_ Re run of O'Reilly on Fox NOW! _E_ We're singlehandedly transferring hundreds of billions of dollars a year... _E_ Last night in Orlando Florida was incredible massive crowd THANK YOU FLORIDA! Today at 3:00 P.M. I will be in Alabama for last rally! _E_ Now Obama is having our army coordinate with Iran against ISIS. What's next? _E_ Oh wow lightweight Governor @BobbyJindal who is registered at less than 1 percent in the polls just mocked my hair. So original! _E_ Good messaging and staying on point. @MittRomney called @BarackObama anti investment anti business anti jobs __HTTP__ _E_ Congrats everyone we topped 4 million today on Twitter and heading up fast! _E_ Happy #VeteransDay to all. And it is nice to have Sgt. Andrew Tahmooressi back home. _E_ I was nice to loser @rosie and she attacked me it just shows never let up with a bully. They only fade when you hit them hard! _E_ Looking forward to seeing the World Champion Yankees today on opening day! _E_ Only a grossly incompetent government led by an equally incompetent president could have made the terrible trade for Bergdahl. #OrangeRoom _E_ A must watch: Legal Scholar Alan Dershowitz was just on @foxandfriends talking of what is going on with respect to the greatest Witch Hunt in U.S. political history. Enjoy! _E_ Few people know that @FortuneMagazine is still in business. Tell your writer Alisa Soloman that I left The Apprentice to run for president _E_ If America unlocked its energy potential we would once again be the most powerful country in the world. Washington is holding us back. _E_ Welcome to the new reality. 23116928 US households on food stamps __HTTP__ Obama's Hope & Change. _E_ In '08 @BarackObama said that Bush adding $4T to the debt was unpatriotic. __HTTP__ @BarackObama has already added $6T. _E_ Thank you Peter if elected I will think big for our country & never let the American people down! #AmericaFirst __HTTP__ _E_ Justice Roberts did the Republican Party and @MittRomney a great favor. He essentially said ObamaCare is a tax (cont) __HTTP__ _E_ ...Such poor leadership ability by the Mayor of San Juan and others in Puerto Rico who are not able to get their workers to help. They.... _E_ Jack Nicklaus II gave the best tribute to a parent I have ever heard at yesterday's Congressional Gold Medal Ceremony honoring @jacknicklaus _E_ Ask Sally Yates under oath if she knows how classified information got into the newspapers soon after she explained it to W.H. Counsel. _E_ China taxing imports from the US 22% why aren't we taxing China? _E_ Egypt is going the exact opposite of what it was. They will soon be very strongly against Israel. Thanks President Obama. @BarackObama _E_ Just met with General Petraeus was very impressed! _E_ I am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ ... _E_ Tell me which is "cooler"—my induction into the @WWE Hall of Fame or my Star on the Hollywood Walk of Fame? _E_ The Republican Party is racking up record amounts of small dollar donations fueled by Trump supporters..... @nypost Thank you! _E_ Our major airports are decaying. It's embarrassing. We need to have them renovated by competent professionals and fast. _E_ "You don't necessarily need the best location. What you need is the best deal." – The Art of The Deal _E_ .@NBA Hall of Famer @dennisrodman rebounds for a tremendous performance in his return to this year's All Star @ApprenticeNBC! Great guy! _E_ Big day in Texas tomorrow! Having a rally in Fort Worth. Tremendous crowd. Will be exciting! #Trump2016 __HTTP__ _E_ Bloomberg News Spain's renewable projects lead by money losing wind turbines facing bankruptcy. Hopefully Scotland is watching! _E_ I was just told by a television pro thay @DannyZucker is one of the truly dumbest guys in the business he's obsessed with T so many flops! _E_ Reports say #ISIS now has a passport machine to have its believers infiltrate our country. I told you so. __HTTP__ _E_ "Faldo to rework two Doral courses" __HTTP__ via @FOXSports _E_ New Q poll out we are going to win the whole deal and MAKE AMERICA GREAT AGAIN! #Trump2016 __HTTP__ _E_ Via @AFP: Trump tees off on new golf course in Scotland __HTTP__ _E_ Biden @VP Spends $1 Million Annually for Weekend Trips __HTTP__ _E_ My wife Melania will be interviewed tonight at 8:00pm by Anderson Cooper on @CNN. I have no doubt she will do very well. Enjoy! _E_ I am no fan of President Obama but to show you how dishonest the phony Washington Post is: __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Congrats to the Senate for taking the first step to #RepealObamacare now it's onto the House! _E_ Must read column by Bob Woodward explaining how Obama pushed for sequestration & promised no tax increase __HTTP__ _E_ Wow I just had two very good Iowa polls and a phenomenal just out National Poll from @ABC @washingtonpost 38%. MAKE AMERICA GREAT AGAIN! _E_ WEST VIRGINIA #VoteTrump TODAY!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Thank you Texas! 10000 amazing supporters! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ I hope @Official1MCD is recuperating well in LA. Get better! @OMAROSA _E_ .@montgomeriefdn Your commentary this weekend was fantastic. People love what you say and how you say it. _E_ Is it legal for @BarackObama to make campaign donor calls from Air Force One? __HTTP__ Obama is always fundraising on our dime. _E_ Via @DMRegister by @brianneDMR:"Trump: Bring back jobs from overseas" __HTTP__ Let's Make America Great Again! _E_ Money pouring into Insurance Companies profits under the guise of ObamaCare is over. They have made a fortune.Dems must get smart & deal! _E_ I will be on The Tonight Show with Jimmy Fallon tonight at 11:30. Should be fun! @jimmyfallon _E_ Via RealClear Politics __HTTP__ _E_ "Do more be more give more and everyone will benefit." – Think Like a Champion _E_ It was my great honor to celebrate the opening of two extraordinary museums the Mississippi State History Museum & the Mississippi Civil Rights Museum. We pay solemn tribute to our heroes of the past & dedicate ourselves to building a future of freedom equality justice & peace. __HTTP__ _E_ I know you don't like to hear this @DannyZuker but the biggest nights of The Apprentice were far bigger than the biggest nights of Mod Fam _E_ Entrepreneurs: See yourself as victorious. Look at the solution not the problem. _E_ The Democrats in Congress don't want ObamaCare for themselves or big businesses. So why are they forcing it on the American people? _E_ Gas prices have doubled under Obama. Over $5/gallon now in California. We must start drilling from our own resources to become independent. _E_ Passion motivates. Passionate people don't give up their zeal eliminates fear. Passion can also create business opportunities. _E_ ...then he who continues the attack wins." Ulysses S. Grant _E_ The Scottish windfarm was conceived by the same mind that released terrorist al Megrahi for humanitarian reasons. .. _E_ Everyone loves @AmandaTMiller here she is with @Joan_Rivers and me. __HTTP__ _E_ Oprah will end up doing just fine with her network she knows how to win. @Oprah _E_ How do you take care of our people if you don't make anything? We don't make anything. We are rapidly losing our manufacturing to China etc. _E_ Obama now just wants to save face Russia is now telling him don't do it . He waited too long and the other side is much better prepared. _E_ Wow a really great review of my golf club in Scotland @TrumpScotland in todaysgolferco.uk. Thank you! __HTTP__ _E_ Negotiation tip: Think about what the other side wants. Know where they're coming from. Try to create a win/win situation. _E_ Put Kathleen Sebelius out of her misery and lovingly say YOU'RE FIRED! Let her go home to her family and rest. BRING IN TOP FLIGHT PEOPLE! _E_ The United Kingdom is trying hard to disguise their massive Muslim problem. Everybody is wise to what is happening very sad! Be honest. _E_ To be a big success in any field you need to build momentum. Momentum is all about energy and timing. Think BIG _E_ Like it or not haters and losers everybody is talking about Miss U.S.A. and Miss Utah. By the way she is a fine young woman unfair to her. _E_ True! __HTTP__ _E_ White House relaxes penalty for canceled health policies a major blow to the sustainability (and concept) of ObamaCare! They are desperate _E_ We will repeal & replace #Obamacare which has caused soaring double digit premium increases. It is a disaster! __HTTP__ _E_ The incompetence of our current administration is beyond comprehension. TPP is a terrible deal. _E_ MY PRO GROWTH Econ Plan:✅Eliminate excessive regulations! ✅Lean government!✅Lower taxes!#Debates ... __HTTP__ _E_ Based on the shoots which silent film do you think will be better? #CelebApprentice _E_ Just left news conference at @TrumpTowerNY with @TheGaryBusey people love @TheGaryBusey! __HTTP__ _E_ Thoughts and prayers with the victims and their families along with everyone at the Berrien County Courthouse in St. Joseph Michigan. _E_ The record 13th season of 'All Star' @CelebApprentice features the return of the beautiful @BrandenRoderick. The fans love her! _E_ The Republicans should use everything against @BarackObama just as @BarackObama is going to use everything (cont) __HTTP__ _E_ RT @IvankaTrump: We're working to make tax cuts & the expanded Child Tax Credit a reality for American families. The time is now! #TaxRefor... _E_ With multiple space options @TrumpChicago is the ideal venue to hold your dream wedding __HTTP__ _E_ Puerto Rico is devastated. Phone system electric grid many roads gone. FEMA and First Responders are amazing. Governor said great job! _E_ Crooked Hillary Clinton who I would love to call Lyin' Hillary is getting ready to totally misrepresent my foreign policy positions. _E_ On behalf of @FLOTUS Melania and myself thank you Poland!🇱#ICYMI watch here __HTTP__ #POTUSinPoland __HTTP__ _E_ Crooked Hillary Clinton likes to talk about the things she will do but she has been there for 30 years why didn't she do them? _E_ ...are now fighting back like never before. There is so much GUILT by Democrats/Clinton and now the facts are pouring out. DO SOMETHING! _E_ I am in Toronto checking the great Trump International Hotel highest rated hotel in Canada. It is a beauty! _E_ .@BradPaisley came up to see me. A really nice and talented guy. __HTTP__ _E_ Wow @GolfMagazine just rated the renovation of The Blue Monster the best of the year. Even better they stated it may be best of all time! _E_ Congrats to R. Emmett Tyrrell Jr of @AmSpec for the fantastic piece on Benghazi. __HTTP__ _E_ I have had the pleasure of getting to know @AnnDRomney & @MittRomney this past year. They love America. Let's push them over the top today. _E_ Via @baltimoresun by @ErinatTheSun: "Maryland GOP book Trump for major fundraiser" __HTTP__ _E_ Thank you to our wonderful team @USUN and their families. Keep up the GREAT work! #USA __HTTP__ _E_ I hate hearing after all of the hard work that @MittRomney never wanted to become President. _E_ If The Art of the Deal is a must read then #TimeToGetTough is my opus. It is available Dec 5th! _E_ Thank you Pennsylvania! Together we are going to MAKE AMERICA GREAT AGAIN! Watch here: __HTTP__ __HTTP__ _E_ Thank you Kirkwood Community College. Heading to the U.S. Cellular Center now for an 8pmE MAKE AMERICA GREAT AGAIN... __HTTP__ _E_ Crooked's State Dept gave special attention to Friends of Bill after the Haiti Earthquake. Unbelievable! __HTTP__ _E_ Big new @ABC Poll to be announced at 9:00 A.M. on This Week with @GStephanopoulos. I will be interviewed on show! _E_ Credit the Bloomberg administration for having the foresight and courage to get this decades old project finished will be BIG for NY. _E_ Join us tomorrow in Kiawah South Carolina! #SCPrimary #VoteTrumpSC#Trump2016 __HTTP__ _E_ Thank you for all of your support Iowa!#MakeAmericaGreatAgain #Trump2016#IACaucus finder: __HTTP__ __HTTP__ _E_ I feel sorry for Rosie 's new partner in love whose parents are devastated at the thought of their daughter being with @Rosie a true loser. _E_ .@CNN is looking at Jeff Zucker to lead them out of the forest Jeff would be a great choice. _E_ Top searched candidate by state as seen in the #GOPDebate media filing center. WE WILL MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ RT @realDonaldTrump: Democrats are holding our Military hostage over their desire to have unchecked illegal immigration. Can't let that hap... _E_ Trump Int. Hotel & Tower Vancouver will transform the skyline w/ its 616 ft twisting & beautiful tower __HTTP__ _E_ Did Hillary Clinton ever apologize for receiving the answers to the debate? Just asking! _E_ Even though every poll Time Drudge etc. has me winning the debate by a lot @FoxNews only puts negative people on. Biased a total joke! _E_ Money is really cheap so this is a great time to buy a house but be sure to lock in long term financing (without which don't buy). _E_ A suggestion for the dishonest media __HTTP__ _E_ "If you can accept losing you can't win." Vince Lombardi _E_ China has been taking out massive amounts of money & wealth from the U.S. in totally one sided trade but won't help with North Korea. Nice! _E_ Melania and I were thrilled to join the dedicated men and women of the @USEmbassyFrance members of the U.S. Military and their families. __HTTP__ _E_ I said don't invade Iraq from the very beginning. my @SRQRepublicans speech _E_ fires its employees builds a new factory or plant in the other country and then thinks it will sell its product back into the U.S. ...... _E_ RT @GMA: WATCH: @IvankaTrump on women who work empowering campaign celebrates modern women. __HTTP__ _E_ Great column by @howardfineman on @HuffPostPol: Karl Rove Is Done __HTTP__ _E_ Via @ChristianToday: "Donald Trump vows to be the 'greatest representative of Christians' if he wins White House" __HTTP__ _E_ Remember we don't get any oil from Iraq China gets whatever ISIS hasn't already taken. So why isn't China sending the troops? Too smart! _E_ New National GOP Zogby Poll#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ So disgraceful that a person illegally in our country killed @Colts linebacker Edwin Jackson. This is just one of many such preventable tragedies. We must get the Dems to get tough on the Border and with illegal immigration FAST! _E_ AMAZING how the press protected President Obama when he did the so called comedy routine with Zach G. He looked like a fool they said cute _E_ The failing @nytimes story is so totally wrong on transition. It is going so smoothly. Also I have spoken to many foreign leaders. _E_ I hear the Rickets family who own the Chicago Cubs are secretly spending $'s against me. They better be careful they have a lot to hide! _E_ Remember politicians are all talk and no action they will never be able to MAKE OUR COUNTRY GREAT AGAIN! Controlled by lobbyists & donors _E_ We are going through contentious primaries now but the GOP must unite. Let's take the Senate and stop Obama's dangerous agenda. _E_ But while Dallas dropped to its knees as a team they all stood up for our National Anthem. Big progress being made we all love our country! _E_ So many people are angry at my comments on Mexico—but face it—Mexico is totally ripping off the US. Our politicians are dummies! _E_ ...instead of biting the hand that feeds you! Don't bother just keep making me money! _E_ Wow! Ted Cruz received $487K in campaign contributions $11M from a NY hedge fund mogul & $1M low int. loan from Goldman Sachs. Hypocrite _E_ We need to worry about the American worker first! _E_ Direct view of crane from apartment window. Crane was never properly secured blowing in the breeze. __HTTP__ _E_ With the fantastic ratings last weekend @meetthepress & @ThisWeekABC I think it's only fair that I go on @FoxNewsSunday w/ Chris Wallace. _E_ I was sorry to decline headlining the Reagan Dinner last Saturday due to a prior business commitment. Pres. Reagan was one of the greats. _E_ RT @ColumbiaBugle: @realDonaldTrump Love our @FLOTUS! __HTTP__ _E_ Vote for the next Miss USA... __HTTP__ #VEGASusa11 #MissUSA _E_ America deserves a commander in chief who respects the challenges and realities our Armed Forces face in our (cont) __HTTP__ _E_ Thank you to Jack Morgan Tamara Neo Cheryl Ann Kraft and all of my friends and supporters in Virginia. GREAT JOB! _E_ The NFL is now thinking about a new idea keeping teams in the Locker Room during the National Anthem next season. That's almost as bad as kneeling! When will the highly paid Commissioner finally get tough and smart? This issue is killing your league!..... _E_ If you don't have passion everything you do will ultimately fizzle out or at best be mediocre. Is that how (cont) __HTTP__ _E_ Fox News Sunday With Chris Wallace will be re broadcast on @FoxNews at 6:00 P.M. _E_ The FDA must immediately stop allowing massive dose vaccinations in babies. It is mind boggling that they allow this practice to continue. _E_ China is ripping wealth out Africa and yet as usual refuses to put anything back to help with Ebola. Let the stupid Americans do it! SAD _E_ Thx and from a better quotation source: You miss 100% of the shots you don't take. Wayne Gretzky _E_ RT @foxandfriends: Former President Obama's $400K Wall Street speech stuns liberal base Sen. Warren saying she was troubled by that __HTTP__ _E_ The top leadership of the New York State Republican Party is totally dysfunctional they haven't won a major election in many years. _E_ "Donald Trump on 'Brutal' New Season of @ApprenticeNBC" __HTTP__ via @YahooTV _E_ Great job today by the NYPD in protecting the people and saving the climber. _E_ CNBC poll: Trump won #GOPDebate #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Great job @AdamScott you deserve it! _E_ Why would the people of Texas support Ted Cruz when he has accomplished absolutely nothing for them. He is another all talk no action pol! _E_ Leaving now for Tennessee. Big crowd! _E_ Departing for Long Island now. An area under siege from #MS13 gang members. We will not rest until #MS13 is eradicated. #LESM __HTTP__ _E_ If Obama resigns from office NOW thereby doing a great service to the country—I will give him free lifetime golf at any one of my courses! _E_ A wonderful afternoon in Iowa! Great people! Heading now to Florida tomorrow South Carolina! #MakeAmericaGreatAgain #Trump2016 _E_ Congrats to Charles @krauthammer for his statements on climate change formerly known as global warming! _E_ I'm leading big in every poll and we are going to WIN! Remember Trump NEVER gives up! _E_ Great meeting with a wonderful woman today former Secretary of State Condoleezza Rice! #USA __HTTP__ _E_ Entrepreneurs: Trust your instincts even after you've honed your skills. They're there for a reason. _E_ A country must enforce its borders. Respect for the rule of law is at our country's core. We must build a wall! __HTTP__ _E_ The new Ebola czar will report to the WH & NSA adviser Susan Rice. More mismanagement & duplicity with CDC. Obama is terrible executive. _E_ I hope Tom Brady sues the hell out of the @nfl for incompetence & defamation. They will drop the case against him and he will win. _E_ Congrats to @KarlRove on blowing $400 million this cycle. Every race @CrossroadsGPS ran ads in the Republicans lost. What a waste of money. _E_ RT @charliekirk11: 100 days ago a new message leader & movement took the Oval Office! A government FOR the people BY the people. This is... _E_ Jeb Bush is weak on illegal immigration in favor of common core bad on women's health issues and thinks the Iraq war was a good thing. _E_ Do you believe that highly overrated political pundit @krauthammer said this is the best Republican field in 35 years. What a dope! _E_ To the people of Kentucky Rand Paul didn't want you. Now he runs back due to his presidential failure. #VoteTrump #MakeAmericaGreatAgain _E_ A MUST READ! @AndrewBreitbart's last article The Vetting Part I: @BarackObama's Love Song to Alinsky __HTTP__ _E_ Hillary's staff thought her email scandal might just blow over. Who would trust these people with national security? __HTTP__ _E_ It's very sad that the administration isn't sending anyone to Margaret Thatcher's funeral. She was a big U.S. supporter. _E_ With Jemele Hill at the mike it is no wonder ESPN ratings have tanked in fact tanked so badly it is the talk of the industry! _E_ .@Graeme_McDowell Great playing Graeme you are a true champion! _E_ Canada has made business for our dairy farmers in Wisconsin and other border states very difficult. We will not stand for this. Watch! _E_ Black politicians are in prison based on Shirley Huntley's statements but not white @AGSchneiderman RACISM! __HTTP__ _E_ .@USATODAY Poll and @QuinnipiacPoll say that I beat both Hillary and Bernie and I havn't even started on them yet! _E_ Cruz did not renounce his Canadian citizenship as a US Senator only when he started to run for #POTUS. He could be Canadian Prime Minister. _E_ Negotiation: It is persuasion more than power. _E_ Will be doing Fox & Friends at 7 A.M. 20 minutes. ENJOY! _E_ Thank you Colorado Springs. If I'm elected President I am going to keep Radical Islamic Terrorists out of our count... __HTTP__ _E_ RT @TeamTrump: .@HillaryClinton just claimed she has a positive optimistic view for America. #Debates __HTTP__ _E_ Just as I predicted people are going to be shocked by the rise in premium prices thanks to Obama Care __HTTP__ Enjoy! _E_ Part 2 of my @jimmyfallon interview giving away some @CelebApprentice spoilers & discussing 2012 Miss Universe Pageant __HTTP__ _E_ Thank you Plymouth New Hampshire! #FITN #NHPrimary __HTTP__ _E_ and knew they were in big trouble which is why they cancelled their big fireworks at the last minute.THEY SAW A MOVEMENT LIKE NEVER BEFORE _E_ .@Jrprotalker Thanks Judy for the wonderful statements on @TrumpTurnberry. Great seeing you there & you did a fabulous job on commentary. _E_ Five U.S. soldiers killed in Afghanistan by so called friendly fire. What are we doing? _E_ Inflation is here. Record beef prices are hitting consumers pockets __HTTP__ Bad for family grills. _E_ Today I am standing with patriots in Arizona for border security! Build a wall! Let's Make America Great Again! __HTTP__ _E_ The harder you work the harder it is to surrender. @ProFootballHOF @buffalobills Head Coach Marv Levy _E_ Wow Macy's numbers just in Trump is doing better than ever thanks for your great support! _E_ I am asking the chairs of the House and Senate committees to investigate top secret intelligence shared with NBC prior to me seeing it. _E_ Why is it that when Warren Buffett uses the bankruptcy laws to his benefit nobody cares but with me they go nuts! _E_ Letterman @Late_Show begging me to go back on his low rated show calls lots must apologize for racist comment. _E_ Thank you @FoxNews Huge win for President Trump and GOP in Georgia Congressional Special Election. _E_ The World Bank is tying poverty to 'climate change' __HTTP__ And we wonder why international organizations are ineffective. _E_ Goofy Elizabeth Warren who may be the least productive Senator in the U.S. Senate must prove she is not a fraud. Without the con it's over _E_ {Crooked Hillary Clinton} created this mess and she knows it. #DrainTheSwamp __HTTP__ __HTTP__ _E_ .@secupp who can't believe that her candidate has bombed so badly is one of the dumber pundits on T.V. Hard to watch zero talent! @CNN _E_ Former President Vicente Fox who is railing against my visit to Mexico today also invited me when he apologized for using the f bomb. _E_ It seems @BarackObama had our tax dollars buy guns for Mexican drug lords that were used to kill Americans. We need answers now. _E_ I have been leading big in all polls with two more today @nbc and @CNN. The NBC poll is more than double next at 29%. Fiorina has 11%. _E_ Pres. Obama should leave the baseball game in Cuba immediately & get home to Washington where a #POTUS under a serious emergency belongs! _E_ Read about my victory against sleazebag @AGSchneiderman. More people should fight when they're right! __HTTP__ _E_ Young entrepreneurs – in an economic climate like this only the strong survive. You can do it. Think Big! _E_ Boycott @Macys no guts no glory. Besides there are far better stores! _E_ .@Graeme_McDowell You are the toughest guy there is. If you were a boxer you'd be the champ. Great going! _E_ Via @scotsmandotcom: "Awards for Trump's golf course" __HTTP__ _E_ Visit the highly acclaimed Trump International Hotel & Tower Chicago and its exceptional 'Sixteen' restaurant __HTTP__ _E_ Will be interviewed on @foxandfriends at 7:00 5 minutes. Then I head to New Hampshire great people! _E_ After destroying the Middle East & our economy the Bushes last gift was having Justice Roberts legalize ObamaCare. No more Bushes! _E_ Get respect and do not give a damn if people like you. Think Big _E_ The worst thing you can possibly do in a deal is seem desperate to make it. #TheArtofTheDeal _E_ A new INTELLIGENCE LEAK from the Amazon Washington Postthis time against A.G. Jeff Sessions.These illegal leaks like Comey's must stop! _E_ "TRUMP TO CPAC: BUILD A GREAT ECONOMY" __HTTP__ via @BreitbartVideo _E_ #TBT Filming an Oreo commercial with Eli Manning Peyton Manning and Darrell Hammond __HTTP__ _E_ Had a great time in Myrtle Beach and Charleston this past Saturday and Monday. Looking forward to going back soon. _E_ The majority of Americans agree with @MittRomney's comments on @Israel and Iran. _E_ LinkedIn Workforce Report: January and February were the strongest consecutive months for hiring since August and September 2015 _E_ #FlashbackFriday Just after I did my renovation in Central Park of @TrumpRink __HTTP__ _E_ The dishonest media likes saying that I am in Agreement with Julian Assange wrong. I simply state what he states it is for the people.... _E_ Via @NorthvillePatch: Donald Trump to Speak in Novi This May __HTTP__ _E_ The GOP doesn't waste an opportunity to waste an opportunity. Defunding Obamacare should be central to any deal. _E_ Just found out I won the Rockingham County Republican Booth Straw Poll at the Deerfield Fair in New Hampshire this past weekend. 39% Wow! _E_ My thoughts and prayers are with the two police officers shot in Sebastian County Arkansas. #LESM _E_ The Central Park Five documentary was a one sided piece of garbage that didn't explain the.horrific crimes of these young men while in park _E_ Since the first day I took office all you hear is the phony Democrat excuse for losing the election Russia RussiaRussia. Despite this I have the economy booming and have possibly done more than any 10 month President. MAKE AMERICA GREAT AGAIN! _E_ My father's 4 step formula for success: Get in get it done get it done right and get out. Fred C. Trump _E_ Watch @AC360 on NOW! @CNN _E_ Crooked Hillary Clinton is spending a fortune on ads against me. I am the one person she doesn't want to run against. Will be such fun! _E_ Join me in Reno Nevada on Wednesday at 3:30pm at the Reno Sparks Convention Center! #MAGATickets:... __HTTP__ _E_ Congrats to @greggutfeld on his new @FoxNews show! Greg makes great TV and is a terrific guy. _E_ W/state of the art Clubhouse & our signature amenities @Trump_Charlotte brings true luxury to The Tar Heel State __HTTP__ _E_ Looking forward to speaking at @Citizens_United & @SteveKingIA's "Iowa's Freedom Summit" on January 24th __HTTP__ _E_ Paulina @MissUniverse Vega will be introduced tonight at the Finale of Celebrity Apprentice.She is a great beauty and a monster star in S.A. _E_ .@TrumpDoral offers multiple award winning dining options in our all new signature restaurant and lounges __HTTP__ _E_ Fracking poses ZERO health risks __HTTP__ In fact it increases our national security by making us energy independent. _E_ My speech is right now on C SPAN 1 _E_ I will be meeting with Henry Kissinger at 1:45pm. Will be discussing North Korea China and the Middle East. _E_ Watch @extratv's spot covering the first annual Trump Invitational at Mar a Lago __HTTP__ _E_ Congratulations to my head pro of Trump International Golf Club (Florida) John Nieporte for qualifying for the U.S. Open! @usopengolf _E_ Obama stop the flights to and from West Africa NOW before it is too late! Can't you see what's happening? Can you be that thick (stupid)? _E_ Does @BarackObama ever work? He is constantly campaigning and fundraising on both the taxpayer's dime and time not fair! _E_ When it comes to the future of America's energy needs we will FIND IT we will DREAM IT and we will BUILD IT.... __HTTP__ _E_ Thank you Roseanne very much appreciated. __HTTP__ _E_ .@politico has no power but so dishonest! _E_ Entrepreneurs: Take responsibility for yourself. It's a very empowering attitude. _E_ Govt. collapsing in Iraq only 2 weeks after withdrawal of our troops. Sadly I called this one and please remember I alone called it. _E_ I think @megynkelly should take another eleven day unscheduled vacation. _E_ Is Jon Stewart a racist? See video that includes clip... __HTTP__ #thedailyshow _E_ .@kevinjonas was great but he brought the wrong person into the boardroom. Had he brought Lorenzo in he would not have been fired. _E_ .@antbaxter should really be ashamed about his massive box office disaster. Take a hint and get out of the film (cont) __HTTP__ _E_ Statement Regarding British Referendum on E.U. Membership __HTTP__ _E_ In three years people won't be building wind turbines anymore they are obsolete & totally destroy the environment in which they sit. _E_ Get ready for fireworks...@Joan_Rivers & @THEGaryBusey face off in the Board Room this Sunday on All Star Celebrity @ApprenticeNBC. _E_ Lots of autism and vaccine response. Stop these massive doses immediately. Go back to single spread out shots! What do we have to lose. _E_ Still looking to give away a RECORD $1M reward on @fundanything for a crowd funding campaign __HTTP__ _E_ In addition to those without health coverage those that have disastrous #Obamacare are seeing MASSIVE PREMIUM INCR... __HTTP__ _E_ .@Neilyoung A few months ago Neil Young came to my office looking for $$ on an audio deal & called me last week to go to his concert. Wow! _E_ Why won't Obama release his college applications? Is there something 'foreign' about them? _E_ An HR solutions company polled 1000 employed adults to find out who would make ideal bosses... __HTTP__ _E_ If the Palestinians want statehood then why are they run by the terrorist group Hamas? _E_ Great time in Burlington Vermont. Crowd was amazing. _E_ My heartfelt condolences to the family of Kathryn Steinle. Very very sad! _E_ Watch this behind the scenes video of @IvankaTrump's Fall 2012 collection photo shoot __HTTP__ _E_ Be a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_ Failing @NYTimes will always take a good story about me and make it bad. Every article is unfair and biased. Very sad! _E_ RT @FoxNews: .@POTUS: I'm not against the media. I'm against the FAKE media. #CashinIn __HTTP__ _E_ Met @newtgingrich at Trump Tower today. He's a big thinker. _E_ Outrageous @BarackObama is suing to suppress the military vote in Ohio __HTTP__ Our Commander in Chief should be ashamed. _E_ Be sure to watch #MissUniverse tonight at 8PM on @nbc with its first simulcast on @Telemundo! _E_ I'd bet the horrible look of Pinehurst translates into poor television ratings. This is not what golf is about! _E_ Right To Play uses the power of play to educate and empower children facing adversity. A great cause check it out. __HTTP__ _E_ The Mar a Lago Club was amazing tonight. Everybody was there the biggest and the hottest. Palm Beach is so lucky to have best club in world _E_ Mullet Bay Golf Course looks like a slum on the beautiful island of St. Maarten. @PrimeMinisterSX should be ashamed for allowing this. _E_ My @IngrahamAngle interview discussing @JebBush's comments a united 2012 GOP #CelebApprentice & Trump#Miss Universe __HTTP__ _E_ "If you can't explain it simply you don't understand it well enough." Albert Einstein _E_ I spoke with President Moon of South Korea last night. Asked him how Rocket Man is doing. Long gas lines forming in North Korea. Too bad! _E_ Just as I predicted today Obama called for even more tax increases. The Republicans played right into his hands and blew their cards. _E_ "Always strive to outdo yourself." – Think Big _E_ No wonder the @nytimes is failing—who can believe what they write after the false malicious & libelous story they did on me. _E_ "Attitude is a little thing that makes a big difference." Winston Churchill _E_ Do you believe that Obama is giving weapons to moderate rebels in Syria.Isn't sure who they are. What the hell is he doing.Will turn on us _E_ I really enjoyed doing the show circuit this AM discussing lightweight AG Eric Schneiderman & the terrible job he has done for NY. _E_ One of the reasons I am no fan of John McCain is that our Vets are being treated so badly by him and the politicians. I will fix VA quickly. _E_ The banks need to start lending again otherwise the economy will continue its downturn. This is why we bailed the banks out! _E_ Puerto Rico Governor Ricardo Rossello just stated: The Administration and the President every time we've spoken they've delivered...... _E_ Lightweight @AGSchneiderman is driving business out of New York for his own public relations benefit. A real dope! _E_ Why does @FoxNews keep George Will as a talking head? Wrong on so many subjects! _E_ So Obama's top people responsible for ObamaCare think the American Public is stupid! All based on lies and deception! Repubs should sue. _E_ We should not attack Syria but if they make the stupid move to do so the Arab Leaguewhose members are laughing at us should pay! _E_ Thanks to everyone who has waited in the long lines at the #TimeToGetTough book signings. It is great to meet fellow patriots. _E_ Conde Nast Traveler Readers' Choice Awards Best Resorts in Europe: Trump Int'l Hotel & Golf Links Doonbeg voted #1. __HTTP__ _E_ Thomas Jefferson wrote the Senate filibuster rule. Harry Reid & Obama killed it yesterday. Rule was in effect for over 200 years. _E_ .@ericbolling you can do much better than you did tonight on @oreillyfactor. Better luck tomorrow! _E_ .@BarackObama Hood: Rob our children's future by borrowing from the Chinese to pay for socialist programs that will bankrupt us. _E_ The Church is yet another victim to his liberal agenda: @BarackObama lied to his Catholic supporters to pass ObamaCare. _E_ ...the beauty that is being taken out of our cities towns and parks will be greatly missed and never able to be comparably replaced! _E_ Horrible and cowardly terrorist attack on innocent and defenseless worshipers in Egypt. The world cannot tolerate terrorism we must defeat them militarily and discredit the extremist ideology that forms the basis of their existence! _E_ Without passion you don't have energy without energy you have nothing! _E_ The Dallas event in two weeks at the American Airlines Center is filling up fast. Get your tickets fast before it is too late! _E_ So impt Rep Senators under leadership of @SenateMajLdr McConnell get healthcare plan approved. After 7yrs of O'Care disaster must happen! _E_ The @BarackObama administration is far more enthusiastic about boosting food stamp enrollment than about preventing fraud. #TimeToGetTough _E_ VERY IRONIC: In 2010 video Clinton lectured underlings on cybersecurity and guarding 'sensitive information' __HTTP__ _E_ .@VattenfallGroup will never solve the issues with the Ministry of Defense. Besides they smartly just left the project. _E_ Highly overrated & crazy @megynkelly is always complaining about Trump and yet she devotes her shows to me. Focus on others Megyn! _E_ Hillary whose decisions have led to the deaths of many accepted $ from a business linked to ISIS. Silence at CNN. __HTTP__ _E_ The U.S. Consumer Confidence Index for December surged nearly four points to 113.7 THE HIGHEST LEVEL IN MORE THAN 15 YEARS! Thanks Donald! _E_ .@CNN is so negative getting even worse as I get closer. Just had two anti Trump losers with zero rebuttal from my team. Turning off! _E_ Join me this Saturday at Ladd–Peebles Stadium in Mobile Alabama! #ThankYouTour2016 Tickets:... __HTTP__ _E_ .@chucktodd said today on @meetthepress that attacking Bill to get to Hillary has never worked before. Wrong attacked him in '08 & won! _E_ ...long he doesn't know how to win anymore just look at the mess our country is in bogged down in conflict all over the place. Our hero.. _E_ We are one step closer to delivering MASSIVE tax cuts for working families across America. Special thanks to @SenateMajLdr Mitch McConnell and Chairman @SenOrrinHatch for shepherding our bill through the Senate. Look forward to signing a final bill before Christmas! __HTTP__ _E_ Entrepreneurs: Set the example. You can motivate others as well as yourself by remembering that you are setting the example. _E_ Thank you Colorado! #MAGA __HTTP__ __HTTP__ __HTTP__ _E_ Just as I predicted @Joe_Biden was a complete disaster in China. He condoned the Chinese one child policy an... (cont) __HTTP__ _E_ Via @DC_Decoder: Donald Trump to 'surprise' GOP convention. What might he do? __HTTP__ Answer: Something major! _E_ What is happening in Atlantic City casino closures is very sad but does anybody give me credit for getting out before its demise? Timing _E_ How can Jeb Bush expect to deal with China Russia + Iran if he gets caught doing a "plant" during my speech yesterday in NH? _E_ The best thing you can do is deal from strength and leverage is the biggest strength you have. Leverage is (cont) __HTTP__ _E_ Watching my beautiful wife Melania speak about our love of country and family. We will make you all very proud.... __HTTP__ _E_ Today there were terror attacks in Turkey Switzerland and Germany and it is only getting worse. The civilized world must change thinking! _E_ Tomorrow I will be tweeting on only one subject! _E_ Looking forward to the 2010 Miss USA Pageant Sunday May 16 on NBC 7 p.m. ET hosted by Curtis Stone and Natalie Morales. _E_ My interview on @gretawire last night Our Leaders Are Leading Us Into 'Oblivion' __HTTP__ _E_ Who is @Macys to pretend innocence when they "racial profile" all over the place? Paid big fine! _E_ I will be commenting LIVE on Sunday night (9 to 11) on TWITTER Celebrity Apprentice will be great this season amazing cast! _E_ I'll be turning the table on Larry King this Saturday night. I'll be interviewing him in honor of the 25th Anniversary of his show. _E_ China is primed to continue to rob us and steal our jobs through their exports __HTTP__ We need @MittRomney to rein them in. _E_ Take a chance! All life is a chance. The man who goes farthest is generally the one who is willing to do and dare. Dale Carnegie _E_ The public is learning (even more so) how dishonest the Fake News is. They totally misrepresent what I say about hate bigotry etc. Shame! _E_ A Rod has disgraced the blessed @Yankees organization lied to the fans & embarrassed NYC. He does not deserve to wear the pinstripes. _E_ THANK YOU IOWA!#ThankYouTour2016 __HTTP__ _E_ .@EricTrump was FANTASTIC on @foxandfriends this morning. He may be my son but he is a special guy! _E_ Being politically correct takes too much time. We have too much to get done! #Trump2016 __HTTP__ __HTTP__ _E_ RT @JerryTravone: @realDonaldTrump __HTTP__ _E_ Getting ready to leave for Michigan will be an amazing evening! See you there. _E_ Join me at 4pm over at the Lincoln Memorial with my family!#Inauguration2017 __HTTP__ _E_ The Fake News is at it again this time trying to hurt one of the finest people I know General John Kelly by saying he will soon be..... _E_ do this under the law I feel it is visually important as President to in no way have a conflict of interest with my various businesses.. _E_ LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_ When the Super Committee fails @BarackObama will get exactly what he really wants automatic cuts in defense spending. This is his plan. _E_ RT @dcexaminer: EXCLUSIVE: How Donald Trump's 30 million followers are crashing the Internet __HTTP__ __HTTP__ _E_ Charles McCullough the respected fmr Intel Comm Inspector General said public was misled on Crooked Hillary Emails. "Emails endangered National Security." Why aren't our deep State authorities looking at this? Rigged & corrupt? @TuckerCarlson @seanhannity _E_ Just finished speaking in Jacksonville Florida. Incredible crowd fantastic people. Thank you! _E_ Rio de Janeiro joins the @TrumpCollection in 2016. It's going to be a spectacular hotel! __HTTP__ _E_ This George Zimmerman is really a mess he really has to just disappear! (He attacked his wife last night). _E_ Congratulations to Woody Johnson and @nyjets on acquiring @TimTebow.@TimTebow is not only a winner but a leader. (cont) __HTTP__ _E_ Looking forward to the 2010 Miss USA Pageant Sunday May 16 on NBC 7 p.m. ET hosted by Curtis Stone & Natalie Morales live from Las Vegas. _E_ Thank you Oregon! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Via CNET: Donald Trump Bests Jeb Bush in Website Performance Experts Say __HTTP__ _E_ Only 109 people out of 325000 were detained and held for questioning. Big problems at airports were caused by Delta computer outage..... _E_ ObamaCare is dead and the Democrats are obstructionists no ideas or votes only obstruction. It is solely up to the 52 Republican Senators! _E_ Apple is finally considering a large screen for the I Phone they better get moving fast. When I told them to do this last year they scoffed _E_ Thoughts and prayers for those in the floods affecting the great people of South Carolina. _E_ Will be on @foxandfriends at 7:02 A.M. Enjoy. _E_ I certainly hope the Democrats do not force Nancy P out. That would be very bad for the Republican Party and please let Cryin' Chuck stay! _E_ N.Y.Times headline states Obama suffers setbacks in Japan trade deal. Can somebody please tell him that with all they sell us WE HAVE CARDS _E_ .@mcuban has less TV persona than any other person I can think of. He's an arrogant crude dope who met some very stupid people... _E_ With few exceptions only really smart people are able to make a lot of money. Hard work is also important but brains will supersede. _E_ A terrible deal with Iran! __HTTP__ _E_ Egypt's Muslim Brotherhood just made its first visit to Hamas led Gaza. Why did @BarackObama promote the Arab Spring ? _E_ My @gretawire int. on Leon Panetta's critique of Obama Ebola rise of ISIS Obama's lack of common sense & 2016 __HTTP__ _E_ "Positive thinking is not merely wishful thinking... _E_ Dress for success. The Donald J. Trump Signature Collection exclusively available @Macys.com __HTTP__ _E_ Interestingly the hurricane may now be a disaster for Obama's reelection because of his grandstanding. _E_ A great honor from somebody that knows how to win! __HTTP__ _E_ Great to hear that @nfl legend and hall of famer John Elway has endorsed @MittRomney in Colorado. CO is a must win state for Mitt. _E_ .@TheJuanWilliams you never speak well of me & yet when I saw you at Fox you ran over like a child and wanted a picture. Please share pic! _E_ Congratulations to our great resident of Chicago Trump Tower Patrick Kane @88PKane for the #StanleyCup win & winning MVP of series. _E_ With @stuartpstevens expected to represent @GovChristie in the Presidential race Chris will have a very hard time winning. _E_ ....your release possible and HAVE A GREAT LIFE! Be careful there are many pitfalls on the long and winding road of life! _E_ We need your vote. Go to the POLLS! Let's continue this MOVEMENT! Find your poll location: __HTTP__ __HTTP__ _E_ .@serenawilliams is a special player. After winning the Gold for the US in the Olympics it looks like she will (cont) __HTTP__ _E_ They are great people! __HTTP__ _E_ Congratulations to @spurs on their @NBA championship. Well deserved. _E_ I will be interviewed by @donlemon tonight on @CNN at 10PM. _E_ .@bwilliams knows that I think his newscast has become totally boring so he took a shot at me last night. _E_ The passage of the @DeptVetAffairs Accountability and Whistleblower Protection Act is GREAT news for veterans! I lo... __HTTP__ _E_ Trump: Obama is 'Unlucky President' __HTTP__ via @Newsmax_Media _E_ If traveling to the Windy City to celebrate 100th anniversary of Wrigley Field @TrumpChicago is Chicago's #1 hotel __HTTP__ _E_ .@SenMikeLee refuted every point Karl 1.6% Rove made on the need to defund ObamaCare.Must listen __HTTP__ @TheRightScoop _E_ Was with great people last night in Fort Myer Virginia. The future of our country is strong! _E_ #Trump2016 #TrumpInstagram: __HTTP__ __HTTP__ _E_ Do you notice that the polling establishment doesn't put me in polls but put in folks who hardly register. MAKE AMERICA GREAT AGAIN! _E_ Often times being 'innovative' is simply putting together pre existing elements into something new. Be resourceful & expect success. _E_ STATEMENT IN RESPONSE TO PRESIDENT OBAMA'S FAILED LEADERSHIP: __HTTP__ _E_ The economy cannot take four more years of these same failed policies.#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_ Thank you Willie Robertson! #VoteTrump #MakeAmericaGreatAgain __HTTP__ _E_ You miss 100% of the shots you don't take. Wayne Gretzky _E_ Perhaps Miss USA can lure Snowden back? _E_ DACA is probably dead because the Democrats don't really want it they just want to talk and take desperately needed money away from our Military. _E_ I never quit trying. I never felt that I didn't have a chance to win. Arnold Palmer @KingdomMag _E_ My @Gretawire interview where I discuss why @BarackObama is an economic ignoramus and how OPEC is inflating gas prices. __HTTP__ _E_ Via @digitaljournal: Donald Trump tweets Obama is 'an incompetent President' __HTTP__ _E_ Great evening with the @AmSpec & the T. Boone Pickens Entrepreneur Award. Amazing crowd—thank you! _E_ .@MissUSA Erin Brady is doing a fantastic job representing Trump Miss USA. Smart gorgeous a really positive force! _E_ Any and all weather events are used by the GLOBAL WARMING HOAXSTERS to justify higher taxes to save our planet! They don't believe it $$$$! _E_ I hope you can go to @oreillyfactor and vote for Donald Trump in order to Make America Great Again! Thanks. _E_ Thank you Rep. Collins! #Trump2016 __HTTP__ _E_ Time Warner cable out AGAIN in Manhattan no television. They have a real problem! _E_ Via @ConservReview by @JeffJlpa1: Why Donald Trump is Right __HTTP__ _E_ "Polling strong Donald Trump starting to get serious" __HTTP__ via @bostonherald by @JaclynCashman _E_ Via @lohud by @hoopsmbd: "Buzz builds for @TrumpFerryPoint" __HTTP__ _E_ Trump Towers Istanbul Sisli will be one of the country's top landmarks __HTTP__ _E_ Dopey @GeorgeWill the most overrated political pundit in the business continues to downgrade the Republican (cont) __HTTP__ _E_ Republicans had all the cards but not the guts to make a great deal! _E_ How does this cast look to you? Pretty amazing. #CelebApprentice _E_ The ratings for the Republican National Convention were very good but for the final night my speech great. Thank you! _E_ I'm going to be live with @ericbolling and @kimguilfoyle to ring in the New Year 2016. Everybody should tune in to @foxnews tomorrow night! _E_ Unprecedented success for our Country in so many ways since the Election. Record Stock Market Strong on Military Crime Borders & ISIS Judicial Strength & Numbers Lowest Unemployment for Women & ALL Massive Tax Cuts end of Individual Mandate and so much more. Big 2018! _E_ Prime Minister @David_Cameron is very foolish in giving @AlexSalmond so much money to build wind turbines which r destroying Scotland. _E_ Watch my interview on @CBSNews Face The Nation now and also the new CBS POLLS which if good for me the media won't report! _E_ The Wall is a very important tool in stopping drugs from pouring into our country and poisoning our youth (and many others)! If _E_ While Hillary profits off the rigged system I am fighting for you! Remember the simple phrase: #FollowTheMoney... __HTTP__ _E_ Do you believe @algore is blaming global warming for the hurricane? _E_ Congrats @JanineTurner on new book A Little Bit Vulnerable you're a breath of fresh air in the political forum __HTTP__ _E_ The hedge fund guys (gals) have to pay higher taxes ASAP. They are paying practically nothing. We must reduce taxes for the middle class! _E_ What's with this rap stuff with me and Ebenezer Scrooge? __HTTP__ _E_ The Chinese want to steal our jobs and technology that includes so called green energy which they make but (cont) __HTTP__ _E_ Join @TeamTrump on Facebook & watch tonight's rally from Geneva Ohio our 3rd rally of the day. #AmericaFirst #MAGA __HTTP__ _E_ With the $635 million dollar website fiasco getting caught tapping phones of WORLD LEADERS and so much more U.S. is looking really stupid! _E_ We are taking care of hundreds of people in the Trump Tower atrium they are seeking refuge. Free coffee and food. _E_ .@MacMiller's Donald Trump just hit 60 million hits. Maybe I should go into a new business. _E_ Thoughts & prayers with the millions of people in the path of Hurricane Matthew. Look out for neighbors and listen... __HTTP__ _E_ Talent is cheaper than table salt. What separates the talented individual from the successful one is a lot of hard work. Stephen King _E_ MAKE AMERICA GREAT AGAIN!#AmericaFirst #Trump2016 __HTTP__ _E_ .@TrumpGolfLA has panoramic Pacific Ocean views features a 7242 yard public course designed by Pete Dye __HTTP__ _E_ Congratulations to @nyknicks on winning their first Atlantic Division title since 1994. @carmeloanthony is a great New Yorker and Knick! _E_ .@KarlRove stated clearly that he wants to repeal the 2nd Amendment. I thought @FoxNews was going to fire that jerk after his Romney fiasco? _E_ .@GolfMonthly re: my Scottish course "Quite simply this is not the best new links course in the UK it is the best links course full stop _E_ Today I signed the Holocaust Remembrance Proclamation: __HTTP__ #ICYMI My statement last night at... __HTTP__ _E_ My parents: Trust in God and be true to yourself. Mary MacLeod Trump Know everything you can about what you're doing. Fred C. Trump _E_ Great poll numbers all over and beating Hillary Clinton one on one. Thank you! _E_ It is time for Iran to face serious consequences. This regime is a threat to our national security. _E_ .@CNBC has just agreed that the debate will be TWO HOURS. Fantastic news for all especially the millions of people who will be watching! _E_ Don't like @SamuelLJackson's golf swing. Not athletic. I've won many club championships. Play him for charity! _E_ The Democratic National Committee would not allow the FBI to study or see its computer info after it was supposedly hacked by Russia...... _E_ NEW @MittRomney TV AD Dream For these small businesses hope and change was not so kind: __HTTP__ #tcot _E_ Why are Democrats fighting massive tax cuts for the middle class and business (jobs)? The reason: Obstruction and Delay! _E_ I hope we never find life on other planets because there's no doubt that the U.S. Government will start sending them money! _E_ The ratings for the Celebrity Apprentice were fantastic and everyone had a great time. It was a terrific season congrats to everyone! _E_ A Rod @Yankees had hip surgery & will be out 6 months. Do you notice all the "druggies" have bad hips. _E_ Turnberry one of the most beautiful places in the world.... soon to be Trump Turnberry a Luxury... __HTTP__ _E_ RT @DRUDGE_REPORT: GREAT AGAIN: +235000 __HTTP__ _E_ Thank you @JeffJlpa1 and @AmSpec for the wonderful and very true article "Total Desperation on Iran" __HTTP__ _E_ I hear that dopey political pundit Lawrence O'Donnell one of the dumber people on television is about to lose his show no ratings?Too bad _E_ I'd bet the lawyers for the Central Park 5 are laughing at the stupidity of N.Y.C. when there was such a strong case against their clients _E_ Scary – in the past 90 days Obama has set over 6125 regulatory burdens __HTTP__ Terrible for the economy. _E_ Thanks. __HTTP__ _E_ Ted Cruz attacked New Yorkers and New York values we don't forget! __HTTP__ _E_ Standing with Jamiel Shaw Sabine Durdin Don Rosenberg Lupe Moreno Brenda Sparks Robin Hvidston & their spouses. __HTTP__ _E_ Crooked Hillary Clinton will be a disaster on jobs the economy trade healthcare the military guns and just about all else. Obama plus! _E_ Via @USNewsTravel: "Best New York City Hotels: @TrumpNewYork" __HTTP__ _E_ Muhammad Ali is dead at 74! A truly great champion and a wonderful guy. He will be missed by all! _E_ Negotiation tip: View any conflict as an opportunity this will expand your mind as well as your horizons. Persistence can go a long way. _E_ Word is spreading that I got a tattoo no way I am not a fan! _E_ Obama sent weapons through Benghazi to ISIS yet he is holding up shipments to Israel. What is he thinking? _E_ The Patch a total loser for @AOL will be a good deal compared to @HuffingtonPost. @ariannahuff laughs at "stupid" Armstrong! _E_ As I have always said let ObamaCare fail and then come together and do a great healthcare plan. Stay tuned! _E_ Thank you Ohio! #AmericaFirst __HTTP__ _E_ Remember THE HARDER YOU WORK THE LUCKIER YOU GET! _E_ Happy New Year to all including to my many enemies and those who have fought me and lost so badly they just don't know what to do. Love! _E_ Saudis just cut oil supplymaking prices rise "immediately" while we are fighting ISIS for them __HTTP__ What are we doing! _E_ "It is hard to fail but it is worse never to have tried to succeed." Theodore Roosevelt _E_ The U.S. once again condemns the brutality of the North Korean regime as we mourn its latest victim. Video: __HTTP__ _E_ Thank you so many people have given me credit for winning the debate last night. All polls agree. It was fun and interesting! _E_ The latest update on Bret Michaels is that he's making every effort to attend the live finale of Celebrity Apprentice on Sunday so tune in! _E_ Great piece by @EWErickson @RedState exposing how Karl 1.6% Rove cooked a poll in support of ObamaCare __HTTP__ _E_ Thank you! #VoteTrump __HTTP__ _E_ .@CNN @jaketapper at 9:00 A.M. _E_ Keep talking about me: use #TrumpRoast to tweet about how good I look on @ComedyCentral tonight at 10:30/9:30c __HTTP__ _E_ Tonight I will be signing copies of #TimeToGetTough in Westbury at Costco 1250 Old Country Rd from 6 pm to 8 pm _E_ Thank you Iowa! #ImWithYou __HTTP__ _E_ .@FoxNews will be re running Objectified: Donald Trump the ratings hit produced by the great Harvey Levin of TMZ at 8:00 P.M. Enjoy! _E_ Congratulations to @MittRomney on Tuesday night's sweep. He also delivered a 'Killer Speech' __HTTP__ _E_ I love that in addition to everything else so much money is raised for such great causes on Celebrity Apprentice all proud of that! _E_ My @gretawire interview discussing @BarackObama's USC comments insurance premiums @SarahPalinUSA on the (cont) __HTTP__ _E_ When you think big you will automatically trigger more details because details are the major component of making anything big. _E_ Via @eonline by @BrettMalec: "2014 @MissUniverse Contestants" __HTTP__ _E_ RT @TeamTrump: RT if you believe @HillaryClinton is the one who owes America an apology! #BigLeagueTruth #Debates __HTTP__ _E_ Lying traitor Snowden now claims that he did not give any information to the Russians or Chinese. Why doesn't he come home then? _E_ I can't believe that Mitt Romney would run for president again. He had his chance and blew it in the last weeks of the race. _E_ Why is Douglas Durst allowed to use the World Trade Center to get out of a lease with Conde Nast? _E_ Another new poll. Thank you for your support! Join the MOVEMENT today! #ImWithYou __HTTP__ __HTTP__ _E_ not anymore. The beginning of the end was the horrible Iran deal and now this (U.N.)! Stay strong Israel January 20th is fast approaching! _E_ We can create jobs in the American economy by protecting our own manufacturing sector. _E_ Budget that just passed is a really big deal especially in terms of what will be the biggest tax cut in U.S. history MSM barely covered! _E_ Thank you @JerryJrFalwell will see you soon. #TrumpPence16 __HTTP__ _E_ .@HillaryClinton channels John Kerry on trade: she was for bad trade deals before she was against them. #TPP #Debates2016 _E_ Goofy Elizabeth Warren and her phony Native American heritage are on a Twitter rant. She is too easy! I'm driving her nuts. _E_ Why is it that the horrendous protesters who scream curse punch shut down roads/doors during my RALLIES are never blamed by media? SAD! _E_ .kimguilfoyle great job tonight! _E_ More of my #TRUMPTUESDAY @SquawkCNBC interview discussing how the US gets killed negotiating with other countries __HTTP__ _E_ Trump Int'l Hotel & Tower New York includes Central Park views & our signature restaurant Jean Georges. Perfection! __HTTP__ _E_ Lance Armstrong just got sued by the Federal Government they want their money back I told you so! What was he thinking when he did that int? _E_ Entrepreneurs: See yourself as victorious. Look at the solution not the problem. And never give up! _E_ RT @FoxNews: Jobs added during @POTUS' time in office. __HTTP__ _E_ #CelebApprentice Photo from last night's boardroom. __HTTP__ _E_ RT @Reince: With a strong candidate in @POTUS & @GOP revolutionary data program Republicans carried WI for 1st time in 30 years __HTTP__ _E_ Marine Plane crash in Mississippi is heartbreaking. Melania and I send our deepest condolences to all! _E_ It is my opinion that many of the leaks coming out of the White House are fabricated lies made up by the #FakeNews media. _E_ RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_ My robocall on behalf of @MittRomney playing across the state of Michigan __HTTP__ _E_ President Obama our great leader wants to declare martial law in New York City as a means of helping out with the massive storm. _E_ Hagel's performance yesterday was the worst I have ever seen before a committee of any kind! _E_ Our great project in South America Trump Tower Punta Del Este in Uruguay will have spectacular views and the... __HTTP__ _E_ Congratulations to @MittRomney on getting the @DMRegister @NewYorkPost @NewYorkObserver & @NashuaTelegraph endorsements! _E_ Huckabee is a nice guy but will never be able to bring in the funds so as not to cut Social Security Medicare & Medicaid. I will. _E_ With an elite course designed by @SharkGregNorman @Trump_Charlotte is North Carolina's most desirable club __HTTP__ _E_ WRONG: A China court ordered @apple to pay $60M to a Chinese company that registered iPad before @apple __HTTP__ _E_ One of the things that has been lost in the politics of this situation is that the Russians collected and spread negative information..... _E_ The Senate should be more concerned about actually passing a budget than spreading lies about @MittRomney's taxes. _E_ Yesterday was a referendum on ObamaCare & all other Obama fiascos. Republicans can now rein him in. _E_ Logic will get you from A to B. Imagination will take you everywhere. Albert Einstein" _E_ Honored to have received the endorsement of Lou Holtz a great guy! #INPrimary #Trump2016 __HTTP__ _E_ Ted Cruz went down big in just released Reuters poll what's going on? Is it Goldman Sachs/Citi loans or Canada? _E_ Rush is right. @limbaugh and I have both created more jobs than @BarackObama...in fact far more jobs! _E_ Tuesday will be a big day for our country to do a complete turnaround. MAKE AMERICA GREAT AGAIN! _E_ A message to my fellow Americans#IrmaHurricane2017 __HTTP__ __HTTP__ __HTTP__ _E_ The ONLY bad thing about winning the Presidency is that I did not have the time to go through a long but winning trial on Trump U. Too bad! _E_ Why is that Hillary Clintons family and Dems dealings with Russia are not looked at but my non dealings are? _E_ .@politico covers me more inaccurately than any other media source and that is saying something. They go out of their way to distort truth! _E_ My @FoxNews @megynkelly int. on why I am considering running for POTUS negotiations & making America great again __HTTP__ _E_ With the great vote on Cutting Taxes this could be a big day for the Stock Market and YOU! _E_ I am no fan of Bill Cosby but never the less some free advice if you are innocent do not remain silent. You look guilty as hell! _E_ Immigration reform is fine—but don't rush to give away our country! Sounds like that's what's happening. _E_ There's nothing like fall in #NewYorkCity. See where @TrumpCollection recommends you take in the season's beauty: __HTTP__ _E_ without retribution or consequence is WRONG! There will be a tax on our soon to be strong border of 35% for these companies ...... _E_ We were let down by all of the Democrats and a few Republicans. Most Republicans were loyal terrific & worked really hard. We will return! _E_ Don't let the fake media tell you that I have changed my position on the WALL. It will get built and help stop drugs human trafficking etc. _E_ Obama said not optimal to Ambassador & embassy killings bad word usage for a Harvard graduate. _E_ See you tonight Huntington West Virginia!#MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_ .@Franklin_Graham: Great job on @foxandfriends this morning. You beautifully stated what most people are thinking! Say hi to all. _E_ Via CNN: Trump now leads in odds to win GOP nomination __HTTP__ _E_ Nancy Pelosi and Fake Tears Chuck Schumer held a rally at the steps of The Supreme Court and mic did not work (a mess) just like Dem party! _E_ Will be interviewed on the @oreillyfactor tonight at 8:00 P.M. Will be talking about the debate and more! _E_ The writer of the now proven false story in the @nytimes Michael Barbaro who was interviewed on CBS this morning was unable to respond. _E_ No matter how far down a path you go if it's the wrong path turn around and go back home before it is too late. _E_ Mar a Lago is Florida's most lavish and exclusive private club and spa with world class amenities __HTTP__ _E_ Departing NH now great morning with record crowd in Portsmouth in a snow storm! Thank you! __HTTP__ __HTTP__ _E_ You must be kidding zero chance he is innocent! _E_ Be sure to tune in for Melania's second QVC show for Melania Timepieces & Jewelry tonight live from 9 10 pm on QVC __HTTP__ _E_ RT @IvankaTrump: Check out my May Redbook magazine cover. Very exciting! #Redbook __HTTP__ _E_ A rough night for Hillary Clinton ABC News. _E_ Pay attention to global news and developments in today's world that is a requirement not an elective. _E_ Join me in San Jose California tonight!#MakeAmericaGreatAgain #Trump2016Tickets: __HTTP__ __HTTP__ _E_ Explain to @brithume and @megynkelly who know nothing that I will beat Hillary and win states (and dem indie votes) that no other R can! _E_ #CrookedHillary #PayToPlay __HTTP__ _E_ If the U.S. Government doesn't give the money necessary for the burials of our military personnel I will.The U.S. under Obama's leadership! _E_ She's back! Champion @Joan_Rivers returns to the boardroom in this year's All Star @ApprenticeNBC. Joan is ferocious. _E_ A clip from last night's @Late_Show where I detail my charitable offer to Obama and Dave describes his terrible grades __HTTP__ _E_ In making any decision you need all the facts. But after exhausting all due diligence in the end you have to go with your gut! _E_ Thank you @CarlHigbie. Great work on @CNN. #Trump2016 _E_ President Obama and other world leaders don't know how close they were to being seriously injured (or worse) standing next to psycho in SA. _E_ Sorry folks but Donald Trump is far richer and much better looking than dopey @mcuban! _E_ #ICYMI: Joint Statement with Prime Minister Shinzo Abe on North Korea. __HTTP__ _E_ Remember but for Conservatives Bush would have given us not only Roberts but also Harriet Miers. Face it Bush was terrible! _E_ It would have been much easier for me to win the so called popular vote than the Electoral College in that I would only campaign in 3 or 4 _E_ "@DonaldJTrumpJr: 'We want to build everything in Dubai" __HTTP__ via @CWO_dotcom _E_ Thank you for your support at this mornings Town Hall in Salem New Hampshire. #FITN #NHPrimary __HTTP__ _E_ The Spa @TrumpWaikiki offers unique treatments that use traditional Hawaiian botanicals & healing techniques __HTTP__ _E_ My @SquawkCNBC interview discussing interest rates the deficit @RepPaulRyan's timing @TimTebow and the Doral __HTTP__ _E_ That's what I find so morally offensive about welfare dependency: it robs people of the chance to improve. Work (cont) __HTTP__ _E_ TIME #DebateNight poll over 800000 votes. Thank you! #AmericaFirst #MAGA __HTTP__ _E_ RT @AnnCoulter: RUMSFELD: Trump has a touched a nerve in our country...in a way that most politicians have not been able to do. __HTTP__ _E_ John @CahillForAG is one of the most respected people in politics. Dopey @AGSchneiderman is one of the least respected! _E_ Great op ed from @RepKenBuck. Looks like some in the Freedom Caucus are helping me end #Obamacare. __HTTP__ _E_ Lots of great new polls big leads! __HTTP__ __HTTP__ __HTTP__ _E_ Obama our Welfare & Food Stamp President is praising himself for expanding welfare __HTTP__ He doesn't believe in work. _E_ I am very supportive of the Senate #HealthcareBill. Look forward to making it really special! Remember ObamaCare is dead. _E_ RT @Realjmannarino: @realDonaldTrump The ungratefulness is something I've never seen before. If you get someone's son out of prison he sho... _E_ Entrepreneurs: Set the example. You can motivate others as well as yourself by remembering you are setting the example. _E_ For all of those that were hoping I was wrong and this is a very unimportant subject to me Dwight Howard just officially announced Houston _E_ The only place success comes before work is in the dictionary. Vince Lombardi _E_ People must remember that ObamaCare just doesn't work and it is not affordable 116% increases (Arizona). Bill Clinton called it CRAZY _E_ .@CharlesGKoch is looking for a new puppet after Governor Walker and Jeb Bush cratered. He now likes Rubio next fail. _E_ Looking forward to speaking at the @ARGOP Reagan Rockefeller Dinner tonight! Record crowd. We are no longer silent! #MAKEAMERICAGREATAGAIN! _E_ RT @hughhewitt: #NeverTrumpers elite MSMers and virtue signalers are persuading themselves that @realDonaldTrump supporters are deserting.... _E_ "Learn work and think in equal proportions and you'll be going in the right direction." – Think Like a Champion _E_ Do executives at @msnbc know that the business of TV centers on viewers & ratings? @msnbc is #19 on cable __HTTP__ Sad. _E_ I had a great time today visiting Facebook NY. __HTTP__ _E_ It's Monday how many more excuses will Obama make today about the economy? _E_ .@SouthJerseyMag "According to the Pros" just named Trump National Golf Club Philadelphia the #1 private club. Thanks! _E_ I'm really saddened to see that @Cher was voted "the 4th ugliest celebrity" according to @listverse.... _E_ We are leaving Iraq after expending a tremendous amount of blood and treasure. We should be reimbursed with oil! Don't give it to Iran. _E_ Watch Late Night with Jimmy Fallon on NBC at 12:35 EST tonight I'll be bringing a couple of surprises with me. _E_ Business is no place for stream of consciousness babbling. Keep it short fast and direct. Think Like a Champion _E_ NY State Republican Party must unify or November will be another disaster. _E_ .@MichaelRCaputo Thank you for all of your support you have been amazing! _E_ People who lost money when the Stock Market went down 350 points based on the False and Dishonest reporting of Brian Ross of @ABC News (he has been suspended) should consider hiring a lawyer and suing ABC for the damages this bad reporting has caused many millions of dollars! _E_ After settling for a ridicilous 13 billion dollars J.P.Morgan's lawyer is critical of the amount of the fine why did they settle then DUMB! _E_ Terrific response to my previous tweet: I'll be in Dallas at the American Airlines Center on Sept 14th at 6 PM. __HTTP__ ... _E_ Hypocrite! in '06 @BarackObama called private equity the best opportunity for long term economic vitality __HTTP__ _E_ Emmys telecast is way down & lowest telecast on record among young adults. Emmys have no credibility Should have nominated Apprentice again! _E_ #CelebrityApprentice ranked #1 among ABC CBS and NBC in all key demos from 10 11PM. It won the 10PM hour by a 53% margin in 18 49 rating. _E_ If crazy @megynkelly didn't cover me so much on her terrible show her ratings would totally tank. She is so average in so many ways! _E_ West Virginia was incredible last night. Crowds and enthusiasm were beyond GDP at 3% wow!Dem Governor became a Republican last night. _E_ : @realDonaldTrump @HelpUServe When we have people eating out of trash cans in this country we have no business helping any other country _E_ I will be going to Sarasota Florida today for a big rally with amazing people! I have one goal on mind: MAKE AMERICA GREAT AGAIN! _E_ Made a speech in Arkansas last night before a record GOP crowd. Great spirit and amazing people. MAKE AMERICA GREAT AGAIN! _E_ Jeffrey Lord @AmSpec—Thank you for the presentation—terrific job! _E_ I am making a major speech in West Palm Beach Florida at noon. Tune in! _E_ We just had the worst jobs report since 2010. _E_ The ObamaCare enrollment numbers are a lie.They will be 'readjusted' by the White House at an opportune time probably after '14 election _E_ We are one nation. When one state hurts we all hurt. We must all work together to lift each other up. __HTTP__ _E_ HAPPY BIRTHDAY to my son @DonaldJTrumpJr! Very proud of you! #TBT __HTTP__ __HTTP__ _E_ There is only one way to avoid criticism: do nothing say nothing and be nothing. – Aristotle _E_ Win a dinner with @MittRomney and me in New York this June 28th. It's selling like hotcakes! __HTTP__ _E_ Bob Corker who helped President O give us the bad Iran Deal & couldn't get elected dog catcher in Tennessee is now fighting Tax Cuts.... _E_ Leaving South Korea now heading to China. Looking very much forward to meeting and being with President Xi! _E_ Mitt Romney had his chance and blew it. Lindsey Graham ran for president got ZERO and quit! Why are they now spokesmen against me? Sad! _E_ Crooked Hillary Clinton knew everything that her servant was doing at the DNC they just got caught that's all! They laughed at Bernie. _E_ As families prepare for summer vacations in our National Parks Democrats threaten to close them and shut down the government. Terrible! _E_ .CNN & @CNNPolitics Lawyer Elizabeth Beck did a terrible job against me she lost (I even got legal fees). I loved beating hershe was easy _E_ Remember when Jeb gave Hillary a medal on the 1 year anniversary of Benghazi?! __HTTP__ Guess he would have invaded Libya too! _E_ Congratulations to the House of Representatives for passing the #TaxCutsandJobsAct — a big step toward fulfilling our promise to deliver historic TAX CUTS for the American people by the end of the year! __HTTP__ _E_ The Mexican legal system is corrupt as is much of Mexico. Pay me the money that is owed me now and stop sending criminals over our border _E_ If Michael Bloomberg ran again for Mayor of New York he wouldn't get 10% of the vote they would run him out of town! #NeverHillary _E_ China is building 50 brand new airports while our country continues to rott! Very sad. _E_ Congratulations to @KingJames on winning Athlete of the Year in last night's @ESPYS. LeBron is also a great guy! _E_ I'm thrilled to announce that my new tailored clothing line has officially launched at Macy's. In business it'... (cont) __HTTP__ _E_ I will be at the Cadillac World Golf Championship @TrumpDoral in Miami tomorrow! Rory Phil Bubba Adam and Dustin all at the top! _E_ If ObamaCare is hurting people & it is why shouldn't it hurt the insurance companies & why should Congress not be paying what public pays? _E_ Thank you! #Trump2016 __HTTP__ _E_ RT @DeptofDefense: VIDEO: Elements of the #DoD and @FEMA are providing humanitarian relief for #PuertoRico and #USVI 🇻 . __HTTP__ _E_ Placing the ball in the right position for the next shot is eighty percent of winning golf. Ben Hogan _E_ My @CNBCClosingBell interview discussing America's financial uncertainty due to @BarackObama and the job report __HTTP__ _E_ Policy towards our enemies: Hit them hard hit them fast hit them often & then tell them it was because they are the enemy! _E_ You must promise that you will never cheat off Manti Te'o's test papers. _E_ ...goodwill and friendship was formed but only time will tell on trade. _E_ The FAKE NEWS media (failing @nytimes @NBCNews @ABC @CBS @CNN) is not my enemy it is the enemy of the American People! _E_ ObamaCare is imploding. It is a disaster and 2017 will be the worst year yet by far! Republicans will come together and save the day. _E_ Twitter is on @BarackObama's enemies list __HTTP__ _E_ While Putin is scheming and beaming on how to take over the World President Obama is watching March Madness (basketball)! _E_ RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_ Wow big lines in Kansas. _E_ The Republicans should NOT give @BarackObama the authority to raise the debt another $1.2Trillion (cont) __HTTP__ _E_ .@lancearmstrong really blew it went down in flames too bad! _E_ Good advice from my mother: Trust in God and be true to yourself. Mary Trump _E_ It was a great honor to welcome Prime Minister Najib Abdul Razak of Malaysia and his distinguished delegation to the @WhiteHouse today! __HTTP__ _E_ Why aren't the same standards placed on the Democrats. Look what Hillary Clinton may have gotten away with. Disgraceful! _E_ .@CNN should stop apologizing for the mistake they made the other day & get back to reporting! _E_ My @FoxNews interview with @gretawire discussing how @BarackObama is delusional and how a 3rd party candidate can win. __HTTP__ _E_ Being at the Army Navy Game was fantastic. There is nothing like the spirit in that stadium. A wonderful experience and congrats to Army! _E_ Just did Howard Stern Show great time. Now doing The Today Show with Ivanka. ENJOY! _E_ What are Hillary Clinton's people complaining about with respect to the F.B.I. Based on the information they had she should never..... _E_ Which National Costume do you think should win? __HTTP__ _E_ Congratulations to our Olympic team for by far winning the most medals including first place gold. _E_ It's freezing and snowing in New York we need global warming! _E_ Polls close at 6pm! #INPrimary #Trump2016 #VoteTrump __HTTP__ _E_ .@foxandfriends in five minutes! _E_ After consultation with my Generals and military experts please be advised that the United States Government will not accept or allow...... _E_ .@FLOTUS & I were honored to host our first WH Congressional Picnic. A wonderful evening & tradition. @MarineBand:... __HTTP__ _E_ The Trump Signature Collection exclusively available at @Macys is the pinnacle of style and prestige __HTTP__ _E_ Pathetic @BarackObama is 'sweetening' his offer to the Taliban __HTTP__ Read 'The Art of The Deal.' _E_ Great going @themichellewie –you showed the world that all of that amazing talent is for real. We love you at Trump Jupiter @TNGCJ _E_ .@DonaldJTrumpJr's @CNBC interview discussing the starving demand that is fueling high end luxury __HTTP__ _E_ The safest way to preserve Medicare is with a robust and vibrant economy. We should lower corporate and capital gain taxes immediately. _E_ Clinton is trying to wash away her bad judgement call on BREXIT with big dollar ads. Disgraceful! _E_ I had a great time doing press interviews with @LisaLampanelli and @Teresa_Giudice earlier today __HTTP__ _E_ 70 years ago today the National Security Council met for the first time. Great history of advising Presidents then & now! Thanks NSC Staff! _E_ Congratulations to the 2016 #StanleyCup Champions Pittsburgh @Penguins! _E_ I will be in beautiful Burlington Vermont tonight for a rally. Will be great fun. MAKE AMERICA GREAT AGAIN! _E_ Canada will now sell its oil to China because @BarackObama rejected Keystone. At least China knows a good deal when they see it. _E_ Very good speech by @MichelleObama and under great pressure Dems should be proud! _E_ Why was @BarackObama selling guns to Mexican drug dealers? _E_ Trump Turnberry is a spectacular place and home to four of the greatest Open Championships of all time. __HTTP__ _E_ Join us via our new #AmericaFirst APP! #TrumpPence16 __HTTP__ __HTTP__ _E_ Thank you Pennsylvania! Going to New Hampshire now and on to Michigan. Watch PA rally here: __HTTP__ __HTTP__ _E_ Paul Ryan said that I inherited something very special the Republican Party. Wrong I didn't inherit it I won it with millions of voters! _E_ Congratulations to @Yankees Derek Jeter on being named to 2014 @MLB @AllStarGame! _E_ I don't like the opening even a little bit! _E_ Today is Donald Trump's Birthday! Send him your B'day wishes here: __HTTP__ _E_ Do your homework. Wasting other people's time due to poor planning or thoughtlessness leaves a bad impression. – Think Like a Billionaire _E_ Congress now has 6 months to legalize DACA (something the Obama Administration was unable to do). If they can't I will revisit this issue! _E_ Will be interviewed on @foxandfriends at 8:30 A.M. Eastern. ENJOY! _E_ Wow one of the all time greats in fashion OSCAR DE LA RENTA has just died at 82. Great fashion achievements but also a really nice guy! _E_ Via @dcexaminer by @rebeccagberg: "Trump: 'I'm the only one who can beat' Hillary" __HTTP__ _E_ Last week's episode of the Celebrity Apprentice set the stage for a great new season. Tune in this Sunday on NBC for even more excitement. _E_ I have over seven million hits on social media re Crooked Hillary Clinton. Check it out Sleepy Eyes @MarkHalperin @NBCPolitics _E_ Obama has no problem leaking national security secrets. Why can't he release his records? Especially when $5M is going to charity. _E_ Via @reason: Donald Trump: I Can Fix America __HTTP__ _E_ ....we need to keep America safe including moving away from a random chain migration and lottery system to one that is merit based. __HTTP__ _E_ To be really successful it is always good to have A COOL HEAD WARM HEART AND BEAUTIFUL COMMON TOUCH! _E_ My @FoxNews interview on @gretawire discussing The China Curse __HTTP__ _E_ Yes Arnold Schwarzenegger did a really bad job as Governor of California and even worse on the Apprentice...but at least he tried hard! _E_ Join me in Indianapolis Indiana tomorrow at 3pm! #Trump2016#MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_ Remember official campaign merchandise (hats apparel etc.) can only be bought at __HTTP__ Be careful don't get ripped off _E_ Marco Rubio would keep Barack Obama's executive order on amnesty intact. See article. Cannot be President. __HTTP__ _E_ Big game trophy decision will be announced next week but will be very hard pressed to change my mind that this horror show in any way helps conservation of Elephants or any other animal. _E_ #TrumpVlog @Rosie needs to rest and relax. It's not working. __HTTP__ _E_ Happy Halloween! __HTTP__ _E_ By failing to prepare you are preparing to fail. Benjamin Franklin _E_ First there was the Declaration of Independence then there was the Constitution. Now there is #TimeToGetTough. Available today. _E_ Who would have thought that an @ApprenticeNBC champion would return to compete? @bretmichaels returns to All Star @CelebApprentice... _E_ My new radio ad airing today in Wisconsin! See you soon!#WIPrimary #Trump2016 __HTTP__ _E_ Come on goAngelo don't give up now just because your rally at Macy's drew only eleven people for twenty minutes! I love@ Macy's. _E_ Solyndra's government loan and subsequent bankruptcy prove that @BarackObama is both corrupt and inept. _E_ Trump Int'l Golf Links Ireland in County Clare fronts the Atlantic Ocean & is #1 Resort in Europe/Conde Nast Traveler __HTTP__ _E_ The Democrats have been told and fully understand that there can be no DACA without the desperately needed WALL at the Southern Border and an END to the horrible Chain Migration & ridiculous Lottery System of Immigration etc. We must protect our Country at all cost! _E_ I've sent a 10 wheeler filled with 358 master cases of food and supplies to my hometown of Queens today #TrumpCares _E_ .@RalphGilles of Chrysler should focus on design rather than filthy language not very professional. _E_ Young Entrepeneurs: Think Big Stay Motivated & Always Remain Confident. The Sky is the Limit. _E_ House GOP better get its act together.Defund ObamaCare. Out negotiate on debt ceiling. Form commissions on Benghazi & IRS. No excuses! _E_ It's freezing in New York—where the hell is global warming? _E_ Congratulations to @sethmeyers on "Emmy's Rating Tumble" __HTTP__ Just as I predicted Seth bombed! . _E_ Is this true about Univision and Fusion? Wow!?! __HTTP__ _E_ Via @thehill by @timdevaney: Donald Trump: GOP nominee 'can't be Mitt can't be Bush' __HTTP__ _E_ My @Live5News int. with @WilliamLive5 in South Carolina with @citadelgop cadets on my 757 discussing 2016 __HTTP__ _E_ Today the House votes on two crucial bills:#NoSanctuaryForCriminalsAct #KatesLaw Pass these bills & lets... __HTTP__ _E_ Had dinner with @RickPerry last night great guy straight shooter impressive record. _E_ Is it the Neil Patrick Harris show or the Emmy Awards?How was he ever put in this position to start with? CRAZY! _E_ Everytime someone tweets that I wear a wig realize to yourself that you are dealing with them just another sad & lonely hater and loser! _E_ Do you notice that Hillary spews out Jeb's name as often as possible in order to give him status? She knows Trump is her worst nightmare. _E_ Voting for @GovGaryJohnson is voting for Obama don't waste your vote! _E_ Our country must get very strong and very tough and fast before it is too late. We have zero leadership and never WIN! We want victory. _E_ "How much money can you stand to lose? That's how much risk you should assume." – Think Like a Billionaire _E_ .@BarackObama wants to see 10 yrs of @MittRomney's tax returns tell him ok but we want to see your college applications first.' _E_ I hope @TGowdySC does better for Rubio than he did at the #Benghazi hearings which were a total disaster for Republicans & America! _E_ Canada's legal immigration plan starts with a simple and smart question: How will any immigrant applying fo... (cont) __HTTP__ _E_ Remember victims of Hurricane Sandy during Thanksgiving. Many will not be celebrating the holiday in comfort.Their lives are in turmoil! _E_ Thank you Northern Mariana Islands!#SuperTuesday #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Join us Monday February 8th @ the Verizon Wireless Arena in Manchester New Hampshire! #FITN #NHPolitics #Trump2016 __HTTP__ _E_ Via @fitsnews: "Donald Trump: John McCain Is 'A Loser'" __HTTP__ _E_ Signing orders to move forward with the construction of the Keystone XL and Dakota Access pipelines in the Oval Off... __HTTP__ _E_ Thank you Indiana! Will be back soon!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ If I am elected President I will immediately approve the Keystone XL pipeline. No impact on environment & lots of jobs for U.S. _E_ Watch What's America Worth? hosted and narrated by me this Sunday at 9PM on @Discovery __HTTP__ _E_ Think. That's the first step. Use all your power to utilize and develop that capability Donald J. Trump __HTTP__ _E_ Crooked Hillary no longer has credibility too much failure in office. People will not allow another four years of incompetence! _E_ RT @SLandinSoCal: @foxandfriends @realDonaldTrump Nothing can stop the #TrumpTrain __HTTP__ _E_ Goofy political pundit George Will spoke at Mar a Lago years ago. I didn't attend because he's boring & often wrong—a total dope! _E_ Mitt Romney had his chance to beat a failed president but he choked like a dog. Now he calls me racist but I am least racist person there is _E_ Coincidence? More than half of @BarackObama's 47 biggest fundraisers have been given administration jobs. __HTTP__ _E_ The Celebrity Apprentice delivers the goods and the puppets Sunday at 9 pm on NBC __HTTP__ _E_ "Trump Gives 'Em Hell" __HTTP__ via @limbaugh _E_ The dummies left Iraq (and Libya) without the oil! _E_ Did you know Donald Trump is on Facebook? __HTTP__ Become a fan today! _E_ We should immediately stop sending our beautiful American tax dollars to countries that hate us and laugh at our President's stupidity! _E_ Always remember that as your success grows you will be asked for more favors. Learn how to say 'No.' It is critical. _E_ Death spiral!'Aetna will exit Obamacare markets in VA in 2018 citing expected losses on INDV plans this year' __HTTP__ _E_ 84% of US troops wounded & 70% of our brave men & women killed in Afghanistan have all come under Obama. Time to get out of there. _E_ Wow just saw an ad Cruz is lying on so many levels. There is nobody more against ObamaCare than me will repeal & replace. He lies! _E_ "You have to have confidence in yourself and confidence to know that what you are doing is right." – Think Big _E_ I will not let you down! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ "If you want the best you'd better be the best – in all aspects of business." – Think Like a Billionaire _E_ Obama still will keep all military recruitment centers & bases Gun Free Zones! It has to stop. MILITARY LIVES MATTER! _E_ Hillary Clinton is a major national security risk. Not presidential material! _E_ Well this is it the final debate let's see how it goes. I'll be tweeting live. _E_ RT @EricTrump: #MakeAmericaGreatAgain!!! __HTTP__ _E_ Check out Serta's Counting Sheep (and me) at the Trump International Hotel New York __HTTP__ _E_ My thoughts and prayers are with the victims and families of those affected by two powerful earthquakes in Italy and Myanmar. _E_ Sen. @DavidVitter & @David_Bossie w/@seanhannity __HTTP__ demand 'Congress Live By Your Laws' __HTTP__ _E_ Today's assignment: read Chapter 7 'Trump Tower: The Tiffany Location' of The Art of the Deal. Focus on how I marketed the property. _E_ Obama lied when he said "you can keep your plan" so why would anyone believe his bogus ObamaCare enrollment numbers?! _E_ By popular request I will also be tweeting live during the Vice Presidential debate Thursday night. It will be very interesting I promise. _E_ The United States is considering in addition to other options stopping all trade with any country doing business with North Korea. _E_ Of the 9 battleground states we only carried North Carolina. I'm proud of @NCGOP & glad I delivered keynote at their state convention. _E_ Yankees should have dropped A Rod long ago not even bothered with arbitration. They would have saved a fortune! _E_ Going to D.C. for big groundbreaking on Old Post Office site. Will be spectacular new hotel. Lots of jobs! _E_ Disproven and paid for by Democrats "Dossier used to spy on Trump Campaign. Did FBI use Intel tool to influence the Election?" @foxandfriends Did Dems or Clinton also pay Russians? Where are hidden and smashed DNC servers? Where are Crooked Hillary Emails? What a mess! _E_ I like Mexico and love the spirit of Mexican people but we must protect our borders from people from all over pouring into the U.S. _E_ On @FallonTonight with @jimmyfallon at 11:30 PM. Enjoy! _E_ Father's Day is Sunday. Find the perfect gift.Trump Signature Collection is exclusively available @Macys __HTTP__ _E_ I'm sending lots of bottled water out to Staten Island & Long Island. _E_ Little Marco Rubio the lightweight no show Senator from Florida is set to be the puppet of the special interest Koch brothers. WATCH! _E_ Attorney General Bill Schuette will be a fantastic Governor for the great State of Michigan. I am bringing back your jobs and Bill will help _E_ I'm turning down millions of dollars of campaign contributions—feel totally stupid doing so but hope it is appreciated by the voters. _E_ THANK YOU IOWA! Highly respected @OANN @GravisMarketing poll just released. #VoteTrump #IowaCaucus __HTTP__ _E_ The Trump Hotel Collection is currently nominated for Conde Nast Traveler Readers Choice Awards Travel & Leisure and World Travel Awards. _E_ Getting ready to land in Hawaii. Looking so much forward to meeting with our great Military/Veterans at Pearl Harbor! _E_ I spent Friday campaigning with John Kennedy of the Great State of Louisiana for the U.S.Senate. The election is over JOHN WON! _E_ The great workers who just completed the skylight at Trump International Hotel D.C. (Old Post Office) __HTTP__ _E_ Obama should work on a ceasefire in Chicago as well as Gaza. _E_ Canadians: My ultra luxury private plane will be featured on Sunday's episode of #MightyPlanes on @DiscoveryCanada don't miss it at 8 ET! _E_ When I think big which is often you can be sure I'm aware of the enormous amount of little things that we will have to account for. _E_ I'm saying that the Tea Party perhaps by another name will soon have another big moment and will be a major factor in victory! _E_ The fastest way we can start saving Social Security is to get Americans back to work. #TimeToGetTough (cont) __HTTP__ _E_ To Jamie Dimon—I love kicking lightweight @AGSchneiderman's ass. Stop settling and fight! _E_ "There is no worse feeling than being trapped in a job you do not enjoy. You have to love what you do." Think Big _E_ My @SquawkCNBC #TrumpTuesday interview discussing how @MittRomney can win the first debate & the last 35 days __HTTP__ _E_ Thank you @FaithandFreedom Coalition! An honor joining you today to discuss our shared values.#RTM2016 #Trump2016 __HTTP__ _E_ "Partner with people who share your values attitude and drive." – Midas Touch with @theRealKiyosaki _E_ "House votes on controversial FISA ACT today." This is the act that may have been used with the help of the discredited and phony Dossier to so badly surveil and abuse the Trump Campaign by the previous administration and others? _E_ Interview with @LouDobbs coming up at 7pmE on @FoxBusiness. Enjoy! __HTTP__ _E_ RT @foxandfriends: Head of the NYPD union slams Mayor de Blasio for skipping vigil for assassinated cop Miosotis Familia __HTTP__ _E_ ...So far he has been a complete failure at doing so. He should read The Art of the Deal and use his energy to focus on a new career. _E_ .@GovernorPerry in my office last cycle playing nice and begging for my support and money. Hypocrite! __HTTP__ _E_ It's not enough that we do our best sometimes we have to do what's required. Winston Churchill _E_ Crude is at $85 right now – isn't even worth half that. OPEC is ripping us off. _E_ Crooked Hillary Clinton lied to the FBI and to the people of our country. She is sooooo guilty. But watch her time will come! _E_ I've realized that success requires 100% effort and 100% focus. Nothing less. Get out there and go for it. _E_ Only very stupid people think that the United States is making good trade deals with Mexico.Mexico is killing us at the border and at trade! _E_ Check out my most recent interview with CNN... __HTTP__ _E_ Coach W to his basketball players BE QUICK BUT DON'T HURRY! _E_ The Democrats are all talk and no action. They are doing nothing to fix DACA. Great opportunity missed. Too bad! _E_ Stop flights into the U.S. from West Africa immediately! _E_ Tom Ridge should be focused on trying to bring the party together rather than ripping it apart w/ your faulty thought process. I will win! _E_ The Miss Universe Pageant will be broadcast live from MOSCOW RUSSIA on November 9th. A big deal that will bring our countries together! _E_ Two dozen NFL players continue to kneel during the National Anthem showing total disrespect to our Flag & Country. No leadership in NFL! _E_ Great honor to receive today's endorsement of @RickSantorum. Really nice! #Trump2016 _E_ .@DanaPerino wrote a wonderful book "And the Good News is.. Dana has a fabulous perspective on life & politics—go get it! _E_ The upcoming All Star season of @CelebApprentice has @lisarinna returning to compete. She doesn't disappoint! _E_ If we could force Russia China and other competitors to use ObamaCare we would be able to instantly destroy their great economic success! _E_ Heading to Myrtle Beach South Carolina. Really big crowd—so much to talk about! _E_ Agreed @piersmorgan says he and @OMAROSA have a "communication malfunction." #CelebApprentice _E_ Our trade deficit is still on pace to be over $500B. This is killing our manufacturing sector and sending jobs overseas. _E_ Ebola's spread is 'unprecedented' says CDC chief __HTTP__ _E_ Central Park's top locale @TrumpRink is open throughout the holidays. Our Skating School is excellent & acclaimed __HTTP__ _E_ Almost every television network wants me badly—but I stay loyal to @NBC. _E_ Thank you @CharlesHurt for the nice words on @seanhannity. I will win and Make America Great Again! _E_ The real war on women over 175000 fewer held jobs in July & 94000 dropped out of labor force __HTTP__ We must do better. _E_ Thank you @JerryJrFalwell! __HTTP__ _E_ Lance Armstrong fought for 7 years & then just ran out of energy. Very sad story although they caught him red handed.He definitely cheated! _E_ In one of the biggest stories in a long time the FBI now says it is missing five months worth of lovers Strzok Page texts perhaps 50000 and all in prime time. Wow! _E_ The failing @nytimes wrote yet another hit piece on me. All are impressed with how nicely I have treated women they found nothing. A joke! _E_ Never quit and always hit back The Art of the Comeback _E_ Remember I said Derek don't sell your Trump World Tower apartment...its been lucky for you. The day after he sold it he broke his foot. _E_ Signing a recent tax return isn't this ridiculous? __HTTP__ _E_ The electoral college is a disaster for a democracy. _E_ I think it was terrible that Tim Cook of Apple apologized to China. What the hell is he apologizing for? Steve Jobs wouldn't. _E_ The 48000 sq. ft. Spa @TrumpDoral boasts 33 treatment rooms and over 100 signature spa services and treatments __HTTP__ _E_ During the campaign I promised to MAKE AMERICA GREAT AGAIN by bringing businesses and jobs back to our country. I am very proud to see companies like Chrysler moving operations from Mexico to Michigan where there are so many great American workers! __HTTP__ _E_ Thank you Farmington New Hampshire! #FITN #Trump2016 __HTTP__ _E_ Where are the 50000 important text messages between FBI lovers Lisa Page and Peter Strzok? Blaming Samsung! _E_ FAKE NEWS media knowingly doesn't tell the truth. A great danger to our country. The failing @nytimes has become a joke. Likewise @CNN. Sad! _E_ To be a visionary you have to chase impossibilities. Few ever get rich easily. Think Like a Billionaire _E_ I will be doing @foxandfriends this morning at 8 (not 7). _E_ I want to see people make lots of $$ and live better lives. I really think they can do that through TheTrumpNetwork __HTTP__ _E_ Today we remember the crew of the Space Shuttle Challenger 31 years later. #NeverForget __HTTP__ _E_ The cast for next season looks really good! _E_ All recent Presidents have released their transcripts. What is @BarackObama hiding? _E_ Congratulations to @SenScottBrown on running an aggressive & fair campaign. Vote for Scott today New Hampshire! _E_ Remember new environment friendly lightbulbs can cause cancer. Be careful the idiots who came up with this stuff don't care. _E_ Trump Tuesday: I'll be on @SquawkCNBC tomorrow morning at 7:30 AM. Be sure to tune in. _E_ ....getting great border security and healthcare. #VoteRalphNorman tomorrow! _E_ The new line of Trump ties shirts and cufflinks are out at Macy's and are really beautiful at a really reasonable.price. Go check them out! _E_ I will be going to Aberdeen Scotland today to help my team celebrate the great success of Trump International Golf Links press conference. _E_ #TrumpVlog Obama stop chewing gum! __HTTP__ _E_ Iran looks like it is toying with John Kerry on nuclear talks he is begging for a deal to save face. Negotiation is just not his thing! _E_ Attending Chief Ryan Owens' Dignified Transfer yesterday with my daughter Ivanka was my great honor. To a great and brave man thank you! _E_ Traitor Snowden has requested asylum in Russia. Why would Russia grant it? Snowden already gave them all the intel he stole! _E_ While I hear the Koch brothers are in big financial trouble (oil) word is they have chosen little Marco Rubio the lightweight from Florida _E_ Obama was beaten but not knocked out. He lives to fight another day. But in the real world presidents are not given a second chance... _E_ .@oreillyfactor called me a master marketeer last night I am not. I am a great builder I build great things & people come. _E_ RT @FLOTUS: I had a wonderful time with the students at the American International School #Riyadh today. #SaudiaArabia __HTTP__ _E_ "Exclusive: Donald Trump wants to build a luxury hotel in Dubai" __HTTP__ via @itp_ab by @ctrenwith _E_ Politician @SenatorCardin didn't like that I said Baltimore needs jobs & spirit. It's politicians like Cardin that have destroyed Baltimore. _E_ Workers of firm involved with the discredited and Fake Dossier take the 5th. Who paid for it Russia the FBI or the Dems (or all)? _E_ Discovery breeds discovery as in success breeds success. Questions are thoughts with a quest. Think Like a Champion _E_ #TrumpVlog Hagel quits __HTTP__ _E_ While I was in Moscow I see that President Obsma apologized for his lie I mean statement on ObamaCare! How nice of him to be so forthright _E_ So @BarackObama is celebrating his 'birthday' with a fundraiser in his home he bought with the help of Rezko __HTTP__ _E_ Great to see @SarahPalinUSA back on @FoxNews. She's a wonderful woman and commentator. _E_ How much money are the lawyers for the Central Park Five getting out of the 40 million dollars or are they paid by the City (or both)? _E_ RT @DanScavino: .@NikkiHaley in 2012 w/ Romney on tax returns🤔(political ploy.) Fast forward..2016 w/ Robot Rubio🤖#FAIL👎#Politician __HTTP__ _E_ RT @JacobAWohl: @realDonaldTrump The #MAGA great again movement is WINNING and the left wing media can't stand it! _E_ We need economic growth and jobs not blue ribbon panels to study the problem. _E_ Looks like @OMAROSA is up to the challenge. #CelebApprentice _E_ Florida Power & Light did a fantastic job of providing service & energy during the big storm in Palm Beach. @insideFPL _E_ Everybody is asking why the Justice Department (and FBI) isn't looking into all of the dishonesty going on with Crooked Hillary & the Dems.. _E_ Thank you @SenatorDole very kind! __HTTP__ _E_ Just in—all efforts to stop sexual abuse in the military have totally failed—in fact the stoppers have become the abusers. _E_ Dopey Sugar @Lord_Sugar I'm worth more than $8 billion acknowledged almost no debt ... _E_ RT @CLewandowski_: The Scrum: Video Emerges to Suggest WaPo Reporter Ben Terris Misidentifies Lewandowski in Fields Incident Breitbart __HTTP__ _E_ I can't believe @VanityFair would renew Graydon Carter's contract...... _E_ "Trump: 'Very much inclined' to enter GOP White House race" __HTTP__ via @McClatchyDC by @LightmanDavid _E_ A great day in New Hampshire and Maine. Fantastic crowds and energy! #MAGA _E_ FBI Deputy Director Andrew McCabe is racing the clock to retire with full benefits. 90 days to go?!!! _E_ Great news @BarbaraJWalters has fully recovered and will be back on @theviewtv this coming Monday. Barbara is wonderful! _E_ Will be having many meetings this weekend at The Southern White House. Big 5:00 P.M. speech in Melbourne Florida. A lot to talk about! _E_ .@_Just_Mads_ #asktrump __HTTP__ _E_ Terrible attacks in NY NJ and MN this weekend. Thinking of victims their families and all Americans! We need to be strong! _E_ #trumpvlog @BarackObama's dismal record in today's video blog.... __HTTP__ _E_ Gov Mike Pence has just stated that Donald Trump has taken a strong stance on Hoosier jobs and he thanks me! I will bring back jobs to USA. _E_ Had a great time with @MittRomney last night. He is focused and ready for the battle ahead. Lots of money was raised. _E_ Lightweight @AGSchneiderman is fighting with @NYGovCuomo –Cuomo wins that one easily. Schneiderman is a total loser. _E_ I was referring to the fact that Jeb Bush wants to keep common core. _E_ The countdown is on. The 13th season of All Star @ApprenticeNBC premieres this Sunday March 3rd at 9PM EST on @nbc. Big! _E_ I'll be speaking at the first ever National Achievers Congress at the San Jose Convention Center (San Jose CA) (cont) __HTTP__ _E_ Very little reporting about the GREAT GDP numbers announced yesterday (3.0 despite the big hurricane hits). Best consecutive Q's in years! _E_ Even Usain Bolt from Jamaica one of the greatest runners and athletes of all time showed RESPECT for our National Anthem! 🇲 __HTTP__ _E_ Billions of dollars in investments & thousands of new jobs in America! An initiative via Corning Merck & Pfizer: __HTTP__ __HTTP__ _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ I will bring our jobs back to the U.S. and keep our companies from leaving. Nobody else can do it. Our economy will sing again. _E_ Happy belated birthday wishes to @BarbaraJWalters. Barbara is terrific! _E_ Gov. John Kasich has really failed on the campaign trail. I thought he would have been far more talented. He is just wasting time & money! _E_ "If you want to succeed you should strike out on new paths rather than travel worn paths of accepted success." John D. Rockefeller _E_ Very interesting election currently taking place in France. _E_ A great honor to host PM Paolo Gentiloni of Italy at the White House this afternoon! #ICYMI Joint Press Conference... __HTTP__ _E_ Announced w/ @pgaofamerica that we will bring @seniorpgachamp to @TrumpGolfDC & @pgachampionship to Trump Bedminster _E_ Attention to detail is critical choose scents that exude sophistication & confidence. Find out more 4/18 5:30 pm @Macys Herald Square. _E_ Great event last night @trumpwinery with @GovernorVA to support @TheVFoundation @UVA @VCU __HTTP__ _E_ A beautiful article by @IvankaTrump on my newly opened golf course in NYC Trump Links Ferry Point __HTTP__ _E_ I will be having lunch at the White House today with Republican Senators concerning healthcare. They MUST keep their promise to America! _E_ Adversity is a fact of life. Be bigger than the problems be ready to fight for your rights & all will be well – Trump Never Give Up _E_ If you have any doubt that @BarackObama must be defeated see @DineshDSouza's '2016: Obama's America.' Amazing film! _E_ .@MittRomney if Obama gets wise tonight just ask for his college records & transcripts he will quiet down quickly. _E_ While I have never met @nytdavidbrooks of the NY Times I consider him one of the dumbest of all pundits he has no sense of the real world! _E_ Our debt finances China's military. It's time to get tough – we hold all the cards. Let's Make America Great Again! __HTTP__ _E_ There's only only one person who has defunded Medicare. His name is @BarackObama. _E_ ....People are angry. At some point the Justice Department and the FBI must do what is right and proper. The American public deserves it! _E_ Via @DailyCaller by @AlexPappas: "Donald Trump To Blast Obama Trade Pact In Radio Ads: 'A Bad Bad Deal'" __HTTP__ _E_ Without momentum there's a lack of energy that can lead the best of ideas to nowhere. Get your momentum going and keep it going. _E_ Coming up soon: The two hour premiere of The Apprentice. Next Thursday September 16th at 9 pm on NBC. __HTTP__ _E_ Who handed Iraq over to Iran yesterday? @BarackObama. We have gotten nothing from the Iraqis we should have them pay us back with oil. _E_ A @senatormcdaniel win is a victory for our country. Chris is a Constitutional Conservative who'll make a difference in Washington. _E_ President Obama you have a big job to do. Go to Baltimore and bring both sides together. With proper leadership it can be done! Do it. _E_ Today I announced an Air Traffic Control Initiative to take American air travel into the future finally!... __HTTP__ _E_ Be sure to read my column in @cnni "Europe is terrific place for investment" __HTTP__ _E_ ...whether there are tapes or recordings of my conversations with James Comey but I did not make and do not have any such recordings. _E_ Even @PiersMorgan is impressed by @THEGaryBusey. #CelebApprentice _E_ Just received from @PeteRose_14. Thank you Pete! #VoteTrump on Tuesday Ohio! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_ A sad day for America with Snowden being granted asylum in Russia. Putin is laughing at Obama. _E_ RT @Jim_Jordan: President Trump did the right thing by withdrawing us from Paris treaty it would hurt American companies and American wor... _E_ All successful people are high energy people who are passionate about what they do. Find a passion that energizes you. Think Big _E_ Ted Cruz said on @oreillyfactor that illegals sent out of country by my administration would come right back as citizens. Another lie crazy! _E_ The onus of the Chicago teachers' strike falls squarely on the teachers & their union. Inexcusable to leave children without school. _E_ Get the big picture but be prepared for the picture to change. Be persistent and alert every single day. _E_ Great honor to be endorsed by popular & successful @gov_gilmore of VA. A state that I very much want to win THX Jim! __HTTP__ _E_ Leaving Superior Wisconsin now. Thank you! #Trump2016 #WIPrimary __HTTP__ __HTTP__ _E_ Kasich has helped decimate the coal and steel industries in Ohio. I will bring them back! #MakeAmericaGreatAgain _E_ Russia just said the unverified report paid for by political opponents is A COMPLETE AND TOTAL FABRICATION UTTER NONSENSE. Very unfair! _E_ Newsmax is a great news org and and its pres debate in IA on 12 27 will be fair balanced and informative. @ralphreed _E_ I'll be on @foxandfriends Monday at 7:30 AM. Tune in! _E_ "@Letterman to Donald Trump: 'Fire @geraldorivera'" __HTTP__ via @Mediaite by @TheMattWilstein _E_ A very good NBC/Wall Street Journal Poll was just released wherein I went up from last month and am in the lead. Nice! _E_ 11AM #MakeAmericaGreatAgain __HTTP__ _E_ A true piece about the standing ovations I got yesterday __HTTP__ _E_ After 13 seasons @ApprenticeNBC easily beat Shark Tank in ratings last year better demos as well. _E_ Must watch for all Georgians @Perduesenate's new ad "Secure Our Border __HTTP__ Michelle Nunn supports amnesty & ObamaCare _E_ Today I spoke @LibertyU Convocation a great crowd... __HTTP__ _E_ I attended @Aerosmith concert last night in Newark NJ. Doesn't get any better than that. @IamStevenT was fantastic great energy! _E_ Dope Frank Bruni said I called many people including Karl Rove losers true! I never called my friend @HowardStern a loser he's a winner! _E_ It's Wednesday. How many times will A Rod sue the @Yankees today? A Rod has no one to blame but himself for his predicament. _E_ Watch @MissUSA Olivia Culpo crowned as @MissUniverse 2012 in the Trump #MissUniverse Pageant __HTTP__ _E_ I will be interviewed on @foxandfriends at 6:00 A.M. Enjoy! _E_ You have to set higher and higher goals. You have to want more or you will start slipping backwards fast. Think Big _E_ Congratulations to @FLGovScott on winning access to federal database __HTTP__ He is making FL a safe & legal election for 2012 _E_ .@lancearmstrong revise your decision to quit go back and fight. _E_ Entrepreneurs: Be ready for problems you'll have them every day. Keep open to new ideas that's where innovation begins. _E_ #DemDebate was really boring but had a lot of fun live tweeting and picked up by far the most followers. _E_ Secretary Kerry cannot get other nations to join us in fighting ISIS. They are afraid and he is a poor salesman who reps a pathetic leader! _E_ After Solyndra @BarackObama is stil intent on wasting our tax dollars on unproven technologies and risky companies. He must be accountable. _E_ Presidential Proclamation Honoring the Victims of the Tragedy in Parkland Florida: __HTTP__ __HTTP__ _E_ Frankly for a writer I don't think @DannyZuker's stuff is good. In fact it's terrible. _E_ Whether we like it or not oil is the axis on which the world's economies spin. It just is. When the price o... (cont) __HTTP__ _E_ The only problem I have with Mitch McConnell is that after hearing Repeal & Replace for 7 years he failed!That should NEVER have happened! _E_ The new reality China and Japan are warning us not to default __HTTP__ Reckless government spending has made us weak. _E_ Strong leader: @IsraeliPM Netanyahu explained at AIPAC the threat Israel faces from Iran's nuclear drive. He is (cont) __HTTP__ _E_ Via @Law360: "Trump's $200M Old Post Office Project Gets Early Approval" __HTTP__ _E_ .@alexsalmond @pressjournal RT @GailLorene Ask our Canadian neighbors who abhor the windfarms. And poor Scotland _E_ I'll be on The Late Show with David Letterman tonight be sure to tune in for a great show. 11:30 pm on CBS. _E_ Isn't it interesting that immediately after September 11th everybody was asking for and indeed demanding torture of any kind. No reports! _E_ RT @foxandfriends: .@jasoninthehouse: Comey went silent when I asked him about his memos which raised a lot of eyebrows. __HTTP__ _E_ Great night in Iowa special people. Thank you! _E_ My heart goes out to the people of Boston on this terrible day! _E_ Donald Trump Has Given Millions To Pro Romney SuperPACs and His Whole Family Is Cutting Checks to Mitt's Campaign __HTTP__ _E_ I will be interviewed on @seanhannity tonight at 10:00. Many things mostly bad to talk about! _E_ Ohio Gov.Kasich voted for NAFTA from which Ohio has never recovered. Now he wants TPP which will be even worse. Ohio steel and coal dying! _E_ .@VattenfallGroup wants out of their Aberdeen windfarm fiasco so badly but @AlexSalmond won't let them—he's (cont) __HTTP__ _E_ With all of its phony unnamed sources & highly slanted & even fraudulent reporting #Fake News is DISTORTING DEMOCRACY in our country! _E_ Congratulations to @PGA_JohnDaly on his big win yesterday. John is a great guy who never gave up and now a winner again! _E_ The era of division is coming to an end. We will create a new future of #AmericanUnity. First we need to... __HTTP__ _E_ Interesting that @Macys criticized me but just paid $650000 in fines for racial profiling. Are they racists? _E_ New York Times Apologizes to Donald TrumpA recent story in the New York Times incorrectly stated that Donald (cont) __HTTP__ _E_ In that @TimeWarner has @HBO with really dumb racist Bryant Gumbel(and I mean dumb) and no CBS (which fired Bryant) I am switching bldgs. _E_ Happy 102nd birthday to President Ronald Reagan. Every day that passes Reagan's presidency looks better and better. _E_ Why did @AGSchneiderman have to fill out 3 successive ballots on Election Day? And this is our A.G. _E_ #ThankYouTour2016 Tue: West Allis WI. Thur: Hershey PA. Fri: Orlando FL. Sat: Mobile AL. Tickets:... __HTTP__ _E_ Steps away from Waikiki's famous beaches @TrumpWaikiki is Hawaii's top destination w/our signature amenities __HTTP__ _E_ The Republican platform is most pro Israel of all time! _E_ My @FoxNews interview last night on @hannityshow discussing OWS and @BarackObama's incompetent leadership. __HTTP__ _E_ Arriving @TrumpScotland with @DonaldJTrumpJr & @EricTrump. Back to New York tonight. Video: __HTTP__ _E_ ...massive regulation cuts 36 new legislative bills signed great new S.C.Justice and Infrastructure Healthcare and Tax Cuts in works! _E_ Have some fun with this __HTTP__ _E_ Check out my speech from last Friday __HTTP__ as well as my appearance this morning on @foxandfriends __HTTP__ _E_ MAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_ The success of Shark Tank over the years is a total joke compared to the success of The Apprentice one of the biggest hits in T.V. history. _E_ To the geniuses at 'Americans United for Change': the more you tax me the less people I employ. Get it? _E_ My friend @AriEmanuel of @IMG bought the Miss Universe pageants from me and they are on tonight on #Fox! Tune in! _E_ The past 4 years have seen the weakest multiyear recovery since WWII __HTTP__ Need to loosen regulations and lower taxes. _E_ Thank you Pittsburgh Pennsylvania! Will be back soon! #AmericaFirst __HTTP__ _E_ Iraq should be paying us while we fight ISIS. Give the money to the families of our brave soldiers. _E_ Everyone makes mistakes but it's what you do with them and what you learn from them that matters. Midas Touch _E_ Thank you South Dakota! #Trump2016 __HTTP__ __HTTP__ _E_ Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_ Woody Johnson's comments that he would rather have @MittRomney win the election than his @nyjets win games shows real patriotism. _E_ Afghani soldiers those on our side killed 7 Marines last month. __HTTP__ They don't want us what (cont) __HTTP__ _E_ My thoughts and prayers are with all of the victims involved in this mornings train collision in South Carolina. Thank you to our incredible First Responders for the work they've done! _E_ Too bad @morningmika did not allow her interview with @SpitzerForNYC to go on another few minutes...would have been interesting... _E_ Plans to build wind farm near Trump Turnberry in Scotland have been dropped. GREAT! @GolfDigest @GolfweekMag @GolfChannel @ESPNGolf _E_ Via @BreitbartNews: TRUMP TO REPUBLICANS: 'PLAY THE DEBT CEILING CARD' __HTTP__ by @joelpollak _E_ Not only does the media give a platform to hate groups but the media turns a blind eye to the gang violence on our streets! __HTTP__ _E_ Autism rates through the roof why doesn't the Obama administration do something about doctor inflicted autism. We lose nothing to try. _E_ .@AlexSalmond sought my support after he released terrorist Al Megrahi who blew up Pan Am #103 killing all aboard. I said "no way!" _E_ The Iranians are having 'difficulties' with their nuclear program __HTTP__ But no thanks to us! _E_ I am proud to announce our newest project Trump Tower Mumbai. Together with the Lodha Group it will be incredible! __HTTP__ _E_ Will be interviewed on Media Buzz with Howie Kurtz on Fox Sunday at 11:00 A.M. _E_ Way to go @serenawilliams you are a true champion. Proud of you! _E_ #MakeAmericaGreatAgain From my speech in South Carolina yesterday __HTTP__ _E_ Great boardroom! What did you think? #CelebApprentice _E_ Big win by @Yankees last night to take control of AL East. Jeter & company now control their destiny. _E_ The Republican House Freedom Caucus was able to snatch defeat from the jaws of victory. After so many bad years they were ready for a win! _E_ Karen Handel's opponent in #GA06 can't even vote in the district he wants to represent.... _E_ I'll bet Lance Armstrong wishes he didn't do the interview with Oprah he's saying to himself what was I thinking? _E_ .@NBC really happy with how well the #MissUniverse pageant went. _E_ The Fake Media is working overtime today! _E_ Not under my watch __HTTP__ _E_ There are great campaigns on @fundanything __HTTP__ Be sure to take a look and donate to one today. _E_ Heading to rally with Bobby now! See you soon! __HTTP__ _E_ RT @IvankaTrump: Thank you New Hampshire! __HTTP__ _E_ RT @ChuckGrassley: Jerusalem Embassy Act of '95 (Senate vote 93 5 & I voted for it) states embassy should be in Jerusalem by 5/31/99. For 1... _E_ Today is the day that ObamaCare website was supposed to be up and working. WRONG website is closed down a total disaster! 90 million doomed _E_ That's right we need a TRAVEL BAN for certain DANGEROUS countries not some politically correct term that won't help us protect our people! _E_ I absolutely support Kate's Law—in honor of the beautiful Kate Steinle who was gunned down in SF by an illegal immigrant. _E_ Crooked Hillary Clinton deleted 33000 e mails AFTER they were subpoenaed by the United States Congress. Guilty cannot run. Rigged system! _E_ At your request I will be doing live tweeting during tonight's @ApprenticeNBC. #CelebApprentice _E_ I just beat a lawyer from Yale and a lawyer from Harvard who teamed up against me in a major case worth millions ($). They were so dumb! _E_ " Pennies don't fall from heaven they have to be earned here on earth. – PM Margaret Thatcher (October 13 1925 – April 8 2013) _E_ Luther Strange has been shooting up in the Alabama polls since my endorsement. Finish the job vote today for Big Luther. _E_ Border Patrol Officer killed at Southern Border another badly hurt. We will seek out and bring to justice those responsible. We will and must build the Wall! _E_ "In every battle there comes a time when both sides consider themselves beaten... _E_ The rules DID CHANGE in Colorado shortly after I entered the race in June because the pols and their bosses knew I would win with the voters _E_ Sorry won't be doing Fox & Friends this morning will be in India on a couple of major business deals! _E_ Trump volunteers were out early today to offload cases of food and supplies for hard hit Rockaways residents #Sandy _E_ #FlashbackFriday At Military Academy second from left. __HTTP__ _E_ It's Tuesday. How many jobs has ObamaCare cost the economy today? _E_ Just saw the phony ad by Cruz totally false more dirty tricks. He got caught in so many lies is this man crazy? _E_ China's domestic economic and political problems prove how pathetic our leadership is in allowing China to rip us off __HTTP__ _E_ As we come together to celebrate the extraordinary contributions of African Americans to our nation our thoughts turn to the heroes of the civil rights movement whose courage and sacrifice have inspired us all. Proclamation: __HTTP__ __HTTP__ _E_ #JFKFiles __HTTP__ _E_ Thank you for your support! __HTTP__ _E_ Certain people are ruining their reputations tonight really sad! #Oscars _E_ Good news @AFPhq is going to fight back against Rove's attack on the Tea Party __HTTP__ Go get em! @marklevinshow _E_ Congratulations to my friend @seanhannity on @hannityshow 1000th show consecutively #1 in his time slot! Great going! _E_ Just spoke to President XI JINPING of China concerning the provocative actions of North Korea. Additional major sanctions will be imposed on North Korea today. This situation will be handled! _E_ The @EricTrumpFDN is doing amazing work helping the children... _E_ especially how to get people even with an unlimited budget out to vote in the vital swing states ( and more). They focused on wrong states _E_ From 1954 to 1960 there were 10 major hurricanes that hit the East Coast. _E_ This morning I will be going to the Commissioning Ceremony for the largest aircraft carrier in the world The Gerald R. Ford. Norfolk Va. _E_ We spend billions of dollars helping nations all over the World but with hurricane Sandy and Oklahoma tornado not one nation helped us! _E_ Thank you Washington! Together WE will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_ .@Cher attacked @MittRomney. She is an average talent who is out of touch with reality. Like @Rosie O'Donnell a total loser! _E_ What could be better than dinner with @MittRomney and me? __HTTP__ _E_ #IceBucketChallenge For those of you who wanted a picture here it is __HTTP__ _E_ .@VattenfallGroup has topped Carbon Data's rankings of the most carbon intensive companies in the EU's emissions trading scheme. _E_ Via @worldnetdaily by @MichaelCarl7: "Trump: Obama blew chance to free U.S. pastor" __HTTP__ _E_ Crooked Hillary Clinton wants to flood our country with Syrian immigrants that we know little or nothing about. The danger is massive. NO! _E_ I was saddened to see how bad the ratings were on the Emmys last night the worst ever. Smartest people of them all are the DEPLORABLES. _E_ ... and pay per view records with "Battle of the Billionaires" in Detroit. It was a wild day! _E_ Being the best requires full time attention and application." – Midas Touch _E_ .@AJDelgado13 Thank you so much for the nice words and support really enjoy listening to your ideas and thoughts. _E_ "Deals are my art form. I like making deals preferably big deals." – The Art of The Deal _E_ The FAKE NEWS media (failing @nytimes @CNN @NBCNews and many more) is not my enemy it is the enemy of the American people. SICK! _E_ Mariano Rivera Yankee pitcher is the greatest ever. Get well fast. _E_ With magnificent views @TrumpChicago is the perfect venue to host impact events & business meetings __HTTP__ _E_ The Obama Administration has a very important duty to provide a budget and then negotiate! OUR COUNTRY is a laughingstock! _E_ Are you ready for the All Star @CelebApprentice? @TraceAdkins is back in the upcoming season...which is the best yet! _E_ Thank you! #Trump2016 #WIPrimary __HTTP__ _E_ I was speaking with Don Imus this morning.... __HTTP__ _E_ The Democrats have said some of the worst things about James Comey including the fact that he should be fired but now they play so sad! _E_ They call it climate change now because the words global warming didn't work anymore. Same people fighting hard to keep it all going! _E_ Amazingly with all of the money I have raised for the vets I have got nothing but bad publicity from the dishonest and disgusting media. _E_ Scary. Over 8332000 Americans left the work force during Obama's first term __HTTP__ How did Romney lose that election? _E_ .@ARealSuperMan #asktrump __HTTP__ _E_ We should stop talking stay out of Syria and other countries that hate us rebuild our own country and make it strong and great again USA! _E_ The Cruz campaign issued a dishonest and deceptive get out the vote ad calling voters in violation. They are now under investigation. Bad! _E_ "Trump to build second Scottish course" __HTTP__ via @UPI _E_ Staff Sgt. Salvatore A. Giunta received the Medal of Honor from Pres. Obama this month. It was a great honor to have him visit me today. _E_ Amazing story in @BreitbartNews about the sleazebag blogger Coppins who fabricated nonsense about me for irrelevant @BuzzFeed. CONGRATS! _E_ Merry Christmas to all. Have a great day and have a really amazing year. Together we will MAKE AMERICA GREAT AGAIN! It will be done! _E_ Sorry to hear of yesterday's passing of General Norman Schwarzkopf. He was a terrific general and leader we could use more like him. _E_ Really great numbers on jobs & the economy! Things are starting to kick in now and we have just begun! Don't like steel & aluminum dumping! _E_ Thank you New Jersey! #Trump2016 __HTTP__ __HTTP__ _E_ I am being proven right about massive vaccinations—the doctors lied. Save our children & their future. _E_ My shirts ties & cufflinks @Macys have never been better or more beautiful. Great holiday gifts great price. _E_ Colorado was amazing yesterday! So much support. Our tax trade and energy reforms will bring great jobs to Colorado and the whole country. _E_ The Fake Media (not Real Media) has gotten even worse since the election. Every story is badly slanted. We have to hold them to the truth! _E_ The deficits under @BarackObama are the highest in America's history. Why is he bankrupting our country? _E_ Rumor has it Pataki Kasich & Senator Lindsey Graham are dropping out of the race very soon. Hope it's not true they're so easy to beat! _E_ #Trump2016 #MakeAmericaGreatAgain #ECONOMY VIDEO: __HTTP__ __HTTP__ _E_ Lincoln never sounded like that! _E_ The Countryside Party just formed in Scotland to fight ugly wind turbines & @AlexSalmond. Congrats to Jim Crawford & Countryside Party. _E_ Trump National Hudson Valley's 7693 yd par 72 course features one of the country's great golf courses. __HTTP__ _E_ #MakeAmericaGreatAgain #NYPrimary __HTTP__ _E_ Via @MailOnline: "But did his hair survive? @MissUniverse & @MissUSA dump water over Donald Trump" __HTTP__ _E_ To all young (and old) entrepreneurs: Believe in yourself talk yourself up! Energize yourself and you'll energize others. _E_ Wow a really nice lead in New Hampshire an increase since my last poll! __HTTP__ _E_ Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_ Mike Bloomberg is doing a great job as Mayor of New York City. Ray Kelly is a great Police Commissioner. @MikeBloomberg _E_ Really bad shooting in Orlando. Police investigating possible terrorism. Many people dead and wounded. _E_ Re Negotiation: Patience is an enormous virtue & needs to be cultivated for successful negotiations on any level. _E_ ...have it. Fake News said 17 intel agencies when actually 4 (had to apologize). Why did Obama do NOTHING when he had info before election? _E_ While @BarackObama watches China is trying to have the yuan overtake our dollar as the international (cont) __HTTP__ _E_ Have a great and peaceful Memorial Day but remember there are people out there who don't want us to have peace. WE MUST BE STRONG!!!! _E_ The press is so totally biased that we have no choice but to take our tough but fair and smart message directly to the people! _E_ The only reason Obama gave a speech last night was because it was on the schedule Putin is laughing and the reviews have been really bad! _E_ Brian Thanks dummy I picked up 70000 twitter followers yesterday alone. Cable News just passed you in the ratings. @NBCNightlyNews _E_ Pathetic excuse by London Mayor Sadiq Khan who had to think fast on his no reason to be alarmed statement. MSM is working hard to sell it! _E_ Attended last night's @Yankees game Derek Jeter is both a great player and a great guy. _E_ Roger Goodell of NFL just put out a statement trying to justify the total disrespect certain players show to our country.Tell them to stand! _E_ Entrepreneurs: Learn to trust yourself. Being an entrepreneur is not a group effort. _E_ THANK YOU Daytona Beach Florida!#MakeAmericaGreatAgain __HTTP__ _E_ Thank you Florida Ohio and Pennsylvania! #CrookedHillary is not qualified. #ImWithYou __HTTP__ _E_ "@WestJournalism Exclusive – We Asked Donald Trump What Jobs He Would Offer ISIS" __HTTP__ _E_ The brand new season of @CelebApprentice starts filming in less than 5 weeks. The 'All Star' cast will be announced very soon. _E_ Will be working all weekend in choosing the great men and women who will be helping to MAKE AMERICA GREAT AGAIN! _E_ If Democrats do not start opposing ObamaCare and fast Republicans will have a massive victory in 2014 far greater than any predictions! _E_ CRIPPLED AMERICA is the perfect gift for friends & family. Order signed copy & join me at 7:30pm live streaming! __HTTP__ _E_ Crazy Maureen Dowd the wacky columnist for the failing @nytimes pretends she knows me well wrong! _E_ I only wish my wonderful father Fred gave me $200 million to start my business like lightweight Rubio says. He didn't total fabrication! _E_ Bush is pretending that the Trump surge is great for him and the @nytimesworld is reporting Bush delight con job a Bush nightmare! _E_ The @BarackObama campaign keeps highlighting a web video of John McCain being nice & respectful. I'll bet John (cont) __HTTP__ _E_ I look forward to Tuesday night's presidential debate I wonder if Obama will use my name again. _E_ Bobby Jindal did not make the debate stage and therefore I have never met him.... _E_ JUST IN: A jury awarded a complete and total victory in buyer's remorse lawsuit against me in Ft. Lauderdale. _E_ Jay Carney won't answer reporters questions of Why Obama won't release his college transcripts Come on Jay! _E_ We should stay the hell out of Syria the rebels are just as bad as the current regime. WHAT WILL WE GET FOR OUR LIVES AND $ BILLIONS?ZERO _E_ .@lancearmstrong should immediately reconsider or his legacy is ruined. _E_ "I believe anybody who is not afraid to fail is a winner." @JoeTorre _E_ 'Better Be Careful':Donald Trump Warns GOP On Immigration Creating '12 Million' New Dem Voters __HTTP__ via @Mediaite _E_ Thank you for the support South Carolina! #USSYorktown #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Spain's government is closing down wind turbines the maintenance is higher than the income. _E_ Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_ Check out OAN and compare to what you are watching now! _E_ Have a GREAT weekend everybody enjoy yourself but always keep your goals and aspirations in mind. Never lose sight of the victory ahead! _E_ The Democrats have zero intention of coming to any deal on the fiscal cliff. They will raise taxes and blame it on the Republicans. _E_ THANK YOU WISCONSIN! #VoteTrump next Tuesday April 5th! #WIPrimary __HTTP__ __HTTP__ _E_ Wow thank you Pensacola FL. See you Friday at 7pm join me! __HTTP__ __HTTP__ _E_ Celebrated doctor @BillCassidy will be a tremendous Senator. Louisiana – send a Conservative to the Senate vote for Bill this November! _E_ Corrupt @BarackObama's largest bundlers are fundraisers linked to the Obama Solyndra boondoggle __HTTP__ Chicago cronyism _E_ The Fake News Media has never been so wrong or so dirty. Purposely incorrect stories and phony sources to meet their agenda of hate. Sad! _E_ "Advertising is totally unnecessary. Unless you hope to make money." Jef I. Richards _E_ .@TrumpCollection continues to deliver the goods __HTTP__ _E_ RT @FLOTUS: The decorations are up! @WhiteHouse is ready to celebrate! Wishing you a Merry Christmas & joyous holiday season! __HTTP__ _E_ .@MikeAndMike I will be on the Mike & Mike Show at 7.05 a.m. (ESPN) 10 minutes. Will be fun great guys! Radio and T.V. _E_ What is wrong with the @GOP? Now they want to give all authority on the sequester cuts to Obama __HTTP__ Pathetic. _E_ My interview with Michael Patrick Shiels on WJIM in Lansing on behalf of @MittRomney __HTTP__ _E_ Please read __HTTP__ and watch a recent trip made to Trump Vineyard Estates by @EricTrump __HTTP__ _E_ years as a pol in Connecticut Blumenthal would talk of his great bravery and conquests in Vietnam except he was never there. When.... _E_ The reason I originally endorsed Luther Strange (and his numbers went up mightily) is that I said Roy Moore will not be able to win the General Election. I was right! Roy worked hard but the deck was stacked against him! _E_ So generous and pious! After spending millions of our tax dollars on his campaign through travel @BarackObama donated to himself. _E_ Remarks by President Trump on the Policy of the U.S.A. Towards Cuba Video: __HTTP__ __HTTP__ _E_ .@foxandfriends in 15 minutes! _E_ Via @Inc by @steelwire: "Donald Trump – To Micromanage or Not To Micromanage?" __HTTP__ _E_ This is a buyers' market. Buy now. You will thank me in 3 years. _E_ Congratulations to @foxandfriends on its unbelievable ratings hike. _E_ Lightweight A.G. Eric Schneiderman sued school with a 98% approval rating while billions in corruption goes unpunished. A total crook? _E_ Best book ever on dealmaking (or so they say) TRUMP: THE ART OF THE DEAL. Go get it and others Washington you really can do better! _E_ Via @scotsmandotcom: Trump joins with Chandler in bid to attract events __HTTP__ _E_ Thank you Georgia! See you soon!#Trump2016 __HTTP__ _E_ Sad.@BarackObama has already exempted major oil importers on Iranian sanctions and is negotiating a waiver with China. __HTTP__ _E_ A really bad night for President Obama. Now the Republicans have to get together and get the job done! _E_ Stocks rose yesterday during the first day of government shutdown. Markets like being left alone for a day. _E_ Which is worse and which is more dishonest the #Oscars or the Emmys? _E_ Today's @WSJ Editorial is WRONG again. I know that China is not in the new T.P.P. trade deal but would come in latter through a back door. _E_ John Kasich fell right into President Obama's trap on ObamaCare and the people of Ohio are suffering for it. Shame! _E_ #TrumpAdvice __HTTP__ _E_ Benghazi was a massive cover up. _E_ "Lifestyle unveils Trump Home brand in GCC" __HTTP__ via @TradeArabia _E_ In order to stop the Ebola outbreak in Africa perhaps the President should put all Africans on ObamaCare rather than sending the troops! _E_ Ron Estes is running TODAY for Congress in the Great State of Kansas. A wonderful guy I need his help on Healthcare & Tax Cuts (Reform). _E_ THANK YOU PORTLAND Maine!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ The United States troops which were sent to West Africa have only gotten 4 hours of Ebola training very unfair to them and their families! _E_ Wow—Golf Magazine just named Trump Scotland "best new course." __HTTP__ _E_ Congratulations to the @NYRangers on bringing the series home last night. _E_ Karl Rove's ads are the worst in political history! _E_ I am calling on Congress to TERMINATE the diversity visa lottery program that presents significant vulnerabilities to our national security. __HTTP__ _E_ Will be interviewed on @SquawkCNBC by @JoeSquawk coming up at 6:00amE from Davos Switzerland. Enjoy! #WEF18 __HTTP__ _E_ Thank you Georgia! #AmericaFirst#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Turnberry in Scotland is a far superior golf course to Pinehurst and it isn't even close! Likewise the Blue Monster at Doral. _E_ Trump SoHo opens this Friday and it is fantastic! Check out the Trump Hotel Collection... __HTTP__ _E_ .@WestwoodLee Great going this weekend. You are a true champion! _E_ Trump Int'l Hotel & Tower Chicago has won many awards & accolades as has Sixteen its signature restaurant. __HTTP__ _E_ Is Anthony Weiner a jerk or what! _E_ In the general course of human nature a power over a man's subsistence amounts to a power over his will. Alexander Hamilton _E_ During my trip to Saudi Arabia I spoke to the leaders of more than 50 Arab & Muslim nations about the need to confront our shared enemies.. __HTTP__ _E_ An attack on our Embassy is an attack on our soil. We have been attacked by Libya. Go into Libya & take the (cont) __HTTP__ _E_ .@HillaryClinton loves to lie. America has had enough of the CLINTON'S! It is time to #DrainTheSwamp! Debates __HTTP__ _E_ "Learn to think continentally." Alexander Hamilton _E_ While in Charlotte this weekend will visit my Trump National Golf Club on Lake Norman—a magnificent place & doing really well! _E_ Just received a copy of @SarahPalinUSA new book a great read! Sarah is a terrific person. _E_ I spoke with other candidates to a Jewish group many friends in D.C. I said I'm a negotiator like you Got standing O rated best of day! _E_ The price of greatness is responsibility. Winston Churchill _E_ British Prime Minister May was very angry that the info the U.K. gave to U.S. about Manchester was leaked. Gave me full details! _E_ Great meeting with automobile industry leaders at the @WhiteHouse this morning. Together we will #MAGA! __HTTP__ _E_ I wonder what the late great Vince Lombardi would say about the Rutgers football player who says he is being bullied because coach yelled? _E_ A hurricane will be coming to Tampa. My @RNC convention surprise hits Monday night! _E_ Establishment flunky @KarlRove is going crazy with the just released CBS poll that has me way ahead. New Fox poll has me beating Hillary. _E_ There is nothing I would be happier to do than to donate the $5M to a charity of Obama's choice once he releases all of his records. _E_ So wonderful to be in Las Vegas yesterday and meet with people from police to doctors to the victims themselves who I will never forget! _E_ Current @NYMag really sad not only boring but highly inaccurate. Use better paper product looks like a death march (which it is!). _E_ House Republicans should be doing everything possible to defund ObamaCare. Instead Leadership is funding it __HTTP__ _E_ Greatly dishonest of @TedCruz to file a financial disclosure form & not list his lending banks then pretend he is going to clean up Wall St _E_ There's no bigger name in America than Donald Trump political or nonpolitical. Sarasota GOP Chair Joe Gruters _E_ No surprise welfare spending is up over 30% under Obama. __HTTP__ He is the food stamp & welfare king _E_ Obama now wants to deny due process to the police. He'll give all constitutional rights to the terrorists but not our cops. _E_ Last week was a first in #CelebApprentice when I fired 2 celebrities at once. Wish I could FIRE @RickSantorum! (cont) __HTTP__ _E_ Who else could take 16 vacations play over 100 rounds of golf and hold over 300 fundraisers while serving as (cont) __HTTP__ _E_ If Obama is concerned about the border he should stop vacationing. Gov't will save millions which it can use to stop illegal migration. _E_ Great now Supreme Court Justices are talking about a constitutional right to a cell phone __HTTP__ Obama just stop already. _E_ Take responsibility for yourself it's a very empowering attitude. _E_ Mitt Romney did great in the debate last night. _E_ Tucson killer Loughner should be given the death penalty not his plea bargained life in prison which will cost (cont) __HTTP__ _E_ Billions of dollars spent on Baltimore and it's still a total mess. Leadership is needed not dollars. Our whole country is going to hell! _E_ This is the simple fact about @HillaryClinton: she is a typical politician all talk no action. #Debates2016 _E_ Thank you! #MakeAmericaGreatAgain __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Again more dead people voted in the last election than enrolled in ObamaCare. Congratulations America! _E_ We're going to use American steel we're going to use American labor we are going to come first in all deals. ... __HTTP__ _E_ Obama's second term is going to very tough for the Republicans. The Republicans must pick their battles wisely and play smart. _E_ Letter to @Univision Re: @TrumpDoral __HTTP__ _E_ Will be on @CNBC at @7:22. Enjoy! _E_ Will be another Sean success! __HTTP__ _E_ Wow...NYT reports @celebrityapprentice was the number 1 show in branding on television for all of 2012. _E_ Thoughts and prayers are with everyone in West Virginia dealing with the devastating floods. #ImWithYou _E_ The $25 Billion settlement with the banks on mortgages will slow the housing market down even more and create higher user fees. Stupid! _E_ Job tip: If you were the employer what kind of person would you most desire as an employee? Be that person. _E_ I will be on @letterman tonight. Be sure to watch! Always a great time. @LateShow _E_ We are now at a time perhaps more than ever before when the World needs GREAT leadership! _E_ Entrepreneurs: See yourself as victorious and the best way to be victorious is to be passionate. Find something you love doing! _E_ The radar defense shipping and civil aviation problems will stop the ugly windfarm. #EOWDC _E_ The media coverage this morning of the very average Clinton speech and Convention is a joke. @CNN and the little watched @Morning_Joe = SAD! _E_ It is a shame that the biased media is able to so incorrectly define a word for the public when they know that the definition is wrong. Sad! _E_ A clip from @KatieShow where I take @katiecouric's audience on the Katie Coach __HTTP__ _E_ When the achiever achieves it's not a plateau it's a beginning. Donald J. Trump __HTTP__ _E_ Every business has surprises hidden dangers beneath the surface and little known opportunities that can lead to huge success. _E_ Fans like winners. They come to watch stars great exciting players who do great exciting things. #TheArtofTheDeal _E_ Imagine how much money the average American would save if we busted the OPEC cartel. (cont) __HTTP__ _E_ Here we go! I stated long ago that we should cancel all flights from West Africa. Now we have Ebola in U.S. AND IT WILL ONLY GET WORSE! _E_ Will be cutting ribbon at 10 A.M. with Mayor Bloomberg and Jack Nicklaus for the opening of TRUMP LINKS at FERRY POINT. _E_ "Spend your time enjoying your big dreams." Think Big _E_ Do you ever notice that lightweight @megynkelly constantly goes after me but when I hit back it is totally sexist. She is highly overrated! _E_ Wow new polls just came out from @CNN Great numbers especially after total media hit job. Leading Ohio 48 44. _E_ Mitt Romney matches sitting President in fundraising for April not an easy thing to do. Bad news for (cont) __HTTP__ _E_ Designed by @IvankaTrump @TrumpDoral's New Villa Deluxe Guestrooms include vintage artwork of golf legends __HTTP__ _E_ Tomorrow at 11AM #MakeAmericaGreatAgain __HTTP__ _E_ We need the Wall for the safety and security of our country. We need the Wall to help stop the massive inflow of drugs from Mexico now rated the number one most dangerous country in the world. If there is no Wall there is no Deal! _E_ Yesterday on the same day I had meetings with Russian Foreign Minister Sergei Lavrov and the FM of Ukraine Pavlo... __HTTP__ _E_ Turn to QVC now to watch Melania really good stuff! _E_ With eleven Republican candidates running in Georgia (on Tuesday) for Congress a runoff will be a win. Vote R for lower taxes & safety! _E_ The Failing New York Times foiled U.S. attempt to kill the single most wanted terroristAl Baghdadi.Their sick agenda over National Security _E_ Will be doing Fox and Friends at 7 A.M. (in 20 minutes). _E_ What a shame that @msnbc's ratings have sunk even lower in 2013. Prime time down 50%. @TheRevAl's are (cont) __HTTP__ _E_ "Surround yourself with people who are smarter than you." @UncleRUSH _E_ The Democrats are most angry that so many Obama Democrats voted for me. With all of the jobs I am bringing back to our Nation that number.. _E_ Must read editorial today about lightweight New York State Attorney General Eric Schneiderman. Is he a crook? __HTTP__ _E_ Thank you West Virginia!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ The polling numbers show a close race. @MittRomney needs all of our support. _E_ .@BarbaraJWalters @theviewtv will apologize to me just like she did when I was right about @Rosie. Besides I get great ratings on The View. _E_ Great evening with President @EmmanuelMacron & Mrs. Macron. Went to Eiffel Tower for dinner. Relationship with France stronger than ever. __HTTP__ _E_ The Obama administration gives better medical care to Al Qaeda at Gitmo than to our vets. _E_ The great @MarianoRivera in my office with my son @EricTrump __HTTP__ _E_ Response to @LindseyGrahamSC: __HTTP__ _E_ Do you notice the silence lately on wind turbine monstrosities? The people of Scotland & many other countries are fighting back. _E_ "You owe it to yourself and to your community to make your property the best it can be." – Think Like a Billionaire _E_ The MOVEMENT in Lakeland Florida. Voter registration extended to 10/18. REGISTER ASAP @ __HTTP__ &... __HTTP__ _E_ This is not a media event or about Donald J. Trump this is about the United States of America. I will be (cont) __HTTP__ _E_ If @BarackObama wanted the Super Committee to succeed he would have lead. Instead he has been campaigning. Where is the leadership? _E_ It is important to think positively. Negative thinking will kill your focus and destroy any chance you have of being successful. _E_ Little @MacMiller you illegally used my name for your song "Donald Trump" which now has over 75 million hits. _E_ Now that Mitt is gone all we have to do is get Bush to drop out and Trump to run—and we will win! _E_ I will be in Washington D.C. on Wednesday1 P.M.in front of the Capitol to protest the horrible and incompetent deal being made with Iran. _E_ DACA has been made increasingly difficult by the fact that Cryin' Chuck Schumer took such a beating over the shutdown that he is unable to act on immigration! _E_ I am happy to hear how badly the @nytimes is doing. It is a seriously failing paper with readership which is way down. Becoming irrelevant! _E_ Rapidly failing @VanityFair magazine hits me for my strong stance against Obama's brilliant 5 killers for 1 deserter trade. Amazing! _E_ The Iran deal is terrible. Why didn't we get the uranium stockpile it was sent to Russia. #SOTU _E_ Looking forward to the Florida rally tomorrow. Big crowd expected! _E_ Ratings starved @CNN and @CNNPolitics does not cover me accurately. Why can't they get it right it's really not that hard! _E_ W/ the ransom Obama paid for deserter Bergdhal getting Mexico to release USMC Sgt Andrew Tahmooressi is much harder. #BringBackOurMarine _E_ .@katyperry must have been drunk when she married Russell Brand @rustyrockets – but he did send me a really nice letter of apology! _E_ Jusr arrived at the studio the place is going wild! LIVE AT 8 P.M. #CELEBRITYAPPRENTICE _E_ Now China is threatening our allies who share defense pacts with us the latest is the Philippines __HTTP__ Very aggressive _E_ "If it doesn't sell it isn't creative." David Ogilvy _E_ My @FoxNews interview from yesterday with @TeamCavuto discussing the economy my trip to Australia @MittRomne... (cont) __HTTP__ _E_ Those five hotels includeTrump International Hotel & Tower New York Trump Soho New York Trump International Hotel & Tower Chicago... _E_ More reports of voting machines switching Romney votes to Obama. Pay close attention to the machines don't let your vote be stolen _E_ Moving forward f/tonight's competitive primaries it is crucial that the Tea Party & @GOP remain united towards November. Take the Senate! _E_ #trumpvlog Why I cancelled the great debate ..... __HTTP__ _E_ "The problem is that no government can create real jobs. Only entrepreneurs can do that." – Midas Touch _E_ Speech on Veterans' Reform: __HTTP__ _E_ I am tired of @BarackObama talking about @MittRomney's father. Why don't we discuss Barack Obama Sr.! _E_ Thank you to @GolfweekMag for naming Trump International Golf Links Scotland #1 GB&I Best Modern Course A great honor! _E_ Pete Rose should now be allowed in The Baseball Hall of Fame. The all time hits leader has paid the price already! _E_ Looking forward to keynoting the @NCGOP #NCGOPcon dinner tomorrow night! @NCGOP is a top state party! _E_ The woman who is the Secret Service Director looks like she is way over her head.Why can't the president appoint the best and the brightest? _E_ The only way for Medicare and Social Security to remain solvent is if our economy is healthy. @BarackObama doesn't get it. _E_ Wow great Ohio poll. Shows me leading by 5 points beating K! _E_ I find hope in the darkest of days and focus in the brightest. Dalai Lama _E_ Sooner or later those who win are those who think they can. Paul Tournier _E_ My interview yesterday on the S&P downgrade with Wolf Blitzer on CNN __HTTP__ _E_ "Courageous people do not fear forgiving for the sake of peace." – Nelson Mandela _E_ On behalf of @FLOTUS Melania & myself THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief @RedCross & @SalvationArmyUS __HTTP__ _E_ The Honolulu accommodations of @TrumpWaikiki are the perfect merger of beauty and function __HTTP__ _E_ Just as I predicted @Rosie would fail on The View __HTTP__ _E_ .@BillMaher's so called show on HBO must be the cheapest special produced in the history of television it sucks! _E_ via __HTTP__ Donald Trump announces launch of his first Indian project in Pune __HTTP__ _E_ It was great being on @MikeAndMike in the Morning (ESPN)—two great guys fantastic show! _E_ It will be interesting to see how Jenna Talackova does as Miss Universe Canada. We all wish her luck. _E_ Make sure to enjoy your time with your family during the holiday. It is a special time. Love and appreciate your family. _E_ Getting China to stop playing its currency charades can begin whenever we elect a president ready to take (cont) __HTTP__ _E_ Now @RonWyden is also "concerned" about ObamaCare along with @MaxBaucus __HTTP__ Program may fold through its own doing. _E_ AGAIN TO OUR VERY FOOLISH LEADER DO NOT ATTACK SYRIA IF YOU DO MANY VERY BAD THINGS WILL HAPPEN & FROM THAT FIGHT THE U.S. GETS NOTHING! _E_ Just arrived in Indianapolis Indiana to make an announcement on #TaxReform! Together we are going to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Big WIN today for building the wall. It will secure the border & save lives. Now the full House & Senate must act! __HTTP__ __HTTP__ _E_ Other than a small group of people who have suffered massive and embarrassing losses the party is VERY united. Great love in the arena! _E_ As President of the United States of America I will ALWAYS put #AmericaFirst#UNGAFull remarks: __HTTP__ __HTTP__ _E_ I commend Roger Ailes for publicly supporting @FoxNews' employees against the Obama administration's intimidation of its reporters. _E_ Congratulations to John Rich and to Marlee Matlin for a terrific job throughout the season. You are both great! __HTTP__ _E_ Entrepreneurs: Cover your bases. Know everything you can about what you're doing. Then go with your gut. Your instincts r there for a reason _E_ Prayers go out to the victims of the terrible fire in New Jersey. Stay strong and remember it will soon get better. _E_ Problem with @GOP is not their message it's that they are incapable of controlling the message. _E_ Jimmy Fallon show will be great tonight I'm on! _E_ Fewer Americans are now insured through their employers due to higher premiums. Obamacare must be fully repealed. __HTTP__ _E_ The Washington Post calls out #CrookedHillary for what she REALLY is. A PATHOLOGICAL LIAR! Watch that nose grow! __HTTP__ _E_ I promise you that I'm much smarter than Jonathan Leibowitz I mean Jon Stewart @TheDailyShow. Who by the way is totally overrated. _E_ Once again @Cher tweets nonsense about @MittRomney. She needs to stop tweeting & start worrying about some of her many problems. _E_ Joint Press Conference with Prime Minister Saad Hariri of Lebanon beginning shortly. Join us live! __HTTP__ __HTTP__ _E_ Spent the full day at meetings and a major rally yesterday in South Carolina. Great people and spirit. Today will be more of the same. _E_ I don't know whether I will win or lose the @billmaher lawsuit but had an obligation to sue for charity. _E_ We grieve for the officers killed in Baton Rouge today. How many law enforcement and people have to... __HTTP__ _E_ .@nbc did a great job last night with the @GoldenGlobes! _E_ Via @nytpolitics by @AshleyRParker: "Strong Showings for Donald Trump in Iowa and New Hampshire Polls" __HTTP__ _E_ Windmills are destroying every country they touch and the energy is unreliable and terrible. __HTTP__ _E_ Watch my wife Melania Trump tonight on @QVC at 1 a.m. So proud of her! _E_ Via @bpolitics by @emtitus: "Defying Doubters Donald Trump Makes Presidential Bid Official" __HTTP__ _E_ Via @newhampshirecom:"Tickets on sale for Loeb School Event featuring Donald Trump" __HTTP__ _E_ I have accepted @billmaher's $5 million offer paid to me for charity (made on the @jayleno show). _E_ That was a great football game. _E_ Are you a Democrat running in a race you should lose? Get @KarlRove to run an ad against you and you will win. _E_ Looking forward to speaking at @ralphreed's @FaithandFreedom Gala Dinner on Friday in D.C. His staff has been great! _E_ Melania and I saw American Idiot on Broadway last night and it was great. An amazing theatrical experience! _E_ Congrats to @BarbaraJWalters on winning the @MadeinNY Mayor's Award for Lifetime Achievement! I love Barbara! _E_ Happy 5th Anniversary to @TrumpWaikiki< __HTTP__ ! Can't believe it's already been 5 years.. _E_ On this Memorial Day holiday we honor our fallen soldiers who have made the greatest sacrifice for freedom. They are our country's finest. _E_ #trumpvlog Windfarms in today's video blog... __HTTP__ _E_ Wow some new and even greater polls thank you! _E_ The U.S. cannot allow EBOLA infected people back. People that go to far away places to help out are great but must suffer the consequences! _E_ The Council was shocked by the exuberance of the demonstration in Blackdog. @AlexSalmond @pressjournal _E_ Via @Newsmax_Media: "Trump: @KarlRove 'The Most Over rated Man in Politics'" __HTTP__ _E_ .@VattenfallGroup had no answers at demonstration last night. It's a failing company. Aberdeen windmills will destroy it. _E_ Hillary Clinton has been involved in corruption for most of her professional life! _E_ Entrepreneurs: Get and keep your momentum going. Without momentum a lot of great ideas go nowhere. _E_ Great article in the @guardian Donald Trump opens £100m golf course __HTTP__ _E_ In one of the biggest stories in a long time the FBI says it is now missing five months worth of lovers Strzok Page texts perhaps 50000 all in prime time. Wow! _E_ Join me in Henderson Nevada on Wednesday at 11:30am! #MAGA Tickets: __HTTP__ _E_ Wind Farms are not only disgusting to look at but also cause tremendous damage to their local ecosystems. __HTTP__ _E_ Via @Slate: Who won the #GOPDebate? __HTTP__ _E_ Fracking will lead to American energy independence. With price of natural gas continuing to drop we can be at a tremendous advantage. _E_ The Voter Violation certificate gave poor marks to the unsuspecting voter(grade of F) and told them to clear it up by voting for Cruz. Fraud _E_ The Saudis are taking credit for a meager 2% drop on crude __HTTP__ They always play this game (cont) __HTTP__ _E_ I enjoyed meeting with @MattBlunt @TrumpTowerNY to discuss why our government must address currency manipulation. Many US jobs are at stake. _E_ The signature restaurant of @TrumpNewYork @jeangeorges is both Forbes Five Star & AAA Five Diamond restaurant __HTTP__ _E_ Via @BreitbartNews by @mboyle1: Donald Trump Slams Liberals In 'Dishonest Press': 'I'm Going To Start Naming Names' __HTTP__ _E_ Thank you for joining us at the Lincoln Memorial tonight a very special evening! Together we are going to MAKE AM... __HTTP__ _E_ A signed copy of CRIPPLED AMERICA makes a great gift. Order & join my live streaming book signing event on 12/3 __HTTP__ _E_ Congratulations to @BarackObama he is the first POTUS to run trillion dollar deficits in all four years of his term! _E_ A mediocre person tells. A good person explains. A superior person demonstrates.... _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ Thank you to my great supporters in Wisconsin. I heard that the crowd and enthusiasm was unreal! _E_ Dummy I'm asking a question look at the question mark at the end of the sentence! Use your head. _E_ John Heilemann the lightweight reporter begging to be on@morning joe looks like a timebomb waiting to explode he's a nervous and sad mess! _E_ Gary as the Cat in the Hat? He can work it out. _E_ How much is New York State spending on that obnoxious T.V. commercial that is being played endlessly for a tax incentive that doesn't work? _E_ I have directed that U.S. Cyber Command be elevated to the status of a Unified Combatant Command focused on....cont: __HTTP__ _E_ USSS did an excellent job stopping the maniac running to the stage. He has ties to ISIS. Should be in jail! __HTTP__ _E_ Via The Brody File: The Lesson Evangelicals Can Learn From Donald Trump Thank you David & CBN News so nice. __HTTP__ _E_ If the Wall Street protesters are upset about the economy then they should really be protesting @BarackObama at the White House. _E_ New CNN/WMUR New Hampshire poll just released. Thank you! #FITN #Trump2016 __HTTP__ _E_ to the U.S. but had nothing to do with TRUMP is more FAKE NEWS. Ask top CEO's of those companies for real facts. Came back because of me! _E_ Virtually no one has spent more money in helping the American people with disabilities than me. Will discuss today at my speech in Sarasota _E_ Great to see that Dr. Kelli Ward is running against Flake Jeff Flake who is WEAK on borders crime and a non factor in Senate. He's toxic! _E_ The failing @nytimes should focus on fair and balanced reporting rather than constant hit jobs on me. Yesterday 3 boring articles today2! _E_ The Chinese are illegally dumping bird killing wind turbines on our shores. Only one of many grievances we should act. _E_ Happy Thanksgiving to everyone. We will together MAKE AMERICA GREAT AGAIN! _E_ Photo from a recent episode of @ApprenticeNBC saying those two famous words! __HTTP__ _E_ ...Spread shots out over long period and watch positive result. _E_ Featuring @BLTPrime & Palm Grill @TrumpDoral offers a wide array of acclaimed top dining options __HTTP__ _E_ This morning @nbc @todayshow played some of the @RNC video I filmed for the Tampa Convention __HTTP__ _E_ Raised a lot of money for the Republican Party. There will be a big gasp when the figures are announced in the morning. Lots of support! Win _E_ See you tonight in North Carolina. Making keynote for the Republican party will be fun. _E_ ObamaCare will increase individual market premiums by 99% for men and 62% for women __HTTP__ DEFUND!! #MakeDCListen _E_ #AmericaFirst #ImWithYou __HTTP__ _E_ RT @Carl_C_Icahn: 1/2 Believe Trump gave a great speech. _E_ Russia and the world has already started to respect us again! __HTTP__ _E_ Big win for Republicans as Democrats cave on Shutdown. Now I want a big win for everyone including Republicans Democrats and DACA but especially for our Great Military and Border Security. Should be able to get there. See you at the negotiating table! _E_ Will be spending the day campaigning in Connecticut another state where jobs are being stolen by other countries. I will stop this fast! _E_ Regardless of the USC's ruling ObamaCare can only be defeated politically. It must be legislatively repealed or America will go bankrupt. _E_ Via @HotelierME: Olympic golf course designer named by Trump Damac __HTTP__ _E_ ...al Megrahi was the man who blew up Pan Am Flight 103 over Lockerbie Scotland. _E_ .@oreillyfactor please explain to the very dumb and failing @glennbeck that I supported John McCain big league in 2008 not Obama! _E_ As I predicted long ago the war in Iraq was a disaster for the U.S. Heading for civil war there are bombings all over the place.Iran happy _E_ Without passion you don't have energy without energy you have nothing. Find work that you love and the energy will be there. _E_ Entrepreneurs: Never give up. Be tough. Apply your skills and talent but above all be tenacious. _E_ Great to be back in Arizona!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Hillary Clinton wants to create the most liberal Supreme Court in history #debate #DrainTheSwamp __HTTP__ _E_ I try to learn from the past but I plan for the future by focusing exclusively on the present. That's where the fun is! _E_ The event with me and @V4SA in L.A on 9/15 is turning out to be huge. Get your tickets before they're gone __HTTP__ _E_ While @JoeBIden is a gaffe machine yesterday's comments that @MittRomney will put y'all back in chains was not at all proper. _E_ The people of Cuba have struggled too long. Will reverse Obama's Executive Orders and concessions towards Cuba until freedoms are restored. _E_ All these polls released by news outlets are oversampling Democrats. They want to influence public perception of the race. _E_ Great event in Columbus taking off for Cincinnati now. Great new Ohio poll out thank you!OHIO NBC/WSJ/MARIST POLLTrump 42% Clinton 41% _E_ Bad move @BarackObama released $147M in aid to the Palestinians __HTTP__ That money is going to Hamas. _E_ Wow with all this talk @MissUniverse is going to Russia on November 9th __HTTP__ _E_ Soaring 92 stories @TrumpChicago boasts a @ForbesInspector 5 Star rating for both its hotel & restaurant __HTTP__ _E_ I always said Obama is lucky for himself but unlucky for the country. The storm could be very good for him as he (cont) __HTTP__ _E_ Somebody please inform Jay Z that because of my policies Black Unemployment has just been reported to be at the LOWEST RATE EVER RECORDED! _E_ The decision on Sergeant Bergdahl is a complete and total disgrace to our Country and to our Military. _E_ President Obama has absolutely no control (or respect) over the African American community they have fared so poorly under his presidency. _E_ I totally respect that Angelina Jolie has shown such great bravery in the face of danger she has really come a long and positive way! _E_ Good news @MittRomney is now leading in North Carolina according to @ppppolls. The NC GOP is united after their (cont) __HTTP__ _E_ There is incredible progress on the site of Trump Tower Punta del Este Uruguay situated on the sands of Playa Brava __HTTP__ _E_ Interesting that certain Middle Eastern countries agree with the ban. They know if certain people are allowed in it's death & destruction! _E_ "Keep your focus global and you may very well find yourself ahead of the game." – Trump Never Give Up _E_ Join me in Bedford New Hampshire tomorrow at 3:00pm. Can't wait to see everyone! #AmericaFirst #MAGA... __HTTP__ _E_ Another example of the destruction caused by wind turbines. Unnecessary waste horrible! __HTTP__ _E_ ....because he doesn't live there! He wants to raise taxes & kill healthcare. On Tuesday #VoteKarenHandel. _E_ Will be participating in a town hall event hosted by @SeanHannity tonight at 10pmE on @FoxNews. Enjoy! __HTTP__ _E_ Instead of driving jobs and wealth away AMERICA will become the world's great magnet for INNOVATION & JOB CREATION. __HTTP__ _E_ I can't believe that @CNN would allow the very nice Jeffrey Lord to be savaged by a panel of seven Trump haters. 7 to 1 Don't watch CNN! _E_ The election is still close but trending toward @MittRomney. He leads all national polls and Obama's likeability is imploding. VOTE! _E_ Newly released documents show Geithner to be laughing as the financial crisis loomed __HTTP__ _E_ If once you forfeit the confidence of your fellow citizens you can never regain their respect and their esteem. Abraham Lincoln _E_ To all haters and losers: I am NOT anti vaccine but I am against shooting massive doses into tiny children. Spread shots out over time. _E_ In Nashville Tennessee! Lets MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Has Charles @krauthammer ever apologized for being so totally wrong on Iraq? I called it right in every way—Make America Great Again! _E_ A great morning with everyone @LibertyU! Thank you! Off to New Hampshire now. #Trump2016 __HTTP__ __HTTP__ _E_ Will be interviewed on @ThisWeekABC this morning. Enjoy! _E_ 'The goal is to be the winner': Donald Trump's campaign is for real. Via The Guardian __HTTP__ _E_ .@AGSchneiderman must take a drug test immediately—make results public. NY Attorney General cannot be a cokehead. _E_ Rubio was very disloyal to Bush his mentor when he decided to run against him. Both said they love each other.They don't word is hate! _E_ Toyota Motor said will build a new plant in Baja Mexico to build Corolla cars for U.S. NO WAY! Build plant in U.S. or pay big border tax. _E_ Once again @RickSantorum proves he can't run a professional campaign. He is ineligible in large section of (cont) __HTTP__ _E_ Many people are saying that the Iranians killed the scientist who helped the U.S. because of Hillary Clinton's hacked emails. _E_ RT @Scavino45: Florida Governor Rick @FLGovScott. #HurricaineIrma __HTTP__ _E_ ObamaCare enrollment lie: Obama counts an enrollee as a web user putting a plan in "their online shopping carts" __HTTP__ _E_ "There are 2 things I've found I'm very good at: overcoming obstacles and motivating good people to do their best work."–The Art of The Deal _E_ Iran admits to aiding the Libyan Rebels and Ahmadinejad received a letter of thanks when will Washington learn? __HTTP__ _E_ The Great State of Michigan was just certified as a Trump WIN giving all of our MAKE AMERICA GREAT AGAIN supporters another victory 306! _E_ David Wright of the NY Mets should have been on the 1st Team All Stars. He's having a great year. _E_ I always enjoy speaking to young aspiring entrepreneurs. They are hungry motivated and eager to learn. Proves America can still be great. _E_ I had a great time in Texas yesterday. A tremendous crowd of wonderful and enthusiastic people. Will be back soon! _E_ Together we will Make America Great Again!#AmericaFirst __HTTP__ _E_ Weakness of attitude becomes weakness of character. Albert Einstein _E_ After four years of getting the run around America needs a turnaround and the man for the job is Governor Mitt Romney. @PaulRyanVP _E_ ....This is real collusion and dishonesty. Major violation of Campaign Finance Laws and Money Laundering where is our Justice Department? _E_ Check out Donald Trump's new iGoogle Showcase page: __HTTP__ _E_ HAPPY THANKSGIVING EVERYONE ENJOY YOUR DAY! _E_ Another health insurer is pulling back due to 'persistent financial losses on #Obamacare plans.' Only the beginning! __HTTP__ _E_ I like thinking big. I always have. To me it's very simple: if you're going to be thinking anyway you might as (cont) __HTTP__ _E_ Great visit to Detroit church fantastic reception and all @CNN talks about is a small protest outside. Inside a large and wonderful crowd! _E_ .@AlexSalmond's insane release of the terrorist—for humanitarian reasons will go down as a better decision.. _E_ Tonight despite everything put A Rod in the lineup. _E_ Entrepreneurship is engine of American success. I bring it to crowdfunding w/ @fundanything's $1M RECORD reward __HTTP__ _E_ Just read the nice remarks by President Jimmy Carter about me and how badly I am treated by the press (Fake News). Thank you Mr. President! _E_ Via @THESHARKTANK1: Donald Trump's Controversial Mexican Comments Are Accurate __HTTP__ _E_ No person who is enthusiastic about his work has anything to fear from life. Samuel Goldwyn _E_ Word is that despite a record amount spent on negative and phony ads I had a massive victory in Florida. Numbers out soon! _E_ Isn't it amazing that the U.S. and NSA can listen to the highly protected phone conversations of world leaders but can't get O's records! _E_ #MakeAmericaGreatAgain #GOPdebate __HTTP__ _E_ Senator Bob Corker begged me to endorse him for re election in Tennessee. I said NO and he dropped out (said he could not win without... _E_ Great article by @AmSpec's Jeffrey Lord: "The Reagan Revolution. And now... the @realDonaldTrump Revolution?" __HTTP__ _E_ Nice interview in the @The Atlantic of Sarasota GOP Chair Joe Gruters on my 2012 'Statesman of the Year' award __HTTP__ _E_ The Arab League stated that it wants nothing to do with an attack on Syria but they want us to attack.Are our leaders insane or just stupid _E_ "You have to keep going and moving forward no matter what is happening around you or to you." – Think Like a Champion _E_ RT @DailyCaller: Guam Governor To Trump: I've Never Felt Safer Than 'With You At The Helm' __HTTP__ __HTTP__ _E_ Along with two championship courses on the Potomac River @TrumpGolfDC's also offers limitless social events __HTTP__ _E_ I will be having a general news conference on JANUARY ELEVENTH in N.Y.C. Thank you. _E_ RT @FoxNews: .@EricTrump: My father was elected for one reason and that's because he actually believes in putting America first which is... _E_ Honored to sign S.442 today. With this legislation we support @NASA's scientists engineers and astronauts in the... __HTTP__ _E_ Via @gatewaypundit: "Please Pray for Me... I Am Losing My Insurance" __HTTP__ Just one of the millions of cases like this... _E_ More Bush cronyism – "Jeb Bush and the Common Core Money Trail" __HTTP__ It's the Bush way! _E_ That would mean that Eliot Spitzer has failed at everything he's done politics TV & even real (cont) __HTTP__ _E_ Good timing: @TraceAdkins won big for American Red Cross last night on @ApprenticeNBC. Now the Red Cross is in Oklahoma doing a great job. _E_ Remember the old saying The more you learn the more you realize you don't know it's true. Learning is a daily challenge. _E_ My heartfelt thoughts and prayers are with the 7 @USNavy sailors of the #USSFitzgerald and their families. ... __HTTP__ _E_ A suicide bomber has just killed U.S. troops in Afghanistan. When will our leaders get tough and smart. We are being led to slaughter! _E_ Great job to Missy Franklin. She's got a smile that can take over the world. She's also a major talent. Great going Missy! _E_ America will have record growth and prosperity during his adminstration: @MittRomney's success in the private sector is a tremendous asset. _E_ Just spoke to President Macri of Argentina about the five proud and wonderful men killed in the West Side terror attack. God be with them! _E_ Today I announced another historic breakthrough for the VA. We are working tirelessly to keep our promises to our GREAT VETERANS! #USA __HTTP__ _E_ RT @EricTrump: So proud to be out on the campaign trail with @realDonaldTrump thanks for an amazing night #Biloxi #Trump2016 __HTTP__ _E_ I am watching the NFL DRAFT will be interesting! A lot of talent but only a few will become STARS. _E_ "What America Needs: The Case for Trump" Great new book by the esteemed Jeffrey Lord @JeffJlpa1 Available now. __HTTP__ _E_ Entrepreneurs: Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_ The ultimate Golf experience @TrumpTurnberry is a unique destination located on the beautiful Ayrshire coastline __HTTP__ _E_ 'President Elect Donald J. Trump Intends to Nominate Congressman Tom Price and Seema Verma.' __HTTP__ __HTTP__ _E_ Book on Bin Laden is a terrible violation of code makes @BarackObama's story a big lie. _E_ Lots of comments—Do you really believe these two brothers operated alone without influence of others? _E_ Congratulations to @Likud_Party MK @dannydanon on being offered Deputy Defense Minister of IDF by @IsraeliPM @netanyahu. _E_ THANK YOU AMERICA! #Trump2016#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Heading for Ohio really big crowd of amazing people! Much to talk about! _E_ Rupert Murdoch is a great guy who likes me much better as a very successful candidate than he ever did as a very successful developer! _E_ Heading to South Carolina now meeting with fantastic people! _E_ Obama says a WALL at our southern border won't enhance our security (wrong) and yet he now wants to build a much bigger wall (fence) at W.H. _E_ Rick Perry is a good guy who had a really tough evening. @RickPerry _E_ Between Iraq war monger @krauthammer dummy @KarlRove deadpan @GeorgezWill highly overrated @megynkelly among others @FoxNews not fair! _E_ Peter Navarro: 'Trump the Bull vs. Clinton the Bear' #DrainTheSwamp __HTTP__ _E_ The man made climate change that our great president should be focused on is of the NUCLEAR variety brought upon us because of weakness! _E_ I love watching dummy @mcuban promote on ok show named Shark Tank—but he is just a small part of that show. _E_ The Architect @KarlRove is directly responsible for losing both houses & @BarackObama becoming President. Ignore him. _E_ RT @IvankaTrump: Thank you for the warm welcome. I'm excited to be in Hyderabad India for #GES2017. __HTTP__ _E_ I will be on State of the Union @CNN with @jaketapper at 9am. Enjoy! _E_ Scary & Unsustainable: On Monday the US added more debt than from 1776 through Pearl Harbor __HTTP__ _E_ Trump International Hotel & Tower Vancouver will include Vancouver's first pool bar nightclub & Trump Spa __HTTP__ _E_ Joe Biden called America the Problem vis a vis Iran __HTTP__ He never wastes an opportunity to say something stupid.@JoeBiden _E_ Also great comeback by the New York Jets. That game was over until a really dumb defensive play by Tampa. Amazing. _E_ Brainpower is the ultimate leverage. Keep your focus intact! _E_ To vote for me and CENTURY 21 for the best #Superbowl commercial click the following link and "Like" the page. __HTTP__ _E_ Our thoughts are with the forces fighting ISIS in Iraq. We must never back down against this extreme radical Islami... __HTTP__ _E_ Tried watching low rated @Morning_Joe this morning unwatchable! @morningmika is off the wall a neurotic and not very bright mess! _E_ Scenes from last night's episode of @OCChoppers where @DonaldJTrumpJr and I visit the OCC HQ __HTTP__ _E_ Rosechem1 One of the reasons that I like you is because I feel that old American greatness in your mentality. It makes me feel hope! Thx. _E_ Important editorial by John Faso in @nydailynews: "Spitzer's reckless leadership" __HTTP__ _E_ Dopey @Lord_Sugar I'm worth $8 billion and you're worth peanuts...without my show nobody would even know who you are. _E_ .@MissUSA Olivia Culpo has been a star a young Audrey Hepburn. _E_ A true honor. @PressSec considers asking for @BarackObama's college transcripts a Donald Trump question. __HTTP__ Release it! _E_ Why do we always try to destroy our true champions and winners in this country while at the same time leaving the losers alone? STUPID! _E_ The damage that Democrats weak Repubicans and this disaster of a president have inflicted on America has put (cont) __HTTP__ _E_ I will once again write a $1 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_ The #IranDeal is a catastrophe that must be stopped. Will lead to at least partial world destruction & make Iran a force like never before. _E_ We need to bring manufacturing jobs back home where they belong. #TimeToGetTough __HTTP__ _E_ THE ROLLOUT OF OBAMACARE IS A TOTAL DISASTER AND AN EMBARRASSMENT TO OUR COUNTRY. THE WORLD IS WATCHING AND LAUGHING.$635000000 WEBSITE! _E_ Concentration is a fine antidote to anxiety. Jack Nicklaus _E_ I will end common core. It's a disaster. __HTTP__ #Trump2016 __HTTP__ _E_ Thank you Pastor Robert Jeffress! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ It is amazing that after lambasting Donald Sterling on @foxandfriends some DISHONEST press only reported my GIRLFRIEND FROM HELL statement! _E_ My @SquawkCNBC interview discussing the close election the enthusiasm gap between Mitt & Obama & the fiscal cliff __HTTP__ _E_ Why won't @BarackObama repeal the Defense of Marriage Act if he supports gay marriage? __HTTP__ He is gaming the issue. _E_ I never heard of @DannyZucker until his very dumb and endless tweets started pouring out of insecure mind but I have a great deal for him! _E_ Likewise the primary victims of violent crimes are in the African American and Hispanic communities. These people want LAW AND ORDER now! _E_ Today's job report is dismal. Now a record 88921000 Americans are no longer in the work force. _E_ Again immigration reform is fine—but don't rush to give away our country. That's what's happening! _E_ Just finished reading a poorly written & very boring book on the General Motors Building by Vicky Ward. Waste of time! @WileyBiz _E_ CEO's most optimistic since 2009. It will only get better as we continue to slash unnecessary regulations and when we begin our big tax cut! _E_ .@BarackObama's dismal job record is reason alone that he must be defeated this November. _E_ On Anthony Wiener I TOLD YOU SO! _E_ I am getting great credit for my press conference today. Crooked Hillary should be admonished for not having a press conference in 179 days. _E_ I am always on the front page of the failing @nytimes but when I won the GOP nomination I'm in the back of the paper. Very dishonest! _E_ Via @nypost's @PageSix: "Trump researching 2016 run" __HTTP__ _E_ Is Chris Jackson as dumb as I hear but I still like that he follows me like a good little soldier! _E_ Here we go! _E_ Losers and haterseven you as low and dumb as you are can learn from watching Apprentice and checking out my tweets you can still succeed! _E_ The Christmas Story begins 2000 years ago with a mother a father their baby son and the most extraordinary gift of all—the gift of God's love for all of humanity.Whatever our beliefs we know that the birth of Jesus Christ and the story of his life... __HTTP__ _E_ If Scotland doesn't stop insane policy of obsolete bird killing wind turbines country will be destroyed. @AlexSalmond @AberdeenCC _E_ Look what the President of NBC sent me recently about his stay in my Las Vegas hotel. Very loyal guy. __HTTP__ _E_ I've just started blocking out some of the repetitive and boring (& dumb) haters and losers. They are a waste of time and energy! _E_ ObamaCare Horror Story: "Navigators Tell Applicants To Lie Like Administration" __HTTP__ @JamesOKeefeIII strikes again! _E_ #VoteTrumpMI! #Trump2016 __HTTP__ _E_ Tom Brady has done a great job tonight amazing New England comeback. Good game not over yet! _E_ After foolishly spending two trillion $'s and losing so many great young people the U.S. will be the only one who won't get the oil in Iraq _E_ Here's a sneak peek at the @DNC convention theme: It's not our fault. Blame Bush. Oh and government built it. _E_ At least 24 players kneeling this weekend at NFL stadiums that are now having a very hard time filling up. The American public is fed up with the disrespect the NFL is paying to our Country our Flag and our National Anthem. Weak and out of control! _E_ To entrepreneurs: Watching you could be the motivation for your employees so make it an example that will best serve your success. _E_ The lightweight @JonHuntsman used my name in a debate for gravitas it didn't work. Sad! _E_ Economists on the TAX CUTS and JOBS ACT:"The enactment of a comprehensive overhaul complete with a lower corporate tax rate will IGNITE our ECONOMY with levels of GROWTH not SEEN IN GENERATIONS..." __HTTP__ _E_ Congratulations to Paul Ryan Kevin McCarthy Kevin Brady Steve Scalise Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes! _E_ The Democrats lead by head clown Chuck Schumer know how bad ObamaCare is and what a mess they are in. Instead of working to fix it they.. _E_ ICYMI: PENCE: I RAN A STATE THAT WORKED KAINE RAN A STATE THAT FAILED. __HTTP__ _E_ I wonder if President Obama would have attended the funeral of Justice Scalia if it were held in a Mosque? Very sad that he did not go! _E_ Check out a list of Donald Trump's books for summer reading at the Trump University Blog: __HTTP__ _E_ Congratulations to @BillCassidy on a decisive win this past Saturday. Bill will be a pro growth & pro energy Senator. _E_ Melania will be on QVC tomorrow night at 9 p.m. ET to introduce her beautiful and inspiring Melania Timepieces & Fashion Jewelry collection. _E_ Twisted Sister frontman @deesnider shines in the record 13th season of 'All Star' @CelebApprentice. The Iron Man of Rock and Roll is great! _E_ Backstage with @jimmyfallon before opening skit great fun! @fallontonight __HTTP__ _E_ My interview with WMUR's @JoshMcElveen at #NHFreedomSummit __HTTP__ _E_ Honored to meet this years @SenateYouth delegates w/ @VP Pence in the East Room of the @WhiteHouse. Congratulations... __HTTP__ _E_ Be tenacious. Being tenacious means you're tough and patient at once so it's a formidable combination. _E_ Ted Cruz should be disqualified from his fraudulent win in Iowa. Weak RNC and Republican leadership probably won't let this happen! Sad. _E_ FLASHBACK October 9 2012: "Donald Trump: Jobs Numbers Are 'A Lot Of Monkey Business'" __HTTP__ Proven right again! _E_ Thank you to FEMA our great Military & all First Responders who are working so hardagainst terrible oddsin Puerto Rico. See you Tuesday! _E_ Join me LIVE with @VP @SecretaryPerry @SecretaryZinke and @EPAScottPruitt. #UnleashingAmericanEnergy __HTTP__ _E_ The Iran nuclear deal is a terrible one for the United States and the world. It does nothing but make Iran rich and will lead to catastrophe _E_ Big day tomorrow in Georgia and South Carolina. ObamaCare is dead. Dems want to raise taxes big! They can only obstruct no ideas. Vote R _E_ The terrorists in Syria are calling themselves REBELS and getting away with it because our leaders are so completely stupid! _E_ Every poll done on debate last night from Drudge to Newsmax to Time Magazine had me winning in a landslide. #MakeAmericaGreatAgain! _E_ For all of those (DACA) that are concerned about your status during the 6 month period you have nothing to worry about No action! _E_ I'm on @foxandfriends every Monday morning at 7:30... _E_ .@Omarosa's new name via @DennisRodman: "Ms. Saboteur" sounds rather elegant. #CelebApprentice _E_ Just arrived in Arizona! #ImWithYou __HTTP__ _E_ The entire village of Blackdog in Scotland protested to the Council last night about the ugly windmills. @AlexSalmond @pressjournal _E_ My interview in @politico with @pwgavin discussing being awarded the 2012 Statesman of the Year by Sarasota GOP __HTTP__ _E_ There's a reason @mcuban's partners can't stand him and on top of that the team sucks! _E_ Whenever you see the words 'sources say' in the fake news media and they don't mention names.... _E_ "Donald Trump: $200 Million D.C. Hotel Will Be Among World's Best" __HTTP__ via @WNEW _E_ So @BarackObama will attack @MittRomney's career at Bain Capital but won't return donations from Bain executives __HTTP__ _E_ RT @realDonaldTrump: Consumer confidence soars to highest level since 2004 📈 __HTTP__ __HTTP__ _E_ Thank you! We are at 35% in new Reuters poll with #2 coming in at 12%. Time to #MakeAmericaGreatAgain!#Trump2016 __HTTP__ _E_ I made a great deal of money in Atlantic City but left years ago when I saw so many political mistakes being made. I have ZERO involvement! _E_ Just got back from Georgia. The crowds and love for U.S. was so amazing! We all had a great day together will be back soon! _E_ "The biggest doers often suffer the biggest setbacks in life. So if you want to aim high you have be able to handle the bumps."–Think Big _E_ If Republicans are going to pass great future legislation in the Senate they must immediately go to a 51 vote majority not senseless 60... _E_ Why isn't Mexico releasing our Marine. U.S. should come down really hard on them. They have ZERO respect for our so called leader _E_ Via @Ammoland by Fredy Riehl: "Donald Trump Talks: Gun Control Assault Weapons Gun Free Zones & Self Defense" __HTTP__ _E_ Thank you! Vote in 2016! #MakeAmericaGreatAgain __HTTP__ _E_ The VA scandal shows the fatal ineptitude of big central planning government. When will we learn? _E_ #FraudNewsCNN #FNN __HTTP__ _E_ .@Morning_Joe: Marco only won the debate in the minds of desperate people. I won every on line poll even crazy @CNBC. Marco good looking? _E_ He @BarackObama wants to release 5 senior Taliban detainees back to the Taliban. __HTTP__ The Taliban out negotiates him! _E_ The Club For Growth said in their ad that 465 delegates (Cruz) plus 143 delegates (Kasich) is more than my 739 delegates. Try again! _E_ Once again we will have a government of by and for the people. Join the MOVEMENT today! __HTTP__ __HTTP__ _E_ Keep it fast short and direct whatever it is. Donald J. Trump __HTTP__ _E_ The middle class has become the new poor in this country and our incompetent politicians are unable to do anything about it.They don't care! _E_ "Consider the fact that for every gallon of gas you put in your car you pay 45.8 cents in state local and federal taxes." #TimeToGetTough _E_ Thank you for such a beautiful welcome Hawaii. My great honor to visit @PacificCommand upon arrival. Heading to Pearl Harbor w/ @FLOTUS now. __HTTP__ _E_ Just arrived at the Pensacola Bay Center. Join me LIVE on @FoxNews in 10 minutes! #MAGA __HTTP__ _E_ HAPPY EASTER HAVE A GREAT DAY! _E_ .@katyperry I watched Russell Brand and I think his mind is fried he looks really bad. Russell is a total joke a dummy who is lost! _E_ With the complete Ft. Lauderdale victory I will now sue for millions of $'s in attorney fees for which plaintiffs are liable. _E_ Remember @dannyzuker you are not even the real boss of Modern Family no big $$$$$$'s for you! _E_ Check out a picture of the custom made Trump Bike that Paul Teutul Sr. presented to me today in Trump Tower __HTTP__ _E_ Such a wonderful statement from the great @LouDobbs. We take up what may be the most accomplished presidency in modern American history. _E_ Young entrepreneurs across the US are trying to make deals & build businesses daily. Stay positive think big & big things will happen _E_ The @nfl games are so boring now that actually I'm glad I didn't get the Bills. Boring games too many flags too soft! _E_ Please remember I am the ONLY candidate who is self funding his campaign. Kasich Rubio and Cruz are all bought and paid for by lobbyists! _E_ You are right the media is always offending Donald Trump they have no limits but they will do anything not to offend the Boston killer! _E_ Ernie Els and myself at Trump National Doral. __HTTP__ _E_ We must fix our education system for our kids to Make America Great Again. Wonderful day at Saint Andrew in Orlando. __HTTP__ _E_ Spoke yesterday with the King of Saudi Arabia about peace in the Middle East. Interesting things are happening! _E_ Entrepreneurs: Use your imagination. Use your intelligence to execute what your imagination presents to you. _E_ Why isn't the House Intelligence Committee looking into the Bill & Hillary deal that allowed big Uranium to go to Russia Russian speech.... _E_ .@JRubinBlogger one of the dumber bloggers @washingtonpost only writes purposely inaccurate pieces on me. She is in love with Marco Rubio? _E_ I am now in Texas doing a big fundraiser for the Republican Party and a @FoxNews Special on the BORDER and with victims of border crime! _E_ #TBT With my friend @muhammadali __HTTP__ _E_ Just won the lawsuit on leadership of Consumer Financial Protection Bureau CFPB. A big win for the Consumer! _E_ Congratulations to @SixteenChicago @TrumpChicago for being honored with a @MichelinGuideChi two star rating again this year! _E_ John Menard of Menards home improvement stores in Midwest treats employees horribly should they form a union? __HTTP__ _E_ It's a national embarrassment that an illegal immigrant can walk across the border and receive free health care and one of our Veterans..... _E_ #ObamacareFail __HTTP__ _E_ Honored to serve as Commander in Chief to the courageous men and women of our U.S. Armed Forces. A grateful nation thanks you! __HTTP__ _E_ Video: Trump Golf Links at Ferry Point @TrumpFerryPoint __HTTP__ _E_ Weak JEB getting thrown out by management during speech. Do you think he will be this tough on Putin & others? __HTTP__ _E_ RT @paulsperry_: Fusion GPS firm behind disputed Russia dossier retracts its claim of FBI mole in Trump camp __HTTP__ _E_ Just got back from Iowa. Fantastic evening with truly fabulous people. Will be back again soon. Thanks! _E_ .@katyperry will do much better __HTTP__ _E_ Entrepreneurs: We win in our daily lives by being careful with every day every moment. _E_ Go with your gut. Take chances. If you think you have the ingredients that you need take chances because your biggest successes... _E_ Via @nypost: Trump's links getting green __HTTP__ _E_ ICYMI This week we hosted a #MadeInAmerica event right here at the @WhiteHouse! If it is MADE IN AMERICA it is the BEST! USA __HTTP__ _E_ The Answer to both Social Security and Medicare is a robust growing economy not cuts on the elderly. _E_ Learning to expect problems saved me from a lot of wasted energy. Winners see problems as just another way to prove themselves. _E_ The premier landmark in midtown NYC Trump Tower features our signature amenities w/a magnificent waterfall __HTTP__ _E_ Will be interviewed on @Morning_Joe at 7:20. Great crowd in Las Vegas yesterday! _E_ National Review @NRO may be going out of business because of the really pathetic job being done by @JonahNRO. No talent means death sad! _E_ RT @TeamTrump: .@HillaryClinton is RAISING your taxes to a disastrous level. @realDonaldTrump is going to LOWER your taxes BIG LEAGUE! #D... _E_ "Obama's promises on the Iran deal are like him promising 'if you like your healthcare plan you can keep it'" @marklevinshow _E_ If Mitt Romney were in the private sector & he suffered the horrendous loss of 2012 do you think he'd rehire himself for 2016?—I don't! _E_ If other countries benefit from our armed forces protecting them those countries should pay for the protection. #TimeToGetTough _E_ Re CRIPPLED AMERICA I am signing books for the next two weeks. Order yours for holiday gifts! __HTTP__ _E_ $1B down another $1B to go. ObamaCare website is 40% unfinished. This is beyond pathetic. _E_ Trump locks down Delaware GOP delegates. #Trump2016 #MAGA __HTTP__ _E_ Money was never a big motivation for me except as a way to keep score. The real excitement is playing the game! _E_ This Sunday's @CelebApprentice will shock you! Big Development...Be sure to tune in on @NBC this Sunday at 9PM EST! _E_ Republicans Senators are working hard to get their failed ObamaCare replacement approved. I will be at my desk pen in hand! _E_ #TimeToGetTough: Making America #1 Again my new book available today. The book both China and OPEC do NOT want you to read. _E_ Think positively. There are always opportunities. Keep your focus and don't give up! _E_ I still don't get how @KarlRove spent $400 million & lost all. _E_ I had a fun time doing the #CallMeMaybe video featuring  the @MissUSA contestants @BravoAndy and @GiulianaRancic __HTTP__ _E_ What is vital now is a swift restoration of law and order and the protection of innocent lives.#Charlottesville __HTTP__ _E_ In light of Newtown our country has to pull together. _E_ How can Crooked Hillary put her husband in charge of the economy when he was responsible for NAFTA the worst economic deal in U.S. history? _E_ "I can accept failure everyone fails at something. But I can't accept not trying." – Michael Jordan _E_ Home of the iconic Ailsa a four time @The_Open course @TrumpTurnberry is a landmark on the Ayrshire coastline __HTTP__ _E_ ...the Uranium to Russia deal the 33000 plus deleted Emails the Comey fix and so much more. Instead they look at phony Trump/Russia.... _E_ Europe and the U.S. must immediately stop taking in people from Syria. This will be the destruction of civilization as we know it! So sad! _E_ Will be on @FallonTonight with @JimmyFallon on @NBC at 11:35pmE. Enjoy! #Trump2016 __HTTP__ _E_ .@Rosie If America's Got Talent uses you the show will fail like all your others! _E_ Congrats to Congress on their 112 'gold tier' healthcare plans __HTTP__ Why should they suffer like regular Americans? _E_ Life is very fragile and success doesn't change that. If anything success makes it more fragile. Anything can (cont) __HTTP__ _E_ In 2016 the Old Post Office will be fully transformed into an iconic destination Trump Int'l Washington DC __HTTP__ _E_ Bernie Sanders supporters have every right to be apoplectic of the complete theft of the Dem primary by Crooked Hillary! _E_ Thank you Redding California!#MakeAmericaGreatAgain #CAPrimary __HTTP__ _E_ The last thing our country needs is another BUSH! Dumb as a rock! _E_ Lyin' Ted and Kasich are mathematically dead and totally desperate. Their donors & special interest groups are not happy with them. Sad! _E_ Crooked Hillary Clinton who called BREXIT 100% wrong (along with Obama) is now spending Wall Street money on an ad on my correct call. _E_ Obama's wind turbines kill "13 39 million birds and bats every year!" __HTTP__ Save our bald eagles symbol of our nation! _E_ Entrepreneurs: Have your own vision and stick with it. Don't be afraid to be unique. Don't tread water get out there and go for it. _E_ "Winning is habit. Unfortunately so is losing." Vince Lombardi _E_ I can't believe @Denver_Broncos allowed final touchdown—dumbest defensive play I have ever seen in football. _E_ The Al Qaeda flag is now flying over Benghazi. @BarackObama spent over $3Billion of our money for this? _E_ Beautiful rally in Albuquerque New Mexico this evening thank you. Get out & VOTE! #DrainTheSwampWatch rally:... __HTTP__ _E_ "You can attack defend counterattack sell or ignore. " Roger Ailes to Pres. Reagan during prep for 2nd Mondale debate/ '84 election _E_ The CIA report should not be released. Puts our agents & military overseas in danger. A propaganda tool for our enemies. _E_ Entrepreneurs: Vision remains vision until you focus do the work and bring it down to earth where it will do some good. _E_ I will be meeting General Kelly General Mattis and other military leaders at the White House to discuss North Korea. Thank you. _E_ Then we attended the Scottish fashion show that benefits veterans Dressed to Kilt 2010 which I co hosted with Sir Sean and Lady Connery. _E_ God bless the people of Mexico City. We are with you and will be there for you. _E_ Here we go A healthcare worker who treated Thomas Duncan the man who flew into the U.S. from West Africa infected with Ebola caught it! _E_ Located in beautiful Briarcliff NY @TrumpNationalNY features a 7291 yard course just 25 minutes outside NYC __HTTP__ _E_ Reality TV's #1 Bad Girl @OMAROSA is back on the upcoming 13th season of All Star @CelebApprentice. She is great as always. _E_ The Boston killer applying today for ObamaCare. He demands that medical bills be taken care of immediately. Does this include dental? _E_ I am attracting the biggest crowds by far and the best poll numbers also by far. Much of the media is totally dishonest. So sad! _E_ Snowden is a liar.and a fraud! _E_ RT @TeamTrump: When @realDonaldTrump is POTUS families are going to be safe and secure. Law and order will be RESTORED! #MAGA #Debates #De... _E_ Thanks Geraldo you're a champion. __HTTP__ _E_ SECURE THE BORDER! BUILD A WALL! _E_ Hope you liked it. Tune in tomorrow night at 8:00 and 9:00 for two episodes and two boardrooms! Will be a great evening of television! _E_ The LIVE FINALE of @ApprenticeNBC is this Sunday at 9/8C. Watch and see who will be the first ever All Star Celebrity Apprentice. _E_ I'll be discussing a variety of topics tonight with Greta Van Susteren 10 p.m. on Fox News. It will be the first of a two part series. _E_ Iowa was amazing today. Great crowd great people. Thanks will be back soon! _E_ There is no comparison between @ApprenticeNBC and Shark Tank in the ratings. The Apprentice beats Shark Tank hands... __HTTP__ _E_ New report from DOJ & DHS shows that nearly 3 in 4 individuals convicted of terrorism related charges are foreign born. We have submitted to Congress a list of resources and reforms.... _E_ The top course on the west coast @TrumpGolfLA overlooks Pacific Ocean & offers a luxurious public golf experience __HTTP__ _E_ President Obama do not attack Syria. There is no upside and tremendous downside. Save your powder for another (and more important) day! _E_ We will always ENFORCE our laws PROTECT our borders and SUPPORT our police! #LESMHarrisburg Pennsylvania #FlashbackFriday #MS13 __HTTP__ _E_ In presidential voting so far John Kasich is ZERO for 22. So why would he be a good candidate? Hillary would beat him I will beat Hillary! _E_ Always great to see the wonderful people of South Carolina. Thank you for the beautiful welcome at Greenville Spartanburg Int'l Airport! __HTTP__ _E_ Tens of millions of dollars in airstrikes had no impact because key leaders fled after hearing ON NEWS REPORTS the strikes were coming. DUMB _E_ This memo totally vindicates "Trump" in probe. But the Russian Witch Hunt goes on and on. Their was no Collusion and there was no Obstruction (the word now used because after one year of looking endlessly and finding NOTHING collusion is dead). This is an American disgrace! _E_ I am truly honored and grateful for receiving SO much support from our American heroes... __HTTP__ __HTTP__ _E_ Since Election Day on November 8 the Stock Market is up more than 25% unemployment is at a 17 year low & companies are coming back to U.S. _E_ Entrepreneurs: Successful negotiation means knowing what the other side wants. You've got to know where they're coming from. Pay attention! _E_ much worse just look at Syria (red line) Crimea Ukraine and the build up of Russian nukes. Not good! Was this the leaker of Fake News? _E_ I'm doing The David Letterman Show tonight should be interesting! _E_ What will happen to Omarosa tonight? One of our all time great episodes! _E_ Another terrorist attack in Paris. The people of France will not take much more of this. Will have a big effect on presidential election! _E_ The new Dark Knight Rises Trailer is great __HTTP__ The movie filmed scenes in Trump Tower last October. _E_ Hey Missouri let's defeat Crooked Hillary & @koster4missouri! Koster supports Obamacare & amnesty! Vote outsider Navy SEAL @EricGreitens! _E_ Vanity Fair Magazine which used to be one of my favorites is failing badly. Newsstand sales are plummeting (cont) __HTTP__ _E_ My @todayshow show interview with @IvankaTrump discussing the fierce competition in All Star @CelebApprentice __HTTP__ _E_ Using Alicia M in the debate as a paragon of virtue just shows that Crooked Hillary suffers from BAD JUDGEMENT! Hillary was set up by a con. _E_ I want to express our support and extend our prayers to all those affected by the vile terror attack in Spain last month. __HTTP__ _E_ "When everyone works with the same energy loyalty and focus it makes for smooth sailing all around." – Midas Touch _E_ Fact – Amnesty lowers wages and invites more lawlessness. Obama has unilaterally cancelled any chance of immigration reform. _E_ The winner of Best in Show at the Westminster Kennel Club Show Miss P will be coming to my office this morning. _E_ Remember that Marco Rubio is very weak on illegal immigration. South Carolina needs strength as illegals and Syrians pour in. Don't allow it _E_ Have the right mindset for the job. See your work as an art form which means paying attention to every detail. _E_ Could this be my newest apprentice? __HTTP__ ...Enter the contest .. . __HTTP__ _E_ Enjoy the Super Bowl! _E_ Government is shut down yet Obama is now harassing the privately owned @Redskins to change its name.He needs to focus on his job! _E_ Looking forward to being in Council Bluffs Iowa later today. Despite weather rally is on will be fantastic! #MakeAmericaGreatAgain! _E_ People are pouring into Washington in record numbers. Bikers for Trump are on their way. It will be a great Thursday Friday and Saturday! _E_ Our country and our leaders are getting dumber all the time. Now they are about to release full documentation on torture. Will destroy CIA _E_ Bob Dole Warns of 'Cataclysmic' Losses With Ted Cruz and Says Donald Trump Would Do Better via New York Times: __HTTP__ _E_ President Obama spoke for me and every American in his remarks in #Newtown Connecticut. _E_ RT @AnnCoulter: Trump's speech today was Churchillian only better. You can tell by the spluttering hysteria on TV about @realDonaldTrump. _E_ Entrepreneurs: Ask yourself is this a blip or is it a catastrophe? and your equilibrium will be kept in check if/when hard times hit. _E_ If you want to be successful two important considerations are passion and efficiency. Think Like a Champion _E_ Today on Earth Day we celebrate our beautiful forests lakes and land. We stand committed to preserving the natural beauty of our nation. _E_ Heading to Washington this morning. Much work to do. Focus on trade and military. #MAGA _E_ I just filed a major ethics complaint against crooked New York State Attorney General Eric Schneiderman he should resign from office! _E_ We are winning and the press is refusing to report it. Don't let them fool you get out and vote! #DrainTheSwamp on November 8th! _E_ .@alexsalmond @pressjournal RT @rdowns @realdonaldtrump Margaret Thatcher NEVER would have allowed those wind mill monstrosities. _E_ .@williebosshog watched you on @foxandfriends. You were great and I appreciate the nice statements. I'm sending out for your new book now! _E_ Another @BarackObama investment triumph the $500Billion American funded Finnish plug in cars are all being recalled __HTTP__ _E_ Obama and all others have been so weak and so politically correct that terror groups are forming and getting stronger! Shame. _E_ .@washingtonpost by @OConnellPostbiz:"Donald Trump lands @chefjoseandres for Old Post Office flagship restaurant" __HTTP__ _E_ I am in Dubai with Damac. PLACE IS BOOMING AMAZING! Major news conference in two hours. Announcing luxury villas and major golf course. _E_ If you want to be successful in business you must take risks. Make sure each risk is calculated and can have a positive fallback. _E_ Congrats @TrumpToronto for being ranked #1 on @TripAdvisor and a Travellers' Choice 2013 Winner! _E_ Big week coming up! _E_ There usually is an easy solution to every problem. For instance a lot of our country's problems can be solved in next year's election. _E_ Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_ .@MattBevin: As someone well versed in job creation and the Private Sector if you lie on your resume You're Fired! _E_ Judge Jeanine Slams GOP Establishment: __HTTP__ _E_ March 5th is rapidly approaching and the Democrats are doing nothing about DACA. They Resist Blame Complain and Obstruct and do nothing. Start pushing Nancy Pelosi and the Dems to work out a DACA fix NOW! _E_ Bad break for @TigerWoods hits a great shot which hits the pin and kicks into the water gets a bogey on hole with another great shot Champ! _E_ Just spoke with @NYGovCuomo and @NYCMayor de Blasio to let them know that the federal government... _E_ It begins Republican Party of Virginia controlled by the RNC is working hard to disallow independent unaffiliated and new voters. BAD! _E_ Re Lance Armstrong—not only was it a big lie but a big lie that lasted too long! _E_ The primary plaintiff in the phony Trump University suit wants to abandon the case. Disgraceful! _E_ Dems failed in Kansas and are now failing in Georgia. Great job Karen Handel! It is now Hollywood vs. Georgia on June 20th. _E_ Record crowd and standing ovation at Simpson College in Iowa lots of fun wonderful audience! _E_ A great night in Raleigh North Carolina! THANK YOU! #Trump2016 __HTTP__ _E_ I'm a skeptical guy but I don't believe Petraeus used this to get out of the Benghazi hearings. _E_ It's Thursday. I wonder how much money @BarackObama drained from Medicare today to finance ObamaCare. _E_ .@billmaher has not yet sent me the $5M he owes which I am giving to various charities. Come on Bill—you made a deal. _E_ Wow I'm at 2200000 followers but I'd love to get rid of the haters & losers—they're such a waste of time! _E_ Third Gun Linked to 'Fast and Furious' Identified at Border Agent's Murder Scene. When will the White House come clean? _E_ Vanity Fair which looks like it is on its last legs is bending over backwards in apologizing for the minor hit they took at Crooked H. Anna Wintour who was all set to be Amb to Court of St James's & a big fundraiser for CH is beside herself in grief & begging for forgiveness! _E_ Now China is publicly supporting the OWS protests __HTTP__ It's time for the protesters to go home. _E_ Entrepreneurs: Brainpower is the ultimate leverage. Don't underestimate yourself or your possibilities. _E_ Trump International Golf Club Turnberry Scotland has been home to four of the greatest Open Championships in history __HTTP__ _E_ Congratulations to my Catholic friends on the selection of Pope Francis I to lead the Catholic Church. People that know him love him! _E_ Brent Musburger did himself a great favor by saying what everyone was thinking he is much more popular now than before. _E_ I am a handwriting analyst. Jack Lew's handwriting shows while strange that he is very secretive—not necessarily a bad thing. _E_ Via @mrctv by Ben Graham: Border Reports Back Up Trump's 'Rapists' Claim __HTTP__ _E_ RT @realDonaldTrump: As the phony Russian Witch Hunt continues two groups are laughing at this excuse for a lost election taking hold Dem... _E_ A rare case where the U.S. should help __HTTP__ _E_ I hate when the news media so afraid to offend anyone always refers to the BOSTON KILLER as the suspect . _E_ Empty pockets never held anyone back. Only empty heads and empty hearts can do that. Norman Vincent Peale _E_ "He who is not courageous enough to take risks will accomplish nothing in life." Muhammad Ali _E_ Making speech tonight in New Hampshire leaving now. Fantastic people fantastic crowd! _E_ The San Fran crash was totally the pilot's fault may be too late for drug testing RIDICULOUS! _E_ The Mayweather decision is a disgrace! _E_ Thank you Nevada! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ We need someone with experience to rebuild America. #MakeAmericaGreatAgain __HTTP__ _E_ What did our very stupid & ineffective A.G. Eric Schneidean during his trips to MY office tell me about President Obama & Governor Cuomo? _E_ Thank you Wisconsin! Tuesday was a great success for #WorkforceWeek at @WCTC w/ @IvankaTrump & @GovWalker. Remarks... __HTTP__ _E_ Josh8J4 @realDonaldTrump I have a dream that you will be president to make this country great again. #USA Thank you. _E_ When will Mayor Vescio and manager Zegarelli repave Pine Road in @BriarcliffManor? It is a disgrace! _E_ Nobody knows for sure that the Republicans & Democrats will be able to reach a deal on DACA by February 8 but everyone will be trying....with a big additional focus put on Military Strength and Border Security. The Dems have just learned that a Shutdown is not the answer! _E_ He @MittRomney gets the China problem why don't the others? _E_ Patrick Reed—We are proud to have you as our champion at Doral. Love the attitude & the play. See you in March at the Cadillac WGC. _E_ I'll be at Liberty University Monday 10 AM for speech. Looking forward to meeting students all sold out! _E_ RT @foxnation: .@SenTedCruz: I want to Get to a 'Yes' Vote: __HTTP__ _E_ "The best luck of all is the luck you make for yourself." – General Douglas MacArthur _E_ Mr. President it is time to lead on the Korean crisis. Make a statement from the Rose Garden and send a strong message to the man child! _E_ Today is the day! Knock on doors and make calls with us on National Day of Action! #TrumpTrain #MAGA... __HTTP__ _E_ Why is Obama playing basketball today? That is why our country is in trouble! _E_ Consumer confidence is at a 16 year high....and for good reason. Much more regulation busting to come. Working hard on tax cuts & reform! _E_ The Bernie Sanders supporters are furious with the choice of Tim Kaine who represents the opposite of what Bernie stands for. Philly fight? _E_ Information is being illegally given to the failing @nytimes & @washingtonpost by the intelligence community (NSA and FBI?).Just like Russia _E_ A Rod must be dropped in the Yankees line up tonight if they want to win. He simply can't perform without drugs. _E_ October has a 7% foreclosure increase last month. Is this @BarackObama's economic recovery? _E_ Wise words from my father: "Know everything you can about what you're doing." Fred C. Trump _E_ A @aahs5star Diamond & Green Star Diamond Award winner @TrumpGolfLA is the nation's top public course __HTTP__ _E_ 70 stores above Punta Pacifica's pristine peninsula @TrumpPanama offers fine dining five pools & luxury rooms __HTTP__ _E_ I will be doing @foxandfriends at 7.00 (15 minutes). _E_ Another great cause Obama could send my $5M donation to is a charity for 9/11 First Responders. They are American heroes. _E_ Such bad reporting: A puff piece on Ben Carson in the @nytimes states that Carson is trying to solidify his lead. But I am #1 easily! Sad _E_ Our next Vice President of the United States of America Gov. @Mike_Pence!#GOPinCLE #GOPConvention#AmericaFirst __HTTP__ _E_ I look forward to paying my respects to our brave men and women on this Memorial Day at Arlington National Cemetery later this morning. _E_ They are saying that tickets to tonight's Saturday Night Live are the hardest to get in the history of this great show! Off to a good start! _E_ I've gotten many letters from people fighting autism thanking me for stating how dangerous 38 vaccines on a (cont) __HTTP__ _E_ Wow Ted Cruz falsely suggested Marco Rubio mocked the Bible and was just forced to fire his Communications Director. More dirty tricks! _E_ I enjoy meeting tourists in #TrumpTower. People travel from across the world to see the five level Atrium & waterfall. _E_ I will be going to Indiana on Thursday to make a major announcement concerning Carrier A.C. staying in Indianapolis. Great deal for workers! _E_ I have a feeling the emphasis by @johnrich and @marleematlin will be on the charities and the money raised. (cont) __HTTP__ _E_ Snow and freezing weather all over mid section of Country. Global warming specialists better start thinking fast! _E_ Obama:"I will destroy ISIS" = Obama: "If you like your healthcare plan you can keep your plan." _E_ Watch me get inducted into the #WWEHOF tonight at 10PM on USA. I will be posting exclusive behind the... __HTTP__ _E_ Scotland is beautiful. I spent several years looking for the right place visiting over 200 sites and this is absolutely the right place! _E_ George Will was pushing for @JonHuntsman for the GOP nomination in December...said he was going to win. (cont) __HTTP__ _E_ Thank you. __HTTP__ _E_ China is closing a massive oil deal w/ Russia taking advantage of the Ukraine conflict __HTTP__ Smart unlike our leaders. _E_ I havn't seen @tonyschwartz in many years he hardly knows me. Never liked his style. Super lib Crooked H supporter. Irrelevant dope! _E_ It was my great honor to welcome Prime Minister Alexis Tsipras of Greece to the WH today! __HTTP__ 📸 __HTTP__ __HTTP__ _E_ Crooked Hillary Clinton perhaps the most dishonest person to have ever run for the presidency is also one of the all time great enablers! _E_ Join me LIVE in South Korea🇰 #NationalAssembly #POTUSinAsia __HTTP__ __HTTP__ _E_ My @ WCNC News interview w/ @DianneG touring the magnificent Trump National Charlotte course & facilities __HTTP__ _E_ Thank you Arizona. Beautiful turnout of 15000 in Phoenix tonight! Full coverage of rally via my Facebook at: __HTTP__ __HTTP__ _E_ #TheArsenioHallShow Well it had to happen. People that are disloyal in the long run never make it. Arsenio was just cancelled! _E_ ... at St. Jude Children's Research Hospital __HTTP__ I am proud of you Eric. _E_ Hillary Clinton has bad judgment and is unfit to serve as President. __HTTP__ _E_ The Middle East is blowing up we didn't back Egypt and now they riot against us. Iran is using Iraqi airspace (cont) __HTTP__ _E_ For those asking the Republicans only have 51 votes in the Senate and they need 60. That is why we need to win more Republicans in 2018 Election! We can then be even tougher on Crime (and Border) and even better to our Military & Veterans! _E_ Join me live as we recognize the first responders to the June 14th shooting involving @SteveScalise. #TeamScalise __HTTP__ __HTTP__ _E_ .@DannyZuker Don't lie @ApprenticeNBC was #1 in all major demos at 10. Do not lie! _E_ RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_ Sadly the overwhelming amount of violent crime in our major cities is committed by blacks and hispanics a tough subject must be discussed. _E_ #DrainTheSwamp! __HTTP__ _E_ With Hillary and Obama the terrorist attacks will only get worse. Politically correct fools won't even call it what it is RADICAL ISLAM! _E_ Thank you so much for the wonderful article Robert Davi. __HTTP__ _E_ Sen. McCain should not be talking about the success or failure of a mission to the media. Only emboldens the enemy! He's been losing so.... _E_ I just released my financial disclosure forms the largest numbers in the history of the F.E.C. Even the dishonest media thinks great! _E_ So professional of @ABC news to throw out the failing @UnionLeader newspaper from their debate. Paper won't survive highly unethical! _E_ A great afternoon. Thank you South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Today's job report is not a good sign & we could be facing another recession. No real job growth. We need over 300K new jobs a month. _E_ We should have gotten more of the oil in Syria and we should have gotten more of the oil in Iraq. Dumb leaders. _E_ Thank you to all for your wonderful comments on my speech. I could feel the electricity in thr air. Great reviews most votes ever recieved _E_ What do you think @amandatmiller is writing? #CelebApprentice _E_ Our country is being torn apart from the inside it's getting nasty out there. _E_ It was a pleasure to have President Ashraf Ghani of Afghanistan with us this morning! #USAatUNGA #UNGA __HTTP__ _E_ RT @NYPDnews: Many supported NYers when Sandy hit. Now our NY Task Force 1 can be there to help others during Harvey & #HurricaneIrma. Here... _E_ Based on the ovation last night from the Letterman @Late_Show audience I believe it will be hard for Obama to throw $5M down the drain.... _E_ Why would the USChamber be upset by the fact that I want to negotiate better and stronger trade deals or that I want penalties for cheaters? _E_ I am on @greta now! _E_ In one hour I will be making a major announcement from Trump Tower. Watch it live on Periscope! __HTTP__ _E_ Do you agree with the client's decision? #CelebApprentice _E_ Wow Bernie Sanders just admitted that the real unemployment rate is 10% (it is actually over 20%) and for African American youth 51%. _E_ Dopey Mort Zuckerman owner of the worthless @NYDailyNews has a major inferiority complex. Paper will close soon! _E_ 92 stories above North Michigan Avenue @TrumpChicago's 5 Star @Forbes rated rooms have the best views of Chicago __HTTP__ _E_ Don't believe Chrysler (if Obama wins) see how fast @Jeep production will be moved to China and I'll be watching! _E_ Via @wsoctv by @BlairWSOC9: EXCLUSIVE: Donald Trump talks possible presidential run __HTTP__ _E_ I AM PLEASED TO INFORM YOU THAT CELEBRITY APPRENTICE HAS BEEN RENEWED FOR ANOTHER SEASON BY NBC. SEE YOU AT THE NBC UPFRONTS TOMORROW. _E_ It was Rosie O'Donnell who ate the cake in the vicious Hillary commercial about me not Crooked Hillary! @marthamaccallum _E_ Today I will meet with Canadian PM Trudeau and a group of leading business women to discuss women in the workforce. __HTTP__ _E_ Hillary Clinton is not qualified to be president because her judgement has been proven to be so bad! Would be four more years of stupidity! _E_ The middle class has worked so hard are not getting the kind of jobs that they have long dreamed of and no effective raise in years. BAD _E_ Just announced that because of Trump advertising rates for debate on @CNN are going from $5000 to $200000 a 4000% increase.PAY CHARITY? _E_ State Senator Shirley Huntley ratted on black politicians & was believed when she ratted on @AGSchneiderman nobody listened. Racism! _E_ Response to Huffington Post __HTTP__ _E_ Crowd gathers to hear Trump speech in Las Vegas __HTTP__ _E_ Thank you Des Moines Iowa! Governor @Mike_Pence and I appreciate your support! #MAGA #TrumpTrain __HTTP__ _E_ FEMA and first responders are working hard (yet again) on Hurricane Nate. Military helping. Very much under control! _E_ Exciting news—After massive construction the Blue Monster at Trump National Doral is open for business today. __HTTP__ _E_ Iran and the United States just pushed deadline back SEVEN MONTHS on working out a nuclear deal. Iran is tapping along our bad negotiators! _E_ Thank you South Carolina! #Trump2016 __HTTP__ _E_ .@VP Mike Pence is working hard on HealthCare and getting our wonderful Republican Senators to do what is right for the people. _E_ Entrepreneurs: Apply your skills and talent but above all be tenacious. See yourself as victorious which means never giving up. _E_ Hagel has been endorsed by China __HTTP__ & Iran __HTTP__ for SOD. Welcome to Obama's second term! _E_ Johnny Miller correctly very critical of greens at Pinehurst. Said they should be redone _E_ .@Peggynoonannyc Interesting article but I will beat Hillary easily. People that have given up on the system will come out to vote for me! _E_ The GOP primary is getting very nasty. The candidates need to remember that @BarackObama is the main target. He must not be reelected. _E_ Via @DMRegister by @BylineAndyDavis: "Donald Trump speaks to veterans residents in Coralville" __HTTP__ _E_ Little @MacMiller sent me an expensive plaque for making his song "Donald Trump" such a big hit. Mac you still... __HTTP__ _E_ The debates are going to have a big impact on the election. @MittRomney has proved in Florida he delivers under pressure. _E_ Even Crazy Jim Acosta of Fake News CNN agrees: "Trump World and WH sources dancing in end zone: Trump wins again...Schumer and Dems caved...gambled and lost." Thank you for your honesty Jim! _E_ The @erictrumpfdn Golf Invitational featuring a performance by @BretMichaels was a great event. Enjoy the video.... __HTTP__ _E_ #TrumpVine Opinion on Egypt __HTTP__ _E_ Only a Reagan or a Trump like figure in the White House will achieve this goal. __HTTP__ _E_ This election is a choice between law order & safety or chaos crime & violence. I will make America safe again for everyone. #ImWithYou _E_ .@Ed_Klein's book 'The Amateur' is out in paper back. Lots of insights. _E_ If you have passion confidence resilience & vision you could become an entrepreneur. Add focus to the list & you're off to a good start _E_ Russia is sending a fleet of ships to the Mediterranean. Obama's war in Syria has the potential to widen into a worldwide conflict. _E_ We will never forget the 241 American service members killed by Hizballah in Beirut. They died in service to our nation. __HTTP__ _E_ China is taking the oil from Iraq after we spent 1.5 trillion dollars and thousands of lives for their freedom . Our leaders are so stupid! _E_ I will be on Fox and Friends at.7.00 A.M. Enjoy! _E_ Look forward to being in Tampa this afternoon. Wonderful crowds. Thank you Florida! _E_ NYC's top cop acted wisely and legally to monitor activities of some in the Muslim community. Vigilance keeps us (cont) __HTTP__ _E_ Do you believe what is going on in Washington with respect to Syria these people don't have a clue! _E_ Just left Sioux Center Iowa. My speech was very well received. Truly great people! Packed house overflow! _E_ If the great Si Newhouse were still running @CondeNastCorp he would fire Graydon Carter immediately circulation tanking. _E_ RT @Scavino45: Hurricane force winds hit Florida Keys. 390 shelters have been opened in Florida. Shelters near you __HTTP__ _E_ Be sure to keep following announcements on the development of Trump International Golf Club Dubai. Will be spectacular. _E_ It is impossible for the FBI not to recommend criminal charges against Hillary Clinton. What she did was wrong! What Bill did was stupid! _E_ Great @Esquiremag piece '@DonaldJTrumpJr: What I've Learned' __HTTP__ _E_ Highly respected Constitutional law professor Mary Brigid McManamon has just stated Ted Cruz is not eligible to be President. Big problem _E_ When will @BarackObama release his college and law school transcripts? __HTTP__ _E_ The last thing we need in Alabama and the U.S. Senate is a Schumer/Pelosi puppet who is WEAK on Crime WEAK on the Border Bad for our Military and our great Vets Bad for our 2nd Amendment AND WANTS TO RAISES TAXES TO THE SKY. Jones would be a disaster! _E_ Call @MELANIATRUMP today on @QVC at 5 PM EST say hello and buy buy buy! _E_ ...can't change history but you can learn from it. Robert E Lee Stonewall Jackson who's next Washington Jefferson? So foolish! Also... _E_ Amazing comeback by The Heat your friends at your favorite golf club Trump National Doral are proud of you. NOW for game 7! _E_ Departing New York with General James 'Mad Dog' Mattis for tonight's rally in Fayetteville North Carolina! See you... __HTTP__ _E_ Don't ever forget we will together MAKE AMERICA GREAT AGAIN! _E_ ....Also there is NO COLLUSION! _E_ If the disgusting and corrupt media covered me honestly and didn't put false meaning into the words I say I would be beating Hillary by 20% _E_ Ebola patient will be brought to the U.S. in a few days now I know for sure that our leaders are incompetent. KEEP THEM OUT OF HERE! _E_ The winner of Best In Show of the 139th @WKCDOGS Miss P visited @TrumpTowerNY today __HTTP__ _E_ Watch @PaulRyanVP explain how 'It's irrefutable' that President Obama is damaging Medicare' __HTTP__ _E_ What the hell is going on with GLOBAL WARMING. The planet is freezing the ice is building and the G.W. scientists are stuck a total con job _E_ Looking forward to Friday night in the Great State of Alabama. I am supporting Big Luther Strange because he was so loyal & helpful to me! _E_ Today I was thrilled to announce a commitment of $25 BILLION & 20K AMERICAN JOBS over the next 4 years. THANK YOU... __HTTP__ _E_ Obama wanted Putin to reset. Instead Putin laughed at him and reloaded. _E_ Mexico doesn't respect our border hourly __HTTP__ Release USMC Tahmooressi NOW! Time for a boycott? #SaveOurMarine _E_ I don't know @SamuelLJackson to best of my knowledge haven't played golf w/him & think he does too many TV commercials—boring. Not a fan. _E_ Thank you @SeanHannity & @BoDiet! #MakeAmericaGreatAgain _E_ June 16th __HTTP__ _E_ The election is absolutely being rigged by the dishonest and distorted media pushing Crooked Hillary but also at many polling places SAD _E_ This Tweet from @realDonaldTrump has been withheld in response to a report from the copyright holder. _E_ America's Olympic uniforms are manufactured in China. Burn the uniforms!#U.S.OlympicCommittee _E_ Some low life journalist claims that I made a pass at her 29 years ago. Never happened! Like the @nytimes story which has become a joke! _E_ "@IvankaTrump: 'Trump Estates Dubai unlike anything else in the region'" __HTTP__ via @aawsat_eng by Musaid Al Zayani _E_ The wimps that run Penn State should be forced to resign (and be sued) for the pathetic settlement they made and destruction of great legacy _E_ Worst ever issue of @VanityFair magazine—bad food Graydon Carter should be fired! _E_ 1. Each week you the audience can choose an MVP among the celebrities @CelebApprentice using Twitter...... _E_ Donald Trump Announcement: $5 Million for Obama College Records __HTTP__ via @Newsmax_Media _E_ .@bobbyjindal watched you on @TeamCavuto. Made some excellent points. Best Wishes. _E_ Via @HorsetalkNZ: "NY's Central Park Horse Show a huge success" __HTTP__ _E_ It is time to #DrainTheSwamp in Washington D.C! Vote Nov. 8th to take down the #RIGGED system! __HTTP__ _E_ I loved watching Clint Eastwood last night he was terrific! _E_ Obama's attack on the internet is another top down power grab. Net neutrality is the Fairness Doctrine. Will target conservative media. _E_ Another great poll result! Thank you! __HTTP__ _E_ Obama is the most profligate deficit & debt spender in our nation's history. Doubled debt (cont) __HTTP__ _E_ Now a small country like Sudan tells Obama he can't send any more Marines __HTTP__ We are a laughing stock. _E_ 15K in OK! Had to turn away 5k but we are coming back soon to take care of them! So much love in the crowd! Thanks! __HTTP__ _E_ If only the illegals were Tea Party members then Obama would get them out of the country immediately. _E_ Via @Newsmax_Media by @melaniebatley: Donald Trump: France's Strict Gun Laws Enabled Attack __HTTP__ _E_ Look here's the deal: @BarackObama has been a total disaster. He has spent this country into the ground and (cont) __HTTP__ _E_ .@MarissaMayer is right to expect Yahoo employees to come to the workplace vs. working at home. She is doing a great job! _E_ Leaving for Liberty University. I'll be speaking today in front of a record crowd. #Trump2016 _E_ By continuing to give massive subsidies to Scotland's ugly wind turbines @David_Cameron is playing right into @AlexSalmond's hands. _E_ China has 5 oil projects in Iraq and we didn't get anything from the Iraqis except asked to leave. Iraq is going (cont) __HTTP__ _E_ Crooked Hillary Clinton is unfit to serve as President of the U.S. Her temperament is weak and her opponents are strong. BAD JUDGEMENT! _E_ This week we came one step closer to reaching the goal of aligning the skills taught in our nation's classrooms with the jobs of the future. __HTTP__ _E_ America needs @MittRomney and @PaulRyanVP and we need them right now. @GovChristie _E_ ...these days...we could all use a little of the power of Trumpative thinking. –BarnesandNoble.com __HTTP__ _E_ How come nobody mentions that the Nielsen Ratings of the Apprentice after 12 seasons as shown by Howard Stern totally blow away... _E_ The phony Club For Growth which asked me in writing for $1000000 (I said no) is now wanting to do negative ads on me. Total hypocrites! _E_ New York City's iconic architectural masterpiece @TrumpTowerNY houses prime commercial residential & retail space __HTTP__ _E_ .@IvankaTrump's @FoxNewsSunday "Power Player of the Week" interview with Chris Wallace __HTTP__ _E_ Senator Mitch McConnell said I had excessive expectations but I don't think so. After 7 years of hearing Repeal & Replace why not done? _E_ Via @BreitbartNews by @mboyle1: "EXCLUSIVE — DONALD TRUMP TO SPEAK AT CPAC" __HTTP__ @CPACnews _E_ Watching other networks and local news. Really good night! Crazy @megynkelly is unwatchable. _E_ This is more than a campaign it is a movement. #MakeAmericaGreatAgainSIGN UP TODAY & WE WILL WIN! __HTTP__ _E_ Join me in Pueblo Colorado on Monday afternoon at 3pm! #TrumpRally __HTTP__ _E_ FACT – the reason why Americans have to worry about a government shutdown is because Obama refuses to pass a budget. _E_ Be sure to stop by Trump Tower today I'll be signing copies of my new book Time To Get Tough from 11 am to 2 pm. _E_ My warmest condolences to the families of the horrible Roseburg Oregon shootings. _E_ Trump Int'l Golf Links Scotland awarded 5 star status by Scottish Tourism chiefs. Via MailOnline __HTTP__ _E_ I know our complex tax laws better than anyone who has ever run for president and am the only one who can fix them. #failing@nytimes _E_ The POLICE in Paris did a fantastic job. Very brave not easy! _E_ Have confidence work hard and keep your focus on the small things that matter while keeping the big picture in mind. _E_ Many of the released Guantanamo detainees are now fighting for ISIS and other enemy groups.We need proper leadership before it is too late! _E_ According to @pewresearch 2/3 of Mexican LEGAL immigrants do not pursue citizenship because of 'no interest' __HTTP__ _E_ US interest payments on the debt have already passed $375B this year __HTTP__ China is laughing at us as usual. _E_ Now A Rod doesn't even show up to his single A rehab games. Maybe the @Yankees will get lucky and @MLB will suspend A Rod. _E_ 1/5 households is on food stamps __HTTP__ We must do better. Americans need to have a work ethic. _E_ Crooked Hillary's brainpower is highly overrated.Probably why her decision making is so bad or as stated by Bernie S she has BAD JUDGEMENT _E_ Via @LatinoVoices by @CaritoJuliette: "Meet The Latina 2014 @MissUniverse Candidates" __HTTP__ _E_ President Obama wants @MittRomney to hand over even more past tax returns he should when @BarackObama reveals his college applications. _E_ My @CNN interview with @piersmorgan explaining why Mitt should not apologize __HTTP__ _E_ All the haters & losers must admit that unlike others I never attacked dopey Jon Stewart for his phony last name. Would never do that! _E_ Disaster! The @BarackObama tax hikes set for 2013 are going to throw us back into a recession according to the CBO __HTTP__ _E_ Looking forward to speaking at prestigious @TheEconomicClub on December 15th __HTTP__ _E_ Thank you Illinois! #SuperTuesday #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ It's almost like the United States has no President we are a rudderless ship heading for a major disaster. Good luck everyone! _E_ I can't believe Republican leadership allowed such a stupid deal to be made. They are rapidly giving up all of their cards. _E_ .@foxandfriends will be showing much of our successful trip to Asia and the friendships & benefits that will endure for years to come! _E_ Democrat Jon Ossoff who wants to raise your taxes to the highest level and is weak on crime and security doesn't even live in district. _E_ Must read editorial via @IBDeditorials: ObamaCare's Bitter Irony: It May Increase Number Of Uninsured __HTTP__ _E_ Sexual assault and rape in the Armed Forces is a Massive problem that nobody wants to talk about or do anything about the big dark secret! _E_ Herman Cain handled the pressure of the debate really well. @THEHermancain _E_ Rosie is back on the View which tells you how desperate they must be. It is the standard short term fix and long term disaster. _E_ Diligence is the mother of good luck. Benjamin Franklin _E_ I delivered a speech in Charlotte North Carolina yesterday. I appreciate all of the feedback & support. Lets #MAGA... __HTTP__ _E_ Rev. @BillyGraham is a great man and so is his son Franklin Graham. _E_ .@ByronYork Great numbers from @CBSNews Poll. Also from ABC Washington Post Poll. Thank you! @CNN _E_ The Failing @nytimes has totally gone against the Social Media Guidelines that they installed to preserve some credibility after many of their biased reporters went Rogue! @foxandfriends _E_ The Daily Snooze publishes lies about me. They should be ashamed but it will die very soon. _E_ Think positively. Zap negativity immediately. Focus on the solution not the problem. Be persistent and alert every single day. Momentum! _E_ Don't think my statement on @ariannahuff was harsh if you knew her and the phony Huffington Post you would understand more to follow. _E_ It was an honor to welcome the Prime Minister of Vietnam Nguyễn Xuân Phúc to the @WhiteHouse this afternoon. __HTTP__ _E_ The @MissUniverse contestants review their amazing stay at @TrumpDoral __HTTP__ _E_ .@ErinBurnett should have stayed at CNBC—she was never smart but people liked her. @OutFrontCNN Jeff Zucker's got problems! _E_ I have been hitting Obama and Crooked Hillary hard on not using the term Radical Islamic Terror. Hillary just broke said she would now use! _E_ Was President Obama in charge of this years Academy Awards they remind me of the ObamaCare website! #Oscars. _E_ .@MattGinellaGC Have you ever seen Trump National/Bedminster or Trump International Golf links in Scotland. Both far better than Pinehurst! _E_ Demand by China continues to raise the price of oil __HTTP__ We must become energy independent through our vast resources. _E_ .@danawhite Great job last night very exciting! You have come a long way from those difficult early days I am proud of you. _E_ Every economic climate whether an uptick or downturn presents new opportunities and challenges. _E_ Join me! 6/10: Richmond VA 8pm6/11: Tampa FL 11am6/11: Pittsburgh PA 3pm6/13: Portsmouth NH 2:30pm __HTTP__ _E_ Congratulations to @David_Bossie & his team @Citizens_United on their important court win for the First Amendment! __HTTP__ _E_ .@BretBaier Why do you have George Will on your show he's exhausted boring and not even a little relevant! Waste of good air time! _E_ Bernie Sanders is being treated very badly by the Dems. The system is rigged against him. He should run as an independent! Run Bernie run. _E_ .@nypost: "Dozens of key staffers fleeing @AGSchneiderman's office" __HTTP__ _E_ There's nothing wrong with bringing your talents to the surface. Having an ego and acknowledging it is a healthy choice. _E_ I don't think Ted Cruz can even run for President until he can assure Republican voters that being born in Canada is not a problem. Doubt! _E_ Looking forward to meeting with Prime Minister @Netanyahu shortly. Peace in the Middle East would be a truly great legacy for ALL people! _E_ Congratulations to THE MOVEMENT we have just won THE GREAT STATE OF OREGON. The vote percentage is even higher than anticipated! Thank you. _E_ In last night's #CNNDebate @MittRomney proved once again why he is the steady conservative who can restore America's future. _E_ "Arrests of MS 13 Members Associates Up 83% Under Trump" __HTTP__ _E_ In making big money knowledge is far more important than any other ingredient including money itself! _E_ Wonderful @pastormarkburns was attacked viciously and unfairly on @MSNBC by crazy @morningmika on low ratings @Morning_Joe. Apologize! _E_ Mayor Bill Vescio of Briarcliff Manor Westchester is doing a terrible job. Horrible roads high taxes housing down. @westchestergov _E_ If you can't handle the hard times that come with business then you will never be able to celebrate the successes. Focus & Stay Positive. _E_ Wacky @NYTimesDowd who hardly knows me makes up things that I never said for her boring interviews and column. A neurotic dope! _E_ Ben Carson has never created a job in his life (well maybe a nurse). I have created tens of thousands of jobs it's what I do. _E_ Chicago don't forget tix for @EricTrumpFdn Wine Tasting Fundraiser @TrumpChicago 11/22. Proceeds benefit @StJude __HTTP__ _E_ Wow television ratings just out: 31 million people watched the Inauguration 11 million more than the very good ratings from 4 years ago! _E_ The American people have waited long enough. There has been enough talk and no action for seven years. Now is the time for action! __HTTP__ _E_ "Take the time to move yourself forward. In other words think work and be lucky." – Think Like a Champion _E_ Via @shinysheet: Mar a Lago to host top equestrian jumpers: Trump Invitational will benefit 90 area charities. __HTTP__ _E_ Congrats to @AlCardenasACU and @CPACnews. I really enjoyed being there—the response was so terrific! _E_ RT @Scavino45: #USNSComfort en route to #PuertoRico from Norfolk Virginia to support Hurricane Maria relief efforts. __HTTP__ _E_ Thank you Florida can't wait to see you Friday in Miami! Join me: __HTTP__ __HTTP__ _E_ President Obama created a VERY BAD precedent by handing over five Taliban prisoners in exchange for Sgt. Bowe Bergdahl. Another U.S. loss! _E_ Entrepreneurs: Be curious. Discovery breeds discovery just as success breeds success. Don't sell yourself short. _E_ My interview with @EWErickson of @RedState discussing #TimeToGetTough GOP primary and my 2012 options __HTTP__ _E_ Via @BreitbartNews: "DONALD TRUMP AT SUMMIT: OBAMACARE A 'FILTHY LIE' CAN BUILD 'A BEAUTY' OF A BORDER FENCE" __HTTP__ _E_ Constitutional law expert #Laurence Tribe of Harvard says wrong to say it (natural born citizen) is a settled matter it isn't settled). _E_ Thank you Governor @TerryBranstad! #AmericaFirst #Debates2016 __HTTP__ _E_ "No one remembers who came in second." – Walter Hagen _E_ With so many scandals plaguing Obama it seems that they all hit him at the right time. Could help him get away w/ all of them. _E_ He @BarackObama is using the IRS to sabotage the Tea Party __HTTP__ What about the Occupy Wall Street groups? _E_ Congratulations to Roy Moore on his Republican Primary win in Alabama. Luther Strange started way back & ran a good race. Roy WIN in Nov! _E_ Played the Trump International Golf Club in Palm Beach last weekend. One of the best golf courses in the country. Perfect weather. _E_ Congratulations to the Rolling Stones on marking their 50th anniversary in London. _E_ There is no instance of a nation benefitting from prolonged warfare. Sun Tzu _E_ I will also be going to a wonderful state Missouri that I won by a lot in '16. Dem C.M. is opposed to big tax cuts. Republican will win S! _E_ Such great support in New Hampshire. So many people are working so hard to #MakeAmericaGreatAgain! _E_ Via @BreitbartNews by @LarryOConnor: TRUMP: NY MAG AILES STORY 'TOTAL BULLS**T' __HTTP__ It was total bullshit! _E_ The @USCHAMBER must fight harder for the American worker. China and many others are taking advantage of U.S. with our terrible trade pacts _E_ Just watched Hillary deliver a prepackaged speech on terror. She's been in office fighting terror for 20 years and look where we are! _E_ FLASHBACK – "Donald Trump Blasts Obama for Failing to Secure Christian Pastor's Freedom in Iran __HTTP__ via @theblaze' _E_ "You're never a loser until you quit trying." Mike Ditka _E_ Democrats are trying to bail out insurance companies from disastrous #ObamaCare and Puerto Rico with your tax dollars. Sad! _E_ #CrookedHillary __HTTP__ _E_ "The President has accomplished some absolutely historic things during this past year." Thank you Charlie Kirk of Turning Points USA. Sadly the Fake Mainstream Media will NEVER talk about our accomplishments in their end of year reviews. We are compiling a long & beautiful list. _E_ Snowden is sitting in China and taunting the U.S. He is mocking us as a Country. Great time to place a tax on China trade if not turned over _E_ What's more important? Rebuilding our military or bailing out insurance companies? Ask the Democrats. _E_ Thank you Geneva Ohio. If I am elected President I am going to keep RADICAL ISLAMIC TERRORISTS OUT of our countr... __HTTP__ _E_ I will do far more for women than Hillary and I will keep our country safe something which she will not be able to do no strength/stamina! _E_ I put @DonnyDeutsch on Apprentice at his request I did his failed cable show as a favor to him then he knocks me for my Obama announcement. _E_ How can George Osborne reduce UK debt while spending billions to subsidize Scotland's garbage wind turbines that are destroying the country? _E_ 'Remarks by President Trump at Signing of H.J. Resolution 41' __HTTP__ __HTTP__ _E_ We pause today to remember the 2403 American heroes who selflessly gave their lives at Pearl Harbor 75 years ago... __HTTP__ _E_ Our wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster is imploding fast! _E_ Thanks to @TheRealMarilu a great woman for her wonderful defense of the Miss USA pageant. _E_ We need a dealmaker in the White House who knows how to think innovatively and make smart deals. #TimeToGetTough. _E_ Wow honored to just pass 2.5M followers on @twitter. Thanks to all my followers. We are going to have a great year together. _E_ Crooked Hillary Clinton said she is used to dealing with men who get off the reservation. Actually she has done poorly with such men! _E_ Congratulations to Michelle and Barack Obama on their 20th anniversary. _E_ I will be doing @hannityshow tonight on Fox at 9 o'clock. Will be interesting and tough! _E_ We call for the full restoration of democracy and political freedoms in Venezuela and we want it to happen very very soon! __HTTP__ _E_ Let Pete into the Hall of Fame __HTTP__ @PeteRose_14 _E_ Ron Fournier: Clinton Used Secret Server To Protect #CircleOfEnrichment" __HTTP__ _E_ Big day for healthcare. Working hard! _E_ Great job by @EricTrump on interview with @BillHemmer on @FoxNews. #ImWithYou #TrumpTrain _E_ Everybody is talking about the protesters burning the American flags and proudly waving Mexican flags. I want America First so do voters! _E_ Trump Nat'l Golf Club Philadelphia is a 360 acre beauty and an award winning Tom Fazio designed course fantastic! __HTTP__ _E_ In order to preserve my options and guarantee that @BarackObama is defeated I changed my voter registration to independent. _E_ Tonight at 8:00 is a really big one for a double episode of Celebrity Apprentice. Watch you won't believe what happens! _E_ Statement by me last night in Florida: "Honestly I don't think the Democrats want to make a deal. They talk about DACA but they don't want to help..We are ready willing and able to make a deal but they don't want to. They don't want security at the border they don't want..... _E_ Looking forward to speaking at 1:30PM tomorrow in Nashua at @NHGOP @FITNsummit!. Let's Make America Great Again! #FITN _E_ On beautiful Lake Norman @Trump_Charlotte offers a state of the art Clubhouse to complement its championship course __HTTP__ _E_ Great poll numbers! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Wow the ObamaCare website which President Obama said would be working TODAY is a total mess with many functions not even thought about! _E_ Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_ Be sure to set exceptional goals for your 2015 resolutions. Push yourself you can do it. Think Big! _E_ The unemployment numbers are tragic. We are letting the world take our jobs. It has to stop! _E_ Had a great time on the @HowardStern show this morning—he will and should never change! _E_ Why doesn't the failing @nytimes write the real story on the Clintons and women? The media is TOTALLY dishonest! _E_ Hillary Clinton's weakness while she was Secretary of State has emboldened terrorists all over the world..cont: __HTTP__ _E_ #SweepsTweet @clayaiken might get some use out of the Chi Touch digital hairdryer. Not the same for @arsenioofficial. _E_ Wow! @FoxNews poll just came out. #1 with 26%! Almost as importantly I am the strongest on economic issues by far! #Trump2016 _E_ HRC is using the oldest play in the Dem playbook when their policies fail they are left w/this one tired argument! __HTTP__ _E_ Never let the fear of striking out get in your way. Babe Ruth _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ .@IamStevenT gave one of the greatest endings to a show ever @MissUniverse. Standing ovation! _E_ Fast trial and death penalty for maniac in Colorado immediately pass speed up legislation. _E_ Last week's boardroom was truly epic and the dust hasn't settled yet. #CelebApprentice _E_ Thank you Jonathan. Greatly appreciated! __HTTP__ _E_ My twitter has become so powerful that I can actually make my enemies tell the truth. _E_ My great honor! __HTTP__ _E_ China's leadership is sneaky and underhanded they significantly underreport their actual defense budget and (cont) __HTTP__ _E_ He's back and causing more trouble than ever before! @THEGaryBusey returns in the record 13th season of 'All Star' @CelebApprentice. _E_ Off to Nashville and the NRA. _E_ ... It is very effective and a commonly used business tool. _E_ Obama's economic policies are causing inflation on hard working families. The price of corn alone has risen over 200% since he was elected. _E_ I cannot believe the Republicans are extending the debt ceiling—I am a Republican & I am embarrassed! _E_ See you soon Arizona! #Trump2016 __HTTP__ _E_ Congratulations to @FLGovScott the state is really making progress and fast! _E_ Part 1 of my @jimmyfallon interview discussing my $5M offer to Obama #TRUMP Tower atrium my tweets & 57th st. crane __HTTP__ _E_ Think of it 20% of our country is essentially unemployed. _E_ Quote of the Day: Donald Trump Decrees Boycott on Glenfiddich Scotch __HTTP__ via @Zagat _E_ Great news that the New York Stock Exchange won't be owned by a German company. European regulators turned the (cont) __HTTP__ _E_ Are we talking about the same cyberattack where it was revealed that head of the DNC illegally gave Hillary the questions to the debate? _E_ Hillary's Aides Urged Her to Take Foreign Lobbyist Donation And Deal With Attacks: __HTTP__ _E_ A storied franchise with a loyal fanbase @buffalobills should remain in Buffalo. _E_ Video from Michigan last night. After asking for months the media panned their cameras! __HTTP__ __HTTP__ _E_ It's hard to read the Failing New York Times or the Amazon Washington Post because every story/opinion even if should be positive is bad! _E_ RT @MittRomney: For nearly 4 years Barack Obama has refused to crack down on China's cheating & American workers have paid the price. _E_ Thank you Windham New Hampshire! #TrumpPence16 #MAGA __HTTP__ _E_ Problems are never truly hardships to winners & if you haven't got any then you must not have a business to run. _E_ Trump: Zimmerman Trial 'Traumatic Period for Country' __HTTP__ @Newsmax_Media _E_ .@NY_POLICE Commissioner Ray Kelly has done a top job keeping NYC safe. Stop & Frisk has been a critical tool for the NYPD. _E_ Why does Conde Nast allow dopey Graydon Carter to run bad food restaurants while running failing @VanityFair magazine? _E_ A commander in chief has to possess the right instincts. That's one of the biggest problems with @BarackObama: (cont) __HTTP__ _E_ Thank you Iowa! I appreciate all of your support @IowaCentral & @ethanolbyPOET this evening! #Trump2016 #IACaucus __HTTP__ _E_ Tune in for my interview with @gretawire tonight at 10 pm @FoxNews _E_ Looking forward to the GOP debate and the outcome of the Ames straw poll. We must get a real leader. _E_ I am interviewed on This Week on @ABC this morning. Enjoy! _E_ In answer to your questions about my favorite impersonator the answer is Darrell Hammond. _E_ ...(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong). To bad the Dems have no one who can change tones! _E_ Less than one week away from implementation ObamaCare's small business exchanges are not ready! __HTTP__ A disaster! _E_ I am so disappointed that the Yankeed haven't terminatrd A Rod's contract. There is no way they would not win in court! Hard to believe. _E_ Thank you Kenansville North Carolina! Remember on November 8th that special interest gravy train is coming to a... __HTTP__ _E_ ObamaCare strikes again. Major insurer announced that over 53000 New Yorkers will be dropped from their plans __HTTP__ _E_ My thoughts on the Republican Party in today's #trumpvlog... __HTTP__ _E_ Wow. Unbelievable. __HTTP__ _E_ Some dope said I deleted a tweet about James G. There was no tweet and there was no delete a totally fabricated story (nobody saw tweet). _E_ The current tax code is a burden on American taxpayers & harmful to job creators. Americans need #TaxReform! More: __HTTP__ __HTTP__ _E_ Via @Newsmax_Media: "Trump Iowa Visit Raises 2016 Speculation" __HTTP__ _E_ Let's go! #CelebApprentice _E_ Via @NYDailyNews: Joan Rivers' last work for @ApprenticeNBC will run on two shows next season says Donald Trump __HTTP__ _E_ Congratulations to Tom Brady @Patriots he is a great quarterback and a great champion! _E_ China just agreed that the U.S. will be allowed to sell beef and other major products into China once again. This is REAL news! _E_ The lawyer I just beat in Chicago was a buffoon but was a lot smarter and sharper than @DannyZuker. Come on Danny make the bet! _E_ .@TheRevAl came to my Trump Tower office to apologize for calling me a racist very nice apology accepted! _E_ Interesting that Roberts said it was a tax in order to come out with his good public relations decision when (cont) __HTTP__ _E_ President Obama will go down as perhaps the worst president in the history of the United States! _E_ Congrats @GretchenCarlson's new Fox show debuts w/ very strong ratings __HTTP__ Guess who her first guest was? Donald Trump. _E_ Tune into the legendary @BarbaraJWalters at 10pmE on@ABC2020 tonight. #MeetTheTrumps for a full hour @ABC #ABC2020! __HTTP__ _E_ Good advice from my father: Know everything you can about what you're doing. Fred C. Trump _E_ Unbelievable evening. Just made a speech in front 17000 amazing New Yorkers in Bethpage Long Island great to be home! _E_ Obama asked a 7 yr old for his birth certificate. He's in your face because the Republicans dropped the ball. (cont) __HTTP__ _E_ WE ARE WITH YOU FLORIDA!Emergency Information 1 800 342 3557 __HTTP__ 1 800 FL HELP 1 __HTTP__ __HTTP__ _E_ What a 'nice guy' 97% of @BarackObama's campaign ads have been negative attacks on @MittRomney __HTTP__ Give it back Mitt! _E_ .@MissTeenUSA visited today __HTTP__ _E_ 'Good Chance' Trump Will Run for President __HTTP__ via @Newsmax_Media by @melaniebatley _E_ .@CNN is in a total meltdown with their FAKE NEWS because their ratings are tanking since election and their credibility will soon be gone! _E_ It is an exciting time for our country!#WeeklyAddress #ConfirmGorsuch __HTTP__ _E_ The threat from radical Islamic terrorism is very real just look at what is happening in Europe and the Middle East. Courts must act fast! _E_ Yes All Star @ApprenticeNBC contestant @THEGaryBusey is a little out there. But he uses his 'uniqueness' to his advantage. _E_ Via @STVAberdeen: Donald Trump reveals first image of his new Aberdeenshire hotel __HTTP__ _E_ New jobs report: 432000 left workforce manufacturing & durable goods go __HTTP__ We need leaders who understand business. _E_ Our big and very popular Tax Cut and Reform Bill has taken on an unexpected new source of "love" that is big companies and corporations showering their workers with bonuses. This is a phenomenon that nobody even thought of and now it is the rage. Merry Christmas! _E_ Obama just stated It's always good to ignore Donald Trump. I state he is right especially when the truth is against him. _E_ First responders have been doing heroic work. Their courage & devotion has saved countless lives – they represent the very best of America! __HTTP__ _E_ He says he will spend $1 B to get re elected: @BarackObama. I can match him preserving my options. _E_ Justice Kennedy should be proud of himself for sticking to his principles in light of Justice Roberts' bullshit! _E_ Going to The Citadel tonight getting The Nathan Hale Patriot Award. Very nice! _E_ Negotiation tip #1: The worst thing you can possibly do in a deal is seem desperate to make it. _E_ Winner of the 5 Star Diamond Award @TrumpGolfLA brings luxury & elite amenities to LA's top public golf course __HTTP__ _E_ John McCain called thousands of people crazies when they came to seek help on illegal immigration last week in Phoenix. He owes apology! _E_ Trump International Hotel & Tower New York has received great acclaim as has our signature restaurant Jean Georges __HTTP__ _E_ Sheena Monnin acted terribly...she got what she deserved! _E_ .@EliseChristine #asktrump __HTTP__ _E_ My @SquawkCNBC interview from this morning discussing the price of oil windfarmsDoral Hotel & Country Club and more... __HTTP__ _E_ Art Laffer just said that he doesn't know how a Democrat could vote against the big tax cut/reform bill and live with themselves! @FoxNews _E_ I'm not proud of my locker room talk. But this world has serious problems. We need serious leaders. #debate #BigLeagueTruth _E_ Via @nypost by @JonathonTrugman: Donald Trump's resume backs his run for president __HTTP__ _E_ Thank you Lake Worth Florida. @foxandfriends _E_ All Star @ApprenticeNBC has done the impossible. TV's greatest villain @OMAROSA & @THEGaryBusey are in competition. Fireworks! _E_ Hawaii: __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_ Please to inform that the Champion Pittsburgh Penguins of the NHL will be joining me at the White House for Ceremony. Great team! _E_ I hope the Fake News Media keeps talking about Wacky Congresswoman Wilson in that she as a representative is killing the Democrat Party! _E_ Stock Market at new all time high! Working on new trade deals that will be great for U.S. and its workers! _E_ Very excited to be addressing the @RepLeadConf next Friday in New Orleans. There is much to discuss! _E_ Via @BreitbartNews by @ASwoyer: Exclusive: Trump Slams Obamatrade Stands Up For American Jobs __HTTP__ _E_ America will never be destroyed from the outside.If we falter and lose our freedomsit will be because we destroyed ourselves. A. Lincoln _E_ RomneyCare/ObamaCare architect Gruber apologized for his comments. He should apologize for the $2T monstrosity & return all taxpayer money. _E_ It was great to appear on Piers Morgan Tonight last night as his first live guest. Piers won the Celebrity Apprentice and he's fantastic. _E_ Jay Leno and his people are constantly calling me to go on his show. My answer is always no because his show sucks. They love my ratings! _E_ Congrats to @Reince Priebus a really good and talented man. We're proud of you Reince! __HTTP__ _E_ Thousands of great people showed up from Liberty University yesterday. I love standing ovations! __HTTP__ _E_ I'm helping the Serta Counting Sheep get back to work. Enter the contest __HTTP__ and win a trip to Las Vegas.. _E_ Hurricane Irma is raging but we have great teams of talented and brave people already in place and ready to help. Be careful be safe! #FEMA _E_ ObamaCare will explode and we will all get together and piece together a great healthcare plan for THE PEOPLE. Do not worry! _E_ .@HillaryClinton Obama #ISIS Strategy Has Allowed It To Expand To Become A Global Threat #DebateNight __HTTP__ _E_ What is never said is that people take a big risk with their money and can lose it all. We should be given credit for taking this risk. _E_ "You had Hillary Clinton and the Democratic Party try to hide the fact that they gave money to GPS Fusion to create a Dossier which was used by their allies in the Obama Administration to convince a Court misleadingly by all accounts to spy on the Trump Team." Tom Fitton JW _E_ ....the wall is not built which it will be the drug situation will NEVER be fixed the way it should be!#BuildTheWall _E_ Does Bush's library have a wing featuring Supreme Court Justice Jon Robert's ObamaCare ruling? Roberts was his prize appointee! _E_ If taxes are raised to avoid the fiscal cliff then they must be accompanied by tangible hard cuts on spending everywhere. _E_ Rapper Mac Miller's song Donald Trump has reached close to 72 million hits. He owes me big! _E_ Why doesn't somebody study the horrible charges brought against @Macys for racial profiling? Terrible hypocrites! _E_ Chris Ruddy is always on point: Trump Opens 'Greatest Golf Course In the World' __HTTP__ via @Newsmax_Media _E_ The biggest story yesterday the one that has the Dems in a dither is Podesta running from his firm. What he know about Crooked Dems is.... _E_ SHOCK Hugo Chavez endorses @BarackObama __HTTP__ Will he be in Chicago on election night too? _E_ Republicans are always saying Obama is such a nice guy. When will they learn that he is not? _E_ All are very scripted and rehearsed two (at least) should not be on the stage. _E_ RT @FoxNews: TUNE IN: @EricTrump joins @seanhannity TONIGHT at 9p ET on @FoxNews Channel! #Hannityat9 __HTTP__ _E_ .@Joan_Rivers Get well soon Joan keep fighting! _E_ Via @bostonherald by @ ChrisCassidy_BH: Donald Trump says Jeb Bush is wrong about Iraq __HTTP__ _E_ Congratulations to @RealSheriffJoe on his successful Cold Case Posse investigation which claims @BarackObama's 'birth certificate' is fake _E_ I am happy that The Job on CBS the 16th. knockoff of the Apprentice was just cancelled. I love to see my opponents lose (not nice)! _E_ I always said that Debbie Wasserman Schultz was overrated. The Dems Convention is cracking up and Bernie is exhausted no energy left! _E_ Crooked Hillary colluded w/FBI and DOJ and media is covering up to protect her. It's a #RiggedSystem! Our country d... __HTTP__ _E_ Take a tour of this amazing residence at Trump World Tower..... __HTTP__ _E_ Lyin' Hillary Clinton told the FBI that she did not know the C markings on documents stood for CLASSIFIED. How can this be happening? _E_ 13 BILLION 4.5 BILLION these are the stupid settlements that J.P.Morgan just made. Why don't they FIGHT? No wonder they keep getting sued. _E_ Word is that little Morty Zuckerman's @NYDailyNews loses more than $50 million per year can that be possible? _E_ Join me in Wisconsin tomorrow or Colorado on Tuesday!Green Bay 6pm __HTTP__ Springs 1pm... __HTTP__ _E_ There is no way that Carly Fiorina can become the Republican Nominee or win against the Dems. Boxer killed her for Senate in California! _E_ RT @gatewaypundit: The Trump Hotel Waikiki looks like a lovely resort @realDonaldTrump #Hawaii _E_ RT @foxandfriends: Sen. Ted Cruz: Trump's air traffic control plan is a 'win win' for Democrats and Republicans __HTTP__ _E_ "Do your homework before you invest. A dumb investor is a dead investor." – Think Like a Billionaire _E_ Great new poll from NH. Thank you! We need to keep this country safe! #Trump2016 __HTTP__ __HTTP__ _E_ Obama friend got a no bid $635M contract to build website __HTTP__ And now she will get more to fix it. _E_ I had thousands join me in New Hampshire last night! @HillaryClinton had 68. The #SilentMajority is fed up with what is going on in America! _E_ Just out the POLAR ICE CAPS are at an all time high the POLAR BEAR population has never been stronger. Where the hell is global warming? _E_ RT @brunelldonald: I thought about jobs that went overseas failing schools open borders not my skin color when I voted @realDonaldTrump! I... _E_ I always believed @BretMichaels was making a mistake in coming back as a competitor. I disagree with him but... __HTTP__ _E_ Will be on @Morning_Joe live from New Hampshire 7:00 A.M. Talking about the debate and more! _E_ We have wasted an enormous amount of blood and treasure in Afghanistan. Their government has zero appreciation. Let's get out! _E_ Millions of $'s of false ads paid for by lobbyists special interests of cheater @SenTedCruz and sleepy @JebBush are now running in S.C. _E_ What is Mitch McConnell thinking?...make the big deal! _E_ Wow CNN had to retract big story on Russia with 3 employees forced to resign. What about all the other phony stories they do? FAKE NEWS! _E_ Another one of my predictions just came true Iraq is a total disaster with government losing all control—so sad. _E_ RT @GovChristie: .@POTUS has done more to combat the addiction crisis than any other President. __HTTP__ _E_ We should not cut any aid to Egypt. Their country is in chaos and now they must form a normal civil government. _E_ I have clearly stated that if the New York State Republican Party is able to unify I would run for Governor and win. They can't unify SAD! _E_ My appearance this morning on Good Morning America... __HTTP__ _E_ Refugees from Syria are now pouring into our great country. Who knows who they are some could be ISIS. Is our president insane? _E_ The women played great today at the @USGA #USWomensOpen I look forward to being there tomorrow for the final round! __HTTP__ _E_ Heading to New Hampshire will be talking about Hillary saying her brain SHORT CIRCUITED and other things! _E_ 26000 unreported sexual assults in the military only 238 convictions. What did these geniuses expect when they put men & women together? _E_ That Seth Meyers is hosting the Emmy Awards is a total joke. He is very awkward with almost no talent. Marbles in his mouth! _E_ The one positive from the plunge in household wealth is that we are in a buyer's market. This is the time to buy! _E_ Paul Begala the dopey @CNN flunky and head of the Pro Hillary Clinton Super PAC has knowingly committed fraud in his first ad against me. _E_ Dem Senator Schumer hated the Iran deal made by President Obama but now that I am involved he is OK with it. Tell that to Israel Chuck! _E_ I am so happy that I was able to do something really good for the Bronx and lots of jobs! _E_ Watch yesterday Obama continued to evade questions on his security failures in the Benghazi consulate attack. __HTTP__ _E_ Totally unauthorized do not pay. I am self funding my campaign! Notice has just been withdrawn. #Trump2016#MakeAmericaGreatAgain _E_ Just leaving for @LandExpo in Iowa standing room only. My great honor. @PeoplesCompany __HTTP__ _E_ Really enjoyed discussing @yankees yesterday with @RealMicihaelKay. I am a long time Yankee fan. _E_ Wow 25000 in San Diego California!Thank you!! #Trump2016 __HTTP__ _E_ The virtually incompetent Republican Strategist who has had a failed career Cheri Jacobus is incoherent with anger that her puppets died! _E_ RT @seanhannity: BOOM!! Tick Tock __HTTP__ _E_ Thank you to all of those who gave me such wonderful reviews for my performance on @nbcsnl Saturday Night Live. Best ratings in 4 years! _E_ As your President I have no higher duty than to protect the lives of the American people. __HTTP__ _E_ The Republicans must get Virgil Goode out of the race in Virginia. He will take votes away from @MittRomney. _E_ I am proud of the Rep. House & Senate for working so hard on cutting taxes {& reform.} We're getting close! Now how about ending the unfair & highly unpopular Indiv Mandate in OCare & reducing taxes even further? Cut top rate to 35% w/all of the rest going to middle income cuts? _E_ ObamaCare/RomneyCare architect Gruber was paid over $6M with our tax dollars yet Obama only claims he 'was some adviser.' _E_ My @gretawire interview discussing @IvankaTrump wanting me to run for POTUS @BarackObama's SOTU and his China policy __HTTP__ _E_ .@Omarosa has another meltdown ... while giving a check for $40000 to Michael's charity the Sue Duncan Center. #CelebApprentice _E_ According to @RasmussenPoll @MittRomney has a 12 point advantage over @BarackObama on the economy __HTTP__ Look for it to grow. _E_ Big day for HealthCare. After 7 years of talking we will soon see whether or not Republicans are willing to step up to the plate! _E_ .@GovernorPerry is a terrific guy and I wish him well I know he will have a great future! _E_ Secure your place at the National Achievers Congress in London. It will be an amazing event with a great surprise. __HTTP__ _E_ The United States condemns the terror attack in Barcelona Spain and will do whatever is necessary to help. Be tough & strong we love you! _E_ Gen. Petraeus has agreed to testify in the Senate on Benghazi. I will be watching. _E_ Australia New Zealand and more. I am always available to them. @nytimes is just upset that they looked like fools in their coverage of me. _E_ A smart negotiator would use the leverage of our dollars our laws and our armed forces to get a better deal (cont) __HTTP__ _E_ Looking forward to keynoting the South Carolina Tea Party Convention in Myrtle Beach on Monday at 3:20PM! __HTTP__ _E_ .@Omarosa on the cover of Soap Opera Digest? That's a credential... #CelebApprentice _E_ It was an honor to welcome President @MarianoraJoy of Spain. Thank you for standing w/ us in our efforts to isolate the brutal #NoKo regime. __HTTP__ _E_ "Most entrepreneurs do not realize that wealth does not come from work but from the assets they build." – Midas Touch _E_ RT @realDonaldTrump: DACA is probably dead because the Democrats don't really want it they just want to talk and take desperately needed m... _E_ .@KatyTurNBC & @DebSopan should be fired for dishonest reporting. Thank you @GatewayPundit for reporting the truth. #Trump2016 _E_ "Trump on Romney: 'You Just Can't Give Him Another Chance':Some golfers can't sink the 3 ft. putt." __HTTP__ via @PJMedia_com _E_ Hopefully the House of Representatives can hold our country together for four more years...stay strong and never give up! _E_ We are TRYING to fight ISIS and now our own people are killing our police. Our country is divided and out of control. The world is watching _E_ Lolo Jones our beautiful Olympic athlete wants to remain a virgin until she gets married she is great. @Followlolo _E_ Thanks. __HTTP__ __HTTP__ _E_ "Appreciate your property and your property will appreciate for you." – Think Like a Billionaire _E_ What people don't know about @BillMaher is that he was a terrible student and not considered smart in his early (cont) __HTTP__ _E_ If the people of Massachusetts found out what an ineffective Senator goofy Elizabeth Warren has been she would lose! _E_ Entrepreneurs: Identify your goals and see each day as an opportunity to show what you can do at the highest level. _E_ Eliot Spitzer was a horrible Governor and A.G. who ruined many good people and cost the Country billions of dollars in losses (and jobs). _E_ "The greatest discovery of all time is that a person can change his future by merely changing his attitude." @Oprah _E_ Check out my new book Time To Get Tough: Making America #1 Again __HTTP__ _E_ In @oreillyfactor's No Spin Zone re: ObamaCare causing unemployment negotiating with China & my $5M court win __HTTP__ _E_ Erin Burnett who has no ratings on CNN in prime time now wants more money to move to the morning slot. @CNN should say no way . _E_ General John Kelly is doing a great job as Chief of Staff. I could not be happier or more impressed and this Administration continues to.. _E_ Impossible is a word to be found only in the dictionary of fools. Napoleon Bonaparte _E_ People very unhappy with Crooked Hillary and Obama on JOBS and SAFETY! Biggest trade deficit in many years! More attacks will follow Orlando _E_ The Stock Market is setting record after record and unemployment is at a 17 year low. So many things accomplished by the Trump Administration perhaps more than any other President in first year. Sadly will never be reported correctly by the Fake News Media! _E_ WSJ/NBC Poll: Donald Trump Widens His Lead in Republican Presidential Race. #Trump2016 __HTTP__ _E_ We have spent over $1 Billion on the Libya operation. What are we getting back? _E_ Trump organisation backs community battle against substation __HTTP__ via @STVNews _E_ The more predictable the business the more valuable it is. Predictability also means consistency of brand experience. Midas Touch _E_ Thank you to everyone for the wonderful reviews of my speech on Thursday night. From the heart! _E_ .@antbaxter Your documentary died many deaths. You have in my opinion zero talent. _E_ Wow Senator Luther Strange picked up a lot of additional support since my endorsement. Now in September runoff. Strong on Wall & Crime! _E_ Heading to a packed house in Waterloo Iowa! Will celebrate today's great poll numbers together. See you soon! _E_ Shouldn't there have been increased security at our embassies on the anniversary of 9/11? _E_ "Study: Insurance costs to soar under Obamacare" __HTTP__ Men in NC get 305% hike. Women in NE suffer an average 237% hike. _E_ #BuyAmericanHireAmericanWatch __HTTP__ __HTTP__ _E_ .@antbaxter—Heard your documentary cost you less than $3000 to make—where did you get that kind of money? _E_ Give great credit to @GeorgeClooney for exposing the atrocities taking place in Sudan. _E_ My support of Anna Wintour for Ambassador got a lot of coverage. She is smart and will be a strong advocate for the US. _E_ President Obama Gruber and all of the other Obama cronies got ObamaCare passed by lies and fraudulent statements. Courts should overturn! _E_ Thank you @IvankaTrump for the kind words. I am very proud of the role model you are for so many. NH & IA radio ad: __HTTP__ _E_ A new radical Islamic terrorist has just attacked in Louvre Museum in Paris. Tourists were locked down. France on edge again. GET SMART U.S. _E_ Sad. Our food stamp rolls now surpass the entire population of Spain __HTTP__ We must do better or we will be Greece. _E_ Boardroom time which team do you think had the best presentation? #CelebApprentice _E_ .@BretBaier's newly released book 'Special Heart' brings a message of hope. All sales donated to heart charities __HTTP__ _E_ ... Icahn Kravis Apollo and most others but nobody says they went bankrupt! _E_ Dear @kimguilfoyle Thank you so much for your nice words today on @TheFive. Will not be forgotten! In Iowa now. Packed house! _E_ US Gov't is on the hook for more than a third of the world's entire debt & we wonder why China & OPEC are laughing all the way to the bank! _E_ ObamaCare must be fully repealed or it will destroy America's small businesses. _E_ In Las Vegas getting ready to speak! _E_ Country music star @TraceAdkins returns to All Star @CelebApprentice. Competing for @RedCross Trace is great! _E_ Congratulations are in order! @TrumpPanama ranks #5 Top Hotel in Panama by @TripAdvisor's #TravelersChoice Awards! __HTTP__ _E_ Via @bizjournals by @BrandonSawalich: 3 lessons about loyalty that I learned from Donald Trump __HTTP__ _E_ My thoughts on Anthony Weiner in today's #trumpvlog... __HTTP__ _E_ ISIS threatens us today because of the decisions Hillary Clinton has made along with President Obama. Donald J. Trump _E_ America has lost its AAA rating and gained over $6T in debt under @BarackObama and now he wants to raise the debt ceiling SCARY! _E_ Working hard on the biggest tax cut in U.S. history. Great support from so many sides. Big winners will be the middle class business & JOBS _E_ Last night's horrific execution style shootings of 12 Dallas law enforcement officers... __HTTP__ _E_ America's debt is greater than our GDP. Time for new thinking. _E_ #TBT @DonaldJTrumpJr @IvankaTrump @EricTrump and I 20 years ago __HTTP__ _E_ Thank you Florida. My Administration will follow two simple rules: BUY AMERICAN and HIRE AMERICAN! #ICYMI Watch:... __HTTP__ _E_ I believe the James Comey leaks will be far more prevalent than anyone ever thought possible. Totally illegal? Very 'cowardly!' _E_ Thank you Bridgeport Connecticut!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ The President must get Congressional approval before attacking Syria big mistake if he does not! _E_ No surprise serial sexter Anthony continues to be a sick pervert. He was sexting a 'young' girl last summer __HTTP__ _E_ Will be in Missouri today with Melania for the funeral of a wonderful and truly respected woman Phyllis S! _E_ Good investors are good students. It's as simple as that. Think Like a Billionaire _E_ Houston TX: __HTTP__ Vegas NV __HTTP__ AZ: __HTTP__ __HTTP__ _E_ Congress use the power of the purse. STOP AMNESTY! _E_ The failing @nytimes which has made every wrong prediction about me including my big election win (apologized) is totally inept! _E_ "Donald Trump on Fiscal Cliff and Obama" __HTTP__ via @Livetradingnews _E_ The @HuffingtonPost is a total joke & laughing stock of journalism as is gross Arianna Huffington. They don't report the facts! _E_ Tom Ridge is a failed 'Bushy' & PA Governor. Him & his friend @KarlRove shouldn't be allowed to do their bias commentary nobody listens! _E_ If @BarackObama really loved this country he wouldn't be destroying it. He has ruined our credit and killed jobs with ObamaCare. _E_ I took a failed club in Dutchess County & made it a great success plus many jobs. @KieranLalor should be thankful. _E_ Too bad I don't get this for political speeches they cost me a fortune! __HTTP__ _E_ Club for Growth is the group that came to my office seeking $1 million dollars. I told them no and now they are doing negative ads. _E_ No deal is better than a bad deal. America out negotiated again. #Iran _E_ Join me live from the @WhiteHouse via #Periscope __HTTP__ _E_ Just got back from Tampa. It was an amazing evening with an even more amazing crowd fantastic people! Will be in South Carolina tomorrow. _E_ US government's foreign indebtedness has grown over 72% under @BarackObama. He is bleeding us dry to China. _E_ DAMAC & #Trump Organization are developing a 2nd Trump #golf course Trump World Golf Club #Dubai at AKOYA Oxygen! __HTTP__ _E_ We must stop Common Core from controlling state & local curriculums. It is a federal grab of education. Keep education local! _E_ If @BarackObama's policies are so advantageous then why is he constantly invoking Ronald Reagan on the Stump? __HTTP__ _E_ During my recent trip to the Middle East I stated that there can no longer be funding of Radical Ideology. Leaders pointed to Qatar look! _E_ Our relationship with Russia is at an all time & very dangerous low. You can thank Congress the same people that can't even give us HCare! _E_ Trump Vineyard Estates is a breathtaking location to hold special events for all occasions. Watch the video for a look __HTTP__ _E_ My int. on @FoxNews' @oreillyfactor: "Donald Trump presidential politics and 'The Factor'" __HTTP__ _E_ Why do we keep broadcasting when we are going to attack Syria. Why can't we just be quiet and if we attack at all catch them by surprise? _E_ #TrumpVlog Trouble in paradise for Clintons __HTTP__ _E_ RT @greta: Thank you @realDonaldTrump this is important to so many of us __HTTP__ _E_ Does he look sharp smart and presidential his hands keep hitting the podium making a loud and distracting noise microphone too sensitive. _E_ Ugly industrial wind turbines are ruining the beauty of parts of the country and have inefficient unreliable energy to boot. _E_ Daily Caller: Trump Surpasses Field Flirts With 40 Percent in Alabama Poll __HTTP__ _E_ Just arrived for the #GOPdebate #MakeAmericaGreatAgain __HTTP__ _E_ Had a great time on @gretawire last night. Greta always does great interviews. _E_ RT @FLOTUS: Looking forward to hosting the annual Easter Egg Roll at the @WhiteHouse on Monday! __HTTP__ _E_ "Donald Trump: The View Will be Better without Joy Behar (Video)" __HTTP__ via @gatewaypundit _E_ Honored to be attending Rev. @BillyGraham's 95th birthday. His life & work has brought hope & faith to millions worldwide. _E_ Always good to have @ArsenioHall back as advisor as well as @DonaldJTrumpJr. They have their own fan clubs at this point. #CelebApprentice _E_ In '08 America voted for Hope & Change. Instead we got incompetency. Now it is time to put a real job creator in office. Vote 4 Mitt! _E_ RT @PressSec: .@POTUS historic tax cuts + doubling of the child tax credit will do infinitely more to empower working moms than liberals' p... _E_ Our hearts & prayers go out to the people of London who suffered a vicious terrorist attack.... __HTTP__ _E_ Thank you Diamond and Silk! __HTTP__ _E_ Wow @GeorgeWill said some very nice things about me today on @FoxNewsSunday with Chris Wallace. I am making progress thanks George! _E_ Delusional @BarackObama claims that his economic plan worked __HTTP__ Is the 16% real unemployment part of the plan? _E_ "If we get tough and make the hard choices we can make America a rich nation—and respected—once again." – Time to Get Tough _E_ A top firm like Cooley will only submit a case they believe in and can win. _E_ As one of Miamii's largest landowners I am pulling for the @MiamiHEAT in the @NBA finals. Lebron's time is now! @KingJames _E_ So excited to have @SantanaCarlos performing at the 2015 #CadillacChampionship at @TrumpDoral: __HTTP__ _E_ Getting ready to go to the great State of Michigan. Big crowd tonight. Make America Great Again! _E_ I still can't believe we left Iraq without the oil. _E_ "Money was never a big motivation for me except as a way to keep score.The excitement is playing the game."–The Art of The Deal _E_ Great optimism for future of U.S. business AND JOBS with the DOW having an 11th straight record close. Big tax & regulation cuts coming! _E_ "Get to know yourself.You can't improve upon something you don't understand.The more you ask the better you'll know." Vince Lombardi _E_ Via @paramuspost: "@TrumpSoHo New York Debuts Sizzling Summer Offerings" __HTTP__ _E_ Obama & his people did a brilliant job of delaying these scandals until after the election. Mitt must be going wild thinking about it! _E_ "Design your business from the start so that it is leverageable expandable predictable and financeable." – Midas Touch _E_ Obama weak on immigration. All words no action. He's been Prez 4 years. _E_ .@washtimes @BrettMDecker: Five Questions w/ @realDonaldTrump 'Lack of Leadership is the biggest threat to America' __HTTP__ _E_ Taking risks & making mistakes is the best way to learn something new. Most of the time you will surprise yourself Trump Never Give Up _E_ Who wants the endorsement of a guy (@EricCantor) who lost in perhaps the greatest upset in the history of Congress? _E_ .@MittRomney and I are working out a great dinner for someone I hope it's you! __HTTP__ _E_ CLINTON REFUGEE PLAN COULD BRING IN 620000 REFUGEES IN FIRST TERM AT LIFETIME COST OF OVER $400 BILLION. __HTTP__ _E_ Being nice to Rocket Man hasn't worked in 25 years why would it work now? Clinton failed Bush failed and Obama failed. I won't fail. _E_ The bend in the road is not the end of the road unless you refuse to take the turn. – Anonymous _E_ Zogby Poll: Trump Widens Lead After GOP Debate __HTTP__ _E_ Formerly of the New York Times @frankrichny was a poor theatre critic who was forced out. Sadly he is an even (cont) __HTTP__ _E_ We are delivering HISTORIC TAX RELIEF for the American people!#TaxCutsandJobsAct __HTTP__ _E_ I am convinced that if @AlexSalmond had not pushed ugly wind turbines all over Scotland the vote would have been much better for him! _E_ Thank you to Ford for scrapping a new plant in Mexico and creating 700 new jobs in the U.S. This is just the beginning much more to follow _E_ Deals are my art form. Other people paint beautifully or write poetry. I like making deals preferably big deals. That's how I get my kicks. _E_ If @DannyZuker competed against me and.won (which not too many people do) he could win millions of $'s for himself or his charity! _E_ It's really cold outside they are calling it a major freeze weeks ahead of normal. Man we could use a big fat dose of global warming! _E_ .@NRO Not much is as dead or irrelevant as National Review thanks to guidance of Goldberg a total loser! Get some real talent or fold! _E_ Thank you @GolfMagazine for putting my Scotland course on your cover and a Top 100 course in the world. __HTTP__ _E_ Join the MOVEMENT to #MAGA! __HTTP__ __HTTP__ _E_ The dopes at the @nytimes bought the Boston Globe for $1.3 billion and sold it for $1.00. Their great old headquarters gave it away! So dumb _E_ #VoteTrump at clerk's offices & 185 ballot drop boxes in #ORPrimary!Closes at 8pm! __HTTP__ _E_ Happy 4th of July! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_ Via @Newsmax_Media: 14 Reasons Donald Trump Is Really Running — and Doing Well __HTTP__ _E_ To put on your calendar for May: Miss USA 2010 live from Las Vegas on May 16th 7 p.m. ET on NBC. I'll be there tune in for a great show! _E_ Had a great meeting at CIA Headquarters yesterday packed house paid great respect to Wall long standing ovations amazing people. WIN! _E_ RT @DRUDGE_REPORT: DEAD HEAT: CLINTON VS TRUMP __HTTP__ _E_ China is now given preference to buy US debt by going directly to Treasury. I don't believe @BarackObama knows that he selling us out. _E_ Today's final round of the WGC Cadillac Championship will be amazing. A lot of pressure on leader who has played great. Big names hunting! _E_ Trump Tycoon App for iPhone & iPod Touch It's $2.99 but the advice is priceless! __HTTP__ _E_ Obama will grant amnesty to millions of illegals yet he has not lifted a finger for USMC Sgt. Tahmooressi! . #BringBackOurMarine _E_ Via @Zawya: "Trump home partners with lifestyle to launch an exclusive collection of home décor" __HTTP__ _E_ RT @foxnation: . @TuckerCarlson : #Dems Don't Really Believe #Trump Is a Pawn of #Russia That's Just Their Political Tool __HTTP__ _E_ Our FIFTH 1K milestone of 2017!#DOW24K #MAGA __HTTP__ _E_ .@stephenfhayes: I heard you were a joke on the media panel this weekend in New Hampshire. You just don't have what it takes! @JoeNBC _E_ My son @EricTrump will be interviewed by @SeanHannity tonight at 10pm on @FoxNews. Enjoy! _E_ Imagine how much stronger economic shape we would be in if we made the Iraqi government agree to a cost sharing (cont) __HTTP__ _E_ being a movie star and that was season 1 compared to season 14. Now compare him to my season 1. But who cares he supported Kasich & Hillary _E_ Via @UnionLeader by @tuohy: "Trump says he will decide on a presidential run by June" __HTTP__ _E_ I am getting bad marks from certain pundits because I have a small campaign staff. But small is good flexible save money and number one! _E_ Ted Cruz has been playing an ad about me that is so ridiculously false no basis in fact. Take ad down Ted. Biggest liar in politics! _E_ ObamaCare is one of the greatest threats our country faces. It is unsustainable and will lead America into complete insolvency. _E_ Ukrainian efforts to sabotage Trump campaign quietly working to boost Clinton. So where is the investigation A.G. @seanhannity _E_ .@Univision cares far more about Mexico than it does about the U.S. Are they controlled by the Mexican government? _E_ Clinton Aides: 'Definitely' Not Releasing Some HRC Emails: __HTTP__ _E_ As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_ CNN anchors are completely out of touch with everyday people worried about rising crime failing schools and vanishing jobs. _E_ .@morning_joe Wow Ticket sales go through the roof after Trump asked to speak at CPAC _E_ The New York Times/Bill Carter/Sept.26 2011: On MSNBC meanwhile Lawrence O'Donnell has lost 100000 viewers (cont) __HTTP__ _E_ Thank you New Hampshire!#Trump2016 __HTTP__ _E_ Republicans want to fix DACA far more than the Democrats do. The Dems had all three branches of government back in 2008 2011 and they decided not to do anything about DACA. They only want to use it as a campaign issue. Vote Republican! _E_ A house divided against itself cannot stand. Abraham Lincoln _E_ Chance favors the prepared mind. Louis Pasteur _E_ .@MeghanMcCain was terrible on @TheFive yesterday. Angry and obnoxious she will never make it on T.V. @FoxNews can do so much better! _E_ Thanks! __HTTP__ _E_ "The true competitors are the ones who always play to win." – Tom Brady @Patriots _E_ People believe CNN these days almost as little as they believe Hillary....that's really saying something! _E_ .@alexsalmond @pressjournal @BBCNews RT @DanScavino one would think the photo & caption says it all.... __HTTP__ _E_ Via @washingtonpost: Donald Trump will speak at CPAC by @rachelweinerwp __HTTP__ @CPACnews @AlCardenasACU @RGreggKeller _E_ Trump Organization's first project in India Trump Towers Pune will epitomize inspired living and timeless elegance __HTTP__ _E_ Order a signed copy of CRIPPLED AMERICA & submit a question for my live streaming book signing on 12/3 at 7:30 pm __HTTP__ _E_ We are not retreating we are advancing in another direction. Douglas MacArthur _E_ Missouri just confirmed #Trump2016 as the official winner with an additional 12 delegates. #MakeAmericaGreatAgain __HTTP__ _E_ Thank you New Hampshire! Great people see you next week! __HTTP__ _E_ Here's to a safe and happy Independence Day for one and all Enjoy it! Donald J. Trump _E_ If the wind will not serve take to the oars. Latin Proverb _E_ Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government! _E_ Great honor Rev. Jerry Falwell Jr. of Liberty University one of the most respected religious leaders in our nation has just endorsed me! _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ The Republicans must face reality & create a strong & positive immigration policy if not they will continue to lose elections. _E_ We are rebuilding other countries while our own country is going to HELL. Time to rebuild the U.S.A.! Tell our stupid politicians ENOUGH _E_ Must read @ConservReview article by @JeffJlpa1: "Jeb Bush and the Outsiders" __HTTP__ _E_ My transition team which is working long hours and doing a fantastic job will be seeing many great candidates today. #MAGA _E_ North Korea just stated that it is in the final stages of developing a nuclear weapon capable of reaching parts of the U.S. It won't happen! _E_ Thank you Travis County Texas!#MakeAmericaGreatAgain __HTTP__ _E_ China is advocating on behalf of Iran's nuclear program the Chinese oppose both sanctions and any militar... (cont) __HTTP__ _E_ The United States made some of the worst Trade Deals in world history.Why should we continue these deals with countries that do not help us? _E_ Hillary Clinton lied when she said that ISIS is using video of Donald Trump as a recruiting tool. This was fact checked by @FoxNews: FALSE _E_ People have been forced to resign positions for far less than @JonahNRO's "tweeting like a 14 year old girl" _E_ Via @BW: Donald Trump Vows to Fight Scottish Wind Farm Plan in Courts __HTTP__ _E_ Thank you South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Via @inventorspot by Myra Per Lee: "Got A Great Idea? Get Donald Trump To Fund It" __HTTP__ _E_ In the end you're measured not by how much you undertake but by what you finally accomplish! _E_ New Hampshire vote today MAKE AMERICA GREAT AGAIN! _E_ Just sat down for a great interview with @PHussionWYFF in Greenville today. Watch at 5pm. An amazing day in South Carolina! #VoteTrumpSC _E_ Make it special! No better place to celebrate St. Patrick's Day in the Windy City than @TrumpChicago __HTTP__ _E_ Glad to hear Bella Santorum is recovering. @RickSantorum has a beautiful family. _E_ I was invited to be with Mitt Romney tonight win lose or draw I'll be there! _E_ Our very weak and ineffective leader Paul Ryan had a bad conference call where his members went wild at his disloyalty. _E_ An: Media fell all over themselves criticizing what DonaldTrump may have insinuated about @POTUS. But he's right: __HTTP__ _E_ Celebrity Apprentice is nearing the end of a wonderful and very successful season. Watch tonight at 8:00. _E_ Just received a wonderful letter from a new father who bought his son his first book The Art of the Deal. Great parent! _E_ Thank you Vermont! #Trump2016#SuperTuesday _E_ An incredible honor to receive the endorsement of a person Ihave such tremendous respect for. Thank you Sheldon! __HTTP__ _E_ .@bobvanderplaats is a total phony and con man. When I wouldn't give him free hotel rooms and much more he endorsed Cruz. @foxandfriends _E_ It is hard to believe I am winning by so much when I am treated so badly by the media. New @CNN Poll amazing in ALL categories. 21 pt. Lead _E_ Great afternoon in Ohio & a great evening in Pennsylvania departing now. See you tomorrow Virginia! __HTTP__ _E_ Final #'s just announced in the GREAT State of MO. TRUMP WINS! New certified #'s show a 365 vote increase for me @ least 12 more delegates! _E_ Record setting cold and snow ice caps massive! The only global warming we should fear is that caused by nuclear weapons incompetent pols. _E_ Can you believe we still have not gotten our Marine out of Mexico. He sits in prison while our PRESIDENT plays golf and makes bad decisions! _E_ .@Morning_Joe just went off the rails. I will beat Hillary easily she does not want to run against me. I am tuning them out waste of time _E_ China controls North Korea. So now besides cyber hacking us all day they are using the Norks to taunt us. China is a major threat. _E_ Why would they announce a finding of the grand jury in Ferguson at 9:00 in the evening a prime time for riots! Not smart. _E_ Rexnord of Indiana made a deal during the Obama Administration to move to Mexico. Fired their employees. Tax product big that's sold in U.S. _E_ After all of these years of suffering thru ObamaCare Republican Senators must come through as they have promised! _E_ Pervert alert! Sexter Anthony Weiner will be running for Mayor of New York City. _E_ My @foxandfriends interview discussing my possible GOP endorsement @MittRomney's taxes and the Florida primary. __HTTP__ _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ The weak jokers who so badly hurt great Penn State University should have fought the NCAA instead of making a deal __HTTP__ _E_ ObamaCare is such a national treasure that @BarackObama has waived over 1200 companies from the law __HTTP__ _E_ See problems as a mind exercise. Enjoy the challenge and remember to keep focused on your goals. _E_ Newt attacks on @MittRomney record at Bain an attack on free enterprise and entrepreneurship. Mistake! _E_ Government waste fraud and abuse should be immediately addressed. This will help solve our deficit crisis both short and long term. _E_ China steals United States Navy research drone in international waters rips it out of water and takes it to China in unprecedented act. _E_ We are what we repeatedly do. Excellence therefore is not an act but a habit. Aristotle _E_ Thank you for your support of my candidacy! #MAGA #ImWithYou __HTTP__ _E_ Wow! I hear you Warren Michigan. Streaming live join us America. It is time to DRAIN THE SWAMP!Watch: __HTTP__ _E_ We are getting reports from many voters that the Cruz people are back to doing very sleazy and dishonest pushpolls on me. We are watching! _E_ In Texas now leaving soon for BIG rally in Florida! _E_ With @C_Soules from #TheBachelor in Iowa __HTTP__ _E_ The new unemployment numbers are terrible. 522000 more people are out of the labor force to 88419000. __HTTP__ _E_ Happy to announce we are awarding $1M to Las Vegas in order to help local law enforcement working OT to respond to last Sunday's tragedy. _E_ I will be making a major statement from the @WhiteHouse upon my return to D.C. Time and date to be set. _E_ When will Pakistan apologize to us for providing safe sanctuary to Osama Bin Laden for 6 years?! Some ally. _E_ Via @BW: Thomas Jefferson Donald Trump Share Love of Grapes in Virginia __HTTP__ @trumpwinery @EricTrump _E_ Editorial by @DonaldJTrumpJr in the DailyCaller: Defending Innovation in America __HTTP__ _E_ I promise that our administration will ALWAYS have your back. We will ALWAYS be with you! __HTTP__ _E_ Remember go vote we need real change this time. _E_ RT @FoxBusiness: .@JerryJrFalwell: I was so impressed by [@realDonaldTrump's] speech yesterday. He was the best I've ever seen him. __HTTP__ _E_ #CrookedHillary is nothing more than a Wall Street PUPPET! #BigLeagueTruth #Debate __HTTP__ _E_ Obama projected a 2012 budget deficit of $557B. It is actually double that at $1.1T __HTTP__ We can't afford four more years. _E_ All the guys that said @MittRomney would lose are rapidly coming on board. Mitt will remember the early helpers. _E_ Via @theblaze: Falwell on Trump: He 'was willing to say publicly' what conservatives said 'privately' __HTTP__ _E_ "Had the information (Crooked Hillary's emails) been released there would have been harm to National Security.... Charles McCulloughFmr Intel Comm Inspector General __HTTP__ _E_ Find out what Success smells like. I'll be @Macys Herald Square April 18 5:30pm to sign my new fragrance first (cont) __HTTP__ _E_ "Winners embrace hard work." @ESPNDrLou _E_ On my way to @TrumpSoHo to receive the AAA Five Diamond Award. _E_ It's very sad that Republicans even some that were carried over the line on my back do very little to protect their President. _E_ All 50 of the WORLD'S TOP 50 PLAYERS will be at TRUMP NATIONAL DORAL on Thursday Sunday for the Cadillac World Golf Championship. _E_ RT @NWSHouston: Historic flooding is still ongoing across the area. If evacuated please DO NOT return home until authorities indicate it i... _E_ My daughter Ivanka is being honored by the Wharton School of Finance with the 2012 Young Leadership Award. Also (cont) __HTTP__ _E_ RT @FoxNews: Geraldo Blasts 'Fake News' Reports About Trump's Visit to Puerto Rico __HTTP__ _E_ Do not underestimate the UNITY within the Republican Party! _E_ 'Hillary Clinton Deleted Emails With Her Email Server Technician' __HTTP__ _E_ China is buying so many of our companies it's really getting bad. _E_ Another historic first under Obama businesses are collapsing faster than they're being formed __HTTP__ New leadership now! _E_ The contract to build the ObamaCare website was given to a CANADIAN company for $55 744 081. It then bloated to $292 071067 INCOMPETENCE _E_ Via @Newsmax_Media by @OwenTew: "Trump on 2016 Run: I Would Self Fund Appoint Wall Street Experts" __HTTP__ _E_ One of the dumber and least respected of the political pundits is Chris Cillizza of the Washington Post @TheFix. Moron hates my poll numbers _E_ Baltimore just set a record for the coldest day in March in a long recorded history 4 degrees. Other places likewise. Global warming con! _E_ Live tweeting during tonight's VP debate...should be a great time _E_ Thank you Faith and Freedom Forum & @UrbandaleSchool. I had a great time in Iowa today! __HTTP__ _E_ We want our companies to hire & grow in AMERICA to raise wages for AMERICAN workers & to help rebuild our AMERICAN cities & towns! #USA __HTTP__ _E_ Arrived in Palm Beach drove by a gas staion $4.50 a gallon. Result of failed @BarackObama leadership. _E_ Small business owners are the DREAMERS & INNOVATORS who are powering us into the future!Read more and watch here: __HTTP__ __HTTP__ _E_ Thank you Colorado Springs. Get out & VOTE #TrumpPence16 in November! __HTTP__ _E_ To show you how shallow politicians can be many are jealous of my @CPACnews speaking slot & also their fellow Republicans! Not good! _E_ RT @DRUDGE_REPORT: Trump: 'Is the Boston Killer Eligible for Obama Care to Bring Him Back to Health?' __HTTP__ _E_ It is very sad to see what @BarackObama has done with NASA. He has gutted the program and made us dependent on the Russians. _E_ It was great having @ArsenioHall back on this week's @ApprenticeNBC! __HTTP__ _E_ Obama will be going on @theviewtv & fundraising while in NYC for the UN Assembly... _E_ Not good or smart for Obama to be calling Russia a regional power or to mention the concept of a nuclear weapon going off in NYC. _E_ RT @realDonaldTrump: ATTN: @HillaryClinton Why did five of your staffers need FBI IMMUNITY?! #BigLeagueTruth #Debates _E_ I'm leaving now for Ireland Spain Scotland and elsewhere crazy life! _E_ People are LOVING the Trump sign on the Chicago building. Big league tweets letters and calls... _E_ Just leaving Virginia really big crowd great enthusiasm! _E_ #2. Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_ "Destiny has a part to play in your life and in your business so give it a chance to work." – Think Like a Champion _E_ I beat Hillary in the new @FoxNews Poll head to head. SHE HAS NO STRENGTH OR STAMINA both of which are needed to MAKE AMERICA GREAT AGAIN! _E_ Thank you to the Robb Report The Best of the Best issue for just naming Trump International Golf Links the Best New Golf Course In World! _E_ Sad to see the history and culture of our great country being ripped apart with the removal of our beautiful statues and monuments. You..... _E_ Thank you for all of the really nice comments and reviews concerning my speech today at the National Press Club. It was my great honor! _E_ Chinese spies stole our F 35 Joint Strike Fighter design __HTTP__ We should offset the cost from our Chinese debt _E_ Never seen such Republican ANGER & UNITY as I have concerning the lack of investigation on Clinton made Fake Dossier (now $12000000?).... _E_ Crooked Hillary has zero imagination and even less stamina. ISIS China Russia and all would love for her to be president. 4 more years! _E_ Bernie should pull his endorsement of Crooked Hillary after she decieved him and then attacked him and his supporters. _E_ Just at a news conference from Trump Turnberry in Scotland. Everybody was there & will be all over television tonite. Back on trail Saturday _E_ .@StephenBaldwin7 You were fabulous on CNN last night I greatly appreciate your support. Best wishes. _E_ "Don't bunt. Aim out of the ball park. Aim for the company of immortals." David Ogilvy _E_ As usual Hillary & the Dems are trying to rig the debates so 2 are up against major NFL games. Same as last time w/ Bernie. Unacceptable! _E_ Politicians are all talk and no action. Bush and Rubio couldn't answer simple question on Iraq. They will NEVER make America great again! _E_ More and more reporters are using the word TRUMP when referring to winning just used on Bloomberg News. Gee I wonder why? _E_ #LawandOrder #ImWithYouVideo: __HTTP__ __HTTP__ _E_ Downtown Manhattan's trendiest hotel @TrumpSoHo 46 stories of luxurious rooms fine dining & The Spa __HTTP__ _E_ I don't know if Hillary will be able to run she is a walking time bomb! _E_ Our country and it's leadership has to be so careful and so smart these are treacherous times like no other. The world is a crazy place! _E_ It was an honor to be @GretchenCarlson's inaugural guest on her new show 'The Real Story.' Gretchen will be a big success! _E_ As Bernie Sanders said Hillary Clinton has bad judgement. Bill's meeting was probably initiated and demanded by Hillary! _E_ "I'm a great believer in asking everyone for an opinion before I make a decision. It's a natural reflex." – The Art of The Deal _E_ My @foxandfriends interview discussing #MissUSA Olivia Culpo the job numbers & the waste of the Obama stimulus __HTTP__ _E_ .@genesimmons really great job handling the wise guys so easy for you such talent! I won't forget. _E_ .@NJPGA Club of the Year Trump Nat'l Bedminster is NJ's top family country club with two award winning courses __HTTP__ _E_ Thank you for a great afternoon South Carolina! See you next Tuesday! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ You mean George Bush sends our soldiers into combat they are severely wounded and then he wants $120000 to make a boring speech to them? _E_ "Do not allow fear to settle into place in any part of your life. It is a defeating attitude & a negative emotion Think Like a Champion _E_ Will Plan B miss Trace? _E_ .... It is a very effective & commonly used business tool. _E_ Welcome to the new reality. @BarackObama is now letting China buy US banks __HTTP__ The US government is selling us out. _E_ Crooked Hillary should not be allowed to run for president. She deleted 33000 e mails AFTER getting a subpoena from U.S. Congress. RIGGED! _E_ I never did a day's work in my life. It was all fun. Thomas A. Edison _E_ The @nytimes purposely covers me so inaccurately. I want other nations to pay the U.S. for our defense of them. We are the suckers no more! _E_ George Steinbrenner would have done a major number on A Rod there is no way he would have gotten paid even with the help of the union! _E_ Labor Unions Giving Serious Thought to Endorsing Trump via Washington Examiner __HTTP__ _E_ I will be interviewed on @oreillyfactor tonight at 8:00 P.M. Enjoy! _E_ Now the Chinese are planning a war game w/ the IraniansSyrians & Russians along Syrian coast. __HTTP__ Laughing at @BarackObama _E_ Why doesn't OPEC lower the price of crude to help avert the European crisis? Crude keeps rising during the dow... (cont) __HTTP__ _E_ My @FoxNews @TeamCavuto interview discussing the @RNC Convention businesses making products in China and unemployment __HTTP__ _E_ Today in Bedminster I signed the Harry W. Colmery Veterans Educational Assistance Act of 2017 joined by @DeptVetAffairs @SecShulkin. __HTTP__ _E_ The TIME Magazine cover showing late age breast feeding is disgusting sad what TIME did to get noticed. @TIME _E_ Always make a total effort even when the odds are against you. Arnold Palmer @KingdomMag _E_ As President I will bring jobs back and get wages up for Americans who need it most. __HTTP__ _E_ Bernie Sanders says that Hillary Clinton is unqualified to be president. Based on her decision making ability I can go along with that! _E_ Wind energy is a complete economic disaster.... __HTTP__ @AlexSalmond @AberdeenCC @David_Cameron @Aberdeenshire @ScotParl _E_ Weakness cow towing and not standing firm is provocative. We are getting pushed around and robbed under this President. _E_ Just found out that @tedcruz is spending a fortune on Iowa push polls negative to me. Not nice but OK! New polls are great. _E_ Visiting LA? Be sure to make a reservation at Trump National Golf Club __HTTP__ The #1 public course in the country! _E_ Congratulations to @bostonpolice on yesterday's successful and safe @bostonmarathon. The entire country is proud. _E_ Thank you Pueblo Colorado! #TrumpRally #AmericaFirst __HTTP__ __HTTP__ _E_ Great seeing @TheLeeGreenwood and Kimberly at this evenings VP dinner! #GodBlessTheUSA __HTTP__ _E_ The @washingtonpost which loses a fortune is owned by @JeffBezos for purposes of keeping taxes down at his no profit company @amazon. _E_ Always leave your ego at the door during negotiations. Remember it's only business and there will always be another day! _E_ .@meetthepress and @chucktodd did a 1 hour hit job on me today – totally biased and mostly false. Dishonest media! _E_ Is it the same Kaine that took hundreds of thousands of dollars in gifts while Governor of Virginia and didn't get indicted while Bob M did? _E_ We are now leading in many polls and many of these were taken before the criminal investigation announcement on Friday great in states! _E_ .@megynkelly recently said that she can't be wooed by Trump. She is so average in every way who the hell wants to woo her! _E_ I told everybody the Oscars were no good—Nielsen ratings confirmed one of the lowest ratings in history. _E_ Entrepreneurs: A winning attitude will put things in perspective. Keep negative thoughts & people where they belong out of the big picture. _E_ Russian officials must be laughing at the U.S. & how a lame excuse for why the Dems lost the election has taken over the Fake News. _E_ Our GDP has been growing less than 2% for the last 5 years. ObamaCare will slow us down even more. Has to be repealed. _E_ My thoughts and prayers are with the two police officers their families and everybody at the @WestervillePD. __HTTP__ _E_ The Oscar Pistorius disaster is a really interesting story to me—a very sad situation for everyone! _E_ RT @IvankaTrump: The Administration is committed to supporting military spouses in the workforce. Thanks Kim for sharing your story! __HTTP__ _E_ America needs a tough negotiator not a community organizer. _E_ Wow new polls just out have Trump up and Cruz down he is a nervous wreck! _E_ I am seriously considering Dr. Ben Carson as the head of HUD. I've gotten to know him well he's a greatly talented person who loves people! _E_ Can you imagine a Canadian company developing our website? Terrible way to put Americans back to work. _E_ RT @MeetThePress: Watch our interview with @KellyannePolls: Russia did not succeed in attempts to sway election __HTTP__ #... _E_ If we did all the things we are capable of we would literally astound ourselves. Thomas Edison _E_ RT @TeamTrump: .@timkaine's Abortion Flip Flops: From Valuing The Sanctity of Life > Pro Abortion Demagogue #VPdebate __HTTP__ _E_ Why does Barack Obama's ring have an arabic inscription? __HTTP__ Who is this guy? _E_ Mitt Romney called to congratulate me on the win. Very nice! _E_ Make sure to follow me on @periscopeco #MakeAmericaGreatAgain _E_ Just arrived at #ASEAN50 in the Philippines for my final stop with World Leaders. Will lead to FAIR TRADE DEALS unlike the horror shows from past Administrations. Will then be leaving for D.C. Made many good friends! _E_ Boeing is building a brand new 747 Air Force One for future presidents but costs are out of control more than $4 billion. Cancel order! _E_ Crooked Hillary wants a radical 500% increase in Syrian refugees. We can't allow this. Time to get smart and protect America! _E_ What will we get for bombing Syria besides more debt and a possible long term conflict? Obama needs Congressional approval. _E_ Via @SunSentinel by @JoanieCox: "In Palm Beach nothing trumps the Trump Invitational" __HTTP__ _E_ I believe Lance Armstrong had death wish when he did interview w/Oprah—as I predicted everybody is suing him he'll have nothing left _E_ People should be proud of the fact that I got Obama to release his birth certificate which in a recent book he "miraculously" found. _E_ "I have a very strict gun control policy: if there's a gun around I want to be in control of it." Clint Eastwood _E_ .@pastormarkburns You were great last night and we all very much appreciate it! Thank you! _E_ .@foxandfriends in 5 minutes. _E_ After decades of our leaders allowing China to steal our jobs & R&D the Chinese will 'overtake America' in 2016 ... _E_ Sadly I will no longer be doing @foxandfriends at 7:00 A.M. on Mondays. This is because I am running for president and law prohibits. LOVE! _E_ Iran is threatening to shut the Strait of Hormuz and @BarackObama won't approve the Keystone pipeline. His energy policy makes America weak. _E_ Golf match? I've won 18 Club Championships including this weekend. @mcuban swings like a little girl with no power or talent. Mark's a loser _E_ When is South Korea going to start paying us for the massive amounts of money we are spending to protect them from the North? _E_ Bought @JohnDeere stock a year ago for old fashioned reason—I love their product and service. _E_ Will be on @foxandfriends at 7:00 15 minutes! Enjoy. _E_ "When you're at a meeting monitor your behavior and work at being an observer – of yourself and others." – Think Like a Billionaire _E_ Why are we giving China foreign aid? Couldn't the Super Committee have agreed to at least cut that outlay? #TimeToGetTough _E_ I'm going to the BORDER tomorrow. Will be seeing some really brave people. Look forward to a big day! _E_ The reason that Ted Cruz lost the Evangelicals in S.C. is because he is a world class LIAR and Evangelicals do not like liars! _E_ Taking a helicopter to New Hampshire boarding now. Amazing activity planned. New UMASS poll very nice! __HTTP__ _E_ Only 1 mill. dollars @mcuban? Offer me real money and I'd consider it. Your team and networks lose so much money I doubt you have much left! _E_ Just like @Yankee organization I can't wait for @MLB to suspend A Rod. Will be a great day for the sport. _E_ How come every time I show anger disgust or impatience enemies say I had a tantrum or meltdown—stupid or dishonest people? _E_ Iraq in political turmoil one day after we leave I told you so. _E_ My interview on @theviewtv discussing #TimeToGetTough the GOP primary and the Newsmax @iontv debate(starts at 23:00) __HTTP__ _E_ The U.S. manufacturing sector has suffered its greatest order losses under @BarackObama. He has stood idle while China steals our jobs. _E_ Be prepared for a sensational episode of The Apprentice tomorrow night 10 pm on NBC. _E_ The new e mail release is a disaster for Hillary Clinton. At a minimum how can someone with such bad judgement be our next president? _E_ What a great group! __HTTP__ With @Schwarzenegger @SammartinoBruno and @TripleH. #WWEHOF _E_ WikiLeaks: 'Clinton Kaine Even Lied About Timing of Veep Pick' __HTTP__ _E_ The rigged Dem Primary one of the biggest political stories in years got ZERO coverage on Fake News Network TV last night. Disgraceful! _E_ ...and says something is seriously wrong. He will never go down as great! _E_ Michele Bachmann just dropped out of prez race when she didn't do the Newsmax debate it showed great disloyalty and people rejected her. _E_ .@mcuban Baseball commissioner and owners were smart when they didn't want you to buy a team but I don't think you have the money anyway. _E_ He @BarackObama wants 23 years of @MittRomney's tax returns __HTTP__ Let's see BHO's school (cont) __HTTP__ _E_ ISIS exploded on Hillary Clinton's watch she's done nothing about it and never will. Not capable! _E_ I am encouraged by President Moon's assurances that he will work to level the playing field for American workers b... __HTTP__ _E_ Who knew this innocent kid would grow into a monster? #TBT #Trump __HTTP__ _E_ See what I have to say about the Occupy Wall Street protestors in today's #trumpvlog.... __HTTP__ _E_ I will be ON THE RECORD with Greta Van Susteren @gretawire tonight at 7 pm eastern/FOX News Channel _E_ Democracy cannot succeed unless those who express their choice are prepared to choose wisely... _E_ TAX CUTS will increase investment in the American economy and in U.S. workers leading to higher growth higher wages and more JOBS! __HTTP__ _E_ National Black Republican Association Endorses Donald J. Trump #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Obama Clinton inherited $10T in debt and turned it into nearly $20T. They have bankrupted... __HTTP__ _E_ Americans by & large hate ObamaCare. They see Obama lied to get it passed. They see big business & gov't got waivers. Defund! _E_ Thank you for the great rallies all across the country. Tremendous support. Make America Great Again! _E_ So many people who have children with autism have thanked me—amazing response. They know far better than fudged up reports! _E_ Thank you Sanford Florida. Get out & VOTE #TrumpPence16! #ICYMI watch this afternoons rally here:... __HTTP__ _E_ An investment in knowledge pays the best interest. Benjamin Franklin _E_ I am happy to donate $5 million to a charity Barack Obama chooses. All I am asking is that he is transparent with the American people _E_ It is not freedom of the press when newspapers and others are allowed to say and write whatever they want even if it is completely false! _E_ Thank you! #TrumpWon #MAGA __HTTP__ _E_ Another example of @BarackObama's diplomatic triumphs he gave the Queen of England an iIPod filled with his speeches. _E_ Wow the respected Monmouth University poll has me ahead of most Republican candidates nationwide and most people don't think I'm running! _E_ Congratulations to @TrumpSoHo for once again receiving the AAA Five Diamond Award for another year! _E_ After many years of LEAKS going on in Washington it is great to see the A.G. taking action! For National Security the tougher the better! _E_ Just watched @meetthepress and how totally biased against me Chuck Todd and the entire show is against me.The good news the people get it! _E_ Afghanistan's so called leader Karzai is toying with the U.S. _E_ The same people that built the ObamaCare website used as the face of the website someone who is not a US citizen. Incompetent. _E_ When will @CNN get some real political talent rather than political commentators like Errol Louis who doesn't have a clue! Others bad also. _E_ Very resource rich Canada our neighbor is looking to China for its growth. Just another sad commentary on the U.S. __HTTP__ _E_ Just got back from Colorado. The love and enthusiasm at two rallies was incredible. Big crowds! _E_ Let the Arab League take care of Syria. Why are these rich Arab countries not paying us for the tremendous cost of such an attack? _E_ Full transcript of economic plan delivered to the Economic Club of New York. #MAGA __HTTP__ __HTTP__ _E_ Orders for U.S. factory goods in March record biggest decline in 3 years __HTTP__ China is eroding the US manufacturing sector. _E_ Why haven't they released the final Missouri victory for us yet? Could it be because Cruz's guy runs Missouri? _E_ Watch – Obama will not fix the illegal immigrant loophole. Instead he will sign another executive action giving more amnesty. _E_ Thank you Iowa! #FITN #IACaucus#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ My wife Melania will be speaking in Pennsylvania this afternoon. So exciting big crowds! I will be watching from North Carolina. _E_ Some @OWS protesters are sincere people frustrated with the system others just in for the party. _E_ After seven horrible years of ObamaCare (skyrocketing premiums & deductibles bad healthcare) this is finally your chance for a great plan! _E_ Where is the President? It is time for him to come on TV and show strength against the repeated threats from North Korea and others. _E_ The F 35 program and cost is out of control. Billions of dollars can and will be saved on military (and other) purchases after January 20th. _E_ The United States has been reminded time and again in recent years that economic security is not merely RELATED to national security economic security IS national security. It is vital to our national strength. #APEC2017 __HTTP__ _E_ Honor Memorial Day by thinking of and respecting all of the great men and women that gave their lives for us and our country! We love them. _E_ Wow no longer Saturday delivery from U.S. Postal Service no money our poor poor Country! _E_ I can't believe that 60 Minutes is right now showing our nuclear facilities for the world to see (at request of U.S. leadership). STUPID! _E_ ..... He knows I don't respect him. _E_ I would like to offer Vice President Biden my warmest condolences on the loss of his wonderful son Beau. Met him once great guy! _E_ Join me on @greta from Indianapolis Indiana at 7pmE! Enjoy! #Trump2016 __HTTP__ _E_ Host of the 2022 @PGAChampionship & 2017 #USWomensOpen Trump Nat'l Bedminster offers 36 holes of world class golf __HTTP__ _E_ RT @IvankaTrump: Ivanka is joining @realDonaldTrump to outline an innovative new child care policy to support American families. Tune in to... _E_ Will @anthonyweiner be fully clothed in his mayoral ads? _E_ David Letterman's show has become so boring and mundane. Somehow every time I look I can't help thinking of (cont) __HTTP__ _E_ Flattering. Over 500 upset people called Mar a Lago disappointed I am not running for President but Mitt Romney will do a great job. _E_ I never equated wind farms to the Pan Am Lockerbie disaster only stated that @AlexSalmond should never have released the terrorist BAD! _E_ Man did JEB throw his brother under the bus last night on @colbertlateshow . Probably true but not nice! _E_ Enough is Enough no more Bushes! __HTTP__ _E_ .@DLoesch played great audio from my @CPACnews press conference on her radio show. Glad she made it! _E_ The Oscars are a sad joke very much like our President. So many things are wrong! _E_ Thank you! __HTTP__ _E_ The press is going out of their way to convince people that I do not like or respect women when they know that it is just the opposite! _E_ For the NY State Repubs to waste time energy and money on a primary—then go against 3 1 Dems—is insane. _E_ Well we all did it together! I hope the MOVEMENT fans will go to D.C. on Jan 20th for the swearing in. Let's set the all time record! _E_ How did Snowden with not even a high school education get access to top secret U.S. records. He then gave or sold those records traitor! _E_ MAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_ On my way to New Hampshire expecting a big and spirited crowd! #FITN #Trump2016 __HTTP__ __HTTP__ _E_ Crooked Hillary will NEVER be able to handle the complexities and danger of ISIS it will just go on forever. We need change! _E_ It is time to #DrainTheSwamp! __HTTP__ _E_ Is this the New York that Ted Cruz is talking about & demeaning? __HTTP__ _E_ Wow Now experts are calling #Harvey a once in 500 year flood! We have an all out effort going and going well! _E_ #TrumpTODAY Watch my appearance on the @TODAYshow from this morning __HTTP__ _E_ The first book signing at Trump Tower for #TimeToGetTough was so popular that I'm doing another one today from noon to 2pm/Trump Tower _E_ Personally I think Douglas Durst's brother got screwed by Douglas—no wonder he's angry! _E_ Far more killed than anticipated in radical Islamic terror attack yesterday. Get tough and smart U.S. or we won't have a country anymore! _E_ Please explain to the dummies at the @WSJ Editorial Board that I love to debate and have won according to Drudge etc. all 11 of them! _E_ Crazy @megynkelly is now complaining that @oreillyfactor did not defend her against me yet her bad show is a total hit piece on me.Tough! _E_ My @SteveDeaceShow interview discussing Ebola Obama's incompetence & my trip to Iowa for @SteveKingIA on Sat. __HTTP__ _E_ Jeb Bush and Ted Cruz are not electable presidential candidates Hillary would destroy them. Ted may not be eligible to run born in Canada _E_ Thank you Cleveland. We love you and will be back many times! _E_ My @foxandfriends re: the sequestration failure of leadership in DC China playing us & taking over in 2016 __HTTP__ _E_ Iowa was amazing last night. The event could not have worked out better. We raised $6000000 for our great vets. They were so happy & proud _E_ Obamacare is a disaster as I've been saying from the beginning. Time to repeal & replace! #ObamacareFail __HTTP__ _E_ I see my friend @FlaGovScott is speaking at CPAC. Solid guy wonderful job. #sayfie @marcaputo _E_ RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_ "Be ready for problems you'll have them every day. Keep your focus and be as big as your daily challenges." – Trump Never Give Up _E_ The U.S. recorded its slowest economic growth in five years (2016). GDP up only 1.6%. Trade deficits hurt the economy very badly. _E_ Are all the illegals pouring into our country vaccinated? I don't think so. Great danger to U.S. _E_ I was on CNN yesterday..... __HTTP__ _E_ I will be interviewed on @foxandfriends at 8:00 A.M. So much to talk about! _E_ Funny that Jeb(!) didn't want help from his family in his failed campaign and didn't even want to use his last name.Then mommy now brother! _E_ A very interesting piece by a very good writer @KirstenPowers of @USATODAY and @FoxNews. __HTTP__ _E_ Thanks. __HTTP__ _E_ Just won IOWA @CNN Poll BIG: Trump 33% Cruz 20% Rubio 11% but @WSJ reported Cruz momentum but nothing about the fact that I easily won! _E_ Rubio lied about my meeting w/ Hispanic activists. I didn't change my opinion but treated them w/ respect. Shame! __HTTP__ _E_ A simplified tax code would spur economic growth and help create jobs. Unfortunately Washington is incapable of simplifying anything. _E_ Via @Newsmax_Media by "Poll: Trump Surges Among GOP Hopefuls in NH" __HTTP__ _E_ Just got back from South Carolina. Going to Alabama tomorrow! _E_ Just did an interview with my friend @MarkSimoneNY. Congratulations to Mark on his new show on @WOR710. _E_ My sons Don and Eric are right now at Doonbeg in Ireland. There will be nothing like it! _E_ Entrepreneurs: Keep the big picture in mind. There are always opportunites and possibilities & thinking too small can negate a lot of them. _E_ .@scottienhughes you were fantastic on CNN. Thank you for the nice words. See you at the #GOPDebate. _E_ I want talented people to come into this country—to work hard and to become citizens. Silicon Valley needs engineers etc. _E_ THE SYSTEM IS RIGGED! _E_ Michigan Mississippi Idaho & Hawaii: Get out to VOTE and join the movement today! Video: __HTTP__ __HTTP__ _E_ I worked hard with Bill Ford to keep the Lincoln plant in Kentucky. I owed it to the great State of Kentucky for their confidence in me! _E_ A 34 story luxury highrise @TrumpParc offers elite amenities with residences that maximize every inch of space __HTTP__ _E_ Everyone is asking me to cover The Apprentice LIVE on twitter. I will do so. Tonight 9 to 11. IT WILL BE A GREAT EVENING OF TELEVISION! _E_ Fun to watch the Democrats working so hard to win the great State of South Carolina when I just won the Republican version amazing people! _E_ Fox & Friends going on now enjoy! _E_ Sleepy eyes @chucktodd whenever you mention me unfairly I will likewise mention you. _E_ Adopt the Arts campaign at @fundanything ensures that an underfunded public school has music and arts programs __HTTP__ _E_ RT @DanScavino: Join President elect Trump LIVE from Mobile Alabama via his #Facebook page! #ThankYouTour2016 Watch: __HTTP__ _E_ The outer boroughs of Manhattan are still devasted by Sandy. How would the press cover this if a Republican was President. _E_ "If you put the federal government in charge of the Sahara Desert in 5 years there'd be a shortage of sand." – Milton Friedman _E_ Via @thehill: Trump warns GOP moving too fast on immigration reform __HTTP__ by @JonEasley _E_ Between Libya the national security leaks and Fast & Furious Obama has had more national security scandals than any other President. _E_ The Miami Heat looked great tonight congratulations from all of your friends at your favorite place in Miami Trump National Doral. _E_ We should have a contest as to which of the Networks plus CNN and not including Fox is the most dishonest corrupt and/or distorted in its political coverage of your favorite President (me). They are all bad. Winner to receive the FAKE NEWS TROPHY! _E_ Thank you @DailyMail for setting the failing @NYTimes story straight. This is what the NYT's should have written! __HTTP__ _E_ As a show of support for our Armed Forces I will be going to The Army Navy Game today. Looking forward to it should be fun! _E_ Just got back from Iowa had a great time with amazing people. Will be back soon! _E_ Despite the upcoming election the cover of paper thin Time Magazine looks like an ad for the movie Lincoln sad! _E_ Thanks @MickyArison for your nice statement @BLTPrimeMiami @TrumpDoral. I just want to do as well as you have with @MiamiHEAT. See u soon _E_ Never make a concession during negotiations that could lead to more demands. Be prudent. It's best to have your concessions predetermined _E_ RT @GregAbbott_TX: Spoke with Pres. Trump & heads of Homeland Security & FEMA. They're helping Texas respond to #HurricaneHarvey. __HTTP__ _E_ Temperature at record lows in many parts of the country. 50 degrees below zero with wind chill in large area. Global warming folks iced in! _E_ I believe that Crooked Hillary sent Bill to have the meeting with the U.S.A.G. So Bill is not in trouble with H except that he got caught! _E_ Truly weird Senator Rand Paul of Kentucky reminds me of a spoiled brat without a properly functioning brain. He was terrible at DEBATE! _E_ We must change the laws of our land and seek fair but rapid trials for the perpetrators of terrorist acts (Boston) with harsh punishment! _E_ I've been warning about China since as early as the 80's. No one wanted to listen. Now our country is in real trouble. #TimetoGetTough _E_ My daughter Ivanka will be representing me today at the opening of our campaign office in Manchester NH #MakeAmericaGreatAgain! _E_ The public is about to learn a lot more information on Barack Obama and his true background in the coming weeks... _E_ RT @EricTrump: I will be always be incredibly proud of my work for @StJude raising $16.3+ million dollars over the last 10 years at a 9.2%... _E_ Corey Lewandowski Senior Political Adviser: Mr.Trump has the vision and leadership skills to bring our country back to greatness. _E_ Via @UrbanTurf_DC: Trump Releases Renderings For Old Post Office Building __HTTP__ _E_ Via @nydailynews: @IvankaTrump oversees new healthy room service menu at Trump Hotels __HTTP__ _E_ I will be interviewed on @60Minutes tonight after the NFL game 7:00 P.M. Enjoy! _E_ President Donald J. Trump and @FLOTUS Melania Participate in the Pardoning of the National Thanksgiving Turkey at the White House. __HTTP__ _E_ When do we sue the company for billions that robbed us in creating the hapless ObamaCare website? _E_ I never made the ridiculous comment about James G. and Obama Care somebody else put it out and attributed it to me. Not my style! _E_ If last night's election proved anything it proved that we need to put up GREAT Republican candidates to increase the razor thin margins in both the House and Senate. _E_ The polls show that I picked up many Jeb Bush supporters. That is how I got to 46%. When others drop out I will pick up more. Sad but true _E_ Snowden is showing how weak the U.S. has become. _E_ .@TraceAdkins says @Joan_Rivers is a gem. I agree. We all agree. #CelebApprentice _E_ Excited to host two great championships at two of our best properties @seniorpgachamp at Trump DC & @pgachampionship at Trump Bedminster _E_ Great optimism in America – and the results will be even better! __HTTP__ _E_ #CrookedHillary is unfit to serve. __HTTP__ _E_ Now Obama is keeping our soldiers in Afghanistan for at least another year. He is losing two wars simultaneously. _E_ .@mcuban Mark—nice picture thanks for the invite to the Mavs/Nets game. Next time I'll go and you'll win! _E_ .@TrumpSoho has just been awarded the AAA Five Diamond Award. Congratulations to the team for this great recognition of their amazing work. _E_ Thank you!! #Trump2016 __HTTP__ _E_ I'm at Trump Int'l Hotel in Las Vegas tallest/most beautiful building in town. Speaking to another great crowd at Treasure Island (12 noon) _E_ Idiot @billmaher always forgets to mention that I am suing him to collect the $5M for charity that he expressly offered. _E_ The Trump Spa @TrumpNewYork is a serene sanctuary featuring luxurious spa treatment rooms saunas and steam rooms __HTTP__ _E_ Thank you Ted. __HTTP__ _E_ Thank you Sean McGarvey & the entire Governing Board of Presidents for honoring me w/an invite to speak. #NABTU2017... __HTTP__ _E_ Dow Passes 23000 for the First Time Fueled by Strong Earnings #Dow23K📈 __HTTP__ __HTTP__ _E_ Via @starpulse: Donald Trump Calls Barack Obama 'Incompetent' __HTTP__ _E_ Meet the amazing mother whose letter I read during my speech. She lost her son to policies supported by Clinton. __HTTP__ _E_ Just saw Crooked Hillary and Tim Kaine together. ISIS and our other enemies are drooling. They don't look presidential to me! _E_ An honor to meet with the Polish American Congress in Chicago this morning! #ImWithYou Video:... __HTTP__ _E_ JOBS JOBS JOBS! __HTTP__ _E_ Via @IBTimes: Under Fire From Donald Trump Jeb Bush Focuses On 9/11 Even Though Hijackers Got Florida Licenses __HTTP__ _E_ I am thrilled to share that the Trump Home furniture collection by @doryainteriors just opened a new... __HTTP__ _E_ THANK YOU AMERICA!#MakeAmericaGreatAgain __HTTP__ _E_ I will be interviewed on @foxandfriends at 7:00 A.M. Enjoy! _E_ My @foxandfriends interview discussing ObamaCare the Romney Trump fundraiser & my plans for Jones Beach __HTTP__ _E_ My beautiful wife Melania will be appearing on QVC this evening from 8 to 9 pm. _E_ May God have mercy upon my enemies because I won't General George S. Patton _E_ My @TeamCavuto interview re: 2016 the need for leadership in our country Syria & China hacking our military __HTTP__ _E_ Presidency. Two of my children Don and Eric plus executives will manage them. No new deals will be done during my term(s) in office. _E_ Wow the ALIS just nominated my purchase of Doral in Miami as Transaction of the Year—thanks! _E_ Re Real Estate: You don't necessarily need the best location. What you need is the best deal... _E_ Join me in Westfield Indiana tomorrow night at 7:30pm! #Trump2016 Tickets: __HTTP__ __HTTP__ _E_ Forty seven million now on food stamps. When he came to office there were 32 million. He's added 15 million people. @MittRomney _E_ Brought to you by @HillaryClinton & her campaign in Chicago Illinois. #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_ The Tax Cut/Reform Bill including Massive Alaska Drilling and the Repeal of the highly unpopular Individual Mandate brought it all together as to what an incredible year we had. Don't let the Fake News convince you otherwise...and our insider Polls are strong! _E_ Saying goodbye to some of my great workers at @TrumpDoral in Miami. __HTTP__ _E_ .@PiersMorgan and @OMAROSA really hate each other. #CelebApprentice _E_ We are getting rid of all Glenfiddich garbage alcohol from Trump properties. _E_ Control your own destiny or someone else will. Jack Welch _E_ A great evening in Springfield Illinois. Thank you for all of the support! #Trump2016 __HTTP__ _E_ Our thoughts and prayers remain with Bret Michaels and his family and for his speedy recovery. _E_ Because of Rodolfo Rosas Moya who owes me lots of money Mexico will never again host the Miss Universe Pageant. _E_ China talks about the so called carbon footprint and then behind our leaders backs they laugh. They could (cont) __HTTP__ _E_ It has just been confirmed by the City of Mobile Alabama that there were 30000 people at last nights event making it #1for pol season. _E_ James Clapper and others stated that there is no evidence Potus colluded with Russia. This story is FAKE NEWS and everyone knows it! _E_ Throughout my travels I've had the pleasure of sharing the good news from America. I've had the honor of sharing our vision for a free & open Indo Pacific a place where sovereign & independent nations w/diverse cultures & many different dreams can all prosper side by side. __HTTP__ _E_ "As someone once put it 'Marriage is the greatest 'anti poverty' program God ever created.'" #TimeToGetTough _E_ The joint statement of former presidential candidates John McCain & Lindsey Graham is wrong they are sadly weak on immigration. The two... _E_ THANK YOU America! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ If you are planning to visit the world famous Trump Tower Atrium be sure to come early. During the holiday season it is packed by 10AM _E_ Senator Tom Cotton was great on Meet the Press yesterday. Despite a totally one sided interview by Chuck Todd the end result was solid! _E_ I have long given the order to help Argentina with the Search and Rescue mission of their missing submarine. 45 people aboard and not much time left. May God be with them and the people of Argentina! _E_ I am returning to the Pensacola Bay Center in Florida Friday 9/9/16 at 7pm. Join me! __HTTP__ __HTTP__ _E_ Via @WSJ by @SenAlexander: Wind Power Tax Credits Need to be Blown Away __HTTP__ @alexsalmond _E_ Caught RED HANDED very disappointed that China is allowing oil to go into North Korea. There will never be a friendly solution to the North Korea problem if this continues to happen! _E_ They had no definitive proof against Tom Brady or #patriots. If Hillary doesn't have to produce Emails why should Tom? Very unfair! _E_ What's incredible is that @Obamacare hasn't even kicked in yet and aleady it's doing tremendous damage. (cont) __HTTP__ _E_ Donald Trump's __HTTP__ Breaks $1M for Comedian Adam Carolla: New crowdfunding site sets record __HTTP__ _E_ RT @EricTrump: #JournalismIsDead __HTTP__ _E_ I love the White House one of the most beautiful buildings (homes) I have ever seen. But Fake News said I called it a dump TOTALLY UNTRUE _E_ The @nytimes is so dishonest. Their hit piece cover story on me yesterday was just blown up by Rowanne Brewer who said it was a lie! _E_ Isn't it ridiculous starting today new Ebola screenings go into effect for people coming from West Africa. Just stop the flights dummies! _E_ Watch #CelebApprentice this Sunday at 9PM ESTon @NBC it has received many 4 star reviews. _E_ More questions answered... __HTTP__ #trumpvlog _E_ .@GStephanopoulos stupidly believes that Hillary wants to run against me because she said so. She says that so people believe it opposite! _E_ Some of your most popular questions answered in today's video __HTTP__ _E_ Bob & Suzanne Wright co founders of @autismspeaks have done an absolutely fantastic job—two real winners. __HTTP__ _E_ After hearing the news that they would not be able to extort $1M from me they went hostile w/ a series of incorrect & ill informed ads. _E_ Autism WAY UP I believe in vaccinations but not massive all at once shots. Too much for small child to handle. Govt. should stop NOW! _E_ On International Women's Day join me in honoring the critical role of women here in America & around the world. _E_ Isn't it ironic that President Obama of all people is pushing for 'universal background checks?!' _E_ Just returned from Mississippi a great evening. _E_ Throwing out the first pitch a few years ago at Fenway in Boston Boston will be better than ever. __HTTP__ _E_ In the 1920's people were worried about global cooling it never happened. Now it's global warming. Give me a break! _E_ President Obama must remember that the worst thing you can do in a deal is seem desperate to make it. Be cool move slowly and think! IRAN _E_ .@GeorgeTakei is doing really well & soon coming to Broadway. _E_ I will be on @LateNightJimmy tonight. Always have a good time with @jimmyfallon. Now we know he will get high ratings tonight. _E_ Maybe the millions of people who voted to MAKE AMERICA GREAT AGAIN should have their own rally. It would be the biggest of them all! _E_ The Veterans Administration is in shambles and our veterans are suffering greatly. John McCain has done nothing to help them but talk. _E_ Football coaches are no longer allowed to scream and yell at their players because it is discriminatoryracist and can be viewed as bullying _E_ Join me tomorrow in Plymouth New Hampshire! #FITN #NHPrimary __HTTP__ _E_ Inner city crime is reaching record levels. African Americans will vote for Trump because they know I will stop the slaughter going on! _E_ In analyzing the Alabama Primary raceFAKE NEWS always fails to mention that the candidate I endorsed went up MANY points after endorsement! _E_ Great news Former Mayor of Dallas Tom Leppert has just endorsed me! Thank you! Tomorrow is a big day VOTE! #VoteTrump #SuperTuesday _E_ Almost universal support that Trump won the debate. Only @FoxNews is consistantly fighting the Trump win and I got them the ratings! _E_ McAllen Texas 8 miles from U.S. Mexico border. #Trump2016 Video: __HTTP__ __HTTP__ _E_ Experience is the teacher of all things. Julius Caesar _E_ .@deedeesorvino was GREAT today on @FoxNews She gets what is going on in politics and sees it very clearly. Have her on more! _E_ As I stated at the press conference on Friday regarding David Duke I disavow. __HTTP__ _E_ #LawandOrder #ImWithYouTranscript: __HTTP__ _E_ One of the hardest jobs in politics must be cleaning up after @JoeBiden gaffes. I feel sorry for his spokespeople. _E_ Had a great time going over renovations for Trump National Doral this past weekend. It is going to be amazing. __HTTP__ _E_ My interview from yesterday with @seanhannity __HTTP__ _E_ HAPPY BIRTHDAY to our @FLOTUS Melania! __HTTP__ __HTTP__ _E_ Very good news—the new Quinnipiac poll just came out—I am #1 in Iowa. _E_ By raiding the defense budget to pay for his failed social programs @BarackObama continues to weaken our (cont) __HTTP__ _E_ My @gretawire interview discussing the economy unemployment numbers China Charles BarkleyFrance and the election __HTTP__ _E_ As I predicted Obama already caught lying on Ocare enrollment # by CBO who's sticking w/ "6 million enrollments" __HTTP__ _E_ RT @billoreilly: A free press is vital to protecting all Americans. A corrupt press damages the Republic. _E_ RT @EricTrump: #Wisconsin: To find your voting location visit __HTTP__ #MakeAmericaGreatAgain #TrumpTrain __HTTP__ _E_ .@VinceMcMahon @MikeTyson @HomerJSimpson I think I'm going to accept the #IceBucketChallenge stay tuned to my Twitter tomorrow.... _E_ Invincibility lies in the defence the possibility of victory in the attack. Sun Tzu _E_ The State Of The Union speech was one of the most boring rambling and non substantive I have heard in a long time. New leadership fast! _E_ Priorities. While Obama wastes billions on a broken website he is going to cut military pay __HTTP__ No surprise. _E_ Will be heading over to the debate soon. Can you believe @CNN is milking it for almost 3 hours? Too long too many people on stage! _E_ I am speaking today at the National Press Club totally sold out and will then be inspecting The Old Post Office on Pennsylvania Avenue! _E_ My @SquawkCNBC interview discussing @MittRomney's pick of @PaulRyanVP how to frame Medicare debate & @RNC convention __HTTP__ _E_ I am very proud of Ivanka! _E_ Entrepreneurs: Business is a creative endeavor. Being innovative = being open to new ideas. Keep an open mind! _E_ We're all thinking of you @SteveScalise! #TeamScalise __HTTP__ _E_ Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in Davenport this past winter. #MAGA __HTTP__ _E_ I will be interviewed on @GMA Good Morning America at 7:00 A.M. @ABC will be announcing new poll numbers. MAKE AMERICA GREAT AGAIN! _E_ Today I introduced my Contract with the American Voter our economy will be STRONG & our people will be SAFE.... __HTTP__ _E_ "Success in golf depends less on strength of body than upon strength of mind and character." Arnold Palmer _E_ Check out Ivanka's new FaceBook page and keep up with what's happening from The Celebrity Apprentice to jewlery to free tickets and more.. _E_ RT @SarahPalinUSA: Trading in the beautiful snow of Iowa for the red dirt of Oklahoma as planned despite what the media is try's no... __HTTP__ _E_ ...healthcare plan is on its way. Will have much lower premiums & deductibles while at the same time taking care of pre existing conditions! _E_ The @BarackObama campaign took in $39M in May but spent $44.6M. Sound familiar! _E_ When will Obama next go on vacation if he wins the election? The day after. _E_ It's a shame to hear that the @dcexaminer is failing. No one wants the paper even if it is being handed out for free. _E_ President Obama we need to protect our closest ally Israel. The situation in the Middle East is at a tipping point. _E_ The owner of California Gold just made a jerk (fool) out of himself. Just smile and congrat the winner. His wife was visibly embarrassed! _E_ Via @SaintPetersblog by @MitchEPerry: "Shock poll: Donald Trump leads Jeb Bush 26 20% ... in Florida" __HTTP__ _E_ My acquisition of the Doral in Maimi will be a major success for the Trump Organization. The re building is on schedule. _E_ First Minister @AlexSalmond will be destroying the beauty of Scotland with his insane desire for bird killing wind turbines. _E_ Contractors can blame Obama admin all day for their $600M failure but both parties are at fault pay taxpayers back. _E_ 'Donald Trump leads Hillary Clinton by 19 points among military veteran voters: poll' #AmericaFirst #MAGA __HTTP__ _E_ The U.S. accidentally air dropped a large shipment of military weapons and supplies right into the middle of ISIS as enemy laughs! Very sad! _E_ We traveled the world to strengthen long standing alliances and to form new partnerships. See more at:... __HTTP__ _E_ Thank you Hershey Pennsylvania. Get out & VOTE on November 8th & we will #MAGA! #RallyForRiley #ICYMI watch here... __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Overlooking Central Park @TrumpNewYork brings both glamor and prestige to your Five Diamond hotel stay __HTTP__ _E_ I will be interviewed tonight on @seanhannity Enjoy! 10:00 P.M. _E_ When written in Chinese the word 'crisis' is composed of two characters. One represents danger and the other represents opportunity. JFK _E_ My interview with @ThisWeekABC w/@GStephanopoulos destroyed all Sunday competition w/ 2.52M total viewers...that's why they want me on! _E_ ... while Tom Brady is guilty because he REPLACED his LEGAL cellphone? _E_ RT @JoeNBC: Trump +15 on Cruz in 2 weeks. Cruz may look back and ask why he ever attacked Trump. DT has killed him ever since. __HTTP__ _E_ I think @TheRevAl should take this challenge. Axelrod was too scared. RT: @RonKaufmanIntrn: Kaufstache vs. Sharpstache. _E_ Waste @BarackObama's Dep. of Energy was warned in advance by Treasury that it wasn't loaning $ out in good deals __HTTP__ _E_ I hope Oprah gives Lance Armstrong 100 million dollars because that's what that ridiculous interview will cost him! _E_ Winners never quit and quitters never win. Vince Lombardi _E_ .@mcuban Shark Tank was shoved to Friday evening Friday is considered "dead television." Besides you are not the star (& never will be). _E_ Gold just set another record high on price with the largest physical gold sales on record __HTTP__ Inflation is coming... _E_ .@JustinRose99 Great playing we are proud of you! _E_ Obama keeps saying that he will do something but why hasn't he done it? It's all talk. _E_ Success breeds success. The best way to impress people is through results. Think Like a Billionaire _E_ Just watched Senator John Barrasso on @FoxNews He was great! Thank you John. _E_ Congratulations to @Boston_Police @FBIBoston & all emergency first responders & doctors for their excellent work under fire yesterday _E_ The Republicans can absolutely win if they stick together but they are NOT sticking together. Sen. McCain just said we can't win .Very bad! _E_ Goofy Elizabeth Warren Hillary Clinton's flunky has a career that is totally based on a lie. She is not Native American. _E_ __HTTP__ Countdown to @AmericaNowRadio as my former _E_ _E_ Senate concludes "Benghazi could have been prevented" __HTTP__ _E_ The Democrats in the Super Committee want to raise taxes first in deficit talks. Huge mistake. Cut wasteful spending first. _E_ I commend @DrZuhdiJasser for defending the NYPD and Commissioner Kelly. The NYPD has done outstanding work in defending NYC from attacks. _E_ Chicago murder rate is record setting 4331 shooting victims with 762 murders in 2016. If Mayor can't do it he must ask for Federal help! _E_ ... at Madison Square Garden followed by a ceremony with 80000 people at MetLife Stadium Wrestlemania. _E_ Watch Celebrity Apprentice on Sunday at 9 pm on NBC we're winding up for a terrific finale. What a season! __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ "Know from inside out that you have the power to succeed and you will. That's taking control." – Think Like a Champion _E_ #TrumpVine on ObamaCare website __HTTP__ _E_ My @foxandfriends int. on IRS targeting Tea Party the Benghazi death scandal & @TraceAdkins v. @pennjillette finale __HTTP__ _E_ The people that gave you global warming are the same people that gave you ObamaCare! _E_ Together we're going to restore safety to our streets and peace to our communities and we're going to destroy the vile criminal cartel #MS13 and many other gangs...'Hundreds arrested in MS 13 crackdown' __HTTP__ __HTTP__ _E_ Iran toys with U.S. days before we pay them ridiculously billions of dollars. Don't release money. We want our hostages back NOW! _E_ My @foxandfriends interview discussing the Make America Great Again Texas filing and the Iowa caucus __HTTP__ _E_ .@greta: Look forward to watching Greta's interview tonight at 7.00 p.m. with Marine Andrew Tahmooressi. #Marinefreed _E_ Thank you Greta. __HTTP__ _E_ Gain and use information to your advantage see every day as an opportunity to learn. _E_ URGENT: we've just announced a $2 million fundraising goal tonight. Please stand with us! __HTTP__ __HTTP__ _E_ Big defeat last night in Nevada for Ted Cruz and Marco Rubio. @KarlRove on @FoxNews is working hard to belittle my victory. Rove is sick! _E_ .@AnnCoulter has been amazing. We will win and establish strong borders we will build a WALL and Mexico will pay. We will be great again! _E_ The US Air Force won the war in Libya to clear the way for Islamic Extremist control of Libya. _E_ Needed: Leaders who negotiate smart trade deals.Only one knows The Art of The Deal. Let's Make America Great Again! __HTTP__ _E_ Hope and Change? Job numbers down. Time for @MittRomney _E_ I'll be doing Fox and Friends this morning at seven. _E_ "Sometimes you have to take a half step back to take two forward." @VinceMcMahon _E_ Many people talking with much agreement on my Iran speech today. Participants in the deal are making lots of money on trade with Iran! _E_ Via @CraveOnline: Donald Trump is NOT A Rod fan __HTTP__ _E_ Never bet against Bob Kraft Bill Belichick or Tom Brady! @Patriots _E_ RT @RSBNetwork: We are ALREADY LIVE in Everett WA for the Trump Rally. Come join us our cameras tonight! #TrumpinEverett __HTTP__ _E_ ALWAYS TRY TO LEARN FROM OTHER PEOPLES MISTAKES NOT YOUR OWN IT IS MUCH CHEAPER THAT WAY! _E_ If I'm the third most envied man in America the small group of haters and losers must be nauseas. _E_ Via the New York Times __HTTP__ _E_ On 9/11 we pray for the victims and their families of the attack and give thanks to all who have sacrificed for justice & our freedom. _E_ I will be on @foxandfriends at 7:00 A.M. ENJOY! _E_ Other networks are begging me to do a show I can't because I'm doing the Apprentice! _E_ The Chinese talk of climate change and carbon footprint but don't clean up their factories but they sell us the equipment to clean up ours! _E_ Rosie O'Donnell just said she felt shame at being fat not politically correct! She killed Star Jones for weight loss surgery just had it! _E_ The tournament at Trump National Doral was much more exciting than what is going on now! _E_ The polls have shown that DEAD PEOPLE voted for President Obama overwhelmingly and without hesitation he must be doing something right! _E_ Shocking two of @BarackObama's largest campaign bundlers are directly linked to Solyndra __HTTP__ What a coincidence! _E_ I will be on @seanhannity tonight at 10 PM @FoxNews. #Hannity _E_ Terrible CBO forecast for 2013 1.4% GDP growth and 7.5%+ unemployment (really 17%+) __HTTP__ You get what you vote for! _E_ Consumer spending is continuing to fall with weak June numbers. @BarackObama's policies have created a climate (cont) __HTTP__ _E_ After being forced to apologize for its bad and inaccurate coverage of me after winning the election the FAKE NEWS @nytimes is still lost! _E_ RT @RepKristiNoem: A lot of tough decisions got us to this point but we're closer than we've been in 30+ years to a fairer tax code that k... _E_ #TrumpVine A message for my hotel guest @MileyCyrus __HTTP__ _E_ I just realized that if you listen to Carly Fiorina for more than ten minutes straight you develop a massive headache. She has zero chance! _E_ .@FoxNews has been treating me very unfairly & I have therefore decided that I won't be doing any more Fox shows for the foreseeable future. _E_ Shock even more @BarackObama solar corruption. @VPBiden's chief of staff's firm got biggest DOE loan. __HTTP__ _E_ In November I think the people of Ohio will remember that the Republicans picked Cleveland instead of going to another state. Jobs! _E_ By the way where is @Oprah? Good question. 4 years ago she strongly supported Obama now she is silent. Anyway who cares I adore Oprah. _E_ Tom Brady played great today. He is a total champ and a really nice guy a rare combination! _E_ .@nbcnightlynews (Brian Williams anyone?) says women warriors are every bit as tough as the guys. Just think about that statement! _E_ I will be making my Supreme Court pick on Thursday of next week.Thank you! _E_ I have a lawsuit in Mexico's corrupt court system that I won but so far can't collect. Don't do business with Mexico! _E_ He @BarackObama said it would be 'unprecedented' if the USC rules that ObamaCare is unconstitutional. It was (cont) __HTTP__ _E_ My @NewsRadio610 int. w/@JackHeathRadio discussing Nickey S. Loeb 1st Amendment Awards Dinner & @SenScottBrown __HTTP__ _E_ RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_ What about all of the contact with the Clinton campaign and the Russians? Also is it true that the DNC would not let the FBI in to look? _E_ .@scottienhughes Keep up the great work Scottie. Polls are best ever! _E_ Congratulations to all the Trump 2012 #MissUniverse contestants who came from across the world. You did great and made us all proud! _E_ Stop the flights! __HTTP__ _E_ Thank you Green Bay Wisconsin! Governor @Mike_Pence and I will be back soon. #TrumpPence16 #MAGA __HTTP__ _E_ Congratulations to @AnnDRomney on delivering a knock out speech last night. America can't wait to call her our First Lady. _E_ Everyone should calm down. @BenAffleck is going to do a great job as Batman. _E_ Congratulations to @TrumpPanama for being named one of the "Best of +VIP Access" hotels for 2014 by @Expedia! __HTTP__ _E_ $25 Million+ raised online in just one week! RECORD WEEK. #DrainTheSwamp Today we set a bigger record. Contribute > __HTTP__ _E_ It is now clear that the Embassy attack in Libya was a coordinated Al Qaeda operation and not based on some video. _E_ So many people are agreeing with me on not creating a highway for Ebola to the U.S. Started in small area of Africa and now spreading fast _E_ "Our Constitution was made only for a moral & religious people. It is wholly inadequate to the government of any other." John Adams _E_ Be sure to check out the new projects @fundanything __HTTP__ Giving away money! _E_ After these spirited primaries are over @GOP must be fully united for November. If we take the Senate we stop Obama's agenda. _E_ .@WayneNewtonMrLV Wayne such a pleasant surprise so nice. Thank you very much. _E_ After raising w/ no obligation almost $6M for Vets I couldn't believe protesters formed @ Trump Tower. JUST OUT SENT BY CROOKED HILLARY! _E_ As gas prices keep rising @BarackObama won't approve Keystone. Instead he is pushing algae yes algae as an (cont) __HTTP__ _E_ It's Tuesday. @AGSchneiderman is wearing Revlon eyeliner today. Governor Cuomo alerted all to this. _E_ Excited to speak at tomorrow night's @ocrp Lincoln Day dinner in Michigan "All time sales record over 2000." __HTTP__ _E_ Great book just out A Place Called Heaven by Dr. Robert Jeffress A wonderful man! _E_ Glad to hear @EWErickson has moved over to @FoxNews. Erick is a sharp political analyst. _E_ I never asked Comey to stop investigating Flynn. Just more Fake News covering another Comey lie! _E_ Great Army Navy Game. Army wins 14 to 13 and brings home the COMMANDER IN CHIEF'S TROPHY (last time was 1996). Wow! Congratulations! _E_ Congratulations John! __HTTP__ _E_ The big golf course project on the water by the Whitestone Bridge in NYC that has been under construction for many years now complete GREAT! _E_ Obama wants taxes to go up so he can take credit for lowering them next year. _E_ RT @narendramodi: Had a wonderful meeting with @IvankaTrump advisor to @POTUS and leader of the US delegation at the @GES2017. __HTTP__ _E_ Getting ready for my big foreign trip. Will be strongly protecting American interests that's what I like to do! _E_ I will be interviewed on the @TODAYshow at 7:00 A.M. this morning. Enjoy! _E_ Spitzer failed as A.G. failed as Governor in disgrace and was fired on all T.V. shows (boring and zero ratings) and he's at it again! _E_ ...you can enhance location through promotion and work. _E_ The @ABC poll sample is heavy on Democrats. Very dishonest why would they do that? Other polls good! _E_ RT @DanScavino: Back to Cincinnati Ohio this Thursday (12/1/16) at 7pm for #PEOTUS @realDonaldTrump's #ThankYouTour2016! Join us! __HTTP__ _E_ TO ALL AMERICANS #HappyNewYear & many blessings to you all! Looking forward to a wonderful & prosperous 2017 as we... __HTTP__ _E_ Does anyone notice how the Montana Congressional race was such a big deal to Dems & Fake News until the Republican won? V was poorly covered _E_ Be careful – sexting pervert Anthony Weiner is upping his campaigning. When will new pictures be released? _E_ Why are people giving money to Karl Rove when he just wasted $400M without any victories? Use your head. _E_ Via The Hill No Tickets Left for Trump's Dallas Rally __HTTP__ _E_ Will be interviewed on @GMA this morning at 7:00. Thanks for the GREAT poll results! _E_ .@BarackObama's college application would be very very very very interesting! _E_ . @chrislhayes replaced @edshow on @msnbc to increase ratings. It's a shame Chris' are even worse. Sad to see. _E_ Champion @bretmichaels triumphantly returns to 13th season of All Star @CelebApprentice. Spoiler – Bret is back to his winning ways. _E_ #LasVegasStrong #USA __HTTP__ _E_ For the great people of Iowa find your #IACaucus location at __HTTP__ so important to vote! #MakeAmericaGreatAgain _E_ Looking forward to a press conference today about @adamcarolla on @fundanything movie project #roadhard __HTTP__ _E_ The @BarackObama administration is pressuring contractors to fix job loss estimates from environmental regulations __HTTP__ _E_ .@StephenBaldwin7 shines in the record 13th season of 'All Star' @CelebApprentice. The Baldwin clan will be proud of Stephen. _E_ Aberdeen tourism is booming because of my great Scottish golf club. _E_ Is PM Cameron a dummy? With monumental cuts in UK spending how come he continues to spend billions of pounds ... _E_ GET READY!! The #TrumpFerryPoint tee sheet opens TODAY @ 10am EST on our website for April 1st 30th! @TrumpFerryPoint _E_ Via @fitsnews: Donald Trump Knows How To (Tea) Party THE DONALD PLANS SPLASHY LANDING IN MYRTLE BEACH S.C. __HTTP__ _E_ As ISIS and Ebola spread like wildfire the Obama administration just submitted a paper on how to stop climate change (aka global warming). _E_ Just spoke w/ Governors Rick Scott of Florida Kenneth Mapp of the U.S. Virgin Islands & Ricardo Rosselló of Puerto Rico. WE ARE W/ YOU ALL! __HTTP__ _E_ George Steinbrenner was a great friend and a true legend. There will never be anyone like him in New York. We've lost a truly great man. _E_ Lightweight Senator Marco Rubio is polling very poorly in Florida. The people can't stand him for missing so many votes poor work ethic! _E_ Congratulations to @MikeTyson on the success of his new book Undisputed Truth & @HBO special and thanks for the nice words Mike. _E_ Winner of the 5 Star Diamond Award @TrumpGolfDC's two courses grace over 600 acres on the Potomac River __HTTP__ _E_ Should be interesting but too bad the three guys at《1% will be taking up so much time but who knows maybe a star will be born (unlikely) _E_ Follow @TrumpNH for all the updates on my New Hampshire political activities. Looking forward to returning to the Granite State on May 14! _E_ .@SenScottBrown is the most competitive GOP option against Obama's amnesty loving @SenatorSheehan. He can win! _E_ .@TraceAdkins presents the NJ Coast Red Cross a $40000 check for Sandy Relief. You can tell he's very pleased about that & rightly so. _E_ We should be concerned about the American worker & invest here. Not grant amnesty to illegals or waste $7B in Africa. _E_ Via @BreitbartNews by Steve Bannon: "'TIME TO GET TOUGH': TRUMP'S BLOCKBUSTER POLICY MANIFESTO" __HTTP__ _E_ Entrepreneurs: Believe in yourself. If you don't no one else will either. _E_ Shouldn't George Will have to give a disclaimer every time he is on Fox that his wife works for Scott Walker? _E_ .@BillMaher needs to cut back on the pot and maybe he will stop making offers he can't afford. _E_ Thank you. __HTTP__ _E_ Club for Growth letter trying to extort $1000000.00 from me. Remember I said NO! __HTTP__ _E_ My list of potential U.S. Supreme Court Justices was very well recieved. During the next number of weeks I may be adding to the list! _E_ Interesting to watch Senator Richard Blumenthal of Connecticut talking about hoax Russian collusion when he was a phony Vietnam con artist! _E_ I believe in the America that never gives up never stops striving never ceases believing in itself. @MittRomney 11.2.12 _E_ Lightweight @JebBush said tonight he didn't know his family used private eminent domain in Texas Lie! #GOPDebate _E_ ... collusion which doesn't exist. The Dems are using this terrible (and bad for our country) Witch Hunt for evil politics but the R's... _E_ Today will be a Super Tuesday for @MittRomney he will win over 220 delegates from states across every region. He will be the nominee. _E_ Wow @CNN got caught fixing their focus group in order to make Crooked Hillary look better. Really pathetic and totally dishonest! _E_ Join me live in Springfield Ohio! __HTTP__ _E_ I hate what has happened to the once great @CNN. _E_ Nelson Mandela and myself had a wonderful relationship he was a special man and will be missed. __HTTP__ _E_ I'm urging my friends in Brooklyn to vote for Bob Turner tomorrow send @barackobama a message. _E_ Success seems to be connected w/ action. Successful people keep moving. They make mistakes but they don't quit. Conrad Hilton _E_ RT @EricTrump: What a scary statistic! Americans are working harder and making less! We need competent leadership! __HTTP__ _E_ Without passion you don't have energy without energy you have nothing. Nothing great in the world has been accomplished without passion! _E_ Everything comes to him who hustles while he waits. Thomas Edison _E_ #IACaucus 2/1/2016 6:30pm#MakeAmericaGreatAgain!Iowa caucus finder: __HTTP__ #GOPDebate __HTTP__ _E_ Great list of spring travel ideas from our @TrumpCollection properties: __HTTP__ _E_ Why doesn't President Obama just get the people from Google to fix the failed website. In fact why didn't he use them in the first place! _E_ A beautiful funeral today for a real NYC hero Detective Steven McDonald. Our law enforcement community has my complete and total support. _E_ Glad to see 9 more Iraq and Afghan war veterans joining the next Congress __HTTP__ They deserve to be there! _E_ Watch @IvankaTrump show you how easy it is to #CaucusForTrump in Iowa! #IACaucus Video: __HTTP__ __HTTP__ _E_ I hope @boyscouts of America handle their problems a lot better than the board at Penn State did. You can't do any worse! _E_ Join me in Houston Texas tomorrow night at 7pm! Tickets: __HTTP__ __HTTP__ _E_ Looking forward to being the guest of honor at @ralphreed's @FaithandFreedom Patriot's Award Gala Dinner in Washington DC _E_ We're going to cut taxes BIG LEAGUE for the middle class. She's raising your taxes and I'm lowering your taxes! __HTTP__ _E_ Why do the losers & haters always say I wear a "wig" when they know I don't. Like it or not it's all mine—just ask Barbara Walters. _E_ Mitt Romney didn't show his tax return until SEPTEMBER 21 2012 and then only after being humiliated by Harry R! A bad messenger for estab! _E_ Thank you Georgia!#SuperTuesday #Trump2016 _E_ Congratulations to our new Miss USA the beautiful Rima Fakih. Rima will represent us well at Miss Universe and be a wonderful Miss USA . _E_ Hillary when you complain about a penchant for sexism who are you referring to. I have great respect for women. BE CAREFUL! _E_ .@CNN poll just hit 49% for Trump. Interesting how my numbers have gone so far up since lightweight Marco Rubio has turned nasty. Love it! _E_ So with all of the Obama tough talk on Russia and the Ukraine they have already taken Crimea and continue to push. That's what I said! _E_ Thank you Arizona! #Trump2016 #WesternTuesday #TrumpTrain __HTTP__ _E_ "Donald Trump 2016: 7 Political Stances of GOP Presidential Hopeful" __HTTP__ via @Newsmax_Media _E_ 2nd segment of my @seanhannity @FoxNews interview discussing @billmaher's insult of parents and sending him $5M bill __HTTP__ _E_ Thank you for your strong testimony when welcoming me to Liberty University yesterday @JerryJrFalwell. __HTTP__ _E_ One year ago I started calling President Obama INCOMPETENT and people thought it was too tough. Tonight everyone is using that word! _E_ Wow did the @nytimes fall into the Bush trap where his people convinced them how happy he was that I was hurting other candidates & not him _E_ This is all about American weakness and an incompetent President. _E_ Just met with courageous family of Sarah Root in Nebraska. Sarah was horribly killed by illegal immigrant but leaves behind amazing legacy. _E_ A great event in Las Vegas Nevada! __HTTP__ _E_ Remember when failed candidate @JebBush said that illegals came across the border as AN ACT OF LOVE? He's spent $59 million and is at 3%. _E_ Goodnight everyone sleep tight! _E_ Entrepreneurs: Believe in yourself! If you don't no one else will either. _E_ Be passionate. If you love what you're doing success will follow. _E_ The @MittRomney fundraiser last night was a tremendous success. _E_ Via @BreitbartNews by @pamkeyNEN: "Trump: ObamaCare Not Working for Business Going to Collapse" __HTTP__ _E_ The reason Ed Schultz said nice things about me is that I'm the only Repub who won't cut Social Security etc. I'll make America rich again! _E_ I just won a big Court decision (N.Y. Post) against some character who claimed I owed him licensing fees on success of my shirts and ties. _E_ "Live Free or Die." – motto of New Hampshire _E_ Dems warn not to underestimate Trump's potential win __HTTP__ _E_ 30000 illegal immigrants with CRIMINAL RECORDS were released last year by our wonderful though highly incompetent government. So stupid! _E_ What would you choose Vampires or Cavemen? #CelebApprentice _E_ "Circumstances are beyond human control but our conduct is in our own power." Benjamin Disraeli _E_ Many on the team and staff of Bernie Sanders have been treated badly by the Hillary Clinton campaign and they like Trump on trade a lot! _E_ Dopey @Lord_Sugar People are calling in saying you are being beaten badly w/ the tweets... _E_ If you don't treat yourself like royalty no one else will. @TrumpWaikiki is Honolulu's most luxurious hotel __HTTP__ _E_ After seven years of talking Repeal & Replace the people of our great country are still being forced to live with imploding ObamaCare! _E_ Via @newsobserver by @RaleighReporter: In Raleigh Donald Trump all but announces presidential bid __HTTP__ _E_ #CelebApprentice With three wonderful but fired contestants __HTTP__ _E_ The Supreme Art of war is to subdue the enemy without fighting. Sun Tzu _E_ I don't believe the Democrats really want to see a deal on DACA. They are all talk and no action. This is the time but day by day they are blowing the one great opportunity they have. Too bad! _E_ Never confuse a single defeat with a final defeat. ― F. Scott Fitzgerald _E_ .@HollySandersGC. Remember it was Martin K who sank the big ten footer to win the Ryder Cup. He can handle the pressure! _E_ Strange but I see wacko Bernie Sanders allies coming over to me because I'm lowering taxes while he will double & triple them a disaster! _E_ Lance Armstrong's liability & lawsuits against him have just increased tenfold—his lawyers will be very happy—lots of fees! _E_ There is only one fix for ObamaCare REPEAL & REPLACE with a free market oriented alternative! _E_ I will be making a major announcement tomorrow (Thursday February 2) at 12:30 pm at Trump International Hotel & (cont) __HTTP__ _E_ Jerry Falwell of Liberty University was fantastic on @foxandfriends. The Fake News should listen to what he had to say. Thanks Jerry! _E_ Thinking small when you could think big limits you in all aspects of your life. _E_ In the end you're measured not by how much you undertake but by what you finally accomplish. _E_ A great guy (with great ratings)! __HTTP__ _E_ Today I filed my Statement of Candidacy with the FEC. Let's #MakeAmericaGreatAgain __HTTP__ _E_ The boardroom and @WrestleMania I'm watching great entertainment tonight! #CelebApprentice _E_ Roadway steel on beautiful Verrazano Narrows Bridge is rusting and rotting away. Scrape and paint before too late. _E_ Thank you John Nolte for wonderful analysis & reporting. _E_ Via @haaretzcom: "Donald Trump calls Obama Israel's greatest enemy" __HTTP__ _E_ I am so happy that people are boycotting Macy's __HTTP__ _E_ It's 4.35 a.m. and I am working on a very exciting (and hopefully very good) deal a major resort. THE HARDER I WORK THE LUCKIER I GET! _E_ Entrepreneurs: See each day as an opportunity to show what you can do at the highest level. Take responsibility for yourself! _E_ I am landing shortly. Can't wait to be with our GREAT MILITARY. See you soon! __HTTP__ _E_ Even though I am not mandated by law to do so I will be leaving my busineses before January 20th so that I can focus full time on the...... _E_ Iran is closing the Strait of Hormuz for a military exercise. Imagine what they will do with nukes?! _E_ From Donald Trump: Andrea Bocelli @ Mar a Lago Many say best night of entertainment in long history of Palm Beach __HTTP__ _E_ The immigration crisis is a horrible mess made worse by an incompetent president who doesn't have a clue. We need new leadership FAST! _E_ Because of the hurricane I am extending my 5 million dollar offer for President Obama's favorite charity until 12PM on Thursday. _E_ When times are difficult you must be even more focused. That's when you will find profitable opportunities. _E_ No surprise. @BarackObama is letting the Muslim Brotherhood in Egypt default on their US loans __HTTP__ Big mistake! _E_ Via @BreitbartNews @biggovt by @mboyle1: "EXCLUSIVE: NEVER AIRED 'APPRENTICE' PARODY OF TRUMP FIRING OBAMA" __HTTP__ _E_ A great night in Fayetteville North Carolina. Thank you! #ICYMI watch here: __HTTP__ __HTTP__ _E_ ...that it was hard not to end up rooting for Trump... _E_ I don't know why the @yankees keep paying A Rod—they have a perfect out. _E_ Additionally two executives @VattenfallGroup are under major investigation & they are unable to get the many permits necessary. _E_ Lightweight choker Marco Rubio looks like a little boy on stage. Not presidential material! _E_ Wow Putin is really taking advantage of President Obama. It is important that Obama responds with strength and determination be smart cool! _E_ Off shore windfarms being abandoned in droves throughout world—too expensive to build & operate—don't work. (cont) __HTTP__ _E_ Keep an eye on Anthony Weiner. Weasels are hard to get rid of. _E_ This is such a special time to be in New York City. No better city in the world to celebrate Christmas! _E_ Obamacare continues to fail. Humana to pull out in 2018. Will repeal replace & save healthcare for ALL Americans. __HTTP__ _E_ With the debt limit approaching @GOP has even more leverage. If they stay united and on message they can win. _E_ Thank you Greensboro North Carolina! Will be back soon! #AmericaFirst __HTTP__ _E_ The difference between a successful person and others is not a lack of strength not a lack of knowledge but (cont) __HTTP__ _E_ In interview I told @AP that my taxes are under routine audit and I would release my tax returns when audit is complete not after election! _E_ Pat Caddell on Neil Cavuto tonight: I've watched Donald Trump take on the issues of energy and he ties it to (cont) __HTTP__ _E_ Watch my appearance on @Letterman from last night __HTTP__ _E_ Via @DMRegister: "@ShawnJohnson returns to reality TV with Donald Trump" __HTTP__ _E_ John McCain had a really hard time with his town hall meeting on immigration. They really went after him! _E_ We must restore the entrepreneurial spirit of our country. A small business boom. Let's Make America Great Again! __HTTP__ _E_ So now tha Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the "unsolved mystery" that took place in Florida years ago? Investigate! _E_ Today we remember the courage and bravery of our troops that stormed the beaches of Normandy 73 years ago. #DDay... __HTTP__ _E_ RT @foxandfriends: Never give up....that's the worst thing you could do. There's always a chance. Kyle Coddington's message to those als... _E_ We have to combat the welfare mentality that says individuals are entitled to live off taxpayers. (cont) __HTTP__ _E_ #MakeAmericaGreatAgain#Trump2016  __HTTP__ _E_ Trump Int'l Hotel & Tower Chicago is one of very few hotels in No. America w/ a 5 Star 5 Diamond Hotel & a 5 Star 5 Diamond Restaurant... _E_ Addressing record crowd @ Madison County Iowa GOP Dinner. We can bring common sense to DC & Make America Great Again! __HTTP__ _E_ My job as President is to do everything within my power to give America a level playing field. #AmericaFirst... __HTTP__ _E_ Polls are starting to look really bad for Obama. Looks like he'll have to start a war or major conflict to win. Don't put it past him! _E_ Entrepreneurs: Believe in yourself. If you don't no one else will either. Realize that becoming an entrepreneur is not a group effort. _E_ Weekly jobless claims are now at an astronomical 365000. Manufacturing sector is suffering badly. We must do better. __HTTP__ _E_ Has the media picked up the new Zogby poll that was just put out? I doubt it! __HTTP__ _E_ Who's the flip flopper? @MittRomney has never flip flopped on gay marriage. _E_ Amazing five days developments in Aberdeen Turnberry (Scotland) and Ireland are fantastic the best anywhere in the WORLD. A lot of fun! _E_ .@TigerWoods has made a truly great comeback he is number one again! Give him credit comebacks are tough to do. Way to go Tiger. _E_ The Maryland Democrat Party attacked me with a racist flyer. @Hogan4Governor won 2nd GOP governor elected in 40 years. _E_ Big crowd expected today in Pensacola Florida for a Make America Great Again speech. We have done so much in so short a period of time...and yet are planning to do so much more! See you there! _E_ So lets get this right. Steve Jobs dies and leaves his wife everything billions of dollars. Now his wife has a boyfriend (lover). Oh Steve! _E_ GRETA IN A FEW MINUTES on Fox. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ The new @BarackObama motto You Own Nothing Not Even Your Own Success ... _E_ The great people of New Hampshire who I love are not properly served by the dying Union Leader newspaper. _E_ CRIPPLED AMERICA is perfect gift for friends & family. Order signed copy & join me tonite live streaming 7:30 __HTTP__ _E_ .@dallasmavs is 1 12 against the Western Conference's top four seeds after Sunday's loss & @okcthunder swept the season series. _E_ Worst graphics and stage backdrop ever at the Oscars. Show is terrible really BORING! _E_ My @WMUR9 Commitment 2016 Conversation with @JoshMcElveen discussing leadership China healthcare & veterans __HTTP__ _E_ A Rod is just not making it. We want to give him a chance but it was only drugs that made him great. _E_ ...addresses of any mentioned person who is still living. I am doing this for reasons of full disclosure transparency and... _E_ I am convinced that sleepy eyes Chuck Todd was only a placeholder for someone else at Meet the Press. He bombed franchise in ruins! @nbc _E_ Excited to announce that @GiulianaRancic & @BravoAndy will be hosting the 2012 Miss Universe Pageant. Great ratings for Miss Universe. _E_ A week after Biden says that the Taliban is not our enemy the Taliban demand that we pay Iraq for a 9 year occupation. __HTTP__ _E_ .@tracegallagher and @FredTecce discussing my case on @FoxNews __HTTP__ _E_ .@MittRomney Op Ed Culture Does Matter : __HTTP__ _E_ Dow hit a new intraday all time high! I wonder whether or not the Fake News Media will so report? _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ So far the hurricane is being handled very well in NY not nearly as bad as stated on news. Let's see what happens later. _E_ The Democrats made up and pushed the Russian story as an excuse for running a terrible campaign. Big advantage in Electoral College & lost! _E_ .@krauthammer pretends to be a smart guy but if you look at his record he isn't. A dummy who is on too many Fox shows. An overrated clown! _E_ ...to terrorism and airline flight safety. Humanitarian reasons plus I want Russia to greatly step up their fight against ISIS & terrorism. _E_ Networks are all wanting me to do shows—like it or not a "ratings machine"! –but time I run a really big company! _E_ Dopey @Rosie I never went bankrupt ABC already apologized to me for your stupid statement in the past they didn't want a lawsuit. _E_ Congratulations to @rushlimbaugh on his recent 26th year anniversary. Rush has revolutionized talk radio! Sorry haters and losers! _E_ Who would be stupid enough to invest in @VattenfallGroup's ill conceived windfarm when it will lose £25M yearly? _E_ Despite so many false statements and lies total and complete vindication...and WOW Comey is a leaker! _E_ .@alexsalmond RT @NOBLE74 I live in Aberdeenshire & I'm with you you have made a big difference to that bit (cont) __HTTP__ _E_ Thank you America! __HTTP__ _E_ What?! LaToya is saying Omarosa is one of the nicest people she's met? _E_ Entrepreneurs: Do your best to your utmost ability every day. Make that your standard. _E_ Glad to see that @PeteRose_14 has been hired by @FOXSports as an analyst. Pete should be around baseball and in the Hall of Fame! _E_ Excited to be keynoting @bobvanderplaats' @theFAMiLYLEADER Leadership Summit in Iowa this Saturday __HTTP__ _E_ Shocker: study reveals that @msnbc is completely biased while @FoxNews is factual __HTTP__ What a surprise! _E_ We need a strong leader and fast! __HTTP__ _E_ The House's failure to pass the Balance Budget Amendment was another unforced error by the GOP. Very disappointing. _E_ Great job tonight on @FoxNews Tony. I am with you all the way! Make America Great Again @tperkins _E_ ....earth shattering. He and his brother could Drain The Swamp which would be yet another campaign promise fulfilled. Fake News weak! _E_ The Benghazi terrorist is getting speedier care than our Vets at the VA. Obama has his priorities. _E_ To the African American community: The Democrats have failed you for fifty years high crime poor schools no jobs. I will fix it VOTE T _E_ Must read article via @fitsnews: DONALD TRUMP VERSUS MEXICO __HTTP__ _E_ Trump Int'l Golf Links & Hotel Ireland fronts the Atlantic Ocean & is host to the 2014 Great Irish Links Challenge __HTTP__ _E_ Another great shot from the beginning of construction at @DoralResort. __HTTP__ _E_ In moments like thiswe are all just Americans. I join with the President religious and civic leaders and encourage all to pray today. _E_ If victorious Republicans will be having a big press conference at the beautiful Rose Garden of the White House immediately after vote! _E_ Unbelievable evening in New Hampshire THANK YOU! Flying to Grand Rapids Michigan now. Watch NH rally here:... __HTTP__ _E_ ...in order to put any and all conspiracy theories to rest. _E_ Thank you Newt! __HTTP__ _E_ Is @billmaher the dumbest man on television?—I think so. _E_ I will be on the @todayshow tomorrow morning to make a major announcement about a television show. Stay tuned! _E_ My H 1B reform plan will transform program so it delivers for country not lobbyists & will have bipartisan support: __HTTP__ _E_ I am sure the Chinese are getting anxious. They watch the polls. @MittRomney won't let them cheat us anymore. _E_ I am very proud of @IvankaTrump for her work with @Cookies4kids. @Cookies4kids is a great cause helping children __HTTP__ _E_ RT @GeraldoRivera: #NewYork tromps #Jonas. Day after storm of the century the big city is up and running unlike others in the northeast. Mu... _E_ My interview with Greta last night on Fox News Nation Has Become All Talk No Action' __HTTP__ _E_ Now there is talk of A Rod being shipped to @Marlins. If A Rod is not a @yankee next year the fans will be happy. _E_ LETS MAKE AMERICA GREAT AGAIN!Schedule & tickets: __HTTP__ __HTTP__ _E_ I was never a fan of Bush in fact he was so bad he gave us Obama! But Obama is truly a pathetic excuse of a president can't get any worse _E_ Crowd was amazing tonight at Trump National Doral in Miami. Love and excitement in the ballroom. Tomorrow at noon in Jacksonville! _E_ Good luck @RoccoMediate and nice hat! __HTTP__ _E_ .@CharlesHurt You were great on @seanhannity last night. Thanks for the nice words. MAKE AMERICA GREAT AGAIN! _E_ Our politically correct country will read the ISIS terrorists who beheaded the reporter their Miranda Rights prior to good food & care! _E_ ....came to the campaign. Few people knew the young low level volunteer named George who has already proven to be a liar. Check the DEMS! _E_ .@CNN Kayleigh McEnany was great on you network today. You should have her on more often! Thank you Kayleigh for your nice words. _E_ Everyone is starting to feel the new tax hikes. You get what you vote for! _E_ President Obama is finally getting hammered even by his most loyal supporters and the press I guess they can only take so much! _E_ .@VanityFair looks like a dying magazine! Really really boring really really thin! _E_ The new President of OPEC is Mahmoud Ahmadinejad's confidant Rostam Ghasemi a commander of the Revolutionar... (cont) __HTTP__ _E_ "Courage is contagious. When a brave man takes a stand the spines of others are often stiffened." – Rev. @BillyGraham _E_ Republicans remember—debt ceiling debt ceiling debt ceiling—be smart and you will win! _E_ Just be tough be strong be willing to learn – and you will learn. Don't be afraid of mistakes or setbacks. Think Like a Champion _E_ .@BreitbartNews continues to do great work in exposing the left wing financing behind amnesty __HTTP__ _E_ The best deals are good for everyone which creates a win win situation. Negotiation is persuasion more than power. _E_ #VoteTrumpID! #Trump2016 __HTTP__ _E_ One of the dumbest political pundits on television is Chris Stirewalt of @FoxNews. Wrong facts check Fox debate rankings Trump #1. Dope! _E_ Thank you Michigan!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Unemployment has been over 8% for a record 40 straight months. @MittRomney's election will end the @BarackObama downturn. _E_ The party of the year in Palm Beach was the New Year's Eve celebration at the Mar a Lago Club it was amazing. __HTTP__ _E_ "@ApprenticeNBC: Donald Trump Talks Joan Rivers" __HTTP__ via @TVGrapevine by @TVG_Sammi _E_ Celebrate Thanksgiving @TrumpNewYork with exclusive viewing access to the 88th Annual @Macys Thanksgiving Parade® __HTTP__ _E_ South African Tourism North America will unveil its new ad campaign "What's Your BIG 5?" on All Star @ApprenticeNBC this Sunday. _E_ Mitt Romneywho totally blew an election that should have been won and whose tax returns made him look like a fool is now playing tough guy _E_ Yesterday I explained to @wolfblitzercnn on @CNNSitRoom why @BarackObama doesn't deserve credit for killing Bin Laden __HTTP__ _E_ Democrat Dianne Feinstein should never have released secret committee testimony to the public without authorization. Very disrespectful to committee members and possibly illegal. She blamed her poor decision on the fact she had a cold a first! _E_ Sanctions were not discussed at my meeting with President Putin. Nothing will be done until the Ukrainian & Syrian problems are solved! _E_ China has announced it is "fully prepared" for a currency war __HTTP__ Outrageous they have no fear of our leaders. _E_ Many people have asked recently when do you sleep? The answer is not much. _E_ Via @examinercom: The Miss Universe contestants glow with elegance during the Trump Holiday Party __HTTP__ _E_ Excellent preliminary meeting in Oval with @SenSchumer working on solutions for Security and our great Military together with @SenateMajLdr McConnell and @SpeakerRyan. Making progress four week extension would be best! _E_ ICYMI Via @nypost by Post Editorial Board: "@TrumpFerryPoint: New Gem in the Bronx" __HTTP__ _E_ The people of Tennessee yesterday were amazing. Thank you! _E_ The five fingers represent the five key factors every entrepreneur dreaming of success must master. (cont) __HTTP__ _E_ Great progress on healthcare. Improvements being made Republicans coming together! _E_ If you never want to be criticized for goodness' sake don't do anything new. Jeff Bezos _E_ Are people really afraid of @OMAROSA Would you be? #CelebApprentice _E_ We're stuck with the worst mayor in the United States. Too bad but New York City will survive! _E_ Join me in Denver Colorado tomorrow at 9:30pm!Tickets: __HTTP__ _E_ Via@politicalwire: Trump Offers to Fund White House Tours __HTTP__ _E_ Former winner @bretmichaels returns to All Star @ApprenticeNBC March 3rd on @NBC. Bret shows once again why he is a champion! _E_ I will be interviewed on This Week with George S this morning. Enjoy! _E_ It was an honor to stop by a #SchoolChoice event hosted by @VP Pence and @usedgov Secretary @BetsyDeVosED at the... __HTTP__ _E_ Gas prices are about to hit a record high during the Labor Day weekend. @BarackObama could have stopped this. _E_ Congratulations @TrumpSoHo for making @CNTraveler's #GoldList2015! __HTTP__ _E_ Your civil liberties mean nothing if you're dead. That's why the single most important function of the federal (cont) __HTTP__ _E_ .@ChuckTodd just informed us that my interview last week on @MeetthePress was their highest rated show in 4 years. Congrats! _E_ I'll be honored at the Family Business Dynasties Gala in NYC on December 5th. It will be a great event for a great cause. _E_ "I believe in spending what you have to. But I also believe in not spending more than you should." The Art of the Deal _E_ Thank you Illinois! #Trump2016 __HTTP__ _E_ "The idea flow from the human spirit is absolutely unlimited. All you have to do is tap into that well." @jack_welch _E_ The very dishonest @NBCNews refuses to accept the fact that I have forgiven my $50 million loan to my campaign. Done deal! _E_ Jeb's big ad buy against me paid for by lobbyists shows my face but doesn't have me answering Jeb's statements. He is really pathetic! _E_ Wind farms are now being paid to shut down __HTTP__ A complete waste. _E_ Mark your calenders for August 23rd: __HTTP__ _E_ THANK YOU ALABAMA! 32000 supporters tonight. Get out & VOTE on Tuesday! WE WILL MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Argyll grandmother takes UK and EU to the United Nations over plans to turn Scotland into windfarm 'hedgehog' __HTTP__ _E_ We just passed 1.9M followers & gained over 250000 followers in the last month.Thank you let's have fun and do business. _E_ RT @IvankaTrump: Ivanka penned an Op Ed that ran in the @WSJ this afternoon read it here. __HTTP__ @realDonaldTrump __HTTP__ _E_ V.P.....really! __HTTP__ _E_ Crooked Hillary said her husband is going to be in charge of the economy.If so he should runnot her.Will he bring the energizer to D.C.? _E_ According to @BarackObama the War on Terror is over __HTTP__ global warming is a national (cont) __HTTP__ _E_ It was announced this morning that unemployment rose this can't be good for Obama. _E_ Success is having to worry about every damn thing in the world except money. Johnny Cash _E_ In all my years in business and participating in politics I've never seen the country as divided as it is right (cont) __HTTP__ _E_ Happy belated birthday to Chris Wallace! Chris does a great job every week on @FoxNewsSunday. Like father like son. _E_ RT @foxandfriends: Anthem announces it will withdraw from ObamaCare Exchange in Nevada __HTTP__ _E_ "I also plan to keep making deals big deals and right around the clock." – The Art of The Deal _E_ Tickets for future debates should be put out to the general public instead of being given to the lobbyists & special interests the bosses! _E_ I would like to thank a great writer and person @JPappasPR. of REAL ESTATE WEEKLY for the wonderful story on me. Very much appreciated! _E_ Can't watch Crazy Megyn anymore. Talks about me at 43% but never mentions that there are four people in race. With two people big & over! _E_ Congratulations to my friend @TheSlyStallone on winning a #GoldenGlobe. A wonderful guy who has created something special well deserved! _E_ Keep hearing about tiny amount of money spent on Facebook ads. What about the billions of dollars of Fake News on CNN ABC NBC & CBS? _E_ Major story that the Dems are making up phony polls in order to suppress the the Trump . We are going to WIN! _E_ We cannot let the failing REPUBLICAN ESTABLISHMENT who could not stop Obama (twice) ruin the MOVEMENT with millions of $'s in false ads! _E_ Little @MacMiller I want the money not the plaque you gave me! _E_ 46 stories above downtown New York @TrumpSoHo features loft inspired interiors designed by Fendi Casa __HTTP__ _E_ General John Kelly is doing a fantastic job as Chief of Staff. There is tremendous spirit and talent in the W.H. Don't believe the Fake News _E_ Again for all of the haters and losers I have NOTHING to do with Atlantic City got out a long time ago! _E_ Obama once said he "would be ignoring the law" by granting amnesty through executive action. Now he's about to do it. What will Congress do? _E_ Amazing rally in Reno Nevada thank you. Make sure you get out on 11/8 & VOTE #TrumpPence16. Together we will put... __HTTP__ _E_ LA Times USC Dornsife Sunday Poll: Donald Trump Retains 2 Point Lead Over Hillary: __HTTP__ _E_ Funny that the Democrats would have their convention in Pennsylvania where her husband and her killed so many jobs. I will bring jobs back! _E_ Wrong Policy: @BarackObama wants to cut defense spending by $487B while China is building their navy in the Pacific. __HTTP__ _E_ Lightweight NYS Attorney General Eric Schneiderman is trying to extort me with a civil law suit. See website __HTTP__ _E_ "Donald Trump—The Disrupter" will air on @FoxNews Saturday night and Sunday night at 8 PM ET. Anchored by @BretBaier. @johnrobertsFox _E_ Via @CNNPolitics by @mj_lee: Father of murder victim to introduce Trump in Phoenix __HTTP__ _E_ I am thinking about changing the name #FakeNews CNN to #FraudNewsCNN! _E_ Just to show you how dishonest certain reporters are here is my @foxandfriends interview __HTTP__ _E_ .@Mitt Romney strongly stated in one of the debates with Pres. OBAMA that Russia is the big problem. Obama scoffed. Mitt was 100% correct! _E_ Wind turbines are totally destroying the areas in which they are located—all for unreliable bad & expensive energy! _E_ Via @Newsmax_Media: Trump Says He'll Foot Bill for White House Tours __HTTP__ _E_ I am very excited about hosting @MittRomney today for a fundraiser. Looking forward to seeing @newtgingrich and many other friends. _E_ For a country like China being able to steal our military designs represents hundreds of billions in savings (cont) __HTTP__ _E_ Being true to yourself equals being true to your brand. That's the solid foundation that stands the test of time. Midas Touch _E_ Obama told the UN that "the world is more stable than it was 5 years ago." Is he delusional? _E_ RT @atensnut: How many times must it be said? Actions speak louder than words. DT said bad things!HRC threatened me after BC raped me. _E_ .@BlairKamin Sorry sucker as usual you lose again. You couldn't work for me for 10 seconds. Bad critic great sign. __HTTP__ _E_ I am increasingly concerned with the UN's ploy against @Israel this coming week and will monitor all events closely from Australia. _E_ Criminal deportations in the U.S. are the lowest number in many years. We are letting criminals knowingly stay in our country. MUST CHANGE! _E_ See June 2007 speech is Obama a total racist? _E_ WEEKLY ADDRESS __HTTP__ _E_ All Star Celebrity @ApprenticeNBC is down to the five final contestants. Getting fired now is when it really hurts! _E_ I will be interviewed on @Morning_Joe at 6:15 A.M. Enjoy! _E_ Self determination is the sacred right of all free people's and the people of the UK have exercised that right for all the world to see. _E_ Please don't pay attention to all of those phony tweets that mention my twitter handle relative to "diet" it is a total scam. _E_ Why is this reporter touching me as I leave news conference? What is in her hand?? __HTTP__ _E_ As it has turned out James Comey lied and leaked and totally protected Hillary Clinton. He was the best thing that ever happened to her! _E_ Obama's coal regulations will destroy the coal industry put Americans out of work raise electricity prices & lead to blackouts. _E_ .@DannyZuker Danny You're a total loser! _E_ Getting ready to leave for Washington D.C. The journey begins and I will be working and fighting very hard to make it a great journey for.. _E_ .@mcuban Letterman @Late_Show had his best ratings with me and you bombed. People don't care about Mark Cuban. _E_ A great honor to host and welcome leaders from around America to the @WhiteHouse Infrastructure Summit.... __HTTP__ _E_ RT @TravelGov: Continue to notify us of US citizens overseas impacted by #HurricaneIrma & #HurricaneJose. __HTTP__ __HTTP__ _E_ Terrorists are engaged in a war against civilization it is up to all who value life to confront & defeat this evil __HTTP__ _E_ Having a truly great imagination is often far more important than having even massive knowledge but still never underestimate knowledge! _E_ "Our country is the greatest force for freedom the world has ever known. We have big hearts big brains and (cont) __HTTP__ _E_ Bottom line I don't think President changed people's minds must hope for a lifeline from Putin a very dangerous lifeline at that! _E_ History lesson: There's a big difference between Hillary Clinton and Abraham Lincoln. For one his nickname is Hone... __HTTP__ _E_ .@MittRomney did a great job last night. Watch the clip! __HTTP__ _E_ Congratulations to @DLoesch on the release of her great new book #HandsOffMyGun! Check out @TheBlazeBooks excerpt __HTTP__ _E_ Anthony Hopkins is a truly great actor I love everything he does! _E_ Watch my latest video blog.... __HTTP__ _E_ Why doesn't @FoxNews quote the new Iowa @CNN Poll where I have a 33% to 20% lead over Ted Cruz and all others. Think about it! _E_ LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_ The more you learn about the debt deal the worse it gets. _E_ Dummies @Deadspin had their big payday taken from them by others in the media. _E_ I told you in speeches months ago that Jeb and Marco do not like each other. Marco is too ambitious and very disloyal to Jeb as his mentor! _E_ Once the tragic mistake of going into Iraq was made we should have at least taken the oil (or at least some of it). Now Iran & China get it _E_ Remember what I said about @BarackObama attacking Iran before the election I hope the Iranians are not so (cont) __HTTP__ _E_ Nation's infrastructure is collapsing MAKE AMERICA GREAT AGAIN! _E_ RT @TeamTrump: #RattledHillary wants to talk about her 30 years in service. How about her 30 years of FLOPSFLOPS?! #BigLeagueTruth #Debat... _E_ Join me live in Reno Nevada! __HTTP__ __HTTP__ _E_ Thank you Warwick Rhode Island!#RIPrimary #VoteTrump __HTTP__ __HTTP__ _E_ Want to know why China is growing? They can build the world's tallest building in 90 days __HTTP__ Red tape would kill it here. _E_ Thank you Nevada! #AmericaFirst#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ If Cory Booker is the future of the Democratic Party they have no future! I know more about Cory than he knows about himself. _E_ of jobs and companies lost. If Mexico is unwilling to pay for the badly needed wall then it would be better to cancel the upcoming meeting. _E_ If Sheena Monnin apologized for her mistake as she should have I would have treated her very nicely. _E_ Egypt's Muslim Brotherhood President is visiting us next month. @BarackObama is so excited. _E_ RT @Scavino45: .@POTUS @realDonaldTrump @IvankaTrump Jared Kushner & Dina Powell in the Oval Office today w/ Aya & her brother Basel.#W... _E_ Politicians are ALL TALK NO ACTION! just look at our country. _E_ President Obama should bring Secretaty Sebelius into his office look right into her beautiful blue eyes and saywith emotion YOU'RE FIRED! _E_ As promised our campaign against the MS 13 gang continues. @ICEgov Busts 39 MS 13 Members in New York Operation __HTTP__ _E_ Thank you Georgia! I had a great afternoon with all of you! I will be back soon. #MakeAmerciaGreatAgain __HTTP__ _E_ RT @Heritage: We had a special visitor yesterday. @IvankaTrump thank you for meeting with @KayColesJames spending time with our team and... _E_ How can any Senator vote for Hagel as Sec. of Defense after that horrific hearing? He is not up for the job but will probably get it. _E_ .@TrumpGolfLA is proud to be hosting @PGAGrandSlam where all 4 Major Champions will square off. October 2015. __HTTP__ _E_ Had a great time yesterday on @theviewtv with @WhoopiGoldberg @JennyMcCarthy @SherriEShepherd & guest host @MrJerryOC! _E_ Iran is playing with fire they don't appreciate how kind President Obama was to them. Not me! _E_ A good question for would be entrepreneurs to ask themselves: What am I pretending not to see? There are a lot of opportunities out there. _E_ I hope all of the many thousands of people who are asking me to give up so much and RUN FOR PRESIDENT will fight hard for victory if I do! _E_ Republican Senate must get rid of 60 vote NOW! It is killing the R Party allows 8 Dems to control country. 200 Bills sit in Senate. A JOKE! _E_ Wow looks like James Comey exonerated Hillary Clinton long before the investigation was over...and so much more. A rigged system! _E_ The entrepreneur builds an enterprise the technician builds a job. Michael Gerber _E_ Many people still out of power in Staten Island. Absolutely ridiculous. Why can't they get service? _E_ I'll be on Greta Van Susteren's show tonight at 10 PM on FoxNews. Tune in. _E_ CORRUPTION CONFIRMED: FBI confirms State Dept. offered 'quid pro quo' to cover up classified emails __HTTP__ _E_ Thank you Tallahassee Florida! A beautiful evening with the MOVEMENT! Get out & VOTE!#ICYMI watch here: __HTTP__ _E_ When you're in a fight with a bully always throw the first punch—and don't telegraph it—hit hard & hit fast! _E_ ...subject to the fact that if we do not reach a fair deal for all we will then terminate NAFTA. Relationships are good deal very possible! _E_ The Islamists have won. Just as I predicted the Muslim Brotherhood has taken over Egypt. @BarackObama never should have abandoned Mubarek. _E_ If Syria was forced to use Obamacare they would self destruct without a shot being fired. Obama should sell them that idea! _E_ A segment from last night's @piersmorgan interview discussing @CoryBooker and fighting fire with fire in a campaign __HTTP__ _E_ #TrumpVlog Free our Marine! __HTTP__ _E_ Via Breitbart Riding High in Polls Donald Trump Storms the American South to Overflow Crowds in Georgia __HTTP__ _E_ The economy is expected to slow down once again at the end of the year __HTTP__ The price of gas has to be lowered. _E_ A state legislator w/ a true record of accomplishments military vet @joniernst will make a tremendous US Senator. Iowa send Joni to DC! _E_ Nobody but Donald Trump will save Israel. You are wasting your time with these politicians and political clowns. Best! #SheldonAdelson _E_ Wow—Family Feud said I am the third most envied man in America. I respectfully disagree—I am very modest. _E_ While millions are being spent against me in attack ads they are paid for by the "bosses" and "owners" of candidates. I am self funding. _E_ "Donald Trump was proven right on another one of his top issues Thursday: 'gun free zones' at military bases." __HTTP__ _E_ More than a century after conquering flight the #WrightBrothers continue to motivate & inspire Americans who never tire of exploration & innovation. This GREAT AMERICAN SPIRIT can be found in the design of every new supersonic jet and next generation: __HTTP__ __HTTP__ _E_ The reason you don't generally hit runways is that they are easy and inexpensive to quickly fix (fill in and top)! _E_ Here I am with @RodStewart at Mar a Lago. __HTTP__ _E_ I look forward to working w/ D's + R's in Congress to address immigration reform in a way that puts hardworking citizens of our country 1st. _E_ Ben Carson was speaking in general terms as to what he would do if confronted with a gunman and was not criticizing the victims. Not fair! _E_ China has a backdoor into the Trans Pacific Partnership. This deal does not address currency manipulation. China is laughing at us. _E_ We are one nation. When one hurts we all hurt. We must all work together to lift each other up.#StandWithLouisiana __HTTP__ _E_ Check out the #trumpvlog to see the answers to your questions... __HTTP__ _E_ I will be going to Asheville North Carolina tonight for the 95th birthday party of the GREAT Billy Graham such a wonderful man! _E_ .And to think that just last week he was lecturing anyone who would listen about sexual harassment and respect for women. Lesley Stahl tape? _E_ Looks like I was right about NATO. I had no doubt. __HTTP__ _E_ President Obama just fired the ObamaCare website builder. My question is why were they hired in the first place? Sue them for damages! _E_ .@somelikeitlar hope you enjoyed the premiere of All Star Celebrity @ApprenticeNBC. Make sure @marklevinshow watches! _E_ Just started building one of the great hotels of the World in Washington D.C. the site of the Old Post Office. Will be amazing JOBS! _E_ If Obama goes after Mitt's private sector experience in the next debate then Mitt should ask for Obama's college records all of them. _E_ Via @postandcourier by @skropf47: "Donald Trump: Don't politicize Walter Scott shooting" __HTTP__ _E_ .@KarlRove still thinks Romney won! He doesn't have a clue! @FoxNews _E_ What an evening in Las Vegas Nevada! THANK YOU for your continued support. #Trump2016 __HTTP__ __HTTP__ _E_ Big vote tomorrow in the House. Tax cuts are getting close! _E_ RT @MELANIATRUMP "@ApprenticeNBC: Her beauty lives 5000 miles past Heaven. __HTTP__ " Thank u @THEGaryBusey! _E_ Everybody wants me to talk about Robert Pattinson and not Brian Williams—I guess people just don't care about Brian! _E_ I will be doing @SquawkCNBC at 7:30. _E_ Work often becomes problem solving. Problems come with the territory and they should never surprise you. Think Like a Champion _E_ Crazy @megynkelly supposedly had lyin' Ted Cruz on her show last night. Ted is desperate and his lying is getting worse. Ted can't win! _E_ Good night everyone sleep well and tomorrow have many victories! _E_ That was some episode last week we've got a great cast! _E_ Via @LINKSMagazine: "Only The Donald __HTTP__ _E_ Standing strong for his people @GovWalker is ignoring the Feds and keeping all Wisconsin parks open. Great! _E_ Amazing Race winning an Emmy again is a total joke. The Emmys have no credibility no wonder the ratings are at record lows. _E_ #AmericasMerkel __HTTP__ _E_ Dangerous. While Obama is cutting down our military China has announced plans to build more aircraft carriers __HTTP__ _E_ Congratulations to @MittRomney for an impressive win in Florida. He performed well under pressure. _E_ My supporters are the best! $18 million from hard working people who KNOW what we can be again! Shatter the record: __HTTP__ _E_ I have been very consistent and always said that Iraq would fall as soon as the U.S. left. What a terrible waste of lives and money! _E_ Make no mistake Fast and Furious goes ALL the way to the White House. _E_ The main stream media wants to surrender constitutional rights I believe #ISIS needs to surrender! _E_ The Senate Democrats have only confirmed 48 of 197 Presidential Nominees. They can't win so all they do is slow things down & obstruct! _E_ If Obama worked as hard on straightening out our country as he has trying to protect and elect Hillary we would all be much better off! _E_ Direct foreign investments continue to flow into China at over $100B a year __HTTP__ That's money that could be spent here. _E_ With Irma and Harvey devastation Tax Cuts and Tax Reform is needed more than ever before. Go Congress go! _E_ Goofy Elizabeth Warren didn't have the guts to run for POTUS. Her phony Native American heritage stops that and VP cold. _E_ It just shows everyone how broken and unfair our Court System is when the opposing side in a case (such as DACA) always runs to the 9th Circuit and almost always wins before being reversed by higher courts. _E_ Looks like the line has started be sure to join me for book signing @TImeToGetTough starting at 11am to 2 pm here in Trump Tower. _E_ LYIN' TED __HTTP__ _E_ I rarely agree with President Obama however he is 100% correct about Crooked Hillary Clinton. Great ad! __HTTP__ _E_ 'Donald Trump: A President for All Americans' __HTTP__ _E_ My friend @eminofficial was fantastic on the @TODAYshow this morning—a star! _E_ .@LilJon once again made it to the Final Four. A true talent and great friend to #CelebApprentice @ApprenticeNBC. Great job! _E_ If Obama was willing to lie about ObamaCare then what else has he lied to us about... _E_ .@CNN is so negative it is impossible to watch. Terrible panel angry haters. Bill O @oreillyfactor said such an amazing thing about me! _E_ Hillary could lose to Trump in Democratic New York #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Re: CPAC "The crowd in the main room filled to capacity by the end of Trump's address something his operative said he planned to do... _E_ Join me live in Cincinnati Ohio!#TrumpRally #MAGA __HTTP__ _E_ Republicans Senators are working hard to pass the biggest Tax Cuts in the history of our Country. The Bill is getting better and better. This is a once in a generation chance. Obstructionist Dems trying to block because they think it is too good and will not be given the credit! _E_ Have your own vision & stick with it. Don't be afraid to be unique.Every day is an opportunity to show what you can do at the highest level. _E_ #TBT With Steven Spielberg in the old days a great guy! __HTTP__ _E_ It's Thursday. How much did OPEC steal from all of us today? _E_ The Fake News refuses to talk about how Big and how Strong our BASE is. They show Fake Polls just like they report Fake News. Despite only negative reporting we are doing well nobody is going to beat us. MAKE AMERICA GREAT AGAIN! _E_ To the @BarackObama administration saving money isn't the point expanding government and spending more (cont) __HTTP__ _E_ Via @ArabianBusiness: "Trump eyes PGA tour for Dubai golf course" __HTTP__ _E_ A great book for your reading enjoyment: REASONS TO VOTE FOR DEMOCRATS by Michael J. Knowles. _E_ Fake @NBCNews made up a story that I wanted a tenfold increase in our U.S. nuclear arsenal. Pure fiction made up to demean. NBC = CNN! _E_ Someone should ask @BarackObama in today's press conference how he accumulated more debt in 3 years than the first 42 presidents combined. _E_ Just landed in New York a one night stay in Scotland. Turnberry came out magnificently. My son Eric did a great job under budget! _E_ Congratulations to @TrumpDoral's #BlueMonster course for being named one of the 10 Toughest courses on Tour This Year __HTTP__ _E_ Even though Bernie Sanders has lost his energy and his strength I don't believe that his supporters will let Crooked Hillary off the hook! _E_ Great solidarity for our National Anthem and for our Country. Standing with locked arms is good kneeling is not acceptable. Bad ratings! _E_ Good news: Toyota and Mazda announce giant new Huntsville Alabama plant which will produce over 300000 cars and SUV's a year and employ 4000 people. Companies are coming back to the U.S. in a very big way. Congratulations Alabama! _E_ ...We negotiated a ceasefire in parts of Syria which will save lives. Now it is time to move forward in working constructively with Russia! _E_ RT @FLOTUS: Thank u to Queen Fabiola University Hospital! Enjoyed creating paper flowers with amazing patients & getting a tour. #Brussels... _E_ As the days and weeks go by we see what a total mess our country (and world) is in Crooked Hillary Clinton led Obama into bad decisions! _E_ Thank you! #Trump2016 __HTTP__ _E_ Obama is under a great of pressure to perform well in the next debate. Let's see how he reacts under pressure. _E_ When people treat me badly or unfairly or try to take advantage of me my general attitude all my life has (cont) __HTTP__ _E_ RT @foxandfriends: SEN. CRUZ: It's crazy to go an August recess without having Obamacare repealed. We should work every day until it is don... _E_ Syria has been given so much time that much of the things we were going to bomb have been moved into civilian areas! A polititian's war. _E_ .@RinglingBros is retiring their elephants the circus will never be the same. _E_ 'WikiLeaks Drip Drop Releases Prove One Thing: There's No Nov. 8 Deadline on Clinton's Dishonesty and Scandals' __HTTP__ _E_ Saudi Arabia should be paying the United States many billions of dollars for our defense of them. Without us gone! @AlWaleedbinT _E_ My wife @MELANIATRUMP will be joining @andersoncooper @AC360 tonight at 8pmE on @CNN. Enjoy! __HTTP__ _E_ People are so jealous of Tom Brady and the Patriots. No court could convict based on the evidence.They can't beat him on the field so this! _E_ "Here's something about Donald Trump he's got a top rated show on TV and everything he says becomes a headline." @DLoesch All true! _E_ Obama's speech in Las Vegas yesterday cost the taxpayer $520 per word and over $1.6M __HTTP__ More money borrowed from China. _E_ A $1.5B website that can only handle 50K users at a time is sad but no surprise! _E_ Ed Gillespie will turn the really bad Virginia economy #'s around and fast. Strong on crime he might even save our great statues/heritage! _E_ The booing at the NFL football game last night when the entire Dallas team dropped to its knees was loudest I have ever heard. Great anger _E_ Just out wonderful poll in North Carolina. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ ...people are now starting to recognize the amazing work that has been done by FEMA and our great Military. All buildings now inspected..... _E_ Via @myrbeachonline by @TSN_MPrabhu: Donald Trump states case for becoming POTUS at SC Tea Party convention __HTTP__ _E_ What separates the winners from the losers is how a person reacts to each new twist of fate. _E_ .@jessebwatters You did a great job hosting @oreillyfactor. Everybody loved it! Thank you for the nice words. _E_ Republicans in the Senate will NEVER win if they don't go to a 51 vote majority NOW. They look like fools and are just wasting time...... _E_ "Most of the time you will need to work hard and stay focused to get to the top – and then work even harder to stay there." Think Big _E_ Too many people on stage for debate. @RandPaul at 11th with 2% in @RealClearNews shouldn't be allowed to participate. _E_ So now it is reported that the Democrats who have excoriated Carter Page about Russia don't want him to testify. He blows away their.... _E_ "History does not long entrust the care of freedom to the weak or the timid." Dwight D. Eisenhower _E_ More than 70M people watched the Presidential Debate. A new record. See what happens when I am so prominently mentioned (just kidding)! _E_ Don't forget Benghazi. _E_ #CelebApprentice We always make sure to have great NYC locations for the task delivery. _E_ Last night in Phoenix I read the things from my statements on Charlottesville that the Fake News Media didn't cover fairly. People got it! _E_ The Federal government has $2.7T in assets & $17.5T in total liabilities plus another $4.7T in intergovernmental debt. Have a nice day. _E_ Who can figure out the true meaning of covfefe ??? Enjoy! _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Donald Trump praises @LilJon and welcomes him back to All Star @CelebApprentice __HTTP__ via @HipHopNews24x7 _E_ Hopefully all supporters and those who want to MAKE AMERICA GREAT AGAIN will go to D.C. on January 20th. It will be a GREAT SHOW! _E_ I will issue a lifetime ban against senior executive branch officials lobbying on behalf of a FOREIGN GOVERNMENT!... __HTTP__ _E_ Excited to keynote of the sold out Pottawattamie County Republican Party Lincoln/Reagan Dinner tonight! __HTTP__ Leaving now! _E_ Lance Armstrong made a really big mistake by opening up to Oprah. I'll bet he wishes he had the chance to do it over again. _E_ To have a government we can afford we need to eliminate the tremendous waste clogging the system. Almost every (cont) __HTTP__ _E_ I will be at @Macys Herald Square April 18 to sign my new fragrance #Success by Trump. First 100 customers receive a copy of my new book. _E_ Hillary just gave a disastrous news conference on the tarmac to make up for poor performance last night. She's being decimated by the media! _E_ Wind turbines threaten the migration of birds __HTTP__ Where's the outcry? _E_ Via @ScotlandNow: "Donald Trump starts £250million overhaul on @TrumpTurnberry golf resort" __HTTP__ _E_ This is Amateur Night who the hell is in charge of this production? #Oscars _E_ Great photo with @IvankaTrump and @Joan_Rivers from this week's @ApprenticeNBC __HTTP__ _E_ Via @Newsmax_Media by @OwenTew: "Trump: 'Maybe Something Miraculous Happens' and Obama Will Succeed" __HTTP__ _E_ Thank you Louisville Kentucky on my way! #MAGA __HTTP__ _E_ The Donald Goes to CPAC: TV star and hotel magnate gives his thoughts on the state of America __HTTP__ by @Kredo0 _E_ Sugar is nowhere near being a billionaire and I know he works for me! _E_ On the shores of Lake Norman @Trump_Charlotte presents the true luxury lifestyle and an elite golf course __HTTP__ _E_ RT @Scavino45: Today @POTUS @realDonaldTrump and @FLOTUS Melania visit the @USCG at the Lake Worth Inlet Station in Riviera Beach Florid... _E_ With our border not being secure Obama is giving a pathway to terrorists to enter our country. An attack is on him. _E_ Entrepreneurs who develop their Midas Touch do not work for money. They work to create or acquire assets. Focus on assets. _E_ Still waiting for an explanation about why @GiulianaRancic & @BillRancic did not name their son Donald. Unbelievable. _E_ Remarks at the United States Holocaust Memorial Museum's National Days of Remembrance. Full remarks:... __HTTP__ _E_ My @FoxNews interview from yesterday discussing my recent meetings in Trump Tower and also @GovChristie __HTTP__ _E_ Heading now for Reno Nevada for a big rally. Good poll numberd all over! _E_ It is time to bring competence to Washington. It is time get results. Let's Make America Great Again! __HTTP__ _E_ Keystone XL should be approved but more importantly we should be drilling & fracking our own resources. Would be an economic windfall. _E_ Ask China if their rapidly expanding (with our money) Navy or Armed Forces are going green They would laugh in your face! _E_ With all of the illegal acts that took place in the Clinton campaign & Obama Administration there was never a special councel appointed! _E_ I am especially grateful for the tremendous support I have received from the Evangelicals in the just out Iowa CNN poll. Thank you! _E_ Leading by 13 over Landrieu in a @FoxNews poll @BillCassidy will beat her in November. _E_ Just leaving Salt Lake City Utah fantastic crowd with no interruptions. Love Utah will be back! _E_ Man we had a great day today at Trump Tower lots of money was given to many people who really needed it good feelings and happiness! _E_ Back in D.C. big week for Tax Cuts and many other things of great importance to our Country. Senate Republicans will hopefully come through for all of us. The Tax Cut Bill is getting better and better. The end result will be great for ALL! _E_ Awarded 5 Stars by @ForbesInspector @TrumpChicago's @SixteenChicago offers Executive Chef @cheflents's new menu __HTTP__ _E_ Do your homework. Wasting other people's time due to poor planning will only leave a bad impression. Think Like a Billionaire _E_ When you can't say it or see it you can't fix it. We will MAKE AMERICA SAFE AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_ Iran continues to delay the nuclear deal while doing many bad things behind our backs. Time to WALK and double the sanctions. Stop payments! _E_ #TheRemembranceProject __HTTP__ __HTTP__ _E_ Our many loyal viewers should expect a major announcement very soon on next season's @CelebApprentice. Our fans will be pleased. _E_ "A failure or setback is not a defeat. Defeat is a state of mind. You are defeated only when you accept defeat." – Think Big _E_ If you live in a state with early voting you should be voting as soon as possible. Bring your friends and family with you. _E_ I give Secretary of State John Kerry credit for working and trying hard but he has zero negotiating ability! _E_ Just watched @NBCNightlyNews So biased inaccurate and bad point after point. Just can't get much worse although @CNN is right up there! _E_ My wife Melania will be on @Morning_Joe tomorrow morning at 8:00. Interviewed by @morningmika Enjoy! _E_ Consumer Confidence Hits Highest Level Since December 2000 Read more: __HTTP__ __HTTP__ _E_ Obama looks exhausted and beaten. He was never made or prepared for the job. Like it or not he doesn't have it _E_ We're all very happy to hear of Bret Michael's progress and send our best wishes for his full recovery. _E_ "I like thinking big. To me it's very simple: if you're going to be thinking anyway you might as well think big." – The Art of The Deal _E_ 'America must decide between failed policies or fresh perspective a corrupt system or an outsider' __HTTP__ _E_ Hey @Rosie how is your recovery going? I hope you are doing well so we can start fighting again soon! _E_ The Oscar broadcast is really boring where is the glamour and beauty? _E_ #ICYMI: Weekly Address __HTTP__ __HTTP__ _E_ The misery of Obama's economic policies. US households with unemployed parent was at record high in 2011 __HTTP__ _E_ My interview yesterday with @TeamCavuto where I discuss Dick Cheney and China __HTTP__ _E_ The ratings for The View are really low. Nicole Wallace and Molly Sims are a disaster. Get new cast or just put it to sleep. Dead T.V. _E_ "Trump: 'No way' Bush Romney would win in 2016" __HTTP__ via @FoxNews by Barnini Chakraborty _E_ #FlashbackFriday With Mickey Rooney @Regis and @itstonybennett __HTTP__ _E_ Still a buyer's market but somewhat fragile. Be sure to calculate the risk of rising rates coming sooner than you think! _E_ The State of Iowa should disqualify Ted Cruz from the most recent election on the basis that he cheated a total fraud! _E_ RT @Reince: Flying to Dallas now with @realDonaldTrump...Reports of discord are pure fiction. Great events lined up all over Texas. Rs wil... _E_ Tonight in his SOTU @BarackObama won't talk about Keystone. He will continue to dissemble about his record and play class warfare. _E_ Entrepreneurs: Is the problem a blip or a catastrophe? Keep things in perspective. Learn to expect problems and keep moving forward. _E_ Obama's deal vs. Trump's deals __HTTP__ _E_ They should seriously look into the moron George Zimmerman who shot and killed the 17 year old kid Trayvon (cont) __HTTP__ _E_ Just landed in New Hampshire. Will be at the venue shortly. #FITN _E_ The Fake News Media works hard at disparaging & demeaning my use of social media because they don't want America to hear the real story! _E_ The voters the Republican Party of Virginia are excluding will doom any chance of victory. The Dems LOVE IT! Be smart and win for a change! _E_ Excited to announce Trump Rio de Janeiro our first South American @TrumpCollection hotel set to open in 2016 __HTTP__ _E_ Just cancelled my subscription to @USATODAY. Boring newspaper with no mojo must be losing a fortune. Founder (cont) __HTTP__ _E_ This Ebola patient Thomas Duncan who fraudulently entered the U.S. by signing false papers is causing havoc. If he lives prosecute! _E_ Scary now China's Development Bank is looking to buy U.S. homes and developments __HTTP__ They will own our country soon. _E_ I can't stress strongly enough that we are currently in a buyer's residential market.Try to buy directly from a bank. _E_ Thank you America! Together we will all #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_ Via @businessinsider by @BKcolin: "Donald Trump called the White House and offered to help fix the BP oil spill" __HTTP__ _E_ Resilience is part of the survival of the fittest formula make sure you remain adaptable. _E_ Will be calling the President of Egypt in a short while to discuss the tragic terrorist attack with so much loss of life. We have to get TOUGHER AND SMARTER than ever before and we will. Need the WALL need the BAN! God bless the people of Egypt. _E_ .@DanaPerino Have you released a copy of the beautiful thank you card you sent me? Would you like to see it? @ericbolling @kimguilfoyle _E_ Michigan has made great progress under Snyder Calley. @MIGOP is out early energizing the grassroots. Keep it up! #JoinMITeam _E_ The PGA tour just extended my Trump Doral contract for WGC for ten years. _E_ Watch me on the @oreillyfactor tonight at 8PM. _E_ Thank you Roanoke Virginia be back soon! #TrumpPence16 __HTTP__ __HTTP__ _E_ The $200M renovations at Trump @DoralResort are right on target. When completed the course will be as good as it gets. _E_ Like the worthless @NYDailyNews looks like @politico will be going out of business. Bad reporting no money no cred! _E_ Who would really believe I would say such a thing about a guy I truly liked James Gandolfini. Sadly very sick people use my name. _E_ Does the Fake News Media remember when Crooked Hillary Clinton as Secretary of State was begging Russia to be our friend with the misspelled reset button? Obama tried also but he had zero chemistry with Putin. _E_ I will be on @foxandfriends tomorrow morning at 7:00. Enjoy! _E_ Wisconsin has suffered a great loss of jobs and trade but if I win all of the bad things happening in the U.S. will be rapidly reversed! _E_ Getting to the point is appreciated by everyone. Here's some advice for public speaking: Be sincere be brief be seated. F.D. Roosevelt _E_ Donald Trump: Yahoo Marissa Mayer Are Right Employees Should Not Work From Home __HTTP__ via @HuffPostSmBiz _E_ I will be interviewed on @foxandfriends at 7:00 A.M. _E_ Governor Kasich whose failed campaign & debating skills have brought him way down in the polls is going to spend $2.5 million against me. _E_ I am very impressed by @dennisrodman. His return to this season's @ApprenticeNBC showed who Dennis really is which is very good. _E_ Same failing @nytimes reporter who wrote discredited women's story last week wrote another terrible story on me today will never learn! _E_ Donate Today To Help Make America Great Again! You Can Help Stop Crooked Hillary Clinton! __HTTP__ __HTTP__ _E_ The polls & momentum are trending towards @MittRomney. Don't let the hurricane change your thinking! _E_ Eliot had a terrible debate performance this morning against Scott Stringer. He can't spin his failing and contemptible public record. _E_ Meeting Former Speaker Newt Gingrich next week. On the Agenda defeating @BarackObama. _E_ RT @FoxBusiness: #StockAlert: U.S. markets since the election __HTTP__ _E_ Obama's foreign policy is a complete and total disaster the worst President we have ever had. _E_ If only @Obama was as focused on balancing the budget as he is on weakening Israel's borders then America would be on the path to solvency. _E_ After 5 SB victories since 2002 it was my honor to give Bob Kraft Coach Belichick and the players their first to... __HTTP__ _E_ #CrookedHillary's plan will add $1.15 TRILLION in new taxes. We cannot afford her! #DrainTheSwamp #Debate __HTTP__ _E_ Arizona had a 116% increase in ObamaCare premiums last year with deductibles very high. Chuck Schumer sold John McCain a bill of goods. Sad _E_ We fight to free Libya and they kill our Ambassador and other Americans. Obama's foreign policy is a joke. _E_ I told you so @politico just lost it's top person. Poor results and no money to pay him. If they were legit they would be doing far better! _E_ You pick it! #1. Anybody that says anything derogatory about @BarackObama is labeled stupid insane or (cont) __HTTP__ _E_ The Schumer Rounds Collins immigration bill would be a total catastrophe. @DHSgov says it would be "the end of immigration enforcement in America." It creates a giant amnesty (including for dangerous criminals) doesn't build the wall expands chain migration keeps the visa... _E_ Our deepest sympathies and most heartfelt prayers are with the victims of the train derailment in Washington State. We are closely monitoring the situation and coordinating with local authorities... __HTTP__ _E_ Third rate reporters Amy Chozick and Maggie Haberman of the failing @nytimes are totally in the Hillary circle of bias. Think about Bill! _E_ The commodity market is extremely fragile. Be wary of investing right now. The futures are way too dependent on the Fed. _E_ Radio interview w/ @seanhannity discussing @PhilMickels0n_ why NY must start fracking & staying in @GOP primary __HTTP__ _E_ Great comeback by Tom Brady New England! _E_ ...our Great American Flag (or Country) and should stand for the National Anthem. If not YOU'RE FIRED. Find something else to do! _E_ Poll data shows that @marcorubio does by far the best in holding onto his Senate seat in Florida. Important to keep the MAJORITY. Run Marco! _E_ Our country has been unsuccessfully dealing with North Korea for 25 years giving billions of dollars & getting nothing. Policy didn't work! _E_ Tonight's episode of @ApprenticeNBC is not only the best episode ever it has a great lesson in life. Don't miss it! _E_ I am the only one that knows how to build cities pols are all talk and no action. Our cities need help and fast. They are crumbling! _E_ The election was a major setback for economy. All young entrepreneurs should be sure to calculate Obama's policies into their investments. _E_ Good article: What Happened to American Men from @Newsmax by Michael Cohen __HTTP__ _E_ In the end Andy Pettitte did not rat out his friend Roger Clemens. I like him again a lot. _E_ I will make this right for our great Vets! _E_ Join me tomorrow! #MAGA 10am Baton Rouge LA. Tickets: __HTTP__ Grand Rapids MI.Tickets: __HTTP__ _E_ Via @latimes' @LATshowtracker:"Monday's TV Highlights:@ApprenticeNBCon @nbc" __HTTP__ _E_ Wow Jeb Bush whose campaign is a total disaster had to bring in mommy to take a slap at me. Not nice! _E_ __HTTP__ _E_ _E_ RT @TeamTrump: WATCH: @realDonaldTrump on the stakes in this election #Debates2016 __HTTP__ _E_ Via @thestate by @AP: "Donald Trump: Giving 'serious thought' to presidential run" __HTTP__ _E_ Tweet me more of your questions to answer in the next video.... _E_ Via @theblaze: Donald Trump on how Rubio should have drank his water __HTTP__ _E_ Read what Donald Trump has to say about daughter Ivanka's upcoming new book The Trump Card: __HTTP__ _E_ Earlier today I spoke with @GovMattBevin of Kentucky regarding yesterday's shooting at Marshall County High School. My thoughts and prayers are with Bailey Holt Preston Cope their families and all of the wounded victims who are in recovery. We are with you! _E_ Donate Today To Help Make America Great Again! You Can Help Stop Crooked Hillary Clinton! __HTTP__ __HTTP__ _E_ Pervert alert–serial sexter @RepWeiner is making another step towards a comeback __HTTP__ All girls under 18 should block him. _E_ The Fake News refuses to report the success of the first 6 months: S.C. surging economy & jobsborder & military securityISIS & MS 13 etc. _E_ With the ridiculous Filibuster Rule in the Senate Republicans need 60 votes to pass legislation rather than 51. Can't get votes END NOW! _E_ As I said on @foxandfriends this a.m. you have to give Obama credit—he won! ... _E_ The debates especially the second and third plus speeches and intensity of the large rallies plus OUR GREAT SUPPORTERS gave us the win! _E_ I will end illegal immigration and protect our borders! We need to MAKE AMERICA SAFE & GREAT AGAIN! #Trump2016 __HTTP__ _E_ #3. Cover your bases. Know everything you can about what you're doing. _E_ .@LindseyGrahamSC and Lyin' Ted Cruz are two politicians who are very much alike ALL TALK AND NO ACTION! Both talk about ISIS do nothing! _E_ Give a lot of credit to Carlos Beltran for developing into a terrific baseball player and total winner for the Cardinals great going Carlos! _E_ The UK has run out of money and can't afford to borrow. __HTTP__ Neither can we but that doesn't stop @BarackObama. _E_ Thank you @AmSpec __HTTP__ _E_ RT @Inspire_Us: No color no religion no nationality should come between us we are all children of God. Mother Teresa _E_ Now he has made his Busey ism into a song. #CelebApprentice _E_ RT @IvankaTrump: Working families need #TaxReform & the time is now. This Administration is committed to ensuring all Americans can thrive... _E_ Another sign that @jack_welch is right. New government labor report casts even more doubt on the September jobs data __HTTP__ _E_ It wasn't only that Obama saluted a Marine with a cup of coffee in his hand but why the hell does he have to exit a heli holding coffee? _E_ 5 year old Trey has terminal cancer. I'm helping him go to Disney won't you? __HTTP__ _E_ MAKE AMERICA SAFE AND GREAT AGAIN! #RNCinCLE __HTTP__ _E_ "Dem candidates are all folks who vote with me." – Barack Obama describing ALL Democrat Senate candidates _E_ RT @Scavino45: POTUS' @realDonaldTrump on Hurricane Response Efforts in #PuertoRico on Instagram part of 9/29/17 Weekly Address. __HTTP__ _E_ Great seeing @MarianoRivera w/@realDonaldTrump at @TrumpTowerNY for @EricTrumpFdn! __HTTP__ __HTTP__ _E_ Bill Clinton stated that I called him after the election. Wrong he called me (with a very nice congratulations). He doesn't know much ... _E_ An 'extremely credible source' has called my office and told me that @BarackObama's birth certificate is a fraud. _E_ Disappointed in GOP and Dems Giving Obama power to raise the debt limit next year is a mistake. _E_ RT @TwitterData: These are the 10 most Tweeted about world leaders during the first day of #UNGA General Debate __HTTP__ _E_ In just out book Secret Service Agent Gary Byrne doesn't believe that Crooked Hillary has the temperament or integrity to be the president! _E_ "A big key to winning is knowing where the other side is coming from." – Think Like a Champion _E_ It would be nice if our commander in chief was as concerned for our Veterans health as he is for illegal immigrants becoming citizens. _E_ Obama's disastrous judgment gave us ISIS rise of Iran and the worst economic numbers since the Great Depression! _E_ Ranked nationally in @GolfMagazine's top 100 Trump Int'l Golf Club in Palm Beach is a 27 hole masterpiece __HTTP__ _E_ In my speech on protecting America I spoke about a temporary ban which includes suspending immigration from nations tied to Islamic terror. _E_ Think of yourself like a one man army. You're not only the commander in chief you're the soldier as well. – Think Like A Billionaire _E_ ...and a Great Leader. John has also done a spectacular job at Homeland Security. He has been a true star of my Administration _E_ The big and highly respected Cooley LLP is handling the @billmaher case for me. _E_ After strict consultation with General Kelly the CIA and other Agencies I will be releasing ALL #JFKFiles other than the names and... _E_ RT @LouDobbs: The stock market has gained an incredible 7.8 Trillion dollars in market value since @POTUS was elected! Looks like 4% econom... _E_ Now @BarackObama is telling @MittRomney how to control his own assets. __HTTP__ Obama is consumed by class warfare. _E_ Lyin' Ted Cruz will never be able to beat Hillary. Despite a rigged delegate system I am hundreds of delegates ahead of him. _E_ I cannot imagine that these very fine Republican Senators would allow the American people to suffer a broken ObamaCare any longer! _E_ If the press would cover me accurately & honorably I would have far less reason to tweet. Sadly I don't know if that will ever happen! _E_ The Mayor of San Juan who was very complimentary only a few days ago has now been told by the Democrats that you must be nasty to Trump. _E_ For political purposes only Obama is planning to hit Libya for the Benghazi embassy attack right before the election? _E_ Yesterday's failing @NYTimes fraudulently shows an empty room prior to my speech when in fact it was packed! __HTTP__ _E_ Obama is community organizing from the Oval Office on Ferguson today. More riots sure to follow. _E_ President Obama thinks the nation is not as divided as people think. He is living in a world of the make believe! _E_ Good news House just passed #KatesLaw. Hopefully Senate will follow. _E_ U.S. Senator Bob Corker (R Tenn.) issued the following statement today regarding the 2016 presidential election: __HTTP__ _E_ Many journalists are honest and great but some are knowingly dishonest and basic scum. They should.be weeded out! _E_ I am going to save Social Security without any cuts. I know where to get the money from. Nobody else does. my @SRQRepublicans speech _E_ Great evening in Canton Ohio thank you! We are going to MAKE AMERICA GREAT AGAIN! Join us: __HTTP__ __HTTP__ _E_ Mitt's proposed tax cuts for the middle class will spur record economic growth. _E_ Is this Hope & Change? A record 46.7M Americans were on food stamps this past June. We must do better. _E_ Be sure to visit the world renowned Trump Tower Atrium to see our holiday decorations. __HTTP__ _E_ The greatest threat to our security is our debt. It is already past 100% GDP. We need to make real budget cuts. _E_ The real war on women. Under @BarackObama 766000 more women are unemployed from when he took office __HTTP__ _E_ President Obama wants to change the name of the White House because it is highly discriminating and not at all politically correct! _E_ M.M. is a good choice also nice guy! #Oscars _E_ ObamaCare will continue to stop entrepreneurship slow growth and halt research & development. Defund Repeal & Replace! _E_ Via @politico: Donald Trump claims Barack Obama bombshell __HTTP__ _E_ Join us! #CaucusForTrump11am WATERLOO: __HTTP__ CEDER RAPIDS: __HTTP__ __HTTP__ _E_ The deal with Iran will go down as one of the most incompetent ever made. The U.S. lost on virtually every point. We just don't win anymore! _E_ The road to success is always under construction. Arnold Palmer _E_ Making a big speech in Alabama today. So many people we had to move to a football stadium! Come and join us! _E_ Just signed Bill. Our Military will now be stronger than ever before. We love and need our Military and gave them everything — and more. First time this has happened in a long time. Also means JOBS JOBS JOBS! _E_ Getting ready to meet President al Sisi of Egypt. On behalf of the United States I look forward to a long and wonderful relationship. _E_ Don't assume you have to accept the hand you were dealt. – Think Like A Billionaire _E_ Trump National Golf Club Los Angeles is situated on the Palos Verdes Peninsula overlooking the Pacific Ocean... __HTTP__ _E_ Me by a lot! _E_ Spoke to U.K. Prime Minister Theresa May today to offer condolences on the terrorist attack in London. She is strong and doing very well. _E_ Located in South Ayrshire Scotland @TrumpTurnberry offers diverse dining options suitable for any occasion __HTTP__ _E_ Just announced that as many as 5000 ISIS fighters have infiltrated Europe. Also many in U.S. I TOLD YOU SO! I alone can fix this problem! _E_ My @FoxNews interview with @TeamCavuto discussing the national housing market unemployment numbers and the FL (cont) __HTTP__ _E_ The same people that said I wouldn't run or that I wouldn't lead or do well (1st place and leading by 21%) now say I won't beat Hillary. _E_ Via @PoliticalTicker: "TRENDING: Trump a right leaning tower at CPAC" __HTTP__ by @KilloughCNN _E_ I will be addressing a fantastic Ames crowd at tomorrow's @bobvanderplaats' @theFAMiLYLEADER Leadership Summit __HTTP__ _E_ Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_ RT @JaydaBF: VIDEO: Muslim Destroys a Statue of Virgin Mary! __HTTP__ _E_ RT @GOP: In @timkaine's own words #Debates2016 __HTTP__ _E_ RT @greta: Prob w/ all pundits saying last fall @realDonaldTrump had no chance is that shows media so out of touch w/ Americans _E_ Scottish government having huge backlash on wind turbines. @AlexSalmond is becoming very unpopular. _E_ 'Trump won the third debate' __HTTP__ _E_ China is our enemy they want to destroy us Redstate Interview _E_ "Trust your instincts especially if they are well honed." – Midas Touch _E_ On 59th & Park Avenue Trump Park Avenue transformed the legendary Hotel Delmonico into 120 luxury residences __HTTP__ _E_ Very organized process taking place as I decide on Cabinet and many other positions. I am the only one who knows who the finalists are! _E_ Can you believe that the Chinese would not give Obama the proper stairway to get off his plane fight on tarmac! __HTTP__ _E_ Lightweight shakedown artist AG Eric Schneiderman was exposed in today's New York Post editorial __HTTP__ _E_ RT @EricTrump: Wow! I am speechless! Thank you to my sidekick @LynnePatton who keeps me & the @EricTrumpFdn in line! __HTTP__ _E_ Little Barry Diller who lost a fortune on Newsweek and Daily Beast only writes badly about me. He is a sad and pathetic figure. Lives lie! _E_ An extended interview from the Super Bowl with @oreillyfactor airs tonight at 8:00 P.M. Enjoy! __HTTP__ _E_ Susan Rice the former National Security Advisor to President Obama is refusing to testify before a Senate Subcommittee next week on..... _E_ Do you believe that @FoxNews is still playing up the old Iowa poll numbers and no mention of the ABCWashington Post or just out CBS results? _E_ I've got news for President @BarackObama: America is not what's wrong with the world. I don't believe we need (cont) __HTTP__ _E_ Costs on non military lines will never come down if we do not elect more Republicans in the 2018 Election and beyond. This Bill is a BIG VICTORY for our Military but much waste in order to get Dem votes. Fortunately DACA not included in this Bill negotiations to start now! _E_ Don't forget to watch @ApprenticeNBC tonight—you will love it! 8 PM on NBC. #CelebApprentice _E_ "Be up front and direct with people and they will return the favor." – Think Like a Billionaire _E_ Republicans must start the Tax Reform/Tax Cut legislation ASAP. Don't wait until the end of September. Needed now more than ever. Hurry! _E_ Entrepreneurs: Look at the solution not the problem. Learn to focus on what will give results. _E_ .@THEGaryBusey returns to @CelebApprentice All Stars this season. His streak of chaos and havoc continues! _E_ ... A great person inspires others to see for themselves. – Harvey Mackay _E_ Why has Barack Obama repeatedly told inconsistent stories about his religious background? __HTTP__ Who is he? _E_ Celebrated for its room views by @LuxTravelExpert @TrumpChicago soars a luxurious 92 stories over the Windy City __HTTP__ _E_ Heading to Colorado for a big rally. Massive crowd great people! Will be there soon the polls are looking good. _E_ .@newsbusters Thank you for a great and very accurate story well done! _E_ Will be in Nashville Tennessee tomorrow (Saturday) at 2:30 P.M. So much to talk about see you there! _E_ Why is Senator John McCain in Syria visiting with the rebels MAKE AMERICA GREAT AGAIN! _E_ Despite the long delays by the Democrats in finally approving Dr. Tom Price the repeal and replacement of ObamaCare is moving fast! _E_ .@EmilyMiller's book Emily Gets Her Gun exposes the attack on our Second Amendment __HTTP__ A must read! _E_ So the highly overrated anchor @megynkelly is allowed to constantly say bad things about me on her show but I can't fight back? Wrong! _E_ Via @DailyMail by @chriskitching: "Luxury penthouse at @TrumpChicago skyscraper sells for record $17M" __HTTP__ _E_ Do you believe Barack Hussein Obama (aka Barry Soetoro) looked like a president last night? I don't! _E_ Just leaving D.C. Had great meetings with Republicans in the House and Senate. Very interesting day! These are people who love our country! _E_ Americans Elect on track to put an Indy Presidential candidate on the ballot in all 50 states. _E_ Stephanie Cutter Attended WH Meetings With IRS Chief __HTTP__ Great investigative work by Jim Hoft @gatewaypundit _E_ MERRY CHRISTMAS!! __HTTP__ _E_ "Trump: I created tens of thousands of jobs" __HTTP__ via @thehill by @SmiloTweets _E_ I'm not sure about @teresa_giudice as Project Manager. @lisalampanelli can be formidable. But let's see what happens #sweepstweet _E_ Many of @TigerWoods' 'friends' were quick to abandon him in his time of crisis. Now Tiger knows who he can count on. _E_ One of the most obvious lessons on @ApprenticeNBC is for the candidates to learn to think quickly. Think Like a Champion _E_ A 7242 yd. masterpiece @TrumpGolfLA's $250 million course features 18 challenging holes with incredible views __HTTP__ _E_ I have fun I love what I do. You should too. Find out how at the National Achievers Conference this October. __HTTP__ _E_ RT @fundanything In case you missed it check out @washingtonpost story about @realDonaldTrump & @fundanything __HTTP__ _E_ The Chinese are planning on going to the Moon...I hope they stop and take a look at our flag that was put there 43 years ago. @MittRomney _E_ Via @MoscowTimes Donald Trump Planning Skyscraper in Moscow __HTTP__ _E_ Once again the Bush appointed Supreme Court Justice John Roberts has let us down. Jeb pushed him hard! Remember! _E_ Military solutions are now fully in placelocked and loadedshould North Korea act unwisely. Hopefully Kim Jong Un will find another path! _E_ Re Megyn Kelly quote: you could see there was blood coming out of her eyes blood coming out of her wherever (NOSE). Just got on w/thought _E_ If everyone is thinking alike then somebody isn't thinking. George S. Patton _E_ A must read for any country or community considering wind turbines. __HTTP__ _E_ I think the @yankees will win today. Unlike A Rod CC is good under pressure. I hope A Rod plays however. _E_ Congratulations to @ehasselbeck on her successful first day as co host of @foxandfriends! Great to be in studio today for Elisabeth. _E_ Bringing hundreds of billions of dollars back to the U.S.A. from the Middle East which will mean JOBS JOBS JOBS! _E_ Theresa @theresamay don't focus on me focus on the destructive Radical Islamic Terrorism that is taking place within the United Kingdom. We are doing just fine! _E_ Thank you Waterbury Connecticut!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Obama is now standing in a puddle acting like a President give me a break. _E_ When James Clapper himself and virtually everyone else with knowledge of the witch hunt says there is no collusion when does it end? _E_ .@sternshow My interview with Howard Stern this morning! __HTTP__ __HTTP__ _E_ #TBT With my family growing up I'm on the left. __HTTP__ _E_ Wow pres. candidate Ben Carson who is very weak on illegal Immigration just said he likes amnesty and a pathway to citizenship. _E_ Nick Adams Retaking America Best things of this presidency aren't reported about. Convinced this will be perhaps best presidency ever. _E_ Sanders said only black lives matter wow! Hillary did not answer question! _E_ We have all been following the Wisconsin recall election. @ScottKWalker's victory tonight will be well earned. A Governor who gets results. _E_ Yesterday China VP Xi stressed the benefits of trade with China to Congress __HTTP__ We need FAIR TRADE with China! _E_ ObamaCare is a disaster and Snowden is a spy who should be executed but if it and he could reveal Obama's recordsI might become a major fan _E_ The rally in Cincinnati is ON. Media put out false reports that it was cancelled. Will be great love you Ohio! _E_ I'm proud to accept the 2010 HollyRod Foundation Humanitarian Award from Holly Robinson Peete who raised $700000 on Celebrity Apprentice _E_ Via @FoxNews: "Trump: Politicians are all talk no action I'm the opposite" __HTTP__ _E_ R.P.Virginia has lost statewide 7 times in a row. Will now not allow desperately needed new voters. Suicidal mistake. RNC MUST ACT NOW! _E_ On Holocaust Remembrance Day we mourn and grieve the murder of 6 million innocent Jewish men women and children and the millions of others who perished in the evil Nazi Genocide. We pledge with all of our might and resolve: Never Again! __HTTP__ __HTTP__ _E_ Despite the establishment and the media's best efforts the people are speaking loudly and clearly. Thank you to my amazing supporters! _E_ Despite what you hear in the press healthcare is coming along great. We are talking to many groups and it will end in a beautiful picture! _E_ Former Prosecutor: The Clintons Are So Corrupt Everything 'They Touch Turns To Molten Lead' __HTTP__ _E_ Thank you Ohio! Just landed in Canton for a rally at the Civic Center. Join me at 7pm: __HTTP__ __HTTP__ _E_ RT @FoxNews: .@KellyannePolls on Harvey recovery: We hope when it comes to basic Hurricane Harvey funding that we can rely upon a nonpartis... _E_ Be sure to enjoy the '50th Anniversary Chicago International Film Festival' at @TrumpChicago the Windy City's top hotel! _E_ "When your brand begins to build you too will be faced with opportunities for greater recognition." – Midas Touch _E_ Lots of response to my Pattinson/Kristen Stewart reunion. She will cheat again 100 certain am I ever wrong? _E_ .@WSJ Editorial Board should review my debate statement re China and T.P.P. and apologize. China not part but will get their way in later. _E_ Do these very stupid politicians who got us involved in Iraq look bad or what? Everybody wants their oil only made possible by U.S.! _E_ .@lisarinna did amazing on #CelebApprentice @ApprenticeNBC. Raising over $505K for @StJude she made it to the Final Four. Congrats Lisa. _E_ Join us in Sparks Nevada today! #NevadaCaucus #VoteTrumpNV __HTTP__ _E_ The Republicans never should have agreed to this past summer's debt deal. Military cuts will now come along with tax increases. _E_ Wow the MSM is really going after me. 12000 in Sarasota a love fest hardly a mention. Only one negativity they only want negatives! _E_ Join me for a 3pm rally tomorrow at the Mid America Center in Council Bluffs Iowa! Tickets:... __HTTP__ _E_ RT @AbeShinzo: トランプ大統領による、初の、歴史的な日本訪問は、間違いなく、日米同盟の揺るぎない絆を世界に示すことができました。本当にありがとう、ドナルド。そして、アジア歴訪の大成功をお祈りしています。@realDonaldTrump __HTTP__ _E_ Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy! _E_ .@FrankLuntz I won every poll of the debate tonight by massive margins @DRUDGE_REPORT & @TIME so where did you find that dumb panel. _E_ Obama admin. called @netanyahu chickenshit. Ironic since Bibi was an IDF Special Forces commando while Obama was a community organizer. _E_ "@jacknicklaus elated at official grand opening of @TrumpFerryPoint" __HTTP__ via @nypost by @NYPost_Willis _E_ 'U.S. Small Business Optimism Index Surges by Most Since 1980' __HTTP__ _E_ Thanks everybody for the Happy Birthday greetings but it's actually the 10th birthday of The Apprentice. My birthday is June 14th.... _E_ Today marks the one year anniversary of @AndrewBreitbart's passing. Andrew's mission & legacy still lives on. @BreitbartNews _E_ Great @UnionLeader piece by @jdistaso on my visit to @saintanselm for @NECouncil & @nhiop Politics & Eggs __HTTP__ _E_ I'm convinced that about half of what separates successful entrepreneurs from the non successful ones is pure perseverance. Steve Jobs _E_ "TRUMP BATTLES THE NEW TOTALITARIANS: GOP elites join with leftists at Media Matters in targeting threat to both" __HTTP__ _E_ Via @FoxNewsInsider as seen on @foxandfriends: "Trump: Iran Nuke Talks Should Have Taken One Day" __HTTP__ _E_ Every strike brings me closer to the next home run. – Babe Ruth _E_ I will be on @foxandfriends at 7:00 in 10 minutes. HAVE A GREAT DAY ALL! _E_ "Concentration and mental toughness are the margins of victory." Bill Russell _E_ The freezing cold weather across the country is brutal. Must be all that global warming. _E_ RT @DonaldJTrumpJr: Last chance #Wisconsin: Find your polling location for today's primary & go vote! Visit __HTTP__ #T... _E_ A third rate architecture critic who I thought got fired—for the failing @chicagotribune likes the building but doesn't like the Trump sign _E_ The sequester is less than 2% of total 2013 budget. Why can't the WH re allocate funds and keep the tours open for children? #OpenOurWH _E_ The Democrats without a leader have become the party of obstruction.They are only interested in themselves and not in what's best for U.S. _E_ Isn't it sad that Weiner's first press conference with wife Huma was yesterday admitting to a sext he made post resignation & apology! _E_ #MakeAmericaGreatAgain #6Days __HTTP__ _E_ We must stop the crime and killing machine that is illegal immigration. Rampant problems will only get worse. Take back our country! _E_ The NY SAFE Act is an unconstitutional attack on 2nd Amendment rights. Will also increase crime. _E_ .@guardian_sport by @mrewanmurray:"Donald Trump's transformation will make @TrumpTurnberry Open worth the wait" __HTTP__ _E_ Great jobs numbers and finally after many years rising wages and nobody even talks about them. Only Russia Russia Russia despite the fact that after a year of looking there is No Collusion! _E_ Keep the big picture in mind. There are always opportunities and thinking too small can negate a lot of them. _E_ President Obama is under pressure from Democrats to undo his lie on ObamaCare. His problem is that such a move would end ObamaCare. _E_ Thanks for all the nice words on my keeping the Trump Tower atrium accessible to stranded victims of #Sandy. My honor. _E_ . #LaskerRink. We do not do the maintenance on Lasker Rink that is done by NEW YORK CITY. _E_ Facebook billionaire gives up his U.S. citizenship in order to save taxes. I guess 3.8 billion isn't enough for (cont) __HTTP__ _E_ So @JLin7 had another game winning shot last night. Looks like the Knicks have not only found a new point guard (cont) __HTTP__ _E_ EXCLUSIVE — Video Interview: Bill Clinton Accuser Juanita Broaddrick Relives Brutal Rapes: __HTTP__ _E_ Just 30 minutes from Manhattan @TrumpNationalNY is Westchester's most elite club offering a 7291 yard course __HTTP__ _E_ People are proud to be saying Merry Christmas again. I am proud to have led the charge against the assault of our cherished and beautiful phrase. MERRY CHRISTMAS!!!!! _E_ The media is pathetic. Our embassies are savaged by radicals while Obama does nothing and all they can do is criticize @MittRomney. _E_ Getting ready to land in Charlottesville Virginia at Trump Vineyards another job producing development that I bought and made AMAZING! _E_ I will be at the @USGA #USWomensOpen in Bedminster NJ tomorrow. Big crowds expected & the women are playing great should be very exciting! _E_ .@evaemery Thanks you sound great! _E_ 69 Democrats voted in favor of the Keystone pipeline in the House this week __HTTP__ A major defeat for @BarackObama _E_ Great sign: We built this business without government help. Obama can kiss our a ! __HTTP__ Commonly heard now across America! _E_ The full video of my @LibertyU speech __HTTP__ Liberty's largest ever Convocation crowd. _E_ Merry Christmas to all have a fantastic day year and life! The World with great leadership will become a much more beautiful place! _E_ So happy about my daughter @IvankaTrump's announcement that she will be having a baby this spring. Congratulations! _E_ .@ArsenioOFFICIAL Thx for the good wishes you are going to have a really big year! _E_ #Obamacare premiums are about to SKYROCKET again. Crooked H will only make it worse. We will repeal & replace! __HTTP__ _E_ So many politically correct fools in our country. We have to all get back to work and stop wasting time and energy on nonsense! _E_ The FAKE MSM is working so hard trying to get me not to use Social Media. They hate that I can get the honest and unfiltered message out. _E_ What we are watching on our TV screens is the unraveling of the Obama foreign policy. @PaulRyanVP _E_ For China of all nations to search the massive Indian Ocean and pick up the ping from the black box of flight 370 sounds a bit far fetched _E_ The race for DNC Chairman was of course totally rigged. Bernie's guy like Bernie himself never had a chance. Clinton demanded Perez! _E_ No surprise that @BBC is in a major scandal for shoddy journalism. Any network that air's @antbaxter's garbage has zero credibility. _E_ Why isn't Obama protecting us from ridiculous gas prices? _E_ We all know that chess is a game of strategy. So is business. Think about that and develop a strategy starting today. _E_ Iran is desperate to develop nukes. Congress must increase sanctions against Iran. _E_ "Out of clutter find Simplicity. From discord find Harmony. In the middle of difficulty lies Opportunity."–Albert Einstein _E_ Lightweight @AGSchneiderman just got his ass kicked by Trump! _E_ RT @DRUDGE_REPORT: Obama Refers to Himself 119 Times During Hillary Nominating Speech... __HTTP__ _E_ My @foxandfriends int.on @FoxNewsInsider:"'We Have No Leadership': Trump Slams Obama for Skipping Paris Unity Rally" __HTTP__ _E_ Back by popular demand the record 13th season of 'All Star' @CelebApprentice features the return of @bretmichaels. Our fans will be happy. _E_ He @BarackObama received an early endorsement from the Soviet newspaper Pravda over @MittRomney (cont) __HTTP__ _E_ Just found out that at a charity auction of celebrity portraits in E. Hampton my portrait by artist William Quigley topped list at $60K _E_ I will be interviewed on @oreillyfactor tonight at 11pmE @FoxNews. Enjoy! _E_ But maybe my biggest beef with Obama is his view that there's nothing special or exceptional about America. #TimeToGetTough _E_ I don't watch or do @Morning_Joe anymore. Small audience low ratings! I hear Mika has gone wild with hate. Joe is Joe. They lost their way! _E_ The Boston killer will soon be asking for a Presidential pardon—don't give it to him Mr. President—hang tough! _E_ When will we see stories from CNN on Clinton Foundation corruption and Hillary's pay for play at State Department? _E_ Pathetic! Since @GovWalker is going to win the recall @BarackObama is trying to disown the endorsement of Tom Barrett __HTTP__ _E_ The new Libyan Government should turn over the Lockerbie bomber now. _E_ I will be interviewed on Fox News Sunday With Chris Wallace at 9:00 A.M. or 10:00 A.M. (depending on location). Will be tough but good! _E_ Join me tomorrow! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_ A Veteran & true Conservative @leezeldin will make a real difference in Washington. NY 1 GOP GOTV for Lee tomorrow! _E_ expensive mistake! THE UNITED STATES IS OPEN FOR BUSINESS _E_ JOIN ME TOMORROW IN FLORIDA!MIAMI 12pm __HTTP__ __HTTP__ __HTTP__ _E_ I told you so. Our country totally lost control of illegal immigration even with criminals. __HTTP__ _E_ Thank you to NC for last evenings great reception. The speech was a great success. Heading now to Louisiana & another speech tonight in MI. _E_ How can the NY Times show an empty room hours before my speech even started when they knew it was going to be packed? So totally dishonest! _E_ Our great VETERANS are being treated very badly because of corruption and incompetence at the V.A. That will stop I will fix this quickly! _E_ .@TrumpLasVegas was just rated "Best Room Service" in LV by The Daily Meal. Congrats to my Las Vegas staff! __HTTP__ _E_ Join me Tuesday in Everett Washington at the Xfinity Arena! Tickets: __HTTP__ __HTTP__ _E_ RT @TeamTrump: .@realDonaldTrump calling out @HillaryClinton's support for NAFTA = most searched moment during tonight's debate. #Debates20... _E_ The five prisoners our government so stupidly released for one pathetic traitor are now fighting and killing for ISIS BAD DEAL! Courtmarshal _E_ Congratulations to @RickSantorum for coming out of Iowa a winner! _E_ With @shawnjohnson and @lorenzolamas from @apprenticenbc two great people! __HTTP__ _E_ BIG NIGHT ON TWITTER TONIGHT. I WILL BE LIVE TWEETING PRESIDENT OBAMA'S SPEECH AT 7:50 P.M. ( EASTERN). MUST TALK RADICAL ISLAMIC TERRORISM! _E_ Money may not grow from trees but it does grow from talent hard work and brains. Think Like a Billionaire _E_ Thanks for the tremendous support for my shirts ties and suits at Macy's. They do great because of really high quality at a low price. _E_ President Obama just had a news conference but he doesn't have a clue. Our country is a divided crime scene and it will only get worse! _E_ Ron Paul is right when he says we are wasting lives and money in Iraq and Afghanistan. _E_ It's amazing that some of the dumbest people on television work for the Wall Street Journal in particular a real dope named Charles Lane! _E_ I can't believe no one has been fired over the ObamaCare website fiasco! _E_ My @foxandfriends interview from this morning __HTTP__ _E_ Legendary Illusionist v. Country Music Star. This Sunday's LIVE Finale of @ApprenticeNBC is a historic matchup. MUST SEE TV! _E_ Isn't it amazing that @CNN paid a fortune for an Iowa Poll which shows me in first place over Cruz by 13% 33% to 20% then doesn't use it _E_ Where's the accountability for the $635M website fiasco in the Obama administration? Heads should roll and officials should be fired _E_ Great new poll thank you! __HTTP__ _E_ Join me in Naples Florida this evening at 6:00pm! Tickets: __HTTP__ __HTTP__ _E_ Isn't it terrible that @megynkelly used a poll not used before (I.B.D.) when I was down but refuses to use it now when I am up? _E_ If the Dems (Crooked Hillary) got elected your stocks would be down 50% from values on Election Day. Now they have a great future and just beginning! __HTTP__ _E_ I wonder how much money dumb @BuzzFeed and even dumber Ben Smith loooose each year? They have zero credibility totally irrelevant and sad! _E_ A level will be reached where ObamaCare will be so out of control expensive and unwieldy that the biggest supporters will abandon ship. _E_ Barack Obama's delivery on Saturday night was excellent cute mention of Trump and I am flattered to be mentioned. @BarackObama _E_ I am watching @CNN very little lately because they are so biased against me. Shows are predictable garbage! CNN and MSM is one big lie! _E_ The long anticipated release of the #JFKFiles will take place tomorrow. So interesting! _E_ Remember @foxandfriends at 7:00 A.M. and Celebrity Apprentice at 8:00 P.M. Enjoy! _E_ .@mdamelincourt Thanks M you are doing a great job at Trump Toronto! _E_ Congratulations to @netanyahu on his electoral victory. He will now be the longest serving @IsraeliPM. A great leader. _E_ THE WEST WILL NEVER BE BROKEN. Our values will PREVAIL. Our people will THRIVE and our civilization will TRIUMPH! __HTTP__ _E_ Thank you Charlotte North Carolina. Great afternoon! #ICYMI I delivered a speech on urban renewal. Full speech:... __HTTP__ _E_ New reality. Yuan just passed the Euro as 2nd most traded finance currency __HTTP__ Our leaders better get smart fast. _E_ Great meeting with @SenateMajLdr Mitch McConnell and Republican leaders in D.C. #Trump2016 __HTTP__ _E_ I will not be able to attend the Miss USA pageant tomorrow night because I am campaigning in Phoenix. Wishing all well! _E_ Taliban targeted innocent Afghans brave police in Kabul today. Our thoughts and prayers go to the victims and first responders. We will not allow the Taliban to win! _E_ My interview with Don Imus on @77WABCradio discussing my @RNC convention surprise & @MittRomney's China policy __HTTP__ _E_ A great new book has been written about Crooked Hillary. Read it & you will never be able to vote for her. @Ed_Klein __HTTP__ _E_ Yesterday I was in Washington D.C. visiting the #TRUMP Old Post Office renovation. It will be magnificent. _E_ As I have long stated we are so tied in with China and Asia that their markets are now taking the U.S. market down. Get smart U.S.A. _E_ The new Rasmussen Poll one of the most accurate in the 2016 Election just out with a Trump 50% Approval Rating.That's higher than O's #'s! _E_ After today Crooked Hillary can officially be called Lyin' Crooked Hillary. _E_ #TBT A picture of my fantastic father and myself. Best teacher in the world! A great Father's Day... __HTTP__ _E_ Judge Gorsuch will be sworn in at the Rose Garden of the White House on Monday at 11:00 A.M. He will be a great Justice. Very proud of him! _E_ Have a great weekend everyone and for those of you that are young entrepreneurs have fun but never stop thinking of the task ahead victory _E_ I will be on Fox & Friends (@foxandfriends) at 7.00. Fighting Ebola will be a topic! _E_ The Republican House members are working hard (and late) toward the Massive Tax Cuts that they know you deserve. These will be biggest ever! _E_ .@THEGaryBusey is making no attempt to help. Is he in BuseyLand? Their team is short on help already... #CelebApprentice _E_ Our president could not make a proper website with $5B. The website still does not work. How can we feel safe about Ebola?! _E_ Incredible crowd in Richmond Virginia tonight! So much spirit and energy! #makeamericagreatagain __HTTP__ _E_ states instead of the 15 states that I visited. I would have won even more easily and convincingly (but smaller states are forgotten)! _E_ .@EricTrump unbelievable job on #FoxNews with @greta. That was better than I could do! #Trump2016 _E_ Tomorrow in DC: 1 PM West Front Lawn of the Capitol. Not even believable that we would do this deal with Iran. _E_ All raising taxes on businesses does is force business owners to lay off employees they can no longer afford. (cont) __HTTP__ _E_ My @gretawire interview discussing @BarackObama's misleading political ad @MittRomney's response and @Cher & @Rosie __HTTP__ _E_ Angelina and Sidney had a really strange vibe going! #Oscars _E_ We need a tax system that is fair and smart one that encourages growth savings and investment. #TimeToGetTough _E_ The Crooked Hillary V.P. choice is VERY disrespectful to Bernie Sanders and all of his supporters. Just another case of BAD JUDGEMENT by H! _E_ Healthcare listening session w/ @VP & @SecPriceMD. Watch: __HTTP__ #ReadTheBill:... __HTTP__ _E_ Back to work for the President to try and keep some dignity for the office and himself. The so called rebels must be thoroughly confused! _E_ Coincidence? Obama and Ahmadinejad each describe @Israel's warning over the Iranian nuclear program as just 'noise' __HTTP__ _E_ I'm self funding and I am going to take care of the people – not the special interests and insurance companies like the other candidates. _E_ My @nbc @todayshow interview discussing my @RNC video & why @MittRomney should not apologize __HTTP__ _E_ Sorry folks but Bernie Sanders is exhausted just can't go on any longer. He is trying to dismiss the new e mails and DNC disrespect. SAD! _E_ .@DavidLetterman @Late_Show fully apologized last night for calling me a racist. Thank you David we are again friends. _E_ Trump International Golf Links and Hotel Ireland is located on the Atlantic Ocean in County Clare. Spectacular! __HTTP__ _E_ Golf bookings for next season on Scottish course are already double our projections for April opening—great news... __HTTP__ _E_ #TrumpAdvice __HTTP__ _E_ This 'deal' @RNC voted for has $41 in tax increases for every $1 in spending cuts. It is pathetic. Obama is laughing at them. _E_ Will be back in Virginia tonight for a 6pm rally at the Berglund Center in Roanoke. Join me! Tickets:... __HTTP__ _E_ The recession was made worse by @BarackObama. A $900Billion deficit is not getting better. _E_ Success tip: See yourself as victorious. This will focus you in the right direction. Apply your skills and talent and be tenacious. _E_ Amazing that Ted Cruz can't even get a Senator like @BenSasse who is easy to endorse him. Not one Senator is endorsing Canada Ted! _E_ Watch – Obama in 2006: "I've stolen ideas from Jonathan Gruber" __HTTP__ And now Obama claims he is 'just some adviser.' _E_ Read this about @Lawrence.... __HTTP__ _E_ One who fears failure limits his activities. Failure is only the opportunity to more intelligently begin again. Henry Ford _E_ The Mayor of Baltimore said she wanted to give the rioters space to destroy another real genius! _E_ Exclusive–Donald Trump: Obama 'Totally Out Negotiated' by Iran Taliban 'Virtually Every Country in the World' __HTTP__ _E_ Lance Armstrong is having a breakdown. What is he doing—his life is now officially over! _E_ My @gretawire interview __HTTP__ _E_ Can't believe these totally phoney stories 100% made up by women (many already proven false) and pushed big time by press have impact! _E_ If the election were based on total popular vote I would have campaigned in N.Y. Florida and California and won even bigger and more easily _E_ Bob Turner great guy great businessman will be a great Congressman. Was happy to help him win. _E_ Is this boring or is it just me? #Oscars _E_ The reason for the plan negotiated between the Republicans and Democrats is that we need 60 votes in the Senate which are not there! We.... _E_ Via The Political Insider: "Donald Trump Just Received The Best News Possible!" __HTTP__ _E_ Entrepreneurs: Stay focused and be tenacious. Pay attention to people who know what they're talking about. Stay fixed on your goals! _E_ Housing prices will be going up big league a great time to buy good luck! _E_ I will be on Meet the Press with Chuck Todd on NBC this morning. Enjoy! __HTTP__ _E_ My new book tells some harsh truths and lays out some bold plans. Time for America to be #1 again. #TimeToGetTough _E_ Thank you Anaheim California!#Trump2016 __HTTP__ _E_ We agree @POTUS SHE'LL (Hillary Clinton) SAY ANYTHING & CHANGE NOTHING. IT'S TIME TO TURN THE PAGE President Obama _E_ Why has @BarackObama allowed the Muslim Brotherhood to visit the @whitehouse? What Hope & Change! _E_ The real unemployment rate according to the CBO is 15% __HTTP__ @BarackObama's economic recovery is all Hope _E_ Obama called August's job report progress. Overall 96K new jobs & over 173K new people on food stamps __HTTP__ _E_ I am leaving China for #APEC2017 in Vietnam. @FLOTUS Melania is staying behind to see the zoo and of course the Great WALL of China before going to Alaska to greet our AMAZING troops. _E_ I'm a conservative but the weakness of conservatives is that they destroy each other whereas liberals unite to win. _E_ RT @joshrogin: Pence is right. Clinton & Obama tried to negotiate an Iraq troop extension but failed. Bush admin always anticipated such an... _E_ LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_ Tonight's official count 7943. An all time record for the Anderson Civic Center in SC! Thanks! #Trump2016 __HTTP__ _E_ Thanks to everyone for your kind birthday wishes very nice! _E_ The Fed's reckless monetary policy is going to create record inflation. _E_ Republicans better start listening to and respecting the Tea Party! _E_ .@Kstupples Thanks for the nice comments on Trump National Doral. I've long been your fan—now am an even bigger fan! @TrumpDoral _E_ I encourage EVERYONE in the path of #HurricaneIrma to heed the advice and orders of local & state officials! __HTTP__ _E_ People are anxiously awaiting my decision as to who the next head of the Fed will be.... __HTTP__ _E_ Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau _E_ The irony is that the Freedom Caucus which is very pro life and against Planned Parenthood allows P.P. to continue if they stop this plan! _E_ 3 Republicans and 48 Democrats let the American people down. As I said from the beginning let ObamaCare implode then deal. Watch! _E_ A message from @IvankaTrump! #SCPrimary #VoteTrumpSC #MakeAmericaGreatAgain Video: __HTTP__ __HTTP__ _E_ Congratulations to @TrumpIntRealty for the two top rentals in 2013! __HTTP__ #TIRNYC _E_ Will the Keystone XL pipeline finally be approved? Will create over 100000 jobs and make us more energy independent. _E_ "Sometimes people spend too much time focusing on problems instead of focusing on opportunities." – Think Like A Champion _E_ I don't consider writing books a small venture...writing books is essentially a sharing experience. @MidasTouch @theRealKiyosaki _E_ My @SquawkCNBC #TrumpTuesday interview discussing QE3 @MittRomney's leaked comments Middle East & US oil capability __HTTP__ _E_ We will all have fun and hopefully learn something tonight. I will shoot straight and call it as I see it both the good and the bad. Enjoy! _E_ "Any political leader who won't face the future head on is putting the American Dream at risk." – The America We Deserve _E_ This is the best deal the Republicans could get? _E_ With 15% US real unemployment and a 16T debt @Michelle Obama's luxurious Aspen vacation her 16th cost us over $1M __HTTP__ _E_ "Do what you can with what you have where you are." Theodore Roosevelt _E_ Take action every day and stay focused for the long haul." Think Big _E_ Stop the assault on American values. Stand w/ Trump to #MakeAmericaGreatAgain!#VotersSpeak: __HTTP__ __HTTP__ _E_ Defund it or own it. If you fund it you're for it. @SenMikeLee _E_ You can have the best product in the world but if people don't know about it it's not going to be worth much. The Art of the Deal _E_ Crime and killings in Chicago have reached such epidemic proportions that I am sending in Federal help. 1714 shootings in Chicago this year! _E_ Via @LasVegasSun by Eugene R. Dunn: "Impeach Obama and elect Trump" __HTTP__ _E_ CORRUPT with the national security leaks and Fast & Furious there are clearly at least two cover ups in @BarackObama's White House. _E_ Ted Cruz is mathematically out of winning the race. Now all he can do is be a spoiler never a nice thing to do. I will beat Hillary! _E_ The Fake News is now complaining about my different types of back to back speeches. Well their was Afghanistan (somber) the big Rally..... _E_ Join me in Dallas Texas on Thursday!#AmericaFirst #Trump2016 __HTTP__ __HTTP__ _E_ I will be live tweeting! _E_ Obama just bought the Afghan Police $288M in ammo __HTTP__ Make no mistake some of these will be shot at our troops. _E_ #SecondAmendment #2A#Debates __HTTP__ _E_ Yesterday was another big day for jobs and the Stock Market. Chrysler coming back to U.S. (Michigan) from Mexico and many more companies paying out Tax Cut money to employees. If Dems won in November Market would have TANKED! It was headed for disaster. _E_ Totally made up facts by sleazebag political operatives both Democrats and Republicans FAKE NEWS! Russia says nothing exists. Probably... _E_ 'Huma Abedin told Clinton her secret email account caused problems' __HTTP__ _E_ RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_ Only by enlisting the full potential of women in our society will we be truly able to #MakeAmericaGreatAgain... __HTTP__ _E_ .@EdWGillespie will totally turn around the high crime and poor economic performance of VA. MS 13 and crime will be gone. Vote today ASAP! _E_ Today it was my great honor to proclaim January 15 2018 as Martin Luther King Jr. Federal Holiday. I encourage all Americans to observe this day with appropriate civic community and service activities in honor of Dr. King's life and legacy. __HTTP__ _E_ Whether you like Obama or not Bob Gates turned out to be one disloyal dude! Personally I hate rats. _E_ Looking forward to being @TrumpSoHo this evening for Corporate Meeting Planners reception for Trump National Doral @TrumpDoral _E_ .@antbaxter Thanks for helping promote & make Trump International Golf Links Scotland so successful you stupid fool! _E_ The ISIS thug who murdered American journalist James Foley may have been Gitmo detainee __HTTP__ If so why was he released? _E_ "You miss 100% of the shots you don't take." Wayne Gretzky _E_ Joe Girardi did a great job of managing the Yankees this series. _E_ BarackObama set a record deficit last February $229 billion while borrowing 42 cents of every dollar it spent. @BarackObama is reckless. _E_ Kasich has already spent $6 million on ads in New Hampshire and his numbers have gone down. People from NH are smart! _E_ Great article on so called climate change formerly known as global warming. __HTTP__ _E_ ISIS just claimed the Degenerate Animal who killed and so badly wounded the wonderful people on the West Side was their soldier. ..... _E_ Thank you Evansville Indiana! #MakeAmericaGreatAgain __HTTP__ _E_ Crooked Hillary Clinton is 100% owned by her donors. #ImWithYou #MAGA __HTTP__ _E_ Putin says Russia can't allow a weakening of its nuclear deterrent—U.S. wants to reduce—are we crazy? _E_ We have to combat the welfare mentality that says individuals are entitled to live off taxpayers. #TimeToGetTough _E_ Remember how @ObamaCare did not have any tort reform? Now the trial lawyers are getting ready for even more lawsuits __HTTP__ _E_ Gee @meetthepress with @chucktodd was getting terrible ratings then with me he set records I saved his job but Chuck still not nice! _E_ Will be on @bloombergtv tomorrow with @sruhle. Enjoy! _E_ .@VattenfallGroup couldn't sell its money losing Aberdeen windfarm—so @AlexSalmond forced phony extension. @AberdeenCC @Aberdeenshire _E_ Work is fun deals are fun life is fun but love of a great family makes it all come together. Go out there and make your family proud. _E_ __HTTP__ _E_ Looks like a very good World Series game! _E_ RT @TeamTrump: We need STRONG BROAD SHOULDERED leadership like @mike_pence & @realDonaldTrump in the White House! #VPDebate #BigLeagueTrut... _E_ #2. Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_ At this point the legacy of the Obama Administration will be sadly that of THE GANG THAT COULDN'T SHOOT STRAIGHT what a pathetic mess! _E_ .@StephenBaldwin7 thinks @TheRealMarilu is ping ponging all over the place. Do you agree? #CelebApprentice _E_ Make no mistake Obamacare is the first step towards changing our health system into single payer. Just a disaster. _E_ A lot of complaints from people saying my name is not on the ballot in various places in Florida? Hope this is false. _E_ If Jeb Bush were more competent he could not have lost the skirmish with Marco in the debate. BAD facts for Marco if properly delivered! _E_ Via @BBCScotland: "Donald Trump's name 'will boost @TrumpTurnberry '" __HTTP__ _E_ With the whacko pervert Weiner about to be embarrassed all women need to be on the lookout. Sexting begins 9.11 @ 12:01 AM _E_ Obama's spending and borrowing is burying America and destroying our children's future. Does he even care? _E_ #TrumpVlog Make our country great again! __HTTP__ _E_ Amazing how fast all of Joe Paterno's friends abandoned him. They ran for the hills. _E_ What did we get for fighting in Libya besides a dead Ambassador. Demand their oil. _E_ In Hudson Valley @TrumpNationalNY's course has pristine fairways tour caliber greens & 64 strategic sand bunkers __HTTP__ _E_ House GOP wants to cut Medicare Obama took $500 billion from Medicare for Obamacare. Both Wrong! _E_ Via Union Leader: Trump leads tribute for slain journalist James Foley | New Hampshire First Amendment Awards __HTTP__ _E_ Funny how the failing @nytimes is pushing Dems narrative that Russia is working for me because Putin said Trump is a genius. America 1st! _E_ Is everything ok over there @Salon? I actually got some good press from them today. _E_ Iran has never had a better friend than Obama. _E_ Remember the golden rule of negotiating: He who has the gold makes the rules. _E_ The Fed must be reined in. In 2011 the Fed bought 61% percent of US debt even more than 2008. Unsustainable! __HTTP__ _E_ Ben Carson wants to abolish Medicare I want to save it and Social Security. _E_ NIELSEN RATINGS: 1.@ThisWeekABC 2.52 viewers 6 SHR1.91RTG .55 25 54 2.@meetthepress 2.24 total viewers 5 SHR1.61RTG .47 25 54 _E_ Barney Frank looked disgusting nipples protruding in his blue shirt before Congress. Very very disrespectful. _E_ Wonderful Frank Gifford has just passed away at age 84. He was my friend and a truly great guy! Warmest condolences to family. _E_ Only 10 more days until the premiere of All Star @ApprenticeNBC. On March 3rd at 9PM EST @NBC the fireworks return to the Board Room! _E_ To Tom Brady @patriots and Gisele Best wishes on the birth of your daughter. Tom is a great player and great friend. _E_ "The aesthetic the quality has to be carried all the way through." Steve Jobs _E_ Miss USA Tara Conner will not be fired I've always been a believer in second chances. says Donald Trump _E_ Little Andy Lassner who lives his life through Ellen and has nothing else going for himself is having a really bad night! #Oscars _E_ Please tell me what is going on with the Republicans? _E_ The U.S. has been talking to North Korea and paying them extortion money for 25 years. Talking is not the answer! _E_ .@FoxNews Objectified tonight at 10:00 P.M. Enjoy! _E_ Serious doubt in Illinois as to whether or not Cruz can run for President. First of many challenges. __HTTP__ _E_ RT @HeyTammyBruce: Coming up at 720a ET on @foxandfriends! See you there! #maga _E_ The reporting at the failing @nytimes gets worse and worse by the day. Fortunately it is a dying newspaper. _E_ By popular demand I will be tweeting during tomorrow's record 14th season premiere of @ApprenticeNBC on @nbc at 9/8c __HTTP__ _E_ .@jimmyfallon regularly features @ApprenticeNBC contestants on his show. We love his support & he's a terrific host.Tonight: Omarosa. _E_ What a shock – higher taxes are slowing retail spending __HTTP__ Wait until 2014 when Obama Care is fully implemented. _E_ I'm surprised that Gabriel Aubry has settled so quickly and easily with Halle—in the long run it was a wise decision. _E_ Now is no time to cut military spending. We must remain strong. Our enemies are looking for weakness. I'm i... (cont) __HTTP__ _E_ Can you imagine we spend billions of dollars protecting Saudi Arabia and now the King refuses to even meet with Obama. Great leadership! _E_ Heading to D.C. to see and hear ROLLING THUNDER. Amazing people that LOVE OUR COUNTRY. Great spirit! _E_ "Do whatever it takes to improve your public speaking skills. You'll absolutely need them." – Midas Touch _E_ If you're going through hell keep going. Winston Churchill _E_ Obama opposes sanctions on Iran __HTTP__ They are laughing at Kerry & Obama! _E_ I had a very respectful conversation with the widow of Sgt. La David Johnson and spoke his name from beginning without hesitation! _E_ Will be doing a sit down interview with @JakeTapper @CNN on Sunday morning at 9:00. Tough questions and hopefully very good answers! _E_ Bernie Sanders started off strong but with the selection of Kaine for V.P. is ending really weak. So much for a movement! TOTAL DISRESPECT _E_ If China had a tenth of the natural resources we do then they would already be energy independent. Instead we continue to buy oil from OPEC. _E_ Washington (D.C.) is such a mess nothing works! I will MAKE AMERICA GREAT AGAIN! It's not going to happen with anyone else. _E_ Looks like Anthony Weiner Is through most recent poll has him deeply in last place. GOOD NEWS _E_ The trip by @VP Pence was long planned. He is receiving great praise for leaving game after the players showed such disrespect for country! _E_ I am officially running for President of the United States. #MakeAmericaGreatAgain __HTTP__ _E_ Looking forward to live tweeting during the rest of the debates. Will be a lot of fun. _E_ Jeb Bush never uses his last name on advertising signage materials etc. Is he ashamed of the name BUSH? A pretty sad situation. Go Jeb! _E_ "Trump Rally: Stocks put 2017 in the record books" __HTTP__ _E_ With the strategy that I announced today we are declaring that AMERICA is in the game and AMERICA is DETERMINED to WIN!OUR FOUR PILLARS OF NATIONAL SECURITY STRATEGY: __HTTP__ _E_ I look very much forward to meeting Prime Minister Theresa May in Washington in the Spring. Britain a longtime U.S. ally is very special! _E_ "Do you want to know who you are? Don't ask. Act! Action will delineate and define you." Thomas Jefferson _E_ In the 10:30 PM ET lead in to local news @ApprenticeNBC delivered a 31 percent margin of victory... _E_ A lot of call ins about vote flipping at the voting booths in Texas. People are not happy. BIG lines. What is going on? _E_ Now @BarackObama is telling donors he will need to 'revisit' healthcare in his 2nd term __HTTP__ _E_ FAKE NEWS media which makes up stories and sources is far more effective than the discredited Democrats but they are fading fast! _E_ I love show Law and Order but the @MRbelzer casting is the worst ever. No talent unwatchable! _E_ I can't believe Mitch McConnell isn't way up in the Kentucky polls. Massive seniority brings so much power and status to State. Brings K.$'s _E_ In war the elememt of surprise is sooooo important.What the hell is Obama doing. _E_ Great decision by Donald Graham @Newsweek to sell. I'll now have to take my newsweek covers off the wall. _E_ Great – we are sending even more F 16's to the Muslim Brotherhood in Egypt __HTTP__ This is a total disaster. _E_ VOTER REGISTRATION DEADLINES TODAY. You can register now at: __HTTP__ and get out to... __HTTP__ _E_ When Mitt Romney asked me for my endorsement last time around he was so awkward and goofy that we all should have known he could not win! _E_ Just out Nevada poll shows Jeb Bush at 1% he should take his dumb mouthpiece @LindseyGrahamSC and just go home. _E_ Who's your pick @bretmichaels or @hollyrpeete ? Vote now on Ivanka's new Facebook page! __HTTP__ _E_ Why didn't Hillary Clinton announce that she was inappropriately given the debate questions she secretly used them! Crooked Hillary. _E_ Once the ISIS thug who beheaded Foley is identified 100% he should be bunker busted to hell. _E_ Look at the editorial I was just sent from the NY Post on 9/14/01 3 days after collapse of WTC. Any apologies? __HTTP__ _E_ You get what you vote for. US credit rating is about to be downgraded once again __HTTP__ _E_ In the spirit of transparency Obama should immediately release the 9.11 tape of Tyrone Woods pleading for military support in Benghazi. _E_ Negotiation: Think about what the other side wants. Know where they're coming from. Don't underestimate them. Create a win/win situation. _E_ We must immediately stop all air traffic coming from the Ebola infected areas of Africa—before it is too late. _E_ China Russia and Iran are laughing at us. We have weak leaders who are threatening our national security. Dangerous times. _E_ .@robbreport Best 2013 Golf Courses: Trump Int'l Golf Links Scotland. Great honor great magazine—thanks! __HTTP__ _E_ Wow did you see how badly @CNN (Clinton News Network) is doing in the ratings. With people like @donlemon who could expect any more? _E_ Will be on #Hannity @ 10pE @FoxNews discussing various subjects including immigration if elected we will #BuildTheWall & enforce our laws! _E_ Join me Monday in Columbus Ohio & Harrisburg Pennsylvania! #MAGA3pm in OH: __HTTP__ in PA: __HTTP__ _E_ The Great Irish Links Challenge @Trump_Ireland & Lahinch Golf Club is coming this June. Don't miss it. __HTTP__ #Doonbeg _E_ I have helped many friends and colleagues in their business ventures. They always thank me after they succeed. #MIDASTOUCH _E_ Thank you Indiana! #Trump2016 __HTTP__ _E_ May God be w/ the people of Sutherland Springs Texas. The FBI & law enforcement are on the scene. I am monitoring the situation from Japan. _E_ ...get things done at a record clip. Many big decisions to be made over the coming days and weeks. AMERICA FIRST! _E_ I will have set the all time record in primary votes in the Republican party despite having to compete against 17 other people! _E_ Being successful requires nothing less than 100% of your concentrated effort. Be totally focused. _E_ The United States cannot continue to make such bad one sided trade deals. There are only so many jobs we can give up. No more! _E_ Crooked Hillary is spending big Wall Street money on ads saying I don't have foreign policy experience yet look what her policies have done _E_ All predictions re: my 12 o'clock release are totally incorrect. Stay tuned! _E_ ....is making. Working very hard on TAX CUTS for the middle class companies and jobs! _E_ Via @RadioIowa by @okayhenderson: "Trump touts business career but not TV show during Iowa speech" __HTTP__ _E_ RT @GOPChairwoman: The Trump Inaugural Committee is donating $3 million in surplus funds to victims of the latest hurricanes. __HTTP__ _E_ Via @NYDailyNews by @klnynews: "Donald Trump wins lawsuit against Joint Commission on Public Ethics" __HTTP__ _E_ Offering true luxury @Trump_Charlotte has spectacular restaurants Olympic pools & six professional tennis courts __HTTP__ _E_ Jeb Bush gave five different answers in four days on whether or not we should have invaded Iraq.He is so confused.Not presidential material! _E_ The Blue Monster @TrumpDoral was a sensation over the weekend. Really tough but players & critics alike loved it. _E_ Happy #CincoDeMayo! The best taco bowls are made in Trump Tower Grill. I love Hispanics! __HTTP__ __HTTP__ _E_ Jared Kushner did very well yesterday in proving he did not collude with the Russians. Witch Hunt. Next up 11 year old Barron Trump! _E_ Towering over trendy Bay Street @TrumpTO offers 118 stunning condominiums w/ multi angle views & elite amenities __HTTP__ _E_ Why do shows have @ananavarro—Ntl Hispanic Chair for the losing McCain '08 & Huntsman '12. She's a loser who doesn't deliver votes. _E_ Republicans don't extend the debt ceiling—make the great deal now! _E_ How much BAD JUDGEMENT was on display by the people in DNC in writing those really dumb e mails using even religion against Bernie! _E_ #ImWithYou __HTTP__ _E_ Thanks for all of the great support but I just don't see myself wanting to run for Governor of New York I have something else in mind! _E_ Something very important and indeed society changing may come out of the Ebola epidemic that will be a very good thing: NO SHAKING HANDS! _E_ The Fake News is going crazy with wacky Congresswoman Wilson(D) who was SECRETLY on a very personal call and gave a total lie on content! _E_ I am in Iowa today great STATE fantastic PEOPLE! Many speeches big crowds all sold out! MAKE AMERICA GREAT AGAIN! _E_ Honest Omarosa: she won't backstab she'll come at you from the front. _E_ Congresswoman Jennifer Gonzalez Colon of Puerto Rico has been wonderful to deal with and a great representative of the people. Thank you! _E_ Just arrived at Trump National Doral saying hello to all the great players. This place is amazing.Come Thursday & see for yourselves! _E_ Today I officially declared my candidacy for President of the United States. Watch the video of my full speech __HTTP__ _E_ Maybe some of the dead voters who helped get President Obama elected can be brought back to life after signing up for ObamaCare. _E_ In my opinion one of the worst utility companies in the country is Florida Power and Light. _E_ Katie Couric the third rate reporter who has been largely forgotten should be ashamed of herself for the fraudulent editing of her doc. _E_ .@GlennBeck got fired like a dog by #Fox. The Blaze is failing and he wanted to have me on his show. I said no because he is irrelevant. _E_ My people caught the person who committed forgery of the James Gandolfini Obama Care phoney quote attributed to me fraud. Arrest coming? _E_ South Carolina voters have the future of our country in their hands. Vote now (today) and MAKE AMERICA GREAT AGAIN! _E_ Just arrived in West Virginia for a MAKE AMERICA GREAT AGAIN rally in Huntington at 7:00pmE. Massive crowd expected tune in! #MAGA _E_ Now Assad is demanding that Obama stop supporting the rebels before he turns over his chemical weapons. What a mess! _E_ Thank you. __HTTP__ _E_ Congratulations to Alyssa Campanella Miss California our new MIss USA! __HTTP__ _E_ My @todayshow int. with @MLauer announcing the January 4th premiere & cast of the 14th season of @ApprenticeNBC __HTTP__ _E_ Thank you Piers for the wonderful article and also great writing. @piersmorgan __HTTP__ _E_ Red line statement was a disaster for President Obama. _E_ ...come down hard tax the hell out of their imports and reduce our deficit fast. _E_ Via @AP: Donald Trump returns to the 'Apprentice' boardroom __HTTP__ _E_ The Democrats only want to increase taxes and obstruct. That's all they are good at! _E_ One of the world's tallest buildings @TrumpChicago is not only a 5 star hotel but has 5 star dining options __HTTP__ _E_ Agreed! __HTTP__ _E_ Mexico will pay for the wall! _E_ like the 116% hike in Arizona. Also deductibles are so high that it is practically useless. Don't let the Schumer clowns out of this web... _E_ And finally Cruz strongly told thousands of caucusgoers (voters) that Trump was strongly in favor of ObamaCare and choice a total lie! _E_ The issue of kneeling has nothing to do with race. It is about respect for our Country Flag and National Anthem. NFL must respect this! _E_ Hillary Clinton Dominates the Pack in Fake Twitter Followers __HTTP__ _E_ Another Obama disaster __HTTP__ _E_ Republicans should not be giving Obama fast track authority on trade. The Trans Pacific Partnership will squeeze our manufacturing sector _E_ Edddie24 Mr. Trump is a real American patriot. You have my vote if you ever ran. 👍 Thank you. _E_ Hillary Clinton reaches new low. #TrumpVlog __HTTP__ _E_ ...Overall the Academy Awards were very average at best. _E_ With all that Congress has to work on do they really have to make the weakening of the Independent Ethics Watchdog as unfair as it _E_ I want to thank all my friends in Macon for the special evening and great reception. What a crowd of incredible people! _E_ My friend Derek is a special athlete and special person there is nobody like him. @Yankees _E_ The @BarackObama administration now claims to have done everything to reduce gas prices __HTTP__ What about Keystone? _E_ Things are looking great for Karen H! _E_ Crooked Hillary launched her political career by letting terrorists off the hook. #DrainTheSwamp... __HTTP__ _E_ The Mar a Lago club in Palm Beach is one of the most successful places on earth in raising money for charity a great feeling! _E_ I will be interviewed by @jdickerson on @FaceTheNation tomorrow morning. Enjoy! #Trump2016 _E_ Great sportscaster Al Michaels a friend of mine played golf with me on Saturday morning at Trump National LA. He was in perfect shape! _E_ .@Mark_Sanchez shouldn't be too upset over @EvaLongoria. He will always do great! _E_ TRUMP TUESDAY @SquawkCNBC tomorrow at 7:30 am Tune in! _E_ Congratulations to our new Attorney General @SenatorSessions! __HTTP__ _E_ The Miami Heat is getting it's ass kicked they better start playing or it will be a long Summer for them. _E_ Is Fake News Washington Post being used as a lobbyist weapon against Congress to keep Politicians from looking into Amazon no tax monopoly? _E_ Home values have sunk a record 15% under Obama. _E_ For America to be great again we must have a President who has been successful and Americans can learn from on how to succeed. _E_ Donald Trump Will Be on Pennsylvania Avenue in 2016 & There's Nothing You Can Do About It __HTTP__ by @lilsarg _E_ The rolling average of jobless claims is the highest in 5 months __HTTP__ ObamaCare continues to slow growth and cost jobs. _E_ The cast of the new season of apprenticenbc. Premieres January 4th on NBC. __HTTP__ _E_ Join me in Washington today!Spokane tickets: __HTTP__ tickets: __HTTP__ __HTTP__ _E_ Congratulations to @AllenWest on winning last night's primary! _E_ WH counsel met with IRS lawyer 3x in 2012 once in September __HTTP__ But Obama just learned through news reports? _E_ The Democrats ObamaCare is imploding. Massive subsidy payments to their pet insurance companies has stopped. Dems should call me to fix! _E_ Does anyone really believe that Chuck Hagel is sorry for any of his past comments or supports Israel? _E_ One of the tallest office buildings in downtown NYC 40 Wall Street is a classic Art Deco building __HTTP__ _E_ A reader just sent me the following: I wanted to share with you something rather startling. On page 103 of (cont) __HTTP__ _E_ Breitbart gets it! Vote now @BarackObama should release his college application records and grades. He says he (cont) __HTTP__ _E_ The phony story in the failing @nytimes is a TOTAL FABRICATION. Written by same people as last discredited story on women. WATCH! _E_ Glad to hear North Carolina is solid for @MittRomney. It started trending for Mitt solidly after my speech at the @NCGOP convention. _E_ What lies behind us and what lies before us are tiny matters compared to what lies within us. Ralph Waldo Emerson _E_ Big excitement last night in the Great State of Pennsylvania! Fantastic crowd and people. MAKE AMERICA GREAT AGAIN! _E_ I now see John Kasich from Ohio who is desperate to run is using my line "Make America Great Again". Typical pol no imagination! _E_ Thank you for the nice words @ktmcfarland. The debate was interesting and fun. Keep up the great work! _E_ Starting tomorrow it's going to be #AmericaFirst! Thank you for a great morning Sarasota Florida!Watch here:... __HTTP__ _E_ Excellent Jobs Numbers just released and I have only just begun. Many job stifling regulations continue to fall. Movement back to USA! _E_ Four more years of weakness with a Crooked Hillary Administration is not acceptable. Look what has happened to the world with O & Hillary! _E_ Sorry losers and haters but I LOVED the great energy in Madison Square Garden during my speech. The WWE thought it was incredible it was! _E_ Bernie's exhausted he just wants to shut down and go home to bed! _E_ I am honored to be chosen by Gray Line for their NY Ride of Fame Campaign. Today we had the ribbon cutting ceremony in front of Trump Tower. _E_ True. __HTTP__ _E_ I was proud to be one of Ronald Reagan's earliest supporters. Like Reagan it's time to Make America Great Again! __HTTP__ _E_ Via @JNSworldnews by @JacobKamarasJNS: Donald Trump says he is no apprentice when it comes to Israel __HTTP__ _E_ #HasJustineLandedYet Justine what the hell are you doing are you crazy? Not nice or fair! I will support @AidForAfrica. Justine is FIRED! _E_ CAMPAIGN STATEMENT: __HTTP__ _E_ My @foxandfriends int. re: Tiger's victory at Trump @DoralResort 's @CadillacChamp my WH tour offer and CPAC __HTTP__ _E_ America is proud to stand shoulder to shoulder with Poland in the fight to eradicate the evils of terrorism and extremism. #POTUSinPoland __HTTP__ _E_ Thank you to teachers across America! When I become POTUS we will make education a far more important component of our life than it is now. _E_ Despite Mexico's interest in again hosting the Miss Universe Pageant it will be because of Rodolfo Rosas Moya that it will never happen. _E_ For too long we've been pushed around used by other countries and ill served by politicians in Washington who (cont) __HTTP__ _E_ MAKE AMERICA GREAT AGAIN!#AmericaFirst #ImWithYou __HTTP__ _E_ Pres. Obama is meeting with China's Pres. this week __HTTP__ He will get zero deliverables. China laughs at us. _E_ The Republicans must use the debt ceiling as leverage to make a great deal! _E_ E mails show that the AmazonWashingtonPost and the FailingNewYorkTimes were reluctant to cover the Clinton/Lynch secret meeting in plane. _E_ Just returned from Trump Doral in Miami. Massive construction job. When completed will be the best resort in U.S. Blue Monster is amazing! _E_ Why would the great people of Florida vote for a guy who as a Senator never even shows up to vote worst record. Marco Rubio is a joke! _E_ They now say using the word thug is like so many other words not politically correct (even though Obama uses it). It is racist. BULL! _E_ "US tycoon Donald Trump in talks with Ryanair to bring more flights back to Prestwick Airport" __HTTP__ via @Daily_Record _E_ Join our next Vice President @Mike_Pence in Wisconsin tonight & Michigan Thursday!MI: __HTTP__ __HTTP__ _E_ Republicans have the right approach to ObamaCare – let it fail. Free market solutions will be embraced by Americans in 2016. _E_ Don't let Obama play the Iran card in order to start a war in order to get elected be careful Republicans! _E_ .@ICEgov HSI agents and ERO officers on behalf of an entire Nation THANK YOU for what you are doing 24/7/365 to keep fellow American's SAFE. Everyone is so grateful!#LawEnforcementAppreciationDayPresident @realDonaldTrump __HTTP__ _E_ 'Small business says Trump is their pick for president' __HTTP__ _E_ America needs a President who can negotiate better deals for the American People. _E_ My interview on 9/13/01 with a German reporter after visiting Ground Zero __HTTP__ _E_ Trump Tuesday on @SquawkCNBC 7:30 AM is getting very good ratings as is @Foxandfriends on Mondays 7:30 AM. _E_ Via @Newsmax_Media: Robb Report: Trump Scotland Best Golf Course in the World __HTTP__ _E_ Again I have nothing to do with the Atlantic City closing I have not even been there in many years. Some press was accurate some not! _E_ Via @BloombergNews by Peter Millard: Trump Helps Rio Builders After Olympics: Corporate Brazil __HTTP__ _E_ Our country is now in serious and unprecedented trouble...like never before. _E_ If history teaches us anything it's that strong nations require strong leaders with clearly defined national (cont) __HTTP__ _E_ Great meeting w/ NATO Sec. Gen. We agreed on the importance of getting countries to pay their fair share & focus on... __HTTP__ _E_ Mitt Romney is right about the Chinese rip off of America. _E_ So Obama can host the Muslim Brotherhood Pres. Morsi in the White House __HTTP__ but doesn't have time for @netanyahu? _E_ Texas & Florida are doing great but Puerto Rico which was already suffering from broken infrastructure & massive debt is in deep trouble.. _E_ Great work being done by @FEMA @DHSgov w/state & local leaders to prepare for hurricane season. Preparedness is an investment in our future! __HTTP__ _E_ Whoever wins today remember that tomorrow we still have a country struggling. Our work is not done until America is strong again. _E_ "Experience knowledge & prescience are a formidable combination of powers. Do not underestimate any of them." Think Like a Champion _E_ Obama has missed 58% of his intelligence briefings. But our president does make 100% of his fundraisers. _E_ Top brand impact is what television is all about from the commercial standpoint—a big deal for @CelebApprentice. _E_ The charities I have designated for @billmaher's donations are: Police Athletic League New York March of Dimes Hurricane Sandy victims.... _E_ Must see morning clip: Donald Trump addresses Lil Wayne tweet and 'Celebrity Apprentice' __HTTP__ via @Salon _E_ My son Don will be giving the Keynote Address at The Investment Show in Sandton South Africa on Dec. 1. He's an (cont) __HTTP__ _E_ Muslim Brotherhood head of Egypt Morsi is already making demands on Obama before the WH visit. Obama's foreign policy is a complete failure. _E_ How come there are no protests in favor of the two young police officers gunned down in Mississippi by two deranged animals. DEATH PENALTY! _E_ WikiLeaks reveals Clinton camp's work with 'VERY friendly and malleable reporters' #DrainTheSwamp #CrookedHillary __HTTP__ _E_ Chicago is a shooting disaster they should immediately go to STOP AND FRISK. They have no choice hundreds of lives would be saved! _E_ Vast numbers of manufacturing jobs in Pennsylvania have moved to Mexico and other countries. That will end when I win! _E_ Remember when I said when Saddam Hussein fell the new leader of Iraq will be meaner and tougher and hate the U.S. even more. Welcome ISIS! _E_ Thanks. __HTTP__ _E_ Keystone: @johnboehner MUST pass Keystone by linking it to another bill. __HTTP__ _E_ Top suspect in Paris massacre Salah Abdeslam who also knew of the Brussels attack is no longer talking. Weak leaders ridiculous laws! _E_ Just out report: United Kingdom crime rises 13% annually amid spread of Radical Islamic terror. Not good we must keep America safe! _E_ As the nuclear crisis with Iran shows America needs to import oil from a reliable region. Keystone XL Pipeline (cont) __HTTP__ _E_ Word is that Sleepy Eyes Chuck Todd who has failed so badly with Meet the Press will be taking over for now irrelevant Brian Williams! _E_ Via @feminamissindia: "@MannyPacquiao among @MissUniverse 2015 judges" __HTTP__ _E_ The Club for Growth is a very dishonest group. They represent conservative values terribly & are bad for America. __HTTP__ _E_ Big G7 meetings today. Lots of very important matters under discussion. First on the list of course is terrorism. #G7Taormina _E_ ...Why did Democratic National Committee turn down the DHS offer to protect against hacks (long prior to election). It's all a big Dem HOAX! _E_ As always & due to popular demand@TrumpRink will be open Christmas eve & day as well as New Year's eve & day __HTTP__ _E_ Be sure to watch the Larry King Show tomorrow night on CNN 9 p.m. I'll be the host Larry the guest. __HTTP__ _E_ RT @WhiteHouse: Do not allow anyone to tell you that it cannot be done. No challenge can match the HEART and FIGHT and SPIRIT of America. ... _E_ IN AMERICA WE DON'T WORSHIP GOVERNMENT WE WORSHIP GOD! __HTTP__ _E_ Will be interviewed on @foxandfriends tomorrow morning Monday at 8:00. Much to talk about! _E_ NO GAMES! HOUSE @GOP MUST DEFUND OBAMACARE! IF THEY DON'T THEN THEY OWN IT! _E_ What a coincidence Michelle Obama called Kenya @BarackObama's homeland in 2008 __HTTP__ _E_ Great news @TPPatriots are starting their own Super PAC to fight @KarlRove __HTTP__ (via @thehill) Go get em! _E_ Pennsylvania is in play @MittRomney. All undecideds in Philly suburbs should ask themselves who do you trust most on @Israel? _E_ Watch listen and learn. You can't know it all yourself. Anyone who thinks they do is destined for mediocrity. Donald Trump _E_ Will be interviewed on the @TODAYshow this morning at 7:00. Talking about politics polls and whatever. Enjoy! _E_ After the litigation is disposed of and the case won I have instructed my execs to open Trump U(?) so much interest in it! I will be pres. _E_ Re negotiation: Know exactly what you want and keep it to yourself. Think about what the other side wants and where they're coming from. _E_ Success breeds success. The best way to impress people is through results. Think Like a Billionaire. _E_ .@MittRomney can only speak negatively about my presidential chances because I have been openly hard on his terrible choke loss to Obama! _E_ .@Borisep was great on @JudgeJeanine tonight. Very smart commentary that will prove to be correct! _E_ protesters and the tears of Senator Schumer. Secretary Kelly said that all is going well with very few problems. MAKE AMERICA SAFE AGAIN! _E_ Everything comes to him who hustles while he waits. Thomas A. Edison _E_ Success is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_ Glad to see that Sacha Baron Cohen's new movie is not only a dud but not too good at the box office. He is talentless. @Sacha_B_Cohen _E_ Going to CPAC! _E_ Obama can kill Americans at will with drones but waterboarding is not allowed—only in America! _E_ Who do you think is going home? #CelebApprentice _E_ Stay confident even when something bad happens. It is just a bump in the road. It will pass. Think Big _E_ "@BarackObama may have been a good 'community organizer' but the man is a lousy international dealmaker." #TimeToGetTough _E_ We need a tax system that is FAIR to working families & that encourages companies to STAY in America GROW in America and HIRE in America __HTTP__ _E_ A penny saved is a penny earned. Benjamin Franklin _E_ I will be doing Fox & Friends at 7.00 will be discussing the the Donald Sterling (Clippers) MESS! _E_ Maybe Obama should donate my $5M to the families of the 17 who have lost loved ones during the storm? _E_ Even Barbara Bush agrees with me __HTTP__ _E_ Check out my interview on @GMA __HTTP__ _E_ ObamaCare has 21 tax hikes __HTTP__ There's now only one solution defeat @BarackObama this November! #GOMITT _E_ .@FoxNews You shouldn't have @KarlRove on the air—he's a clown with zero credibility—a Bushy! _E_ Happy birthday to the great @TheLeeGreenwood. You and your beautiful song have made such a difference. MAKE AMERICA GREAT AGAIN! _E_ Re: hiring contractors remember the cheapest isn't always the best. Their work may have to be redone & they may not be reliable. _E_ Check out the last webisode www.youtube.com/user/mattressserta in our 3 part series featuring me with Serta. Which one was your favorite? _E_ Where are the other candidates now that this tragic murder has taken place b/c of our unsafe border __HTTP__ We need a wall! _E_ If these guys have any integrity they'd say no to MSNBC a network that few watch and is very negative. @AndrewBreitbart re debate. _E_ I made my decision to allow Jenna Talackova to participate in Miss Universe Canada two days before Gloria Allred (cont) __HTTP__ _E_ I would rather run against Crooked Hillary Clinton than Bernie Sanders and that will happen because the books are cooked against Bernie! _E_ Thank you Orlando Florida! We are just six days away from delivering justice for every forgotten man woman and ch... __HTTP__ _E_ By the way Hillary & the MSM forgot to mention that Hillary is in the Al Shabaab terror video. __HTTP__ _E_ A clip of my @LibertyU speech talking about the importance of the election & our country's potential __HTTP__ via@washingtonpost _E_ .@LouDobbs just stated that President Trump's successes are unmatched in recent presidential history Thank you Lou! _E_ The failing @nytimes is greatly embarrassed by the totally dishonest story they did on my relationship with women. _E_ Let me put this as plainly as I know how: Iran's nuclear program must be stopped by any and all means necessary. Period. #TimeToGetTough _E_ I will be on @oreillyfactor tonight at 8:00. Enjoy! _E_ Getting closer and closer on the Tax Cut Bill. Shaping up even better than projected. House and Senate working very hard and smart. End result will be not only important but SPECIAL! _E_ Act as if what you do makes a difference. It does. William James _E_ Thank you. __HTTP__ _E_ Watch @ApprenticeNBC episode 2 online again via @nbc: "Nobody Out Thinks Donald Trump __HTTP__ _E_ If @RepMarkMeadows @Jim_Jordan and Raul_Labrador would get on board we would have both great healthcare and massive tax cuts & reform. _E_ Designed by @IvankaTrump @TrumpDoral's Deluxe Guestrooms feature impeccable furnishing and details __HTTP__ _E_ "Get to the essence immediately. Learn to economize. People appreciate brevity in today's world." – Think Like a Champion _E_ 'President elect Donald J. Trump today announced his intent to nominate Steven Mnuchin Wilbur Ross & Todd Ricketts... __HTTP__ _E_ Just like its website ObamaCare is a disaster.Maybe all those who are fighting it are wasting their time it will fail on its own! _E_ October 2015 thanks Chris Wallace @FoxNewsSunday! __HTTP__ _E_ I said that Eliot Spitzer was going to lose when he was way up in the polls. I fought him when others retreated out of fear. NEVER GIVE UP! _E_ Via @NRO: Palin Trump Get Longer Speaking Slots at CPAC by @KatrinoTrinko __HTTP__ _E_ The rallies in Utah and Arizona were great! Tremendous crowds and spirit. Just returned but will be going back soon. _E_ I wonder if I run for PRESIDENT will the haters and losers vote for me knowing that I will MAKE AMERICA GREAT AGAIN? I say they will! _E_ Trending story on Miss Utah is very unfair. She simply lost her train of thought—could happen to anyone! @MissUSA @MissUniverse _E_ Opening in 2016 Trump Hotel Rio de Janeiro will be a 13 story 171 guestroom masterpiece with a beachside view __HTTP__ _E_ Obama and Clinton told the same lie to sell #ObamaCare. #Debates2016 __HTTP__ _E_ Thank you to all of our amazing military families service members and veterans. #ImWithYou __HTTP__ _E_ You wouldn't believe how tall and beautiful @_KatherineWebb is 6'5 in heels. She is also a total winner in... __HTTP__ _E_ The crackdown on illegal criminals is merely the keeping of my campaign promise. Gang members drug dealers & others are being removed! _E_ MAKE AMERICA GREAT AGAIN! _E_ Remember Cruz and Bush gave us Roberts who upheld #ObamaCare twice! I am the only one who will #MAKEAMERICAGREATAGAIN! _E_ Deepest condolences to the families & fellow officers of the VA State Police who died today. You're all among the best this nation produces. _E_ Thank you to our great Police Chiefs & Sheriffs for your leadership & service. You have a true friend in the... __HTTP__ _E_ RT @DineshDSouza: Finally as if by accident the @washingtonpost breaks down & admits the truth about where the violence is coming from ht... _E_ So many great polls like Reuters big leads everywhere. New Hampshire really special! We will win big and MAKE AMERICA GREAT AGAIN! _E_ Article: More illegals enter than people born in state each week. __HTTP__ _E_ RT @DeptofDefense: #HappyThanksgiving from @USArmy and @USNationalGuard #soldiers serving with Task Force Marauder in #Afghanistan. 🦃 __HTTP__ _E_ Can't wait for @DylanByers' follow up @politico piece discussing my large Sunday news shows ratings win because of my interview! _E_ Looking for an excuse not to cook for Thanksgiving? Many NYC outlets will delivery a full meal including @TrumpSoHo __HTTP__ _E_ ...New Donna B book says she paid for and stole the Dem Primary. What about the deleted E mails Uranium Podesta the Server plus plus... _E_ Just finished two major speeches in South Carolina. Big crowds great people. Going for a third now! _E_ My thoughts on Dick Cheney and his new book... __HTTP__ #trumpvlog _E_ Join me in Carmel Indiana tomorrow at 4pm! #INPrimary __HTTP__ __HTTP__ _E_ Just leaving Nashville Tennessee. Had a great time with a fabulous crowd of people! Love Nashville back soon! __HTTP__ _E_ Under the leadership of Obama & Clinton Americans have experienced more attacks at home than victories abroad. Time to change the playbook! _E_ Dick Clark was a friend of mine he lived in one of my buildings on East 61st Street. Everybody loved him. He will be missed. _E_ General Kelly is doing a great job at the border. Numbers are way down. Many are not even trying to come in anymore. _E_ Sadly when it comes to using the energy industry to create American jobs Obama has been a total disaster. And (cont) __HTTP__ _E_ .@dixierhilton #asktrump __HTTP__ _E_ I have never liked the media term 'mass deportation' but we must enforce the laws of the land! _E_ My interview with @parademagazine from the Olympics 100 Day Countdown in Times Square __HTTP__ _E_ .#IranDeal will go down as one of the dumbest & most dangerous misjudgments ever entered into in history of our country—incompetent leader! _E_ .@CNN is all negative when it comes to me. I don't watch it anymore. _E_ With allies like Egypt and Libya who needs enemies?! _E_ RT @DRUDGE_REPORT: TRUMP STUMPS... __HTTP__ _E_ Make your life as groundbreaking as possible while also minding the tides and riptides around you. Think Like a Champion _E_ Nice guy @pennjillette needs your help to make his bad guy movie Directors Cut > __HTTP__ @fundanything _E_ Thank you @morningmika and @JoeNBC for all of your nice words and comments on the debate! _E_ RT @paulsperry_: Wray needs to clean house. Now we know the politicization even worse than McCabe's ties to McAuliffe/Clinton. It also infe... _E_ I love watching the dishonest writers @NYMag suffer the magazine's failure. _E_ We will never have great national security in the age of computers too many brilliant nerds can break codes (the old days were better). _E_ WATCH – WH official says that ObamaCare/RomneyCare architect Gruber was 'an important figure' in crafting the law __HTTP__ _E_ Weekly Address __HTTP__ __HTTP__ _E_ I hear @JoeNBC of rapidly fading @Morning_Joe is pushing hard for a third party candidate to run. This will guarantee a Crooked Hillary win. _E_ "@AP Interview: @MissUniverse Gabriela Isler reflects as her reign winds down" __HTTP__ via @YahooNews _E_ The US GDP in 2010 was 4.1% down to 2% in 2011 & now 1.5%. I guess @BarackObama's plan is not working! _E_ #CrookedHillary #ThrowbackThursday __HTTP__ _E_ 11000 inside venue tonight in Tampa! Broke record set by Elton John in 1988 w/out musical instruments! Another 5000 outside. Will be back! _E_ For beauty and flight I'll take the @Boeing 757 over the @Boeing 787 any day! _E_ Wow just heard really bad stuff about the failing @politico. How much longer will they be around? Some very untalented reporters. _E_ If Cuba is unwilling to make a better deal for the Cuban people the Cuban/American people and the U.S. as a whole I will terminate deal. _E_ For the disciples of global warming in 150 summers (years) there have been 20 heat waves as bad or worse than current this has happened b4! _E_ I will be interviewed on @GMA at 7:00 A.M. and @foxandfriends at 7:50. Talking about my new book out today Crippled America. _E_ Loved the debate last night and almost everyone said I won but the RNC did a terrible job of ticket distrbution. All donors & special ints _E_ RT @DonaldJTrumpJr: Nevada: Here is a quick video @IvankaTrump created on How to Caucus very quick and simple! __HTTP__ ... _E_ Looking forward to attending the GREAT Rev. @BillyGraham's birthday party tonight there's nobody like him! _E_ My interview with @IngrahamAngle discussing @THEHermanCain @BarackObama's mistreatment of Israel and GOP 2012. __HTTP__ _E_ The entire cast will be back for the live finale of @ApprenticeNBC Monday night at 8 PM _E_ .@JebBush is slashing campaign salaries people making millions. If he can't manage his campaign how can he manage our countries finances? _E_ Donald Trump appearing today on CNN International's 'Connect the World' as 'Connector of the Day'. Submit questions: __HTTP__ _E_ Masa said he would never do this had we (Trump) not won the election! _E_ I wonder when we will be able to see @BarackObama's college and law school applications and transcripts. Why the long wait? _E_ I'll be co hosting @extratv tonight. Be sure to tune in! _E_ I watched Russell Brand @rustyrockets on the @jimmyfallon show the other night—what the hell do people see in Russell—a major loser! _E_ Almost every major dealmaker has used the bankruptcy laws as a business tool... _E_ That trip would be to the Trump International Hotel Las Vegas... __HTTP__ _E_ Donald J. Trump's History Of Empowering Women #BigLeagueTruth __HTTP__ _E_ A fine man Dr. Paul F. Crouch has just passed away. All Christians are grateful for his wonderful life and work. @TBN _E_ Trump National Golf Club Charlotte is the premiere club in North Carolina. __HTTP__ Will visit tomorrow. _E_ Lyin' Ted Cruz denied that he had anything to do with the G.Q. model photo post of Melania. That's why we call him Lyin' Ted! _E_ .@billmaher has continually degraded Catholic Church on the joke he calls a show __HTTP__ Catholics should boycott HBO. _E_ My daughter Ivanka will be on @foxandfriends tomorrow morning. Enjoy! _E_ Thanks Piers. Greatly appreciated. @piersmorgan __HTTP__ _E_ .@LilJon's take on @piersmorgan seems to be a classic love hate combo. Piers can be tough and everyone knows it. #CelebApprentice _E_ What ever happened to the good old days of The Academy Awards. This show is an insult to the past just plain bad! _E_ .@TrumpDoral will be featured on @GolfChannel this morning (now). _E_ Why isn't Hillary 50 points ahead? Maybe it's the email scandal policies that spread ISIS or calling millions of... __HTTP__ _E_ Remember that I predicted a long time ago that President Obama will attack Iran because of his inability to negotiate properly not skilled! _E_ I am running against the Washington insiders just like I did in the Republican Primaries. These are the people that have made U.S. a mess! _E_ Boston's Mayor Walsh wasted a lot of time and money on going for the Olympics and then he gave up. I don't want him negotiating for me! _E_ 10 yrs ago today the Iraq war began. 4485 of our nation's finest have not returned home alive. Iran will soon control Iraq & its oil. _E_ Obama was guest at VP debate moderator Martha Raddatz's wedding __HTTP__ Do people think this is fair? _E_ Via @trscoop: WHOA: Trump changing venues for Saturday rally in Arizona due to OVERWHELMING RESPONSE __HTTP__ _E_ Hillary is the most corrupt person to ever run for the presidency of the United States. #DrainTheSwamp __HTTP__ _E_ They just arrested pol Shelly Silver in New York. Why aren't they arresting a far bigger crook @AGSchneiderman? _E_ Obama killed over 100k jobs by not approving Keystone XL pipeline and Canada is now selling the oil to China very dumb! _E_ Big news Budget just passed! _E_ .@ABCPolitics #GOPDebate#MakeAmericaGreatAgain #FITN __HTTP__ _E_ THANK YOU NEW YORK!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ The last time I visited China I couldn't believe all the construction. You can go up with a project in a week no red tape. _E_ Glad to hear Derek Jeter just removed his boot and is practicing on the field for @yankees. Derek is a true champion. _E_ Why doesn't @MittRomney just endorse @marcorubio already.Should have done it before NH or Nevada where he had a little sway. Too latenow! _E_ Rather than putting pressure on the businesspeople of the Manufacturing Council & Strategy & Policy Forum I am ending both. Thank you all! _E_ New Fox News PollThank you Iowa! #Trump2016 #IACaucus __HTTP__ _E_ Our new American Energy Policy will unlock MILLIONS of jobs & TRILLIONS in wealth. We are on the cusp of a true ene... __HTTP__ _E_ After spending $89 million @JebBush is at the bottom of the barrel in polls. He is ashamed to use the name Bush in ads. Low energy guy! _E_ There is no world problem which cannot be solved if people of good will & intelligence want it to be. _E_ Thank you Delaware! #Trump2016 #MakeAmericaGreatAgain #TrumpTrain __HTTP__ __HTTP__ _E_ .@bretmichaels and George Ross are back as advisors. Good to see them! #CelebApprentice _E_ The illegal immigrant crime problem is far more serious and threatening than most people understand. Along our (cont) __HTTP__ _E_ Capitalism requires capital. When government robs capital from investors through high taxes it takes away the (cont) __HTTP__ _E_ Our great team at @FEMA is prepared for #HurricaneNate. Everyone in LA MS AL and FL please listen to your local authorities & be safe! _E_ I left Atlantic City years ago good timing. Now I may buy back in at much lower price to save Plaza & Taj. They were run badly by funds! _E_ Business is a creative endeavor. Cultivate a sense of discovery and start thinking big. _E_ The Audacity of Ineptitude – ObamaCare website will cost over $1B __HTTP__ When will someone finally be held accountable? _E_ Offshore Wind in Europe: Lessons for the U.S. __HTTP__ via @HuffPostGreen The lesson should be that it's a lousy idea!!! _E_ .@oreillyfactor @KarlRove as per the show an even more serious Cruz charge is the fraudulent voter violation certificate sent to everyone. _E_ My @FoxNews interview from last night with @gretawire discussing yesterday's meeting with @MittRomney __HTTP__ _E_ In Bangladesh hostages were immediately killed by ISIS terrorists if they were unable to cite a verse from the Koran. 20 were killed! _E_ I will be interviewed by @SeanHannity tonight at 10pm on FOX! Enjoy! _E_ I had to fire General Flynn because he lied to the Vice President and the FBI. He has pled guilty to those lies. It is a shame because his actions during the transition were lawful. There was nothing to hide! _E_ See I told you so __HTTP__ _E_ Wow! I hear that thousands of people are cutting up their @Macys credit card. That's great. #MakeAmericaGreatAgain! _E_ Great basketball game going on right now! _E_ .@MonicaCrowley you were great with @SeanHannity on @FoxNews tonight. Thank you for your kind words. We will keep Americans safe. _E_ .@TrumpDoral's Red Course redesign is underway. Will be completed in September. Follow all the developments __HTTP__ _E_ Finally in the new ABC News/Washington Post Poll Hillary Clinton is down 11 points with WOMEN VOTERS and the election is close at 47 43! _E_ Price gouging at many gas stations $10 a gallon welcome to the new world. _E_ Housing prices are up in Feb over last Feb 9.3 per cent remember I told everyone two years ago to buy (but they will be going much higher) _E_ "Life is difficult no matter what but hard work and perseverance make it a lot easier." – Think Like a Billionaire _E_ A.G. Lynch made law enforcement decisions for political purposes...gave Hillary Clinton a free pass and protection. Totally illegal! _E_ RT @EricTrump: Congratulations @SeanHannity! Looking forward to being on the show tonight at 9pmET Hannity beats Maddow POLITICO __HTTP__ _E_ All I can say is that if I were President Snowden would have already been returned to the U.S. (by their fastest jet) and with an apology! _E_ Great works are performed not by strength but by perseverance. Samuel Johnson _E_ Just left Istanbul Turkey yesterday where #TrumpTowers was just opened magnificent! _E_ For the great people of Iowa find your #IACaucus location at __HTTP__ So important to vote! #MakeAmericaGreatAgain _E_ Make sure you get on the Trump line and are not mislead by the Cruz people. They are bad! BE CAREFUL. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ "If you want to be successful at anything in life you have to be able to handle pressure." – Think Big _E_ Oscar Pistorious will likely only serve 10 months for the cold blooded murder of his girlfriend. Another O.J. travesty.The judge is a moron! _E_ They asked me to dress as Santa Claus to open Miss Universe tonight—I'm thinking about it! _E_ I don't cheat at golf but @SamuelLJackson cheats—with his game he has no choice—and stop doing commercials! _E_ I very much look forward to tomorrow's debate in New Hampshire—so many things to say so much at stake. It will be an incredible evening! _E_ Cyberattack on White House what's next? __HTTP__ _E_ Study what General Pershing of the United States did to terrorists when caught. There was no more Radical Islamic Terror for 35 years! _E_ Boycott @Macys and @Univision. MAKE AMERICA GREAT AGAIN! _E_ The stock market is having a horrendous day bad employment numbers. _E_ Thank you Iowa Get out & #VoteTrumpPence16! __HTTP__ __HTTP__ _E_ It was my great HONOR to present our nation's highest award for a public safety officer THE MEDAL OF VALOR to FIVE AMERICAN HEROES! __HTTP__ _E_ Moderator: "Respectfully you won't answer the pay to play question." #Debate #BigLeagueTruth _E_ Via @PostSports @barrysvrluga Donald Trump has major aspirations for his Trump National Golf Club in Virginia __HTTP__ _E_ Trade with China has killed over 29% of US manufacturing jobs in the US __HTTP__ China is robbing us blind! _E_ Moderator: Hillary paid $225000 by a Brazilian bank for a speech that called for "open borders." That's a quote! #Debate #BigLeagueTruth _E_ My just filed lawsuit against Univision. Always fight back when right. #MakeAmericaGreatAgain __HTTP__ _E_ I just started construction of The Old Post Office on Pennsylvania Avenue in D.C. Many jobs. Will be finest hotel in U.S. Watch it happen! _E_ .@LaurenScruggs who was badly injured by an airplane was great on The Today Show! _E_ Thank you @DonaldJTrumpJr. Proud of you! #RNCinCLE #TrumpPence2016 __HTTP__ _E_ All Star @ApprenticeNBC premiering March 3rd on @NBC features terrific TV stars competing in the toughest tasks yet. Will be great. _E_ The dirty poll done by @ABC @washingtonpost is a disgrace. Even they admit that many more Democrats were polled. Other polls were good. _E_ China's currency manipulation is one of our nation's greatest sovereign threats. The yuan has appreciated 40% against our dollar since 2005. _E_ Thanks Piers. __HTTP__ _E_ Despite all of China's cheating they are not doing that well we can beat them our country has great potential! _E_ Someone incorrectly stated that the phrase DRAIN THE SWAMP was no longer being used by me. Actually we will always be trying to DTS. _E_ Negotiations 101: The best deals you can make are the ones you walk away from...and then get them with better terms. _E_ My interview on @ThisWeekABC with @GStephanopoulos had a 40%+ ratings increase over same Sunday last year. 20% over last week. _E_ Many many people are thanking me for what I said about @autism & vaccinations. Something must be done immediately. _E_ The military generals are fuming at Obama. He has boxed them in against ISIS with a strategy that is destined to fail. Sad! _E_ In beautiful Pine Hill Trump Nat'l Philadelphia's award winning course provides amazing views of Philly skyline __HTTP__ _E_ If it were up to goofy Elizabeth Warren we'd have no jobs in America—she doesn't have a clue. _E_ President Obama has just reached an ALL TIME low approval rating! Is anybody surprised? The happiest person is former President Jimmy Carter _E_ Rumor has it that @politico is going out of business. Losing too much money. Great news! Likewise dopey Mort Zuckerman's @NYDailyNews _E_ We need a PRESIDENT with strength stamina heart and incredible deal making skill if our country is ever going to be able to prosper again! _E_ Why would Ohio listen to Bruce Springsteen reading his lines? Be careful or I will go to Ohio and @MittRomney will win it! _E_ .@FreeJesseJames Just read your complete statement. You are an amazing guy & I really appreciate your words & support. I will see you soon! _E_ This whole Super PAC scam is very unfair to a person like me who has disavowed all PAC's & is self funding. _E_ Again illegal immigrant is charged with the fatal bludgeoning of a wonderful and loved 64 year old woman. Get them out and build a WALL! _E_ RT @FiIibuster: @realDonaldTrump We have a President that is putting the security and prosperity of America first. Thank you President Tru... _E_ Carter Banned Iranians From Coming To U.S. During Hostage Crisis __HTTP__ _E_ Remember tonight (Monday) the second and third episodes of The Apprentice are on at 8:00 & 9:00. Great ratings last night 18 49. FUN! _E_ I look forward to being in South Carolina tomorrow a total sellout crowd! _E_ .@dbongino You were fantastic in defending both the Second Amendment and me last night on @CNN. Don Lemon is a lightweight dumb as a rock _E_ .@jasondhorowitz I am very proud of my sister your story was terrific. Thank you so much. _E_ Via @411mania: Donald Trump Comments on a Return to Wrestling __HTTP__ _E_ From Donald Trump: Wishing everyone a wonderful holiday & a happy healthy prosperous New Year. Let's think like champions in 2010! _E_ Always fun to read the @NewYorkObserver investigative piece re @AGSchneiderman his mascara and more! __HTTP__ _E_ Why would anybody listen to @MittRomney? He lost an election that should have easily been won against Obama. By the wayso did John McCain! _E_ Professional anarchists thugs and paid protesters are proving the point of the millions of people who voted to MAKE AMERICA GREAT AGAIN! _E_ Oil is starting to rise again despite the horrible times. OPEC continues to rip us off. Not worth $30. New leadership needed. _E_ Via @LinkedInPulse by @nicholas_wyman: "What All Hiring Managers Can Learn from Donald Trump" __HTTP__ _E_ Treat yourself to the pinnacle of luxury public golf at @TrumpGolfLA's white sand $250M premiere course __HTTP__ _E_ An excerpt of my @TheBrodyFile interview at the Sarasota 'Statesman of the Year' dinner discussing the Tea Party __HTTP__ _E_ RT @AmericaFirstPol: MAJOR IMPACT: @POTUS Trump is 50 Days in and moving swiftly to get America back on the right track. #MAGA __HTTP__ _E_ He makes a mistake every hour every day admits @BarackObama. __HTTP__ The problem is that we are paying for them. _E_ I am going to repeal and replace ObamaCare. We will have MUCH less expensive and MUCH better healthcare. With Hillary costs will triple! _E_ RT @FoxNewsSunday: Sunday our exclusive interview with President elect @realDonaldTrump Watch on @FoxNews at 2p/10p ET Check your local... _E_ 70 stories over Panama Bay @TrumpPanama's deluxe rooms feature private balconies to enjoy the ocean views __HTTP__ _E_ Why aren't we getting any oil from Iraq before we leave? We are leaving the country wide open for Iran. Big mistake. _E_ I am in Scotland checking on my developments in Aberdeen and Turnberry. Just left Ireland property will be great. ALWAYS CHECKING! _E_ Go to @Macys now to see the incredible new selection of Trump Signature Collection ties shirts and suits. _E_ Sometimes we do things to build up experience and stamina to prepare but it's to prepare us for something bigger. _E_ Why would @BarackObama be spending millions of dollars to hide his records if there was nothing to hide? _E_ "Effective leadership is putting first things first. Effective management is discipline carrying it out." Stephen Covey _E_ The perfect Hawaiian getaway @TrumpWaikiki's 462 luxury guest rooms and suites each have spectacular views __HTTP__ _E_ My @TheBrodyFile int. discussing the persecution of Christians in the Middle East & Religious Liberty & Freedom __HTTP__ _E_ With our $250M in renovations @TrumpDoral offers a wide array of courses restored to perfection __HTTP__ _E_ Trump Victorious in Fort Lauderdale Litigation __HTTP__ _E_ "To keep your goals alive you must take action every day. No one should care about your money and success more than you do." Think Big _E_ #1. Be passionate you have to love what you're doing to be successful at it. _E_ We've just set a new goal: raise $4 million from our grassroots supporters by MIDNIGHT! __HTTP__ __HTTP__ _E_ Ronald Kessler's new book The Secrets of the FBI is a great book that should be read by everyone. _E_ (2/2) David brilliantly tells it like it is the real deal! Read it! __HTTP__ _E_ The @nyjets are going to have a terrific season. @Mark_Sanchez & @TimTebow will do great things on the field. _E_ I'd bet a good lawyer could make a great case out of the fact that President Obama was tapping my phones in October just prior to Election! _E_ Focus on your goals not your problems. Don't tread water. Get out there and go for it. _E_ An idealist is a person who helps other people to be prosperous. Henry Ford _E_ When will the Democrats give us our Attorney General and rest of Cabinet! They should be ashamed of themselves! No wonder D.C. doesn't work! _E_ Great parent teacher listening session this morning with @VP Pence & @usedgov Secretary @BetsyDeVos. Watch:... __HTTP__ _E_ I will be doing @oreillyfactor tonight at 8:00pmE from Mesa Arizona will be talking about the #GOPDebate & more. __HTTP__ _E_ My @SquawkCNBC interview discussing why I don't own Facebook stock and running a tough campaign against @BarackObama __HTTP__ _E_ I am soooo proud of my children Don Eric and Tiffany their speeches under enormous pressure were incredible. Ivanka intros me tonight! _E_ Basically nothing Hillary has said about her secret server has been true. #CrookedHillary _E_ .@megynkelly the @FoxNews poll said very plainly I came in second in the debate. All others Time Drudge Slate etc. said I came in 1st. _E_ Greta in a few minutes will be interesting! _E_ The ratings at @FoxNews blow away the ratings of @CNN not even close. That's because CNN is the Clinton News Network and people don't like _E_ I am very worried that if @BarackObama is re elected then Medicare will be destroyed. We must take care of our seniors. _E_ I was on CNBC this morning talking about the market and America's financial future __HTTP__ _E_ If you can't focus with unyielding resolve then you will never be successful. Believe in yourself and you can accomplish your goals. _E_ I have great confidence in King Salman and the Crown Prince of Saudi Arabia they know exactly what they are doing.... _E_ Bernie Sanders must really dislike Crooked Hillary after the way she played him. Many of his supporters because of trade will come to me. _E_ Stop congratulating Obama for killing Bin Laden. The Navy Seals killed Bin Laden. #debate _E_ The crowning moment – Conneticut's Erin Brady winning @MissUSA 2013 __HTTP__ _E_ CPAC attendees & fellow patriots lines for my @CPACnews start at 7:00AM outside the Potomac Ballroom. Make sure to get there early! _E_ Henry McMaster Lt. Governor of South Carolina who endorsed me beat failed @CNN announcer Bakari Sellers so badly. Funny! _E_ Lightweight @AGSchneiderman is pushing for the Moreland Commission to be disbanded immediately because he is being looked at! _E_ I was standing with @SHAQ when a young high school star Kevin Garnett @Celtics said to a crowd Forget Shaq I want to meet Donald Trump. _E_ Glad that @MittRomney is hitting @BarackObama on ending work requirements for welfare. Obama attacks the American work ethic. _E_ Thank you to General John Kelly who is doing a fantastic job and all of the Staff and others in the White House for a job well done. Long hours and Fake reporting makes your job more difficult but it is always great to WIN and few have won more than us! _E_ As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_ Sec of State Kerry said we would not go back to Iraq. We shouldn't but he should not have said that. So stupid! _E_ A nurse in Dallas who treated Ebola patient Thomas Duncan was allowed to fly to Cleveland.She should never have been so allowed! The real JV _E_ Dress your best! The Trump Signature Collection exclusively available @Macys offers the tops style in menswear __HTTP__ _E_ Scary while @BarackObama has been POTUS for 1.6% of America's history he has amassed 33.3% of the total debt. _E_ Obama betrays Israel yet again our strongest ally in the Middle East. He will recognize Hamas breaking long standing US policy. _E_ Looking forward to next week's unveiling of the Red Tiger @TrumpDoral. An 18 hole masterpiece w/two island greens __HTTP__ _E_ I am proud of the Tea Party. These great patriots have accomplished so much in strengthening our country in only 3 short years. _E_ President Donald J. Trump Proclaims October 9 2017 as #ColumbusDay __HTTP__ _E_ The best thing you can do is deal from strength and leverage is the biggest strength you have." – THE ART OF THE DEAL _E_ Thank you Newt! __HTTP__ _E_ Vince McMahon @WWE and I hold the all time ratings & pay per view record in the history of wrestling. _E_ Trust in God and be true to yourself. Mary MacLeod Trump Know everything you can about what you're doing. Fred C. Trump _E_ In order to be successful especially to be very successful you must have the ability to be able to handle pressure! _E_ Celebrity Apprentice will be rebroadcast tonight at 9 on CNBC. _E_ Flashbk – "Trump: 'I would build a border fence like you have never seen before'" __HTTP__ via @BreitbartNews by @rwildewrites _E_ The golden rule for every businessman is this: 'Put yourself in your customer's place.' Orison Swett Marden _E_ Romney was the architect of ObamaCare. Bush's Chief Justice legalized the monstrosity. Notice a trend? _E_ Who thinks that President Obama is totally incompetent? _E_ Our $17T national debt and $1T yearly budget deficits are a national security risk of the highest order. _E_ Just left the set of The Apprentice the live show tonight will be fantastic and something very big and very different is going to happen _E_ An appeaser is one who feeds a crocodile hoping it will eat him last. Winston Churchill _E_ How do third rate talents with no smarts like @ron_fournier get so much time on television news. Boring guy really bad for ratings! _E_ .@RogerJStoneJr was great on @TheKudlowReport last night. Roger and Larry are good friends! _E_ I'll be on @foxandfriends on Monday at 7:30 a.m. Always a great time. _E_ Thanks Matthew! _E_ Obama will go down as the worst President in history on many topics but especially foreign policy. _E_ Via @WashTimes by @EmilyMiller: Donald Trump says 'This country is going to hell in a handbasket' __HTTP__ _E_ .@IamStevenT stopped by my office to say hello a great guy! __HTTP__ _E_ I got to know @johnboehner very well—he is a great guy who will do the right thing for the country! _E_ ...to Mar a Lago 3 nights in a row around New Year's Eve and insisted on joining me. She was bleeding badly from a face lift. I said no! _E_ .@brithume thinks that when Republicans drop out of the race someone will pick up ALL of that vote. The fact is I will get much of it! _E_ Via @UnionLeader by @tuohy: "Trump inches closer to a decision" __HTTP__ _E_ Anticipate change and embrace it. Recognize new developments that you can capitalize on and use to open new doors. _E_ Via @NewHampJournal by @jdistaso: "In NH 'The Donald' hammers Mitt Jeb as he again weighs a run for President" __HTTP__ _E_ I will be speaking about our great journey to the Republican nomination at 9:00 P.M. The movement toward a country that WINS again continues _E_ With oil below $50 the blighted views by windfarms of historic @CulzeanCastle will be very sad. #SaveCulzean __HTTP__ _E_ Happy Thanksgiving I hope everyone can get together to MAKE AMERICA GREAT AGAIN! It won't be easy nothing is but it can be done. _E_ .@BretBaier Thank you for the very fair and highly professional segment on me tonight. Many people watched and commented. _E_ Obama is not working. US Manufacturing orders fell a record 13.9% in August. Where's the recovery? __HTTP__ _E_ Aspirin gets the best press of almost anything I can think of fact or great PR? _E_ Very sad that a person who has made so many mistakes Crooked Hillary Clinton can put out such false and vicious ads with her phony money! _E_ Check out today's From The Desk Of Donald Trump at __HTTP__ I'm willing to answer your questions tweet me.... _E_ Working hard from New Jersey while White House goes through long planned renovation. Going to New York next week for more meetings. _E_ Will be leaving Trump Turnberry tomorrow place & Women's British Open are great. Will be back hitting hard tomorrow. @Turnberrybuzz _E_ Via @BreitbartNews: DONALD TRUMP: EXEC AMNESTY WILL MAKE ILLEGAL IMMIGRATION 'WORSE THAN IT'S EVER BEEN __HTTP__ _E_ Don't worry West Coast etc. we are not going to tweet who was fired or give any indication there of until after it airs. #CelebApprentice _E_ Today we are not merely transferring power from one Administration to another or from one party to another – but we are transferring... _E_ New York City hosted over 52 million visitors in 2012. __HTTP__ Record amount visited Trump Tower. _E_ A doctor on NBC Nightly News agreed with me we should not bring Ebola into our country through two patients but should bring docs to them. _E_ When the military informed Obama that they had Bin Laden is there anyone with a brain that would not have said Ok go get him ? _E_ Entrepreneurs: Put everything you've got into what you're doing. Know exactly what you want and go for it. Nothing should be haphazard. _E_ Boy did Pharrell & Robin Thicke get screwed. The Marvin Gaye song sounds nothing like theirs. Get new lawyers fast! _E_ I will be interviewed by @SeanHannity tonight at 10pm EST on @FoxNews! Enjoy! _E_ RT @Scavino45: LIVE Joint Statement by President Trump and Prime Minister Shinzo Abe: __HTTP__ _E_ Congrats to @cheflents of TrumpCollection's #TrumpChicago on being a James Beard semifinalist: __HTTP__ via @CrainsChicago _E_ Happy Birthday to the great @BillyGraham. He's done so many wonderful things not the least of which is his fantastic family. I love Billy! _E_ Join me in Cincinnati Ohio tomorrow evening at 7:00pm. I am grateful for all of your support. THANK YOU!Tickets:... __HTTP__ _E_ Hillary Advisers Wanted Her To Avoid Supporting Israel When Talking To Democrats: __HTTP__ _E_ Our campaign store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_ As I have been saying. Only the beginning: ISIS Suspects Arrested in Turkey 150 European Passports Seized. __HTTP__ _E_ Great work Ivanka! __HTTP__ _E_ A clip from guest hosting @extratv yesterday on @nbc discussing Halle Angus and Gen. Petraeus __HTTP__ _E_ .@gerardtbaker Gerard—wonderful job last night as moderator of the debate. I told many "really smart and elegant." _E_ I am getting worried about Chris @hardball_chris Matthews. Is he drinking again? _E_ Join me live from Fort Myer in Arlington Virginia. __HTTP__ _E_ I win an election easily a great movement is verified and crooked opponents try to belittle our victory with FAKE NEWS. A sorry state! _E_ I will sign the first bill to repeal #Obamacare and give Americans many choices and much lower rates! _E_ With the very dangerous carjacking epidemic going on especially in New York and New Jersey you would be lucky to have a gun for protection _E_ Pocahontas wanted V.P. slot so badly but wasn't chosen because she has done nothing in the Senate. Also Crooked Hillary hates her! _E_ Merry Christmas and a very very very very Happy New Year to everyone! _E_ Via @BreitbartNews __HTTP__ _E_ Via WSOC_TV: Donald Trump's son says family thinking about expanding in uptown Charlotte __HTTP__ Great job @EricTrump _E_ Carly Fiorina is terrible at business the last thing our country needs! __HTTP__ _E_ .@BarackObama's assault on coal and gas and oil will send energy and manufacturing jobs to China. @MittRomney _E_ Thank you. __HTTP__ _E_ .@ConradMBlack what an honor to read your piece. As one of the truly great intellects & my friend I won't forget! __HTTP__ _E_ Today it was my privilege to welcome survivors of the #USSArizona to the @WhiteHouse. #HonorThemRemarks: __HTTP__ __HTTP__ _E_ .@FoxNews should be ashamed for allowing experts to explain how to make a nuclear attack! _E_ ...way up. Regulations way down. 600000+ new jobs added. Unemployment down to 4.3%. Business and economic enthusiasm way up record levels! _E_ You talk tough Mr. President but have done nothing about China killing our jobs and economy. _E_ Watch Celebrity Apprentice on NOW! _E_ RT @LouDobbs: Making America Great Again @Kellyannepolls: After #Irma @POTUS is focused on saving lives not swamp shenanigans. #Dobbs #MA... _E_ I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ __HTTP__ _E_ Congratulations to Thomas Perez who has just been named Chairman of the DNC. I could not be happier for him or for the Republican Party! _E_ ...Corker dropped out of the race in Tennesse when I refused to endorse him and now is only negative on anything Trump. Look at his record! _E_ PAY TO PLAY POLITICS. #CrookedHillary __HTTP__ _E_ The Democrats are pushing for Universal HealthCare while thousands of people are marching in the UK because their U system is going broke and not working. Dems want to greatly raise taxes for really bad and non personal medical care. No thanks! _E_ Bird killing windfarm that I oppose in Aberdeen just got delayed by at least two years.@AlexSalmond forced the failing developers to delay! _E_ ...You have little persona but The Apprentice concept is great and lucky for you! _E_ Happy to have just passed 1.3M Twitter followers. Love communicating with everyone daily. _E_ This shows what a complete & total liar Ted Cruz is he said he wouldn't have nominated John Roberts. Really? __HTTP__ _E_ "Americans are hungry to feel once again a sense of mission and greatness." – Pres. Ronald Reagan _E_ Jeb Bush will never secure our border or negotiate great trade deals for American workers. Jeb doesn't see & can't solve the problems. _E_ Rima Fakih our beautiful Miss USA rode with me on the Gray Line Ride of Fame yesterday... __HTTP__ _E_ Remember if you don't promote yourself then no one else will! Likewise believe in yourself or no one else will either. _E_ ALWAYS BORROW MONEY FROM A PESSIMIST BECAUSE HE WILL NEVER EXPECT IT TO BE PAID BACK! _E_ Great news. We are only just beginning. Together we are going to #MAGA! __HTTP__ __HTTP__ _E_ Will be on @CNN at 7:00 A.M. _E_ Wowthe Fake News media did everything in its power to make the Republican Healthcare victory look as bad as possible.Far better than Ocare! _E_ "Shutting down the government is a very serious thing. People die accidents happen. I don't know how I would vote right now on a CR OK?"Sen. Dianne Feinstein (D Calif) __HTTP__ _E_ The Afghan Security Forces who we are training have killed 52 U.S. soldiers __HTTP__ Time to get out of there! _E_ With China beating us like a punching bag daily OPEC vacuuming our wallets clean and jobs nowhere in sight (cont) __HTTP__ _E_ Both @BarackObama and China have embraced OWS. All want the decline of America. Time for the protesters to go home. _E_ Melania and I are honored to light up the @WhiteHouse this evening for #WorldAutismAwarenessDay. Join us & #LIUB.... __HTTP__ _E_ Obama's rollout of his ISIS war plan is another unmitigated disaster. The Generals must be furious. _E_ I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting! _E_ For a president who likes to showcase how hip and tech savvy he is Obama also appears surprisingly clueless (cont) __HTTP__ _E_ Will be in New Hampshire and then on @CNN Special at 9 PM tonight. _E_ Under a Trump administration it's called #AmericaFirst! #ImWithYou __HTTP__ _E_ Amazing crowd outside @FallonTonight. Tune in tonight at 11:30. __HTTP__ _E_ Thank you @TeamTrump Florida. Keep me updated and lets get those 100000 registered voters!#MakeAmericaGreatAgain __HTTP__ _E_ Former Weather Underground radical Kathy Boudin spent 22 yrs in prison for armored car robbery that killed 2 cops & a Brinks guard... _E_ So I raised/gave $5600000 for the veterans and the media makes me look bad! They do anything to belittle totally biased. _E_ "Going with your instincts requires tuning in to everything around your decision." – Think Big _E_ Just put out a very important policy statement on the extraordinary influx of hatred & danger coming into our country. We must be vigilant! _E_ The last thing we need is another Bush in the White House. Would be the same old thing (remember read my lips no more taxes ). GREATNESS! _E_ Thank you for a great afternoon Birmingham Alabama! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Obama said in his speech that Muslims are our sports heroes. What sport is he talking about and who? Is Obama profiling? _E_ Ridiculous that they gave the 14 year old golfer from China a one stroke penalty for slow play at The Masters(see I can stick up for China) _E_ The real story on Collusion is in Donna B's new book. Crooked Hillary bought the DNC & then stole the Democratic Primary from Crazy Bernie! _E_ Stock Market hits an ALL TIME high! Unemployment lowest in 16 years! Business and manufacturing enthusiasm at highest level in decades! _E_ I hear they are very unhappy w/ Arianna and @huffingtonpost at @AOL. I'll bet she won't be there for long! _E_ Remember oftentimes the best deal you make is the deal you don't make! _E_ RT @KellyannePolls: Love and prayers for friends Adrienne & Eric Bolling. May Eric Chase know eternal peace. __HTTP__ _E_ Pennsylvania: Cast your vote for Trump for POTUS & ALSO vote for the TRUMP DELEGATES in your congressional district! __HTTP__ _E_ Why is no one talking about the horrible murder of Ana Charle by ex con thug West Spruill. Gunned down on street naked. Why no riots here? _E_ ....instead of giving to a wonderful charitable cause. _E_ Via @BreitbartNews by @AWRHawkins: TRUMP PREACHES PEACE THROUGH STRENGTH IN PHOENIX __HTTP__ _E_ The Federal government spent over $3.7 trillion last year. This is unsustainable and a true danger. The American dream is being destroyed. _E_ It has been a pleasure to make so many friends and meet so many great people on the trail this past cycle. We will fight on! _E_ Once again Obama fails to classify China as a currency manipulator. He just helped China steal even more jobs and money from us. _E_ The pressure on the debt ceiling is on @BarackObama.... __HTTP__ #trumpvlog _E_ Thank you for all of the nice statements on the Press Conference yesterday. Rush Limbaugh said one of greatest ever. Fake media not happy! _E_ If you don't have a competitive advantage don't compete. Jack Welch _E_ Pres. Bill Clinton 5.31.12: @MittRomney had a sterling business career. _E_ Will be having meetings and working the phones from the Winter White House in Florida (Mar a Lago). Stock Market hit new Record High yesterday $5.5 trillion gain since E. Many companies coming back to the U.S. Military building up and getting very strong. _E_ Wow new Reuters Poll just out. Big lead if you want to MAKE AMERICA GREAT AGAIN! TRUMP 37 CRUZ 11 This is at the top of Drudge! _E_ Liberals can hardly belileve it they can't understand how health care costs could have risen so much when (cont) __HTTP__ _E_ Crooked H destroyed phones w/ hammer 'bleached' emails & had husband meet w/AG days before she was cleared & they talk about obstruction? _E_ "Learn know and show. It's a proven formula. Put it to use starting today." – Think Like a Champion _E_ #Imwithyou __HTTP__ __HTTP__ _E_ Today will be a big day @Team_Mitch for you in many ways. The country is lucky. _E_ Great read: "How New York's Veterans Day Parade Became 'America's Parade'" __HTTP__ _E_ Congrats to @JoeTorre @TonyLaRussa & Bobby Cox on all being unanimously elected to @MLB's @BaseballHall! Great leaders & managers. _E_ At the National Achievers Congress in London this October I'm going to talk about success and how to avoid failure __HTTP__ _E_ I have great respect for the people that represent China. What I don't respect is the way that we negotiate and (cont) __HTTP__ _E_ If Justice Roberts had done the right thing and voted against ObamaCare our country would be in a lot better shape right now! TOTAL TURMOIL _E_ "NBC FIRES TRUMP KEEPS SHARPTON: The bigots of the NBC executive suite look the other way" __HTTP__ via @AmSpec by @JeffJlpa1 _E_ Wow Rowanne Brewer the most prominently depicted woman in the failing @nytimes story yesterday was on @foxandfriends saying Times lied _E_ Featuring private living spaces oversized bathrooms & stunning views @TrumpSoHo = downtown NYC's premiere hotel __HTTP__ _E_ RT @DRUDGE_REPORT: MEXICO 2ND DEADLIEST COUNTRY TOPS AFGHAN IRAQ... __HTTP__ _E_ Today I signed an Executive Order on Improving Accountability and Whistleblower Protection at the @DeptVetAffairs:... __HTTP__ _E_ I'll bet Obama now uses the amendment for the debt ceiling. _E_ .@BillClinton was very nice to me as I am to him on the Piers Morgan Show (CNN). He is loyal to his friends. @piersmorgan _E_ I will be interviewed on @greta at 7:00 P.M. Enjoy! @FoxNews _E_ He @johnedwards is bad but @andrewyoung is worse not only is he a rat but it turns out he stole much of the money for himself. _E_ Putin has shown the world what happens when America has weak leaders. Peace Through Strength! _E_ My thoughts and prayers are with the great people of Tennessee during these terrible wildfires. Stay safe! _E_ The thousands of people that showed up for me in Phoenix were amazing Americans. @SenJohnMcCain called them crazies must apologize! _E_ 1988 with Oprah discussing why I would never rule out a run for #POTUS.#Trump2016 #VoteTrumpNY #PrimaryDay __HTTP__ _E_ I don't know how much longer I can take this bullshit so terrible! #Oscars _E_ I feel so badly for Mark Cuban the Dallas Mavericks were just eliminated from the playoffs and his partners are pissed. Very sad! _E_ A great book by a great guy highly recommended! __HTTP__ _E_ No surprise Obama's Deputy Campaign Manager tweeted link from Chinese propaganda outlet __HTTP__ Did she also write it? _E_ For those of you that have conveniently forgotten dummy Jon Stewart is a bad filmmaker. His last effort was a real bomb (in all ways)! _E_ CHAIN MIGRATION must end now! Some people come in and they bring their whole family with them who can be truly evil. NOT ACCEPTABLE! __HTTP__ _E_ Lightweight A.G. Eric Schneiderman meets with President Obama (who he told me sucks as a president) and quickly files a suit against me! _E_ RT @DanScavino: Doesn't fit the MSM narrative so they wont share what @realDonaldTrump did for Jesse Jackson in 1999 so I will! __HTTP__ _E_ There are huge opportunities for profits if you can think big & create big solutions for the human needs brought by trends. Think Big _E_ Looking forward to my @theFAMiLYLEADER summit visit and speech. _E_ The April jobs report is terrible. If the labor forces didn't shrink under @BarackObama then real unemployment (cont) __HTTP__ _E_ I'll bet Jimmy Fallon gets great ratings tonight! _E_ Great interview on @foxandfriends with the parents of Otto Warmbier: 1994 2017. Otto was tortured beyond belief by North Korea. _E_ Yesterday was a big day for the stock market. Jobs are coming back to America. Chrysler is coming back to the USA from Mexico and many others will follow. Tax cut money to employees is pouring into our economy with many more companies announcing. American business is hot again! _E_ As an addition Apple must go to a larger screen now asap! They're losing their standing in the market! _E_ #NYCStrong #USA __HTTP__ _E_ Remember Anthony Wiener continued sending sick pics. long after his resignation from Congress and his apology zero control over himself! _E_ Great job on Fox this morning @KatiePavlich. I am sending out for your book immediately. Thank you very much! _E_ .@GovChristie is going to do a fantastic job tonight explaining why @MittRomney should be elected and @BarackObama has to go. _E_ Honored to be named as one of business's "Top Leaders Icons and Rebels" by @CNBC __HTTP__ Vote Trump! _E_ Getting ready to deliver a VERY IMPORTANT DECISION! 8:00 P.M. _E_ Destroying the world's finest health care system so that @BarackObama can have his socialized medicine program (cont) __HTTP__ _E_ Despite what you have heard from the FAKE NEWS I had a GREAT meeting with German Chancellor Angela Merkel. Nevertheless Germany owes..... _E_ The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved there will be a News Conference at The White House at approximately 1:00 P.M. _E_ While Obama is obsessed with green collar jobs blue collar workers aren't buying it. (cont) __HTTP__ _E_ We should never have gone into Iraq but once in should have gotten out a lot faster. MAKE AMERICA GREAT AGAIN! _E_ China will never go to war with us because if they won they would only take over property they already own! _E_ Now is the time to buy a house if you can DIRECTLY from a bank. They want to get rid of all their foreclosures. _E_ Watch the clip from my #C21 Super Bowl spot on @AccessHollywood tonight. _E_ ...yet not one meeting with an ally (or an enemy!) Where's the media? _E_ Thank you West Virginia! All across the country Americans of every kind are coming together w/one simple goal: to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Entrepreneurs: Pay attention to your negotiation skills. It's all about persuasion and persuasion is power. _E_ I will be speaking at 9:00 A.M. today to Police Chiefs and Sheriffs and will be discussing the horrible dangerous and wrong decision....... _E_ Only reason the hacking of the poorly defended DNC is discussed is that the loss by the Dems was so big that they are totally embarrassed! _E_ They're going to riot in Ferguson no matter what. _E_ THANK YOU Phoenix Arizona! Time for new POWERFUL leadership. Just imagine what WE can accomplish in our first 100... __HTTP__ _E_ Somerset County New Jersey SWAT Team really fantastic people! __HTTP__ _E_ A lot of undecided and independent voters have had enough with Obama's lack of transparency. I don't blame them. _E_ 'Clinton Charity Got Up To $56 Million From Nations That Are Anti Women Gays' #CrookedHillary __HTTP__ _E_ As expected the media is very much against me. Their dishonesty is amazing but just like our big wins in the primaries we will win! _E_ .@TrumpGolfLA is ranked the top course in the West __HTTP__ If you're in the area book a round today. _E_ While @BarackObama continues to defend ObamaCare in the courts he is also granting companies waivers. Eve... (cont) __HTTP__ _E_ via __HTTP__ Only one man up for the job of president __HTTP__ _E_ Can you conquer the Blue Monster? Book a tee time @TrumpDoral right here __HTTP__ _E_ Via @AP March2013: Jeb said "he was open to...pathway for citizenship for illegal immigrants" __HTTP__ Lying on campaign trail! _E_ Congrats to @rushlimbaugh on the release of his new book "Rush Revere and the Brave Pilgrims." #1 on @amazon and @bnbooks. Must read! _E_ Obama just stated he didn't take school seriously made bad choices and GOT HIGH then how the hell did he get into Columbia & Harvard? _E_ Obama's complaints about Republicans stopping his agenda are BS since he had full control for two years. He can never take responsibility. _E_ A top rated NY course by @GolfDigestMag @TrumpNationalNY provides award winning services and exceptional facilities __HTTP__ _E_ ...What is wrong with this story? Isn't this just ridiculous? Terrible! #KathyBoudin _E_ Leaving the White House for the Great State of North Carolina. Big progress being made on many fronts! _E_ Mexico has taken advantage of the U.S. for long enough. Massive trade deficits & little help on the very weak border must change NOW! _E_ Remember Sunday is National Prayer Day (by Presidential Proclamation)! _E_ #TBT For all who have been asking my mother was a great beauty and a wonderful person. Here we are with my father __HTTP__ _E_ At the Univision forum Obama continued to make excuses for Fast and Furious __HTTP__ His operation killed innocent Americans. _E_ Thank you New Hampshire! Together we will Make America Great Again! __HTTP__ _E_ Wouldn't it be great to Repeal the very unfair and unpopular Individual Mandate in ObamaCare and use those savings for further Tax Cuts..... _E_ In Iran deal we get 4 prisoners. They get $150 billion 7 most wanted and many off watch list. This will create great incentive for others! _E_ .@BarackObama is begging the Eurozone to keep Greece in until after 11.6.12. He thinks the world revolves around his re election. _E_ Thank you! #Trump2016 __HTTP__ __HTTP__ _E_ Look forward to going to Indiana tomorrow in order to be with the great workers of Carrier. They will sell many air conditioners! _E_ RT @JoeNBC: Explosive Trump attack on HRC Bill Monica Cosby and Weiner. Trump camp just upped the ante on women's rights __HTTP__ _E_ FLASHBACK: "Hiding evidence of global cooling" __HTTP__ @washtimes "Scientific data" is cooked! _E_ This is what @BarackObama thinks: that America would be better off if we acted more like European socialist (cont) __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016UNIFYING THE NATIONVideo: __HTTP__ __HTTP__ _E_ Am I morally obligated to defend the president every time somebody says something bad or controversial about him? I don't think so! _E_ Remember Bill Maher praised the animals who took down the World Trade Center and was fired by ABC. DROP@HBO until dopey Bill is canned! _E_ We have the Final Six—and @LilJon is the last remaining member of Team Power. He's done a great job. #CelebApprentice _E_ Just received the new Fox poll.Thank you America! #Trump2016 __HTTP__ _E_ Cadillac has made amazing strides in the beauty and quality of their cars. Great management team congratulations! @Cadillac _E_ Visit @Fund_Anything at __HTTP__ to see my picks! #FundAnything _E_ One of my many Twitter followers suggested Obama should take my offer & give $1250000 to each family of the four... __HTTP__ _E_ U.S. COAL PRODUCTIONUp📈7.8% past year. Down📉31.5% last 10 years. #EndingWarOnCoal __HTTP__ _E_ Great meeting with military spouses in Virginia joined by @IvankaTrump @LaraLeaTrump @GenFlynn & @MayorRGiuliani. __HTTP__ _E_ Very exciting week for @TrumpDoral. I will be in Miami opening what will soon be best resort in U.S. World Golf Championship this week! _E_ America's relationship with China is at a crossroads. We only have a short window of time to make the tough (cont) __HTTP__ _E_ We will push onward to victory w/hope in our hearts courage in our souls & everlasting pride in each & every one of you. God Bless America. __HTTP__ _E_ It was a great honor to be on @MikeAndMike on @espn. Wow the response was amazing! _E_ Trump Was Right: 'Obama's America' Tops 2012 Documentaries __HTTP__ via @Newsmax_Media _E_ Looking forward to being guest of honor at @ralphreed's @FFCoalition Patriot Gala Dinner on June 14th in DC. Flag day and my birthday. _E_ The U.S. should not be giving away our strategy & tactics to the enemy so they can prepare. Just go and do what you have to do! _E_ FOX debate advertising rates falling like a rock! Tune into my special event for the Veterans at 9pm EST! _E_ Secret Service members on break from Obama's $4M vacation are more than welcomed to relax at Hawaii's top hotel @TrumpWaikiki. _E_ Prime Minister @Netanyahu and @PresidentRuvi on behalf of @FLOTUS Melania and myself thank you for the invitation... __HTTP__ _E_ Central American presidents are blaming us for the influx of illegal immigration __HTTP__ Obama will soon apologize. _E_ The world is most peaceful and most prosperous when America is strongest. __HTTP__ _E_ Historic Change! Obama has spent over $44M of our money on travel expenses the most for any president __HTTP__ _E_ Cruz going down fast in recent polls dropping like a rock. Lies never work! _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Can you believe it—the model who mysteriously disappeared from the ObamaCare website is not a US citizen—she's from Colombia. _E_ I use Social Media not because I like to but because it is the only way to fight a VERY dishonest and unfair "press" now often referred to as Fake News Media. Phony and non existent "sources" are being used more often than ever. Many stories & reports a pure fiction! _E_ Obama's new excuse for his failures is that you can't change Washington from the inside. Not what he said in '09. __HTTP__ _E_ U.S. tuitions are completely out of control. In the last 4 years the average price has gone up by 15%. __HTTP__ Unsustainable! _E_ I believe in free markets but allowing a merger of US Air & American Airlines is totally ridiculous! Will control most of US market. _E_ HILLARY'S HEALTH CARE POLICIES#DrainTheSwamp #Debate __HTTP__ _E_ Tomorrow is #TrumpTuesday on @squawkCNBC 7:30 AM EST. Always interesting. _E_ In even the darkest moments the light of our people has shown through their goodness their courage and their love. #USA __HTTP__ _E_ The @GOP primary voter spoke last night in VA 7 & @DaveBratVA7th won going away. Now the party MUST stand behind him! Unity Unity Unity! _E_ Department of Homeland Security has spent $3.5 billion dollars building their new headquarters and is years late and billions over budget! _E_ Tea Party takes down Eric Cantor REALLY BIG WIN! _E_ Bashar Assad is stronger today than he was before Obama threatened military action. Obama really bungled this. _E_ Phoenix crowd last night was amazing a packed house. I love the Great State of Arizona. Not a fan of Jeff Flake weak on crime & border! _E_ RT @realDonaldTrump: National Pearl Harbor Remembrance Day "A day that will live in infamy!" December 7 1941 _E_ RT @townhallcom: ABC NBC And CBS Pretty Much Bury IT Scandal Engulfing Debbie Wasserman Schultz's Office __HTTP__ _E_ I will be on @60Minutes tonight at 7:00 P.M. with Mike Pence talking about LAW AND ORDER and many other subjects! Bad times for divided USA! _E_ Congratulations to my friend David Wright of the @mets who is now their all time hitting leader. _E_ The just out USA Today National Poll where I lead by big numbers shows that in a head to head matchup I beat both Hillary and Bernie. _E_ Poor @JohnKasich doesn't have what it takes __HTTP__ _E_ Thank you @SarahPalinUSA for your amazing help and support. Big win leaving now for Atlanta and Nevada.The people of South Carolina got it! _E_ Tomorrow is #TrumpTuesday on @squawkboxCNBC 7:30 AM don't miss it! _E_ I will be going to Mississippi tomorrow night hear the crowds are going to be massive! Look forward to it. _E_ Entrepreneurs: Don't expect anyone to be on your side. Sometimes we're all in this alone. So believing in yourself is mandatory. _E_ Sirius National News at 7:30 A.M. Steve Bannon. @BreitbartNews _E_ Obama has destroyed the middle class. In '09 median household income was $55198. Now it is $50678. Four more years? _E_ SHOCK! While attacking @MittRomney's private equity experience @BarackObama raises $2M from private equity bankers __HTTP__ _E_ When and how are the dummies at the @WSJ going to apologize to me for their totally incorrect Editorial on me. I want smart trade deals. _E_ I will be on @Morning_Joe live from New Hampshire tomorrow at 7am. #Trump2016 #MakeAmericaGreatAgain _E_ Washington must come together on a deal to avoid a fiscal cliff. If taxes are raised they must come with real hard cuts. _E_ I should have easily won the Trump University case on summary judgement but have a judge Gonzalo Curiel who is totally biased against me. _E_ "Much as it pays to emphasize the positive there are times when the only choice is confrontation." – The Art of the Deal _E_ Thanks @greggutfeld. Really nice! I'm glad I did your show. @GregGutfeldShow _E_ Election is being rigged by the media in a coordinated effort with the Clinton campaign by putting stories that never happened into news! _E_ .@ShawnJohnson Congratulations on your engagement he is a lucky guy. You are  a true winner and will be an amazing couple. _E_ Once you consent to some concession you can never cancel it and put things back the way they are. Howard Hughes _E_ Did President Obama have a rough day yesterday or what? He has got to start telling the truth NO MORE LIES OR DECEPTION! _E_ Brian I hope @NBCNightlyNews isn't paying you too much look at what's happening to nightly news. _E_ Do the people of Ohio know that John Kasich is STRONGLY in favor of Common Core! In other words education of your children from D.C. No way _E_ Unlike U.S. China taxes things made in the U.S. and sold in China. China demands plants we don't. Stupid! _E_ .@HillaryClinton's 2008 Campaign And Supporters Trafficked In Rumors About Obama's Heritage #DebateNight __HTTP__ _E_ Keystone must be approved through Congress. @BarackObama is costing America over 20000 jobs and driving the price of gas high. _E_ House of Representatives needs to pass Government Funding Bill tonight. So important for our country our Military needs it! _E_ .@ErraticSLK Shout out = work hard! _E_ Everybody should contribute & fight in the long haul battle against autism. @autismspeaks _E_ GOPers eye Donald Trump for governor run __HTTP__ via @nypost by @fud31 _E_ If people knew how hard I worked to get my mastery it wouldn't seem so wonderful at all. Michelangelo _E_ The lady in Chicago that I'm fighting owes me $500 000 and is sophisticated & vicious. She made up a story & plays the age card bad! _E_ Congrats to people of Scotland on the Judge's ruling concerning bird killing land destroying environmentally disastrous windmills. _E_ Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_ O'Malley as former Mayor of Baltimore has very little chance. _E_ I'm eagerly awaiting the next polls. The debate performance could be devastating to the Obama team. Let's see what happens. _E_ Admitted:@BarackObama's Treasury Secretary admitted that their 2013 budget does nothing to address America's (cont) __HTTP__ _E_ In this book our second together we share what gives us the Midas Touch the ability to turn things we touch (cont) __HTTP__ _E_ Russians are playing @CNN and @NBCNews for such fools funny to watch they don't have a clue! @FoxNews totally gets it! _E_ My friend @TheSlyStallone lost his wonderful son Sage this weekend. We all send Sly our love and warmest wishes. (cont) __HTTP__ _E_ RT @CLewandowski_: Trump winning over Latino Republicans poll says | New York Post __HTTP__ _E_ A guy named @BobBeckel on FOX their resident liberal was not born with much of a brain. _E_ For the Republicans to have any success these next two years they must have a long game plan... _E_ It's important to listen to what people say. "Horrible" and "disgusting" are the words I used in response to Sterling's comments. _E_ The most important truth our FOUNDERS understood was: FREEDOM is NOT a gift from Govt. FREEDOM is a GIFT from GOD. __HTTP__ __HTTP__ _E_ Randy Moss should not be bragging about himself—I'm the only one who is allowed to do that! _E_ It is time for DC to protect the American worker not grant amnesty to illegals. Let's Make America Great Again! __HTTP__ _E_ I had amazing time in Charlotte. Great people & many new friends. I look forward to coming back very soon. Congrats to Gavin & Staff. _E_ Do you think that very dumb reporter(blogger) McKay Coppins has apologized to his wife for his very inappropriate behavior while in Florida? _E_ 'Americans overwhelmingly oppose sanctuary cities' __HTTP__ _E_ Interesting how the U.S. sells Taiwan billions of dollars of military equipment but I should not accept a congratulatory call. _E_ Via @Newsmax_Media by @dpatten32: "Trump's Brand Gives Him 2016 Mojo" __HTTP__ _E_ Did a shoot in front of the Metropolitian Museum on 5th Ave for the 13th season of the Apprentice... _E_ Exxon donated $250g to Obama's inaugural __HTTP__ I guess the Democrats have no problem accepting money from 'big oil.' _E_ Thanks & I won't let you down. __HTTP__ _E_ There are many ways of going forward but only one way of standing still. Pres. Franklin D. Roosevelt _E_ Via @successmagazine by @MikeSeemuth: "Trump Power" __HTTP__ _E_ Today I announced a new Executive Order with re: to North Korea. We must all do our part to ensure the complete denuclearization of #NoKo. __HTTP__ _E_ "If you like your plan you keep it." = "Gruber is just some adviser." Two of Obama's greatest lies told to the American public. _E_ Experience is not what happens to you it's what you do with what happens to you. Aldous Huxley _E_ RT @IvankaTrump: 3/4: This Administration is deeply committed to those who serve & their families who make it possible through their love a... _E_ Join me live in Louisiana! Tomorrow we need you to go to the polls & send John Kennedy to the U.S. Senate. __HTTP__ _E_ .@BarbaraJWalters @theviewtv Barbara unfortunately you've missed the entire point of my announcement you just don't get it! _E_ "@OMAROSA is a bit toxic" per @BrandenRoderick. Being a bit PC? #CelebApprentice _E_ The Fed should not do QE3. Neither the economy nor the dollar can withstand another round of artificial liquidity. _E_ "Donald Trump: Karl Rove Has Done Ashley Judd A Favor" __HTTP__ via @SheKnows _E_ Will be interviewed by @GStephanopoulos on @ABC at 10:00 A.M. _E_ $716 Billion from Medicare by @BarackObama. When will it end? _E_ Karen Handle's opponent in #GA06 can't even vote in the district he wants to represent.... _E_ Obama still refuses to stop the flights. Is he stubborn or just plain incompetent I say both! _E_ Robert Pattinson should not take back Kristen Stewart. She cheated on him like a dog & will do it again just watch. He can do much better! _E_ Thank you @MikeOzanian for the nice comments on @FoxNews today. Great job! _E_ RT @foxandfriends: U.S. spy satellites detect North Korea moving anti ship cruise missiles to patrol boat __HTTP__ _E_ Today it was a tremendous honor for me to sign the #VAaccountability Act into law delivering my campaign promise... __HTTP__ _E_ It is the same Fake News Media that said there is no path to victory for Trump that is now pushing the phony Russia story. A total scam! _E_ Still time to #VoteTrump! #iVoted #ElectionNight __HTTP__ _E_ How can @JebBush beat Hillary Clinton if he can't beat anyone else on the #GOPDebate stage with $150M? I am the only one who can! _E_ A good head and a good heart are always a formidable combination. Nelson Mandela _E_ If you voted for Obama in 2008 to prove you were not a racist then vote for Romney in 2012 to prove you are not stupid. Thanks Walter D! _E_ Where's the global warming? 2013 was one of the least extreme years in weather on record __HTTP__ _E_ Just got back from the Iowa State Fair. Record crowds phenomenal people. Thank you IOWA I will never let you down! _E_ Re negotiation: Trust your instincts even after you've honed your skills. They're there for a reason. _E_ Rand Paul is a friend of mine but he is such a negative force when it comes to fixing healthcare. Graham Cassidy Bill is GREAT! Ends Ocare! _E_ OPEC has just raised oil to over $102/Barrel. And @BarackObama still won't approve the Keystone Pipeline. Does he want high gas prices? _E_ Join me in Council Bluffs Iowa today at 3pm! #MakeAmericaGreatAgain Tickets: __HTTP__ _E_ .@AlexSalmond If a country wants to rapidly destroy its economy I have an idea just put up subsidized wind (cont) __HTTP__ _E_ Via @ACLJ: Pastor Saeed's Wife Expresses Gratitude to Donald Trump for Raising Her Husband's Plight __HTTP__ _E_ I'll be speaking tomorrow at the San Jose Convention Center (CA) for the first ever National Achievers Congress __HTTP__ _E_ AngieApon I think you should try wearing your hair combed back. It looked good when you slicked it back Mr. Trump ) #ALS May happen thx _E_ Donald Trump's Speech Is a Game Changer. #Trump2016 __HTTP__ __HTTP__ _E_ Build your reputation on intelligence responsibility and results. That's building the right way. Think Like a Champion _E_ I am going to save Medicare and Medicaid Carson wants to abolish and failing candidate Gov. John Kasich doesn't have a clue weak! _E_ I can't believe that in New York we can't watch the PGA Championsip on CBS. How .much discount is Time Warner giving its customers? _E_ Happy 8th Anniversary to @MELANIATRUMP. __HTTP__ _E_ RT @IngrahamAngle: Trump Int'l Golf Club West Palm Beach is spectacular. Almost makes me wish I had time to play/learn/like golf. _E_ I have to admit @AlexSalmond is a tough smart guy. He is formidable by any standard! _E_ Why is Washington ready to spend billions on care for illegals while our VA is still in shambles? Vets should be the priority. _E_ THANK YOU to all of the great men and women at the U.S. Customs and Border Protection facility in Yuma Arizona & around the United States! __HTTP__ _E_ Look forward to being in DC tomorrow—big crowd expected for our protest against the truly stupid nuclear deal we are making with Iran. _E_ When is the media going to talk about Hillary's policies that have gotten people killed like Libya open borders and maybe her emails? _E_ Welcome to @BarackObama's America 8.74 million workers on 'Federal Disability __HTTP__ Where are the jobs?! _E_ I own Turnberry in Scotland one of great resorts in world. Women's British Open there this week. I'll go for two days & back on trail. _E_ A simplified tax code will help promote growth in the private sector. _E_ My @SquawkCNBC #TrumpTuesday interview with Ken Langone & Dick Grasso discussing the Chicago teachers' strike @ 2012 __HTTP__ _E_ Why would Texans vote for liar Ted Cruz when he was born in Canada lived there for 4 years and remained a Canadian citizen until recently _E_ Why can't @Politico get better reporters than Ben Schreckenger? Guy is a major lightweight with no credibility. So dishonest! _E_ Entrepreneurs: Ignorance is not bliss it's fatal. It's costly. Pay attention or get crushed. Watch listen and learn. _E_ I am self funding my campaign and only work for YOU the American people!#Trump2016 Video: __HTTP__ __HTTP__ _E_ "Patriotism is supporting your country all the time and your government when it deserves it." – Mark Twain _E_ If the Republican Convention had blown up with e mails resignation of boss and the beat down of a big player. (Bernie) media would go wild _E_ #BigLeagueTruth __HTTP__ _E_ This week's All Star Celebrity @ApprenticeNBC features another memorable Board Room rumble between @piersmorgan & @OMAROSA. _E_ Saudi Arabia and many of the countries that gave vast amounts of money to the Clinton Foundation cont'd: __HTTP__ _E_ ...American Cancer Society and the Dana Farber Cancer Center. _E_ Thank you @loudobbsnews I will be trying very hard to prove you right great show! _E_ I hope A Rod has a great night for the Yankees he owes it to them especially with Derek hurt. _E_ The President's speech tonight will largely focus on class warfare. The Republicans don't know how to handle that—I do. _E_ Failing host @glennbeck a mental basketcase loves SUPERPACS in other words he wants your politicians totally controlled by lobbyists! _E_ From day one I said that I was going to build a great wall on the SOUTHERN BORDER and much more. Stop illegal immigration. Watch Wednesday! _E_ #CrookedHillary Job Application __HTTP__ _E_ Via @cnsnews by @CraigBBannister: "Poll: Hispanics Blacks Call for Tighter Borders Access to Illegals' Jobs" __HTTP__ _E_ Good move by Bernie S. _E_ The statement put out yesterday by @FoxNews was a disgrace to good broadcasting and journalism. Who would ever say something so nasty & dumb _E_ Snow and ice freezing weather in Texas Arizona and Oklahoma what the hell is going on with GLOBAL WARMING? _E_ Boycott Mexico until they release our Marine. With all the money they get from the U.S. this should be an easy one. NO RESPECT! _E_ They will soon be calling me MR. BREXIT! _E_ BIG NIGHT on Celebrity Apprentice tonight. IMPORTANT starts at 10 P.M. as scheduled but NBC just increased all future episodes to 2 hours! _E_ ....Dopey @krauthammer should be fired. @FoxNews _E_ Everyone knows I am right that Robert Pattinson should dump Kristen Stewart. In a couple of years he will thank me. Be smart Robert. _E_ ...vast sums of money to NATO & the United States must be paid more for the powerful and very expensive defense it provides to Germany! _E_ Where was all the outrage from Democrats and the opposition party (the media) when our jobs were fleeing our country? _E_ My recent statement re: @macys We must have strong borders & stop illegal immigration now!... __HTTP__ _E_ I'm self funding my campaign but lobbyists & special interests for Jeb & others are starting to do big ads—desperate! Don't believe them. _E_ Via @bluegreentweet: Scottish wind farm opposed by Donald Trump delayed __HTTP__ _E_ We will bring America together as ONE country again – united as Americans in common purpose and common dreams. #MAGA _E_ Thank you Senator David Perdue! __HTTP__ __HTTP__ _E_ With Mexico being one of the highest crime Nations in the world we must have THE WALL. Mexico will pay for it through reimbursement/other. _E_ Pastor #Nadarkhani must be released by Iran immediately. I applaud the @WhiteHouse & @StateDept for issuing (cont) __HTTP__ _E_ .@loudobbsnews did a fantastic interview with syndicated columnist Michelle Malkin. Congrats to both! _E_ I am more concerned about Biden in the debate than I am about Obama. Be careful on Thursday night! _E_ RT @USHCC: USHCC was delighted to host @IvankaTrump for a roundtable discussion w/ Hispanic women biz owners today in Washington #USHCCLegi... _E_ One of my first acts as President will be to deport the drug lords and then secure the border. #Debate #MAGA _E_ .@rushlimbaugh played 3 separate audio bites (the most of anyone) of my CPAC speech. Hour 3 in Friday's show. _E_ Crooked Hillary Clinton wants completely open borders. Millions of Democrats will run from her over this and support me. _E_ The Tonight Show @nbc will be amazing 11:30 P.M. ENJOY! _E_ Rush Limbaugh: Trump Has Changed the Entire Debate on Immigration __HTTP__ _E_ Sgt. Bowe Bergdahl should face the death penalty for desertion five brave soldiers died trying to bring him back. U.S. has to get tough! _E_ China's business interests reach far and wide even domestically within our borders. We need to reassess our relationship. _E_ Congratulations to @Mets @RADickey43 on becoming the first knuckleball pitcher to ever win the CY Young award! _E_ #TBT Do you believe once upon a time Jon Stewart really liked me? From 2004. __HTTP__ _E_ Via @AP: Miss USA Olivia Culpo is crowned Miss Universe Ratings increase 15% over last year. __HTTP__ _E_ My offer to Obama is about transparency. In 2008 American people were sold on hope and change. This our last chance to get the full record. _E_ A great night in Macon Georgia! Thank you for all of the support. Together we will #MakeAmericaGreatAgain! __HTTP__ _E_ You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_ You have to feel bad for the Democrat Senators. They don't want Hagel either. Just following Obama's orders. _E_ So far the Super Bowl is very boring not nearly as exciting as politics MAKE AMERICA GREAT AGAIN! _E_ I am in New Hampshire having a great time! Loved the #GOPDebate last night! Everybody enjoy the Super Bowl. #SuperBowlSunday #SB50 _E_ Well Iran has done it again. Taken two of our people and asking for a fortune for their release. This doesn't happen if I'm president! _E_ Very low ratings radio host Hugh Hewitt asked me about Suleiman Abu Bake al Baghdad Hassan Nasrallah and more typical gotcha questions _E_ Hypocrite. Watch Senator Obama defend democratic debate' of Senate filibuster rules in 2005 __HTTP__ _E_ North Carolina is a fantastic state with wonderful people. I enjoy my time there when I visit Trump National Charlotte. _E_ I'm having a real hard time watching the Academy Awards (so far). The last song was terrible! Kim should sue her plastic surgeon! #Oscars _E_ ...But while Dallas dropped to it knees as a team they ALL stood up for our National Anthem. Big progress being made we love our country! _E_ Rated "#1 Resort in Europe" by @CNTraveler @Trump_Ireland offers breathtaking golf & the 5 Star Lodge at Doonbeg __HTTP__ _E_ A 60% increase in Texas Blue Cross/Blue Shield through ObamaCare. I told you so there is panic and anger as healthcare costs explode! _E_ 13 states have voter registration deadlines TODAY: FL OH PA MI GA TX NM IN LA TN AR KY SC.Register: __HTTP__ _E_ Can't fool Americans. 57% of uninsured hate ObamaCare __HTTP__ Reality is less will be insured b/c of this monstrosity. _E_ I'm impressed both teams have produced very entertaining silent films. #CelebApprentice _E_ Bad. @gallupnews survey shows 30% of businesses not hiring they are worried they won't be around in a year. __HTTP__ _E_ My @FoxNews interview with @gretawire discussing why I endorsed @MittRomney and why he will make a great President __HTTP__ _E_ Intelligence stated very strongly there was absolutely no evidence that hacking affected the election results. Voting machines not touched! _E_ If US Air and American Airlines are allowed to merge we are back to the days of "monopoly." _E_ The townhall question segment of my @WMUR9 Commitment 2016 Conversation @JoshMcElveen __HTTP__ Great questions/people #FITN _E_ James Clapper who famously got caught lying to Congress is now an authority on Donald Trump. Will he show you his beautiful letter to me? _E_ Just left Trump Golf Links at Ferry Point. Ribbon cutting w/@MayorBloomberg & @jacknicklaus was spectacular. Lots of people & jobs! _E_ Ever see @bluemangroup in performance? They're fantastic. And so are Penn & Teller. Don't miss them. #CelebApprentice _E_ The MAKE AMERICA GREAT AGAIN agenda is doing very well despite the distraction of the Witch Hunt. Many new jobs high business enthusiasm.. _E_ I hope @billmaher comes through with his $5 million offer which I fully accepted or I will be forced to sue him. All goes to charity! _E_ This Sunday's All Star @ApprenticeNBC features some of the biggest fireworks of the entire season. Get ready. _E_ Via @washingtonpost 9/18/01. I want an apology! Many people have tweeted that I am right! __HTTP__ __HTTP__ _E_ Happy 70th Birthday @USAirForce! __HTTP__ _E_ Success is good. Success with significance is even better. Work on what you will be proud to be associated with make your work count. _E_ Good news @RickSantorum did the right thing. I congratulate him on running a very good race. Now it's onto @BarackObama go get him Mitt! _E_ I unfairly get audited by the I.R.S. almost every single year. I have rich friends who never get audited. I wonder why? _E_ Getting ready to leave for the Great State of Indiana and meet the hard working and wonderful people of Carrier A.C. _E_ Vince McMahon shows the crowd one of the greatest moments in WWE History. #WWEHOF __HTTP__ _E_ The time has come. THEGaryBusey will be project mgr on this Sunday's All Star Celebrity @ApprenticeNBC. MUST SEE TV!!! Back to 2 hrs. _E_ See ungrateful Little @MacMiller's statement to me a year ago— __HTTP__ he was kissing my ass! _E_ ....on ruining Scotland's beauty with ugly & costly wind turbines? _E_ The U 6 Unemployment Rate is over 14.9%. ObamaCare is stopping businesses from both hiring and expanding. _E_ Ratings challenged @CNN reports so seriously that I call President Obama (and Clinton) the founder of ISIS & MVP. THEY DON'T GET SARCASM? _E_ Will be in Terre Haute Indiana in a short while big rally! See you soon! _E_ Colin Montgomerie @montgomeriefdn You are not only a great golfer you are doing a great job of commentary @GolfChannel _E_ MAKE AMERICA SAFE AGAIN!#NoSanctuaryForCriminalsAct #KatesLaw #SaveAmericanLives __HTTP__ _E_ My Doral Country Club purchase was made just before Miami real estate market went through the roof—good timing! _E_ .@megynkelly must have had a terrible vacation she is really off her game. Was afraid to confront Dr. Cornel West. No clue on immigration! _E_ My @FoxNews int with @TeamCavuto on the state of world affairs economy the Bushes etc. __HTTP__ _E_ Interview with @oreillyfactor on Fox Network 4:00 P.M. (prior to Super Bowl). Enjoy! _E_ On the sands of Playa Brava waves will reflect on walls & circular architecture of Trump Tower Punta del Este __HTTP__ _E_ Thank you @SenOrrinHatch. Let's continue MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ It's not that I'm so smart it's just that I stay with problems longer. Some good words from Albert Einstein. It pays to be tenacious. _E_ Crooked Hillary who embarrassed herself and the country with her e mail lies has been a DISASTER on foreign policy. Look what's happening! _E_ Be sure to watch The Celebrity Apprentice on Sunday at 9 pm on NBC. It's an episode you'll want to see and one you won't forget! _E_ It's time for Mountain State to have a Senator who will stop Obama's war on coal. This November send DC a message vote for @CapitoforWV! _E_ Ohio had the biggest budget increase in the U.S. If it were not for striking oil they would be bust! Governor Kasich in favor of TPP fraud! _E_ The Clinton Campaign at Obama Justice #DrainTheSwamp __HTTP__ _E_ Ratings for NFL football are way down except before game starts when people tune in to see whether or not our country will be disrespected! _E_ Our Founding fathers got it. They understood that nothing good in life religious freedom economic freedom (cont) __HTTP__ _E_ As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_ I am signing copies of my book CRIPPLED AMERICA. Makes a great holiday gift. Order yours now! __HTTP__ ... ... _E_ My @foxandfriends interview discussing @newsday's endorsement of @MittRomney tomorrow's election and Sandy's victims __HTTP__ _E_ A clip from my interview with @jimmyfallon discussing the cast of @ApprenticeNBC Season 5 __HTTP__ _E_ I play golf to relax. My company is in great shape. @BarackObama plays golf to escape work while America goes down the drain. _E_ Entrepreneurs always remember that every business relationship can lead to greater deals in the future. Be sure to cultivate relationships _E_ I would triple the sanctions on Iran if the American pastor is not released. my @SRQRepublicans speech _E_ $ave your $. Don't invest in @KarlRove. He doesn't have a clue. __HTTP__ _E_ Congratulations to @FLGovScott on getting an A grade from @CatoInstitute on his fiscal policy. Rick is a fantastic governor. _E_ Thank you @LuisRiveraMarin! __HTTP__ _E_ The Yankees are absolutely terrible what happened to this team? _E_ Did you ever think our country would become an economic basket case? So much for Hope & Change. _E_ The endorsement of me by the 16500 Border Patrol Agents was the first time that they ever endorsed a presidential candidate. Nice! _E_ .@TrumpDoral's record $200M renovations are on schedule. The hotel remains open for guests events and conferences. __HTTP__ _E_ RT @greta: interesting poll results so far (and go vote on __HTTP__ __HTTP__ _E_ I'm on Bill @oreillyfactor tonight at 8 PM. It will be another lively interview about how to #MakeAmericaGreatAgain! _E_ Tremendous backlash against the NFL and its players for disrespect of our Country.#StandForOurAnthem _E_ ObamaCare is an attack on our country's identity. The latest victim is the Catholic church. It must be full repealed. @BarackObama _E_ The signature restaurant of @TrumpNewYork @jeangeorges is both a Forbes Five Star & AAA Five Diamond restaurant __HTTP__ _E_ Obama is totally "tweaking" the Republicans because he doesn't respect them—they've got to change their ways. _E_ I hope when Rand Paul gets out of the race—he is at 1% his supporters come over to me. I will do a much better job for them. _E_ I was #1 on Twitter and so positive. Thank you! __HTTP__ _E_ "TEA TALK: Highlights from Monday convention speech from Donald Trump" __HTTP__ via @myrbeachonline by @TSN_MPrabhu _E_ Check out ShouldTrumpRun.... __HTTP__ _E_ Work begins on the Old Post Office in Washington D.C. in 3 months. It will soon become one of the great hotels of the world. _E_ At least ObamaCare/RomneyCare architect Gruber admitted albeit privately that we were lied to by Obama. Gang of Liars. _E_ Join me in Colorado Springs at 2pm or in Denver tonight at 7pm!Colorado Springs: __HTTP__ __HTTP__ _E_ Our nation has a duty to care for our vets & their families. It's time to do it! Let's Make America Great Again! __HTTP__ _E_ I am really happy that Hillary made her speech right under Trump World Tower! _E_ My @WMUR9 'Close Up' int. with @JoshMcElveen discussing the midterms the new Congress travelling to NH & 2016 __HTTP__ _E_ My new book Time To Get Tough comes out on December 5th. Pre order on Amazon.com. It's the best book I've ever written. _E_ RT @Scavino45: The Iran deal was one of the worst & most one sided transactions the United States has EVER entered into. @POTUS @realDona... _E_ Time Warner Cable went out on 5th Avenue for 2 plus days. They are a disaster. I think I'm going to switch. _E_ RT @DanScavino: Congratulations to the 2017 @PinstripeBowl (Yankee Stadium) Champions Iowa @HawkeyeFootball! __HTTP__ _E_ Entrepreneurs: Learn to be succinct. Can you tell someone your idea in three minutes or less? Be clear and concise. _E_ To show you how politicians act Bobby Jindal spent $1000 to register in New Hampshire & dropped out the next day. Such a waste! _E_ Great quote from the late Steve Jobs: Innovation distinguishes between a leader and a follower. _E_ Congratulations to @BarackObama for being reckless. In his first 38 months in office the debt has grown at a rate that is unthinkable. _E_ Whether you think you can or think you can't you are right. Henry Ford _E_ It is crucial for Republicans to remain united during this shutdown _E_ Senator Luther Strange has done a great job representing the people of the Great State of Alabama. He has my complete and total endorsement! _E_ Great piece on Extra tonight re. Celebrity Apprentice! _E_ "Your money should be at work at all times. Even in the worst economy there is no excuse. Think Like a Billionaire _E_ The average family has spent $4155 this year filling up the car on $3.50/gallon average. Both record highs. (cont) __HTTP__ _E_ "When mistakes are made and they will be the entrepreneur's true character emerges and further growth takes place." – The Midas Touch. _E_ I can't get over after all of the buildup what a terrible game that was the worst Super Bowl in history. The advertisers must be furious! _E_ Employees of @NYMag should have their resumes updated. It is very boring & will die in the near future. How much are they losing now? _E_ Just left a great event in Pella. Going to church tomorrow in Muscatine Iowa. _E_ Every sports fan is treated to an All Star game. The loyal and growing fan base of @CelebApprentice will be getting a much bigger treat! _E_ Join me live at the 2018 World Economic Forum in Davos Switzerland! #WEF18 __HTTP__ __HTTP__ _E_ .@club4growth should release the letter they sent me asking for $1000000. When I said no they came out against me. A scam operation? _E_ The most luxurious hotel in downtown Manhattan @TrumpSoHo is a top destination __HTTP__ _E_ Many Republicans support TPP. They are stupid. We have stupid Republicans too. We need to keep jobs here! my @SRQRepublicans speech _E_ Remember that in 2006 then Senator Obama voted NOT TO INCREASE THE DEBT CEILING. Now he acts in disbelief as others plan to do the same! _E_ Thank you Florida! #Trump2016 __HTTP__ _E_ I want to thank the people of Iowa for an unbelievable day. The crowds were amazing. Will be back Tuesday! _E_ All of the phony T.V. commercials against me are bought and payed for by SPECIAL INTEREST GROUPS the bandits that tell your pols what to do _E_ Thank you @Forbes for showing the @WSJ was wrong. So dishonest! __HTTP__ _E_ Thank you California! See you soon!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ No taxes in Boehner or Reid Plan important victory for America. _E_ From the great author of Rich Dad Poor Dad Robert Kiyosaki here is a very nice article. __HTTP__ _E_ Sometimes understanding other people's problems is the key to finding opportunities. Midas Touch w/@atheRealKiyosaki _E_ The Republican Party must spend its money wisely and do incredible television commercials. They must be tough and smart. _E_ The people of Colorado had their vote taken away from them by the phony politicians. Biggest story in politics. This will not be allowed! _E_ Have you seen the new #Trump fall collection exclusively available @Macys? Top selling brand nationwide.Ties shirts fragrance great gifts. _E_ WASTE HUD is spending $70M to teach grant recipients how to spend the money from their grants __HTTP__ Does it get any dumber? _E_ I am the best builder but if that were my building with the crane mishap I would have been lambasted from coast to coast. _E_ I love Twitter.... it's like owning your own newspaper without the losses. _E_ The WH yesterday defended Biden's comments that the Taliban aren't our enemy. When did the American people decide this? __HTTP__ _E_ RT @IvankaTrump: Thank you to the amazing men and women working tirelessly to bring relief to those in need. #PuertoRico #HurricaneMaria ht... _E_ Thank you Carl Higbie (former Navy Seal) for you support of my plan to straighten out the Veterans Administration a mess!Great job @kilmeade _E_ Luther Strange of the Great State of Alabama has my endorsement. He is strong on Border & Wall the military tax cuts & law enforcement. _E_ Wow @CNN is really working hard to make me look as bad as possible. Very unprofessional. Hurting in ratings bad television! _E_ Our military is building and is rapidly becoming stronger than ever before. Frankly we have no choice! _E_ My interview with @IngrahamAngle discussing @MittRomney's Super Tuesday and why @BarackObama must be defeated. __HTTP__ _E_ When will people realize that @billmaher is not an intellectual but actually a rather dumb guy—just look at his past. _E_ Our new @MissUniverse Olivia Culpo is not only beautiful but intelligent and accomplished. She is a wonderful role model. _E_ It was truly an honor to introduce my wife Melania. Her speech and demeanor were absolutely incredible. Very proud! #GOPConvention _E_ It was an honor to welcome the Teachers of the Year to the WH last month. Today we honor and thank all teachers!... __HTTP__ _E_ No member of Congress should be eligible for re election if our country's budget is not balanced deficits not allowed! _E_ Sarasota was an unbelievable success. We expected 5000 a record but 12000 showed up! Great love in the air! __HTTP__ _E_ We better get tough with RADICAL ISLAMIC TERRORISTS and get tough now or the life and safety of our wonderful country will be in jeopardy! _E_ I hope Arnold S. does well with the Apprentice because he is a nice guy and also because I get a big percentage of the profits! _E_ I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_ On the way to the #GOPDebate with my wonderful wife @MelaniaTrump. __HTTP__ __HTTP__ _E_ Entrepreneurs: View any conflict as an opportunity. Being positive could lead you into a fortunate situation. _E_ Congratulations to Billy Payne and @AugustaNational on doing the right thing. _E_ Reporter should resign __HTTP__ _E_ The road to success is always under construction. Arnold Palmer _E_ Do we still want a President who bows to the Saudis and lets OPEC rip us off? Make America strong vote for @MittRomney. _E_ Why doesn't @CNN use the #CNN Iowa poll? @andersoncooper @andydean2014 _E_ ... & all Obama is concerned about stopping them doing is buying wind farms __HTTP__ _E_ I don't know why but I feel so sorry for dummy reporter John Heilemann when I watch him on television. _E_ I look forward to the debate on Thursday night & it is certainly my intention to be very nice & highly respectful of the other candidates. _E_ It was my honor. THANK YOU! __HTTP__ _E_ My interview with @Jay_Severin on behalf of @MittRomney discussing why the GOP must nominate @MittRomney __HTTP__ _E_ Congratulations to our new National Security Advisor General H.R. McMaster. Video: __HTTP__ __HTTP__ _E_ My induction last night at Madison Square Garden into the WWE Hall of Fame was amazing I met some great people including Bruno. _E_ Why hasn't Obama created jobs? _E_ US froze $8B in Iranian assets during '79 Hostage Crisis. Now Obama is giving it back to Iran while Christian Pastor is jailed. Don't do it! _E_ Wind farms are ugly not cost effective and don't produce worthwhile returns or energy. No wonder governments are giving up on them. _E_ Kim Jong Un of North Korea made a very wise and well reasoned decision. The alternative would have been both catastrophic and unacceptable! _E_ A total refutation of the disgraceful David Brooks column in the failing @NYTimes by the @WashingtonPost: __HTTP__ _E_ Another four years not good for the country but we'll have to live with it! _E_ Trump is already delivering the jobs he promised America __HTTP__ _E_ Obama hasn't released a budget in over 2 years & for the 1st time House & Senate delivered budgets before him __HTTP__ _E_ Trump Turnberry news conference tomorrow at noon Scotland time. The place is amazing! _E_ "The vast majority felt she should be prosecuted... even senior FBI officials thought Crooked was guilty. __HTTP__ _E_ It is truly amateur hour at the White House and this is why we should not be doing the war thing right now! _E_ May God Forever Bless the United States of America. #NeverForget911 __HTTP__ _E_ We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_ Keep Wednesday morning free. You will want to see this! _E_ My condolences to Dwyane Wade and his family on the loss of Nykea Aldridge. They are in my thoughts and prayers. _E_ America's debt officially became 100% of our GDP on @BarackObama's 50th birthday coincidence? _E_ Obama Care is already having a devastating impact on our economy. _E_ Will do thanks. __HTTP__ _E_ Let me say it as clearly as possible that the attack on my Catholic brothers and sisters is an attack on me. @GovMike Huckabee _E_ Scary thought is the sexual pervert Anthony Weiner now in Charlotte? Did he bring his phone with him? _E_ ....and has been horrible on Virginia economy. Vote @EdWGillespie today! _E_ We still have not learned the full truth on Benghazi. Four Americans were killed. Congress must act! _E_ Eli Manning staged a great comeback in 4th quarter an elite quarterback. _E_ .@CarlyFiorina had to inject herself into my factual statements concerning Ben Carson in order to breathe life into her failing campaign! _E_ "Donald Trump's Miss USA Pageant Scores $5 Million Legal Victory Following Rigged Claims" __HTTP__ via @eonline _E_ Via @WashTimes by @SethMcLaughlin1: "Donald Trump: I want to run for president 'so badly'" __HTTP__ _E_ RT @joegooding: What's happening in our country isn't just an assault on our @POTUS @realDonaldTrump it's an assault on the American people... _E_ .@AP continues to do extremely dishonest reporting. Always looking for a hit to bring them back into relevancy—ain't working! _E_ Thanks Lou. __HTTP__ _E_ I called Chuck Schumer yesterday to see if the Dems want to do a great HealthCare Bill. ObamaCare is badly broken big premiums. Who knows! _E_ RT @FoxNews: U.S. Markets since election. __HTTP__ _E_ Great. Just reported on @FoxNews that many people who supported @JebBush are now supporting me. I knew that would happen pundits didn't! _E_ "The most important political office is that of the private citizen." Justice Louis D. Brandeis _E_ Glad to hear patriotic Americans are organizing a movement this August to boycott Chinese products __HTTP__ People get it! _E_ Report: "ANTI TRUMP FBI AGENT LED CLINTON EMAIL PROBE" Now it all starts to make sense! _E_ .@CNBC continues to report fictious poll numbers. Number one based on every statistic is Trump (by a wide margin). They just can't say it! _E_ Thank you to the 2500+ in North Augusta South Carolina. Lines down the block! Don't forget to VOTE on Saturday! __HTTP__ _E_ .@MarketMavensInc #asktrump __HTTP__ _E_ Environmental regulations stop Border Patrol from protecting 40% of the border __HTTP__ A coup for the migrant Democrats. _E_ Received a beautiful letter from Joe Paterno's son Jay. He really loved and respected his father. _E_ Wow did you just hear Bill Clinton's statement on how bad ObamaCare is. Hillary not happy. As I have been saying REPEAL AND REPLACE! _E_ Without focus it's just impossible to be successful at anything. Midas Touch _E_ Afghanistan leaders want the U.S. to keep 20 000 troops there for many more years fully paid for by the U.S. but first they want apology. _E_ ESPN is paying a really big price for its politics (and bad programming). People are dumping it in RECORD numbers. Apologize for untruth! _E_ Learning never exhausts the mind. Leonardo da Vinci _E_ Gross negligence by the Democratic National Committee allowed hacking to take place.The Republican National Committee had strong defense! _E_ The Art of the Deal = #1 business book. Over 3 million copies sold. Forbes Article from Oct. 20 2014. __HTTP__ _E_ After Super Tuesday every GOP candidate should take a long hard look at their prospects and drop out if they can't get the nomination. _E_ ..Ryan died on a winning mission ( according to General Mattis) not a failure. Time for the U.S. to get smart and start winning again! _E_ Every sport evolves. Every sport gets bigger and more athletic and you have to keep up. Tiger Woods _E_ RT @Scavino45: .@POTUS @realDonaldTrump and @UN Secretary General @AntonioGuterres pose for📸prior to their expanded bilateral meeting. #USA... _E_ Vanity Fair is failing. Newstand sales are down 20 percent 2nd most for major magazines and the magazine has (cont) __HTTP__ _E_ Which National Costume do you think should win? __HTTP__ _E_ One of the reasons Hillary hid her emails was so the public wouldn't see how she got rich selling out America. __HTTP__ _E_ Really sad news: The great Arnold Palmer the King has died. There was no one like him a true champion! He will be truly missed. _E_ Yom Kippur blessings to all of my friends in Israel and around the world. #YomKippur _E_ Just met with David Perdue @Perduesenate. He's a fantastic guy who will fight hard against ObamaCare. He will win! _E_ I will write a $2 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_ "The cheapest natural gas in the world is in the United States." @boonepickens _E_ Great job on @CNN tonight @heytana. We are all proud of you! Also congrats on a great son he is going places. _E_ Wow! Such nice words from Robert Redford on my running for President. Thank you Robert. __HTTP__ _E_ .@DennisRodman must be thinking of North Korea. #CelebApprentice _E_ This is a terrific day for downtown New York. Trump SoHo is unlike anything else. Be sure to visit this fantastic hotel soon! _E_ .@davidaxelrod I hope your book is better than the Obama second book but it is inaccurate as it pertains to me but no big deal boring! _E_ I think Senator Blumenthal should take a nice long vacation in Vietnam where he lied about his service so he can at least say he was there _E_ Thank you Wilmington North Carolina. We are 3 days away from the CHANGE you've been waiting for your entire life!... __HTTP__ _E_ "It's a good idea to take your own pulse once in a while instead of focusing on what the masses are doing." – Think Like a Champion _E_ I want to see @BarackObama's college records to see how he listed his place of birth in the application. _E_ Great news here comes the Tea Party! @MittRomney has received 42k donations online & raised over $4.2 million since the ObamaCare decision. _E_ Just landed in Bedminster New Jersey. #MAGA __HTTP__ _E_ QE3 is going to further sink the dollar into oblivion. Creates artificial numbers for short term market gains. (cont) __HTTP__ _E_ Yesterday our national debt topped a record $18T. Over 44% has accrued under Obama. A real mess. _E_ Reigning @ApprenticeNBC Champion @TraceAdkins does great work with @wwpinc. Donate to an Injured Warrior today __HTTP__ _E_ The Iran deal poses a direct national security threat. It must be stopped in Congress. Stand up Republicans! _E_ Great! __HTTP__ _E_ Unbelievable support in Florida last night thank you! #MAGA __HTTP__ _E_ Obama is making the Ebola problem much worse than it needs to be in the U.S. by not halting flights from West Africa. Airport testing a joke _E_ Ranked a top course @GolfMagazine & 6 Star Diamond Award Trump Int'l Palm Beach has been expanded to 27 holes __HTTP__ _E_ Do not view any failure as the end. Learn your lessons quickly then move on. Do not dwell on failure. Start thinking big again. _E_ Each and every new event space at @TrumpDoral looks stunning. See the transformation for yourself: __HTTP__ _E_ Yesterday was Matt Drudge's birthday Happy Birthday @DRUDGE and great job! _E_ RT @EricTrump: Mathematically it is statistically impossible for Kasich to get to 1237 he would need 112% of the remaining delegates to b... _E_ My @SquawkCNBC interview discussing the Republic of Georgia taxes the fledgling economy and Facebook __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Being true to yourself equals being true to your brand.That's the solid foundation that will keep your brand flourishing. Midas Touch _E_ The Super Committee is finding ways to raise all our taxes without admitting it. The Republicans made a big mistake agreeing to this deal. _E_ Trump arrives for SC Tea Party Convention in Myrtle Beach __HTTP__ via @WCBD _E_ Crooked Hillary Clinton just can't close the deal with Bernie. I had to knock out 16 very good and smart candidates. Hillary doesn't have it _E_ This is a storm of enormous destructive power and I ask everyone in the storm's path to heed ALL instructions from government officials. __HTTP__ _E_ I am sure the @NCGOP will do a great job bracketing the @DNC convention. They are a tremendous statewide organization. _E_ Notice how @BarackObama failed to mention ObamaCare last night in his SOTU. Even he knows it is terrible. _E_ A look at the Trump hotel planned for the Old Post Office pavilion __HTTP__ via @washingtonpost _E_ Conservative? Jeb Bush doubled Florida State debt! __HTTP__ _E_ Ignorance is inexcusable it's the surest way to fail. No acceptable reason exists for not being well informed. _E_ My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person always pushing me to do the right thing! Terrible! _E_ With panoramic views of Central Park & the Manhattan skyline 5 Star @TrumpNewYork offers 176 newly renovated rooms __HTTP__ _E_ .@WhoopiGoldberg Don't let @Rosie speak badly of you or try to bring you down. She is rude crude & not smart. She is not in your league. _E_ Unbelievable crowd of supporters in Virginia Beach Virginia. Thank you! Next stop Cleveland Ohio.... __HTTP__ _E_ Thank you California Connecticut Maryland and Pennsylvania!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ #3. Look at the solution not the problem. Learn to focus on what will give results. _E_ I'm leaving now for Burlington Vermont. It will be wild! _E_ I will be interviewed on the @oreillyfactor tonight from Florida now. Enjoy! _E_ Druggies drug dealers rapists and killers are coming across the southern border. When will the U.S. get smart and stop this travesty? _E_ Our President is a great embarrassment to the U.S. How could anybody be so dumb or know so little as to make the very stupid 5 for 1 swap? _E_ Thanks Eric. __HTTP__ _E_ China does not negotiate from a position of strength we simply negotiate against ourselves. We have all the advantages but don't execute. _E_ New poll states that a record number of Americans have lost all faith in President Obama duh! _E_ Anyway I'm all about jobs & the economy & making America great again. We're falling fast! _E_ Sorry to hear of the passing of Neil Armstrong over the weekend. He was an American hero. _E_ Thanks to Donald Trump __HTTP__ via @AmSpec. My pleasure Jeffrey! _E_ Big cancer risk from new environmental light bulbs a big price to pay! _E_ My official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_ I am on @FoxNews with @greta doing a town hall from Wisconsin now! Enjoy!#MakeAmericaGreatAgain #Trump2016 _E_ Millions Could Get Surprise Tax Bills Under 'Obamacare' If They Don't Accurately Project Their Income __HTTP__ _E_ Stock Market could hit all time high (again) 22000 today. Was 18000 only 6 months ago on Election Day. Mainstream media seldom mentions! _E_ Browse Donald Trump's Summer Reading List for Business Success at the Trump University Blog: __HTTP__ _E_ Will be interviewed tonight at 7 by @greta re Sony & Bush _E_ "If you want to be the best you'd better be the best – in all aspects of business." Think Like a Billionaire _E_ A signed copy of CRIPPLED AMERICA is the ultimate gift. Order now & join my live streaming book signing on 12/3 __HTTP__ _E_ The Blue Monster is being torn up at Trump @DoralResort. On April 1 I go out & play it one more time until the new course opens. _E_ My @SquawkCNBC interview discussing housing prices the GDP numbers China spreading its wealth and my stock picks. __HTTP__ _E_ Roger Goodell must stop apologizing to everyone who will listen and toughen up. His street smart players are laughing at him and the NFL! _E_ It was an honor to meet with Republic of Rwanda President Paul Kagame this morning in Davos Switzerland. Many great discussions! #WEF18 __HTTP__ _E_ I will be developing the two tallest towers in the Republic of Georgia. __HTTP__ _E_ .@washingtonpost @BretBaier Please thank Charles Lane for his new found confidence. He has made a very good bet! _E_ The Boston terrorist thugs' mother is also a radical. I am sure she will be granted citizenship shortly. _E_ I'll be on @foxandfriends on Monday at 7:30 AM. Tune in! _E_ Wow great post debate poll: Trump Increases Lead via Breitbart __HTTP__ _E_ While in the Philippines I was forced to watch @CNN which I have not done in months and again realized how bad and FAKE it is. Loser! _E_ Thank you!Mitchell FOX2 Michigan Poll finds Trump holds 3 1 lead over closest GOP opponents. Trump 47% Clinton 43% __HTTP__ _E_ .@penn_state leadership has permanently scarred & perhaps destroyed a great university. They should have (cont) __HTTP__ _E_ I am pleased to announce that I had the Union Leader removed from the upcoming debate. __HTTP__ _E_ I am leaving for Sioux City Iowa great event (rally). _E_ "Donald Trump pledges to make Prestwick Airport 'really successful'" __HTTP__ via @STVNews _E_ Thank you North Carolina! #Trump2016 #SuperTuesday  #MakeAmericaGreatAgain __HTTP__ _E_ If the U.S. attacks Syria and hits the wrong targets killing civilians there will be worldwide hell to pay. Stay away and fix broken U.S. _E_ Broadcom's move to America=$20 BILLION of annual rev into U.S.A. $3+ BILLION/yr. in research/engineering & $6 BILLION/yr. in manufacturing. __HTTP__ _E_ The cast has been largely selected for next year's Celebrity Apprentice. Wait 'till you hear the names AMAZING! Season 14 many nights at #1 _E_ Won $5000000 against Miss Pennsylvania Sheena Monnin for her terrible and untrue statements about Miss USA Pageant. Not a nice person! _E_ Wow @SharylAttkisson just wrote the definitive piece on what I said about John McCain __HTTP__ _E_ "Donald Trump To Be In Mason City June 4th" __HTTP__ via @KCHA _E_ .@megynkelly I am in Nevada. Sorry to inform you Kellyanne is in the audience. Better luck next time. _E_ American league wins! _E_ ... The NY Daily Snooze totally lied and never even called my kids! _E_ Expecting a great crowd of amazing people. Questions will be live! #TrumpToday _E_ "Never give up on yourself." – Think Big _E_ Oh no just reported that Ted Cruz didn't report another loan this one from Citi. Wow no wonder banks do so well in the U.S. Senate. _E_ Congratulations to Boys and Girls Nation. It was my great honor to welcome you to the WH today! Full Remarks: __HTTP__ __HTTP__ _E_ So much dishonest reporting (or non reporting) in political media—an amazing experience for me. @BretBaier _E_ Having a great time hosting Prime Minister Shinzo Abe in the United States! __HTTP__ __HTTP__ _E_ The @CNN panels are so one sided almost all against Trump. @FoxNews is so much better and the ratings are much higher. Don't watch CNN! _E_ #VoteTrumpHI! #Trump2016 __HTTP__ _E_ RT @DanScavino: #TrumpTrain🚂💨 __HTTP__ _E_ I will not let the families of The Remembrance Project down! #MakeAmericaSafeAgain __HTTP__ __HTTP__ _E_ An interesting cartoon that is circulating. __HTTP__ _E_ All the online polls have me winning the debate. I really enjoyed the evening. Not easy but good. __HTTP__ _E_ Re Negotiation: Persistence can go a long way. Being stubborn can be good. The key is to know when to loosen up. _E_ Dopey @ariannahuff should force her reporters to be accurate—if she has that power. _E_ TEXAS: We are with you today we are with you tomorrow and we will be with you EVERY SINGLE DAY AFTER to restore recover and REBUILD! __HTTP__ _E_ Just landed in Iowa to attend a great event in honor of wonderful Senator @JoniErnst. Look forward to being with all of my friends. _E_ "Donald Trump Takes on Apple @CPACnews" __HTTP__ via @kmbznews _E_ Go get the new book on Andrew Jackson by Brian Kilmeade...Really good. @foxandfriends _E_ Be sure to watch highlights from the record setting 14th season of @ApprenticeNBC here __HTTP__ _E_ Thank you to the greatest heroes __HTTP__ #DDay70 #WWII _E_ Hypocrites! @JamesOKeefeIII's new video shows Journal News reporters refusing to designate their homes as 'gun free' __HTTP__ _E_ My @todayshow interview where I reveal the new cast of Celebrity Apprentice and discuss the GOP primary field __HTTP__ _E_ Entrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. Get out there & go for it! _E_ With our national debt passing $16T during the @DNC convention @BarackObama has amassed more debt than the first 42 presidents. Scary. _E_ Hillary Clinton should not be given national security briefings in that she is a lose cannon with extraordinarily bad judgement & insticts. _E_ .@realbobmassi who does a show called Bob Massi Is The Property Man on @FoxNews really knows his stuff a total pro! _E_ So funny Jeb Bush called me a highly gifted politician and a great entertainer I assume that is a compliment! _E_ I hope everybody reads the @AmSpec article "Shakedown Schneiderman" – the AG of New York @AGSchneiderman __HTTP__ _E_ RT @KellyannePolls: #Polls showing @realDonaldTrump surging @hillaryclinton #slipping have HER camp on defense/lowering expectations goi... _E_ RT @DanScavino: OHIO GENERAL ELECTIONDonald Trump vs. Hillary Clinton#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ General John Kelly totally agrees w/ my stance on NFL players and the fact that they should not be disrespecting our FLAG or GREAT COUNTRY! _E_ Instinct has a lot to do with timing. You have to be patient & wait for your instincts to tell you the best time to make your move. _E_ Wacky @glennbeck who always seems to be crying (worse than Boehner) speaks badly of me only because I refuse to do his show a real nut job! _E_ ... and opened a full month ahead of schedule. Case is taught in Wharton. _E_ Wow I have always liked the @nypost but they have really lied when they covered me in Iowa. Packed house standing O best speech! Sad. _E_ You may want to watch David Letterman tonight I am on! _E_ Isn't it ironic that a lot of the wealthy environmentalists use private jets and fight wind farms being placed near their property? _E_ True! __HTTP__ _E_ 'Trump Helps Lift Small Business Confidence to 12 Yr. High' __HTTP__ __HTTP__ _E_ We must have Security at our VERY DANGEROUS SOUTHERN BORDER and we must have a great WALL to help protect us and to help stop the massive inflow of drugs pouring into our country! _E_ Our @TrumpNewYork is really starting the summer on the right foot with their #wellness program as seen in @TandCmag: __HTTP__ _E_ The storied success of Bain in private entrepreneurship and equity is one reason @MittRomney will be a great POTUS. _E_ Does anybody like Lyin' Ted? __HTTP__ _E_ I hereby demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_ Thank you America! @FoxNews post debate poll with +/ from previous poll. #VoteTrump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Thank you for your wonderful endorsement today @TGowdySC. It means a great deal to me. We will not disappoint! #Trump2016 _E_ There are now 119000 fewer Americans employed than there were in July. The economy is still terrible. _E_ Kasich just announced that he wants the people of Indiana to vote for him. Typical politician can't make a deal work. _E_ With all that is happening with Ebola including the doctor who so easily came back to New York Obama still refuses to stop the flights! _E_ Before you are a leader success is all about growing yourself. When you become a leader success is all about growing others. Jack Welch _E_ Via @DMRegister by @JenniferJJacobs: Trump adds events to his Iowa trip next month __HTTP__ _E_ Wow my poll numbers have just been announced and have gone through the roof! _E_ Does anyone else have two golf pros—John Nieporte & Jim Herman—who qualified for the U.S. Open? Could this be an all time record? _E_ ...or mentally troubled (or a con). _E_ Offering two championship courses @TrumpGolfDC has been awarded the honor of hosting the 2017 @seniorpgachamp __HTTP__ _E_ I will be live tweeting during tonight's #CelebrityApprentice 9 PM ET @NBC _E_ 26000 sexual assaults or rapes reported in military last year and that is just the number that is reported (many do not want to report). _E_ Which National Costume do you think should win? __HTTP__ _E_ Watching John Kasich being interviewed acting so innocent and like such a nice guy. Remember him in second debate until I put him down. _E_ Wow 25 degrees below zero record cold and snow spell. Global warming anyone? _E_ Total misnomer to call ObamaCare 'The Affordable Care Act.' Affordable for whom besides big businesses & Congress w/their exemptions? _E_ The meeting with the @nytimes is back on at 12:30 today. Look forward to it! _E_ The more you know the more you realize how much you don't know. How can you possibly discover anything if you already know everything? _E_ Great meeting with Ford CEO Mark Fields and General Motors CEO Mary Barra at the @WhiteHouse today. __HTTP__ _E_ .@unicef Caryl M. Stern CEO is driving around in a Rolls Royce... _E_ I am glad America is starting to get to know @MittRomney the way I know him. A wonderful & decent family man (cont) __HTTP__ _E_ They made up a phony collusion with the Russians story found zero proof so now they go for obstruction of justice on the phony story. Nice _E_ The New Hampshire drug epidemic must stop. If elected POTUS I will create borders & the drugs will stop pouring in. __HTTP__ _E_ After my tour of Asia all Countries dealing with us on TRADE know that the rules have changed. The United States has to be treated fairly and in a reciprocal fashion. The massive TRADE deficits must go down quickly! _E_ #VoteTrump2016 & together we will #MakeAmericaGreatAgain! THANK YOU for your support! __HTTP__ _E_ Always bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_ Via @AP by @splaisance: @realmissnvusa NIA SANCHEZ CROWNED AS 63RD @MissUSA __HTTP__ _E_ .@History's wonderful The Men Who Built America with me on tonight at 9 bad timing I'll be live tweeting the debate _E_ In your planning know how much risk you can take. Evaluate whether the returns will be worth the risk. _E_ The Fake News Networks are working overtime in Puerto Rico doing their best to take the spirit away from our soldiers and first R's. Shame! _E_ After @BarackObama's speech tonight which should be well delivered reality will hit Friday morning when the new jobs report is released. _E_ .@BarackObama blocked Keystone. Now China is preparing a massive $1.5B oil deal with Canada. __HTTP__ A terrible deal for US! _E_ I endorsed Luther Strange in the Alabama Primary. He shot way up in the polls but it wasn't enough. Can't let Schumer/Pelosi win this race. Liberal Jones would be BAD! _E_ Thank you @Samsung! We would love to have you! __HTTP__ _E_ .@Theresa_May don't focus on me focus on the destructive Radical Islamic Terrorism that is taking place within the United Kingdom. We are doing just fine! _E_ Ivanka on @foxandfriends now! _E_ Via @PRNewswire "Streetsense Brings The National a Geoffrey Zakarian Restaurant to DC's New Trump Intl Hotel" __HTTP__ _E_ .@MELANIATRUMP just finished being on @theviewtv by any standard she was great! _E_ Thanks to all for your thoughtful birthday wishes – Donald Trump _E_ We should remember that during this entire Petraeus episodeover 50 of our nation's bravest have died in Afghanistan... _E_ I will be going to church in Iowa this morning with my wife Melania. After church I will be making two speeches and touring the State! _E_ Alex Rodriguez has played under 140 games in each of the last five seasons. He will miss half of next season. Really bad deal for @yankees. _E_ For last minute shopping my new book #TimeToGetTough is a great choice... __HTTP__ _E_ "You have to learn the rules of the game. And then you have to play better than anyone else." – Albert Einstein _E_ #TrumpVlog Will our brave soldiers catch Ebola? __HTTP__ _E_ Lance Armstrong is now going to admit guilt—can that be possible after many years of denying? Just go away Lance. _E_ Entrepreneurs: Don't sell yourself short. Don't ever think you've done it all already or that you've done your best. _E_ Obama's carbon tax plan will finance more windmills in America. More real estate depreciated wildlife killed incl. bald eagles _E_ RT @USArmy333: @804StreetMedia @realDonaldTrump He's done more in 9 months then obama did in8 yrs _E_ THANK YOU NEVADA! WE WILL MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ __HTTP__ _E_ #TrumpAdvice __HTTP__ _E_ RT @AmericaFirstPol: .@POTUS Trump led a historic journey to the White House. 50 days in that historic journey continues. Take a look 👉 ht... _E_ Thank you to all for the wonderful reviews of my foreign policy speech. I will soon be speaking in great detail on numerous other topics! _E_ .@RICKYMONEY I don't know a lot about failures. And as you know I never went bankrupt. _E_ Now A Rod is claiming that @MLB and @yankees are out to get him' __HTTP__ He should just get the hell out of NYC already! _E_ #1 for success: Find out what you love to do. Trust yourself enough to find out what is best for you and what you're best at doing... _E_ I will be live tweeting President Obama's prime time speech tonight starting at 7:50 P.M. (Eastern).Will he finally state the real problems? _E_ Headline reads Rubio passes Bush in Florida poll Unfair because Trump destroys them both! Trump 31.5% Rubio 19.2% Bush 11.3% _E_ Thank you Tennessee! #Trump2016#SuperTuesday _E_ Good news @MittRomney has pulled ahead in Wisconsin __HTTP__ WIth @PaulRyanVP on the ticket Wisconsin is in play. _E_ Thank you West Chester Pennsylvania!#PAPrimary #VoteTrump __HTTP__ __HTTP__ _E_ We must keep evil out of our country! _E_ THANK YOU LAS VEGAS NEVADA!#NevadaCaucus #VoteTrumpNV __HTTP__ __HTTP__ _E_ JOIN ME IN OHIO TOMORROW!Springfield 1pm: __HTTP__ 4pm: __HTTP__ 7pm... __HTTP__ _E_ Why isn't President Obama working instead of campaigning for Hillary Clinton? _E_ Why did Vince and the WWE give my speech and segment the most time last night on USA Network because that's what people want to see! _E_ Trump: 'Terrible traitor' Snowden embarrassing US __HTTP__ via @thehill by @JTSTheHill _E_ Attention Arnold Palmer: Happy Birthday Arnold. There is no one like you The King! @KingdomMag __HTTP__ _E_ Losers such as George Will and @Rosie use me to get publicity for themselves. They are strictly third rate. _E_ THANK YOU! _E_ Join me in Florida tomorrow!MIAMI 12pm __HTTP__ __HTTP__ __HTTP__ _E_ The Debt is our nation's greatest threat. @BarackObama is out of touch. _E_ COURT FINDS IN FAVOR OF TRUMP UNIVERSITY __HTTP__ _E_ Well the Special Elections are over and those that want to MAKE AMERICA GREAT AGAIN are 5 and O! All the Fake News all the money spent = 0 _E_ Winners I am convinced imagine their dreams first. They want it with all their heart and expect it to come true. Joe Montana _E_ Dummy Graydon Carter doesn't like me too much...great news. He is a real loser! @VanityFair _E_ Remember negotiations are fluid. Remain calm and don't settle easily. If you have the goods you will ultimately win. _E_ I heard because his show is unwatchable that @Lawrence has made many false statements last night about me. Maybe I should sue him? _E_ "You can't wear a blindfold in business. A regular part of your day should be devoted to expanding your horizons." – Trump: How to Get Rich _E_ Also The Donald J. Trump Signature mattress from SERTA is doing record business call Serta and see why! _E_ Remember the Republicans are 5 0 in Congressional races this year. In Senate I said Roy M would lose in Alabama and supported Big Luther Strange and Roy lost. Virginia candidate was not a "Trumper" and he lost. Good Republican candidates will win BIG! _E_ Via @politicalwire: "Trump Not Happy with Republicans" __HTTP__ _E_ I have chosen one of the truly great business leaders of the world Rex Tillerson Chairman and CEO of ExxonMobil to be Secretary of State. _E_ Dishonest media says Mexico won't be paying for the wall if they pay a little later so the wall can be built more quickly. Media is fake! _E_ The White House has just admitted Al Qaeda was involved in Benghazi __HTTP__ What about the video tape? _E_ Trump Finalizes Agreement For Trump International Hotel The Old Post Office Building Washington D.C. __HTTP__ _E_ Little Marco Rubio gave amnesty to criminal aliens guilty of sex offenses. DISGRACE! __HTTP__ _E_ Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_ The brand new hotel at Trump National Doral has the most beautiful rooms and suites in Miami. Enjoy! _E_ .@ZachJohnsonPGA You're one of the truly great competitiors. I've said it for years. Great going winning @OpenChampionship Not surprised! _E_ All polls have me winning debate big Drudge TIME etc. Dopey Charles Krauthammer still nasty. He has zero cred totally dishonest! _E_ Russia has more warheads than ever N Korea is testing nukes and Iran got a sweetheart deal to keep theirs. Thanks @HillaryClinton. _E_ President Obama be cool be smart be sharp and FOCUS (no more March Madness) and you can beat Putin at his own game. IT CAN BE DONE! _E_ Sorry but this is years ago before Paul Manafort was part of the Trump campaign. But why aren't Crooked Hillary & the Dems the focus????? _E_ Via @JTAnews and Jason Greenblatt Donald Trump is a Visionary With Talents Our Country Needs @JasonDovEsq __HTTP__ _E_ Never said anything derogatory about Haitians other than Haiti is obviously a very poor and troubled country. Never said "take them out." Made up by Dems. I have a wonderful relationship with Haitians. Probably should record future meetings unfortunately no trust! _E_ Glad to hear that @RobinRoberts is doing well. She is a terrific person. _E_ Via @itp_ab by @ctrenwith: 'Trump effect' will see Dubai properties rise 50%" __HTTP__ _E_ Obama has called Libya attack a bump in the road and not optimal. Just come clean already tell Americans the truth! _E_ Democrats purposely misstated Medicaid under new Senate bill actually goes up. __HTTP__ _E_ I said simply that the Mexican leaders and negotiators are smarter than ours and that the Mexican gov't is pushing their hard core to U.S. _E_ .@RGIII & @DangeRussWilson & Luck are very special players will be great playoff games. _E_ I am somewhat surprised that Bernie Sanders was not true to himself and his supporters. They are not happy that he is selling out! _E_ Thank you @krauthammer for your nice comments on @oreillyfactor. A lot of progress is being made! _E_ As the @BarackObama's took their 16th vacation this month unemployment is back to 9% and underemployment at (cont) __HTTP__ _E_ Cruz caught cold in lie after denial of push polls like lies w/ @RealBenCarson. How can he preach Christian values? __HTTP__ _E_ Got to know Senator @JohnKerry in Aspen Colorado years ago—a very solid and stand up guy. _E_ I can't believe David Letterman has announced his retirement he is a great guy! @Letterman _E_ Here's the deal: when your secretary of defense tells you that your proposed cuts will erode America's military (cont) __HTTP__ _E_ Great mtg w/ @Cabinet today. Tomorrow I will be announcing the new head of the Fed. I think you will be extremely impressed by this person! __HTTP__ _E_ Rated @GolfMagazine as 1 of the top courses in the country Trump Int'l Palm Beach has been expanded to 27 holes __HTTP__ _E_ Always pretend that you're working for yourself. You'll do a wonderful job. It's simple but it works. _E_ If you think big you will encounter big setbacks from time to time. What really matters is how you respond to them. Think Big _E_ Such a nice article in the New York Times about a wonderful developer Arthur Zeckendorf __HTTP__ _E_ Highest Stock Market EVER best economic numbers in years unemployment lowest in 17 years wages raising border secure S.C.: No WH chaos! _E_ Wishing everyone a wonderful Independence Day holiday weekend a great celebration for a great country. _E_ Miss Pennsylvania is just looking for free publicity at the expense of the real winner of Miss USA Olivia Culpo. _E_ Via CBSWashDC: "114 Year Old DC Building a Step Closer to Becoming Trump's Latest Hotel" __HTTP__ _E_ Departing The Pentagon after meetings with @VP Pence Secretary James Mattis and our great teams. #MAGA __HTTP__ _E_ General Petreus and his family are paying a big price! _E_ Great numbers on Stocks and the Economy. If we get Tax Cuts and Reform we'll really see some great results! _E_ There's no love lost between @latoyajackson & @OMAROSA Disrespectful? Who is being disrespectful? #CelebApprentice _E_ I had a great time answering as many questions as possible in sixty seconds at @facebook NY today __HTTP__ _E_ Press conference at The Old Post Office in D.C. __HTTP__ _E_ Thank you Massachusetts! #Trump2016 #SuperTuesday _E_ Get our Marine out of Mexico. __HTTP__ _E_ Pennsylvania poll just released. Two rallies there on Mon join me!Ambridge: __HTTP__ Barre:... __HTTP__ _E_ "How to travel like a billionaire! Inside Donald Trump's £63m private jet" __HTTP__ via @travelmail by @AndreaMagrath _E_ I don't know Dennis Kozlowski who made Tyco into a great company & then went to prison but he's up for parole—let him go! _E_ Senator Marco amnesty Rubio who has worst voting record in Senate just hit me on national security but I said don't go into Iraq. VISION _E_ Entrepreneurs: Keep your momentum going! It's a big factor in sustaining your success. Keep moving forward! _E_ Sleepy eyes @chucktodd—one of the dumbest voices in politics is angry that I'm doing @ThisWeekABC. _E_ Mika Brzezinski: Dem Criticism of Comey Reinforcing Idea 'There's Something There' __HTTP__ __HTTP__ _E_ A year ago today a diplomat and 3 security operatives were abandoned by our government while they were under attack. Never forget! _E_ Let's all take a moment to remember all of the heroes from a very tragic day that we cannot let happen again! _E_ President Obama NOW bring our 4000 innocent and ill trained soldiers home from West Africa before it is too late AND STOP THE FLIGHTS! _E_ Lightweight Marco Rubio was working hard last night. The problem is he is a choker and once a choker always a choker! Mr. Meltdown. _E_ Entrepreneurs: Keep the big picture in mind. There are always opportunities and thinking too small can negate a lot of them. _E_ The @washingtonpost loses money (a deduction) and gives owner @JeffBezos power to screw public on low taxation of @Amazon! Big tax shelter _E_ We finally agree on something Rosie. __HTTP__ _E_ A Call for Unity by Jason Greenblatt @JasonDovEsq __HTTP__ _E_ #CrookedHillary __HTTP__ _E_ For all of my millions of followers and at your request I will be tweeting tonight during President Obama's speech! 9pm ET _E_ .@michellemalkin & @BuzzFeedAndrew: "Vaccine court awards millions to two autistic children damaged by vaccine" __HTTP__ _E_ A special message to the staff of @TrumpWaikiki in celebration of the 2nd anniversary.... __HTTP__ _E_ Personally I'm glad the NYPD is monitoring the actions of certain extremists. New York's finest! I support them. _E_ What is your favorite @THEGaryBusey film? Tonight's short film? Point Break? Lethal Weapon? #CelebApprentice _E_ Serious voter fraud in Virginia New Hampshire and California so why isn't the media reporting on this? Serious bias big problem! _E_ Sanctions Relief From Clinton Obama Iran Nuclear Deal Likely Go to Terrorists: __HTTP__ #BigLeagueTruth #VPDebate _E_ Via @CBSmiami by @LisaPetrillo: "Trump Unveils Renovated @TrumpDoral Red Tiger Golf Course" __HTTP__ _E_ Sen. Lindsey Graham embarrassed himself with his failed run for President and now further embarrasses himself with endorsement of Bush. _E_ Heading back to Washington D.C. Much will be accomplished this week on trade the military and security! _E_ Crooked Hillary has ZERO leadership ability. As Bernie Sanders says she has bad judgement. Constantly playing the women's card it is sad! _E_ Via @Newsmax_Media: Trump at CPAC: What Really Happened __HTTP__ _E_ We spent over a billion on Libya and lead the way why is Europe getting the oil? _E_ thilan_GolfSwag @realDonaldTrump Played Doral for the first time. absolutely great course! Fantastic job! Thanks. _E_ See Schneiderman admit he spoke with Obama about "ongoing investigations. __HTTP__ _E_ Honored to welcome Republican and Democrat members of the House Ways and Means Committee to the White House today! #USA __HTTP__ _E_ Just said at #NCGOPcon that politicians are all talk and no action and we are all tired of it! We need action and results to move forward! _E_ See what I have to say about Iran and Iraq in today's #trumpvlog... __HTTP__ _E_ .@Neilyoung one of my favorite musicians in my office. __HTTP__ _E_ RT @VP: .@POTUS is committed to the health & well being of the US people & we are confident Dr. Jerome Adams will succeed as our new surgeo... _E_ Because Gov. Kasich cannot run in the state of Pennsylvania he cannot win the nomination & should not be allowed to compete in Ohio on Tue. _E_ This is good news: @MittRomney is now leading in Michigan by 6 points according to @RasmussenPoll __HTTP__ _E_ Get ready to turn to NBC for CELEBRITY APPRENTICE TONIGHT'S SHOW IS GREAT! _E_ Support Coach Kennedy and his right together with his young players to pray on the football field. Liberty Institute just suspended him! _E_ Yankees can win today. Kuroda is a highly underrated pitcher. _E_ The protesters in California were thugs and criminals. Many are professionals. They should be dealt with strongly by law enforcement! _E_ "The longer you play the better chance the better player has of winning." @jacknicklaus _E_ There won't be any new gun legislation. No surprise. Americans support the 2nd amendment. _E_ Scary thought what is the pervert Anthony Weiner doing with all the free time he has. Does he collect unemployment? _E_ Al Shabbab not ISIS just made a video on me they all will as front runner & if I speak out against them which I must. Hillary lied! _E_ What the hell is Obama doing in allowing all of these potentially very sick people to continue entering the U.S.! Is he stupid or arrogant? _E_ Via @limbaugh: "See Trump Told You So" __HTTP__ _E_ My video response to President Obama's lack of transparency. __HTTP__ _E_ State Department has not revoked a single passport of ISIS Americans __HTTP__ We should send them to Gitmo for some R&R. _E_ Iran is moving troops into Iraq under the guise that it is helping out. Actually they will take over Iraq and all of their oil. Stupid U.S. _E_ Don't worry getting rid of state lines which will promote competition will be in phase 2 & 3 of healthcare rollout. @foxandfriends _E_ .@TrumpSoHo features a striking glass walled building w/ loft inspired interiors __HTTP__ NYC's trendiest luxury hotel _E_ Great article by @jameshohmann @politico explaining why @KarlRove was biggest loser @CPACnews __HTTP__ James is sharp. _E_ I will be interviewed on the @TODAYshow at 7:30. Enjoy! _E_ Via @MiamiHerald: Donald Trump aims to bring luxury to Doral Golf Resort & Spa __HTTP__ @DoralResort _E_ President Reagan had it right: Social Security is here to stay. We must root out the fraud and make it more (cont) __HTTP__ _E_ People the lawyers and the courts can call it whatever they want but I am calling it what we need and what it is a TRAVEL BAN! _E_ "Most people think small because most people are afraid of success afraid of making decisions afraid of winning" The Art of the Deal _E_ True thanks. __HTTP__ _E_ I'll be on @foxandfriends Monday morning at 7:30 AM. Tune in! _E_ The failing @nytimes has disgraced the media world. Gotten me wrong for two solid years. Change libel laws? __HTTP__ _E_ Obama/Reid/Nunn's failed economic policies are not working. @PerdueSenate will bring fresh perspective to solving problems. #GASen _E_ Spent the weekend in LA checking out Trump National Golf Club on the Pacific Ocean. An amazing place! __HTTP__ _E_ RT @CBSNews: WATCH NOW: The @realDonaldTrump supporters you'd never expect __HTTP__ __HTTP__ _E_ Doing an interview with @SteveDeaceShow. Discussing the ObamaCare web disaster. Be sure to listen __HTTP__ _E_ Snowden is doing great damage to our relations with other countries and U.S.prestige. China is laughing at us as he continues illegal action _E_ Final poll results from NBC on last nights Commander in Chief Forum. Thank you! #ImWithYou #MAGA __HTTP__ _E_ The failing @NYDailyNews destroyed by little Morty Zuckerman is preparing to close and save face by going online. It's dead! _E_ .@ApprenticeNBC Season 13 still #1 at 10PM in all key demos despite having to serve as our own lead in from 9 10. 11PM News loves Trump! _E_ Do you believe that The State Department on NEW YEAR'S EVE just released more of Hillary's e mails. They just want it all to end. BAD! _E_ Millions without electricity across NY & NJ. The media has covered for Obama's massive failure. Can you imagine if this was another Pres? _E_ Just left Oklahoma the most amazing crowd and people! What a night! _E_ My @Newsmax_Media interview from Friday where I predicted that @newtgingrich in South Carolina would change the race. __HTTP__ _E_ Highly respected author Christopher Bedford just came out with book The Art of the Donald Lessons from America's.... Really good book! _E_ The best vision is insight. Malcolm Forbes _E_ Follow @MELANIATRUMP's jewelry line on @QVC site __HTTP__ _E_ The #CelebrityApprentice Sunday night on NBC at 9 PM. Another exciting episode is ready to go. __HTTP__ _E_ How is Bernie Sanders going to defend our country if he can't even defend his own microphone? Very sad! _E_ Republicans must unite to defund Obamacare it will drive our country into oblivion and by the way the healthcare is no good anyway! _E_ Will be doing a big interview tonight with Bret Baier at 6:00 P.M. on Fox. Don't miss it! _E_ Thank you Waukesha Wisconsin! Full transcript of my speech #FollowTheMoney: __HTTP__ __HTTP__ _E_ It was an honor to welcome the Prime Minister of Denmark Lars Løkke Rasmussen {@larsloekke} to the @WhiteHouse yes... __HTTP__ _E_ Wishing @FLOTUS Melania and all of the great mothers out there a wonderful day ahead with family and friends! Happy #MothersDay _E_ USMC Sgt. Tahmooressi has now been held in Mexican jail for over 150 days. When will Obama call for his release? #FreeOurMarine _E_ The oil reserve is a strageic asset for a time of war and an embargo. @BarackObama should open more land for drilling not tap the reserve. _E_ Briarcliff Manor Mayor Vescio is doing a terrible job. Taxes way too high roads in terrible condition—repave Pine Road. @BriarcliffManor _E_ Journal News readership is already down 50 percent over the years. _E_ Right now we have a president and a Treasury secretary who shrug while China tears away hundreds of thousands (cont) __HTTP__ _E_ "Happiness is not something ready made. It comes from your own actions." @DalaiLama _E_ Best of luck to @chucktodd on his @meetthepress debut this Sunday. _E_ .@McIlroyRory What a year it has been for you and this weekend topped it off. Fantastic job see u at Doral. _E_ I'll be tweeting live tonight starting at 9PM ET re:@ApprenticeNBC. Don't worry other time zones I will give nothing away! _E_ Great reception in D.C. At the Values Voter Summit. Now checking on my job at the Old Post Office... _E_ The problem w/ the concept of global warming is that the U.S. is spending a fortune on fixing it while China & others do nothing! _E_ Via @foxnewslatino: "Donald Trump Plans Huge Towers In Rio For Post Olympic Building Boom" __HTTP__ _E_ This is dangerous: @BarackObama is seeking to shrink Israeli military funding but gives $1.3Billion to Muslim (cont) __HTTP__ _E_ Looks like the Bernie people will fight. If not their BLOOD SWEAT AND TEARS was a total waste of time. Kaine stands for opposite! _E_ Obama and Kerry are bungling Syria by the hour. They have set America's deterrence & stature back by years. Amateurs! _E_ According to many ISIS was given so much time and so many signals as to when we would start bombing that they were able to prepare and hide _E_ .@VattenfallGroup lead investor in Aberdeen windfarm fiasco has dropped out—project not economically viable & protestors hate it. _E_ Eli Wallach was a great actor and a great guy. My opinion his performance in The Good the Bad and the Ugly was his all time best! _E_ ...Get along & make deals for the good of the country! _E_ Due diligence includes increasing your financial IQ daily. _E_ Remember no one ever said success was easy.Good luck doesn't come overnight.But if u work hard & love it u will find success & luck. _E_ Obama is an easy target on foreign policy.@MittRomney has many openings to attack especially when Obama starts bragging about Bin Laden. _E_ I answered some of your questions in today's video... __HTTP__ _E_ According to many and while nominated I would have won the Emmy many times except for my politics. @PrimetimeEmmys _E_ It's not climate changeit's global warming.Don't let the dollar sucking wiseguys change names midstream because the first name didn't work _E_ Getting ready to leave for my GREAT resort Turnberry in Scotland. Hosting The Women's British Open (biggest tournament). Will be back Sat. _E_ ...and safe. Questions were asked about why the CIA & FBI had to ask the DNC 13 times for their SERVER and were rejected still don't.... _E_ Worthless @NYDailyNews which dopey Mort Zuckerman is desperately trying to sell has no buyer! Liabilities are massive! _E_ A record high 6.7% of Americans are living in extreme poverty. This is tragic. We can do better. _E_ Governor Rick Perry said Donald Trump is one of the most talented people running for the Presidency I've ever seen. Thank you Rick! _E_ Re Life: Life is very fragile and success doesn't change that. If anything success makes it more fragile. _E_ The mother of the Boston killers (not suspects) says her boys are totally innocent and were set up I can see the 14 year long defense now! _E_ RT @DonaldJTrumpJr: Donald Trump Jr. On The Record: Why Trump International Hotels And Residences Are Still Winning via @forbes __HTTP__ _E_ #CNNDebate Winning the @drudge_report poll __HTTP__ _E_ I'm leaving for Iowa now will be great! _E_ Fan favorite @LilJon once again shines in the record 13th season of 'All Star' @CelebApprentice. He is an amazing & wonderful guy! _E_ Watching Hurricane closely. My team which has done and is doing such a good job in Texas is already in Florida. No rest for the weary! _E_ LIVE on #Periscope: Join me for a few minutes in Pennsylvania. Get out & VOTE tomorrow. LETS #MAGA!! __HTTP__ _E_ Whether you think you can or think you can't you're right. Henry Ford _E_ RT @piersmorgan: BOOM! Thank you Mr President. Trophy hunting is repellent. __HTTP__ _E_ .@WSJ and dopey Karl Rove made a mistake and purposely mischaracterized my statement on the terrible TPP deal. __HTTP__ _E_ Dark Knight Rises is projected to gross over $180 million this weekend. Remember to watch for Trump Tower! _E_ .@HillaryClinton's tax hikes will CRUSH our economy. I will cut taxes BIG LEAGUE. __HTTP__ __HTTP__ _E_ Be ready for problems. You'll have them every day so keep things in perspective. Ask yourself: Is this a blip or is it a catastrophe? _E_ Think of this: After we spent $2 trillion on Iraq Baghdad is about to be taken over by ISIS. _E_ ... debut her first 2013 "Melania® Timepieces & Fashion Jewelry" collection! _E_ I made a lot of money in Atlantic City and left 7 years ago great timing (as all know). Pols made big mistakes now many bankruptcies. _E_ BTW The Miss USA pageant was the highest rated non sports telecast on the Big 4 networks. Congrats to our newly crowned @Nia_Sanchez_! _E_ I am following the Trayvon Martin case carefully. It's a terrible situation that should never have happened. (cont) __HTTP__ _E_ Bill Cosby is foolish stupid or getting bad advice in remaining silent if he is innocent. Probably guilty! Not a fan. _E_ We are suffering through the worst long term unemployment in the last 70 years. I want change Crooked Hillary Clinton does not. _E_ Thank you Delaware! #Trump2016 __HTTP__ _E_ Happy Birthday @DonaldJTrumpJr! __HTTP__ _E_ Great crowd in Johnstown Pennsylvania thank you. Get out & VOTE on 11/8! Watch the MOVEMENT in PA. this afternoon... __HTTP__ _E_ A great deal of good things happening for our country. Jobs and Stock Market at all time highs and I believe will be getting even better! _E_ #USAatUNGA#UNGA __HTTP__ _E_ We will now be helping Syria and Iran by attacking ISIS ironic isn't it! _E_ Via @WWE: Donald Trump announced for WWE Hall of Fame __HTTP__ _E_ Situated in the heart of downtown Toronto the 65 story @TrumpTO offers an elegant and wonderful lifestyle __HTTP__ _E_ The @WSJ Editorial Board is so wrong so often. They got info from an incorrect story in another pub. Why not watch and listen to debate. _E_ Remain open to new ideas. That's where innovation begins. _E_ A record 46.68M Americans are now on food stamps __HTTP__ Four more years? _E_ Ask Sally Yates under oath if she knows how classified information got into the newspapers soon after she explained it to W.H. Council. _E_ I am self funding my campaign putting up my own money not controlled. Cruz is spending $millions on ads paid for by his N.Y. bosses. _E_ China is sending an Envoy and Delegation to North Korea A big move we'll see what happens! _E_ In memory of Joan Rivers watch when she became my Celebrity Apprentice which meant so much to her! __HTTP__ _E_ .@TrumpChicago's award winning dining options also offer the best views of the city __HTTP__ _E_ My @CNNS interview with @wolfblitzercnn discussing my endorsement of @MittRomney and why he can beat @BarackObama __HTTP__ _E_ The opinion of this so called judge which essentially takes law enforcement away from our country is ridiculous and will be overturned! _E_ Entrepreneurs: Brainpower is the ultimate leverage. _E_ If amnesty is so popular according to the DC ruling class then why is Obama delaying his executive action until after the election? _E_ 'Top Hillary Adviser Mocked Plotted Attacks on Pro Sanders Civil Rights Leader' #DrainTheSwamp __HTTP__ _E_ Nobody beats me on National Security. __HTTP__ _E_ As a stockholder in Apple they should get on with a larger screen iPhone as a supplement—immediately. _E_ The Coca Cola company is not happy with me that's okay I'll still keep drinking that garbage. _E_ Entrepreneurs: Being stubborn is a big part of being a winner. Never give up! _E_ Many of the great jobs that the people of our country want are long gone shipped to other countries. We now are part time sad! I WILL FIX! _E_ The most stringent gun laws in the U.S. happen to be in Chicago and look what is happening there! _E_ To all my fans sorry I couldn't do The Apprentice any longer—but equal time (presidential run) prohibits me from doing so. Love! _E_ American sanctions alone cannot stop Iran's nuclear drive and @BarackObama cannot get China and Russia to agree on new Iranian sanctions. _E_ Great article by @RichLowry on @POLITICOMag : "Sorry Donald Trump Has A Point" __HTTP__ _E_ My speech to @PressClubDC yesterday at the #NPCLunch on the topic of building a business brand via @cspan __HTTP__ _E_ China's Communist Party has now publicly praised Obama's reelection. They have never had it so good. Will own America soon. _E_ #LaborDay #AmericaFirstVideo: __HTTP__ __HTTP__ _E_ Congrats to winners from around the world who entered the Think Like A Champion signed book/keychain contest! __HTTP__ _E_ See yourself as an organization. Pay attention to every facet of your life. What's strong? What's weak? What's missing? _E_ I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_ Heading over to @Kelly and Michael re. Apprentice! _E_ Consumer Comfort Reaches 16 Year High on U.S. Economic Optimism via Bloomberg __HTTP__ _E_ National Security Presidential Memorandum on Strengthening the Policy of the United States Toward Cuba Memorandum... __HTTP__ _E_ If someone says "I'll bet you ten dollars" and loses the bet it's pay up time. _E_ LIVE on #Periscope: Live with the Donald __HTTP__ _E_ Both Aberdeen and Turnberry in Scotland and the soon to open Doonbeg in Ireland blow Bandon Dunes away. Bandon is a toy by comparison! _E_ 59% of the United States by area is now covered in snow highest % in many years. The global warming name isn't working anymore SORRY! _E_ Word is that they have far more evidence on A Rod than they have on Ryan Braun! Alex is over. _E_ Ted Cruz didn't win Iowa he stole it. That is why all of the polls were so wrong and why he got far more votes than anticipated. Bad! _E_ #CelebrityApprentice Boardrooms—can anything be more intense? #sweepstweet _E_ Senator Sessions will serve as the Chairman of my National Security Advisory Committee. __HTTP__ __HTTP__ _E_ I don't mind that @BarackObama plays a lot of golf. I just wish he used it productively to make deals with Congress! _E_ Congrats to @bubbawatson on winning the Masters. He did it without heavy reliance on coaches and the other hanger ons he just played golf. _E_ ICYMI "Raw video: Donald Trump speaks at Rep. Steve Stepanek's Amherst reception" __HTTP__ via @wmur9 _E_ Via @pressjournal by Ann Marie Parry: Plans revealed for course named after Trump's mother __HTTP__ _E_ Big protests in Iran. The people are finally getting wise as to how their money and wealth is being stolen and squandered on terrorism. Looks like they will not take it any longer. The USA is watching very closely for human rights violations! _E_ RT @BrazoriaCounty: __HTTP__ _E_ If I would have offered Obama a billion dollars to show his records he would have refused. _E_ Every Poll has me winning BIG.If you listen to dopey Karl Rove a Trump hater on @oreillyfactor you would think I'm doing poorly. @FoxNews _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Ben Smith (is that really his last name?) of @BuzzFeed is a total mess who probably got his minion Coppins to do what he didn't want to do? _E_ So much Fake News is being reported. They don't even try to get it right or correct it when they are wrong. They promote the Fake Book of a mentally deranged author who knowingly writes false information. The Mainstream Media is crazed that WE won the election! _E_ How do you like Seth and Oscars so far? _E_ Senator Lindsey Graham called me yesterday very much to my surprise and we had a very interesting talk about national security and more! _E_ Wind Power is proving to be very costly and unsightly. _E_ Via @BostonDotCom by @lilsarg: "Donald Trump on Snow Salt Vaccines and the Oval Office" __HTTP__ _E_ Scary thought @JoeBiden is a heartbeat away from the Presidency. _E_ 2013 is the worst year ever for Hollywood. Garbage released after garbage. What is going on in these studios?! _E_ Bob Beckel a commentator for FOX is bad for the @FoxNews brand: @BobBeckel is close to incompetent. _E_ The Midas Touch hand is the ideal metaphor to represent the attributes critical to entrepreneurial success. (cont) __HTTP__ _E_ Great Governor @Mike_Pence is in Indiana to help lead the relief efforts after tornadoes struck. True leadership. _E_ "Winning is the most important thing in my life after breathing. Breathing first winning next." George Steinbrenner _E_ Thank you Senator @TedCruz!#Debates2016 #MAGA __HTTP__ _E_ I watched POTUS speech from Europe same old tax and spend won't create jobs. _E_ Congratulations to Patrick Reed for winning at Trump National Doral. He told me The Blue Monster is the best course I've ever played _E_ How about President Obama fixing the gasoline situation instead of taking photo ops in the destruction. _E_ Watch me on @SeanHannity's show at 10PM tonight on @FoxNews _E_ Entrepreneurs: There are no guarantees but being ready sure beats being taken by surprise. Know everything you can about what you're doing. _E_ Thank you Rand! __HTTP__ _E_ W/ views of NYC's skyline Trump Stamford is Connecticut's most luxurious high rise featuring Trump amenities __HTTP__ _E_ 'Uniforms 4 Everyone' campaign @fundanything has a $3000 goal to buy underprivileged kids school uniforms __HTTP__ _E_ Great debate poll numbers I will be on @foxandfriends at 7:00 to discuss. Enjoy! _E_ The Hillary Clinton staged event yesterday was pathetic. Be careful Hillary as you play the war on women or women being degraded card. _E_ Congratulations to @SpeakerRyan @GOPLeader @SteveScalise and to the Republican Party on Budget passage yesterday. Now for biggest Tax Cuts _E_ Gabriel Sherman's book on Roger Ailes is filled with falsehoods and inaccuracies. Publisher should be ashamed (and sued). _E_ Don't worry when our country starts hurting bad enough from all of the mistakes that are being made we will start doing the right things. _E_ "Don't expect to build up the weak by pulling down the strong." Calvin Coolidge _E_ The road to success is always under construction. Arnold Palmer _E_ Thank you Jason Greenblatt @JasonDovEsq For Our Children: Let's Elect Donald Trump __HTTP__ _E_ Receiving the @RobbReport trophy for best new golf course in the world Trump International Golf Links Scotland. __HTTP__ _E_ .@KathieLGifford Melania and I send our deepest condolences. Frank was a special and amazing person. He will be missed by all! _E_ .@FranksFight Keep fighting Frank! Never give up! _E_ "Always be prepared to start." Joe Montana _E_ Via @nypost by @StarrMSS: "Trump: @ApprenticeNBC contestants 'the meanest by far'" __HTTP__ _E_ With these record high gas prices what does it say about Obama that he was trying to brag about his energy policy in the debate? _E_ Looking forward to a full day of meetings with President Xi and our delegations tomorrow. THANK YOU for the beautiful welcome China! @FLOTUS Melania and I will never forget it! __HTTP__ _E_ I am doing Greta tonight on Fox talking about Obama Care and pervert Anthony Wiener! 10 P.M. _E_ Congratulations to Dubai on winning the rights to host Expo 2020! A great place winning a major global event.@damacofficial @dubaiexpo2020 _E_ Many Super Pacs funded by groups that want total control over their candidate are being formed to "attack" Trump. Remember when u see them _E_ The Unaffordable Care Act sometimes referred to as ObamaCare is not working. Millions of people are losing their plans and doctors fraud! _E_ Every time I speak of the haters and losers I do so with great love and affection. They cannot help the fact that they were born fucked up! _E_ China has control over North Korea! _E_ Congratulations to Barack Obama for having 2012's debt already surpass 2011 __HTTP__ _E_ My @WOR710 interview on The John Gambling Show discussing the 2012 election Trump real estate projects & our airports __HTTP__ _E_ Wow "FBI lawyer James Baker reassigned" according to @FoxNews. _E_ The Federal deficit crossed $15Trillion 100% of our GDP. Yet the Super Committee can't find $1.2Trillion i... (cont) __HTTP__ _E_ So funny Crooked Hillary called BREXIT so incorrectly and now she says that she is the one to deal with the U.K. All talk no action! _E_ Join me in Tampa Florida tomorrow at 1pmE! Tickets: __HTTP__ __HTTP__ _E_ We could make America great again by spreading ObamaCare throughout the World while at the same time dropping it from U.S.! _E_ We believe that every American should stand for the National Anthem and we proudly pledge allegiance to one NATION UNDER GOD! __HTTP__ _E_ Be sure to watch the Celebrity Apprentice on Sunday night 9 pm on NBC. __HTTP__ _E_ .@megynkelly is very bad at math. She was totally unable to figure out the difference between me and Cruz in the new Monmouth Poll 41to14. _E_ KAREN HANDEL FOR CONGRESS. She will fight for lower taxes great healthcare strong security a hard worker who will never give up! VOTE TODAY _E_ So nice being with Republican Senators today. Multiple standing ovations! Most are great people who want big Tax Cuts and success for U.S. _E_ .@GovMikeHuckabee Great job on @FoxNews tonight. Thanks for your nice words about my children. Class! _E_ Enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_ thought it would be hypocritical to attend Bush's swearing in....he doesn't believe Bush is the true elected president. Sound familiar! WP _E_ Speaking at the City Club of Chicago. Sold out in minutes with thousands on the wait list!... __HTTP__ _E_ Stock market hits another high with spirit and enthusiasm so positive. Jobs outlook looking very good! #MAGA __HTTP__ _E_ .@AC360 Anderson so amazing. Your mother is and always has been an incredible woman! _E_ Over 90% of American workers could lose their healthcare by 2020 thanks to ObamaCare. Repeal before it is too late! _E_ How the hell does the Libyan government get off telling our embassy security they can't have loaded guns for protection?! _E_ Jane Fonda and Michael Douglas look great! _E_ Afghanistan is a total disaster. We don't know what we are doing. They are in addition to everything else robbing us blind. _E_ My FoxBusiness interview with Don Imus discussing #TimeToGetTough the GOP primary and the Newsmax @iontv debate __HTTP__ _E_ As I anticipated Justice Roberts made the cover of Time Magazine etc. The liberal media now loves him he should be ashamed. _E_ Thank you @JoeTrippi for the nice and true words on #Media Buzz with terrific Howie Kurtz. Leading New Hampshire 30 to 12. @FoxNews _E_ Hillary has bad judgment! __HTTP__ _E_ Response to the Pope: __HTTP__ _E_ Trump to Liberty U Students: 'The World is Laughing at Us' __HTTP__ Via @Newsmax_Media _E_ "Perception about India has changed says Donald Trump" __HTTP__ via @EconomicTimes by Kailash Babar _E_ Why are the Republicans giving Obama fast track authority for TPP and the Iran agreement?! Obama gets more from the GOP than his own party. _E_ Sorry for all of the millions of people who long to hear my brilliant words of wisdom on Fox & Friends on Monday A.M. no go in Dubai. _E_ #ObamacareFail __HTTP__ _E_ When will President Obama issue the words RADICAL ISLAMIC TERRORISM? He can't say it and unless he will the problem will not be solved! _E_ #NeverForget __HTTP__ _E_ Entrepreneurs: Ask yourself: What am I pretending not to see? There may be some great opportunities right around you. _E_ Here's what I told @Gretawire on @FOX when it comes to singer @Cher's inappropriate attacks on @MittRomney __HTTP__ _E_ So terrible that Crooked didn't report she got the debate questions from Donna Brazile if that were me it would have been front page news! _E_ The real J.P.Morgan is spinning in his grave at the ridiculous settlements the bank is making to settle disputes. A settler is a soft target _E_ Obama: "I will control Ebola." = Obama: "If you like your health care plan you can keep your healthcare plan." _E_ Marco Rubio is a total lightweight who I wouldn't hire to run one of my smaller companies a highly overrated politician! _E_ My persona will never be that of a wallflower I'd rather build walls than cling to them Donald J. Trump _E_ "The most important thing in communication is hearing what isn't said." Peter Drucker _E_ I will beat Hillary easily but Lindsey Graham says I won't and yet he got zero against me no cred! Why does FOX put him on? _E_ The U.S. is spending fortunes at airports checking people coming in from West Africa with uncertain results. STOP THE FLIGHTS YOU DUMB B's! _E_ THANK YOU NEW YORK! #Trump2016 __HTTP__ _E_ Scotland is beautiful and Trump Internatonal Golf Links Scotland is progressing beautifully as well. __HTTP__ _E_ John Roberts arrived in Malta yesterday. Maybe we will get lucky and he will stay there. _E_ In 2011 I said that Mubarak never should have been ousted because whoever replaces him will be worse. Obama made a mistake. _E_ JOBS JOBS JOBS! __HTTP__ __HTTP__ _E_ Don't believe @BarackObama's whining Pro Romney SuperPAC spending is on par with Pro Obama SuperPAC __HTTP__ _E_ Fast and Furious gun running goes all the way to the White House. We need answers now! _E_ Just toured Baton Rouge Louisiana GREAT PEOPLE fantastic place doing really well. Miss USA Pageant totally sold out.Tomorrow night NBC _E_ I will be interviewed by @seanhannity tonight at 10:00 on @FoxNews . Much much much to talk about! _E_ .@Ynberg: Long term goal > > > to be the black @realDonaldTrump 4real .Great Dean and you will make it! _E_ Which National Costume do you think should win? __HTTP__ _E_ He @RickSantorum wants to decide what books people can read what movies they can see. #freespeech It doesn't work that way! _E_ Big announcement in Ames Iowa on Tuesday! You will not want to miss this rally! #Trump2016 __HTTP__ __HTTP__ _E_ We are asking law enforcement to check for dishonest early voting in Florida on behalf of little Marco Rubio. No way to run a country! _E_ What is our President doing? __HTTP__ _E_ I was so looking forward to being in Virginia Beach Virginia today. The demand for tickets was amazing. Good luck with storm back soon! _E_ these companies are able to move between all 50 states with no tax or tariff being charged. Please be forewarned prior to making a very ... _E_ Belated congratulations to @serenawilliams on winning the French Open. A great player & person! _E_ Ted Cruz poll numbers are down big. Because he was born in Canada and was until recently a Canadian citizen many believe he cannot run! _E_ Because of our terrible leaders it is now open season on every American throughout the world. Terrorists are thrilled. _E_ Success tip: Keep the big picture in mind. There are always opportunities & possibilities & thinking too small can negate a lot of them. _E_ The Obstructionist Democrats make Security for our country very difficult. They use the courts and associated delay at all times. Must stop! _E_ I aim very high and then just keep pushing and pushing to get what I'm after. The Art of the Deal _E_ Watching @TigerWoods on NBC playing great golf. Tiger won The WGC Cadillac Championship at Trump National Doral this year. I love Tiger! _E_ You can't build a reputation on what you're doing to do. Great quote by Henry Ford. _E_ Time to #DrainTheSwamp in Washington D.C. and VOTE #TrumpPence16 on 11/8/2016. Together we will MAKE AMERICA SAFE... __HTTP__ _E_ The Formula of Knowledge: The best way to learn is through studying the history of success and failures in your industry. _E_ WOW SO NICE AND SO TRUE. THANK YOU! @not_that_actor: @realDonaldTrump #TRUMP2016 TIME TO RETHINK THE CHOICES __HTTP__ _E_ I was always a big fan of Kim Novak and still am—a wonderful actress. _E_ Via @NJcomsomerset BY @wobriensomerset: @TigerWoods brings charity golf playoffs toTrump Nat'l/Bedminster __HTTP__ _E_ Why did @oreillyfactor give @davidaxelrod so much time to sell his third rate book. Bill should have hit stammering David MUCH harder! Waste _E_ Sadly when it comes to using the energy industry to create American jobs Obama has been a total disaster. #TimeToGetTough _E_ Just got back from Iowa great people! _E_ We are in the NAFTA (worst trade deal ever made) renegotiation process with Mexico & Canada.Both being very difficultmay have to terminate? _E_ More than anything else I think deal making is an ability you're born with. It's in the genes. #TheArtofTheDeal _E_ RT @IvankaTrump: "The Trump economy is booming." One thing @realDonaldTrump "has done that has received little attention despite arguably d... _E_ .@realDonaldTrump is PRO LIFE PRO FAMILY #BigLeagueTruth #Debates2016 __HTTP__ _E_ I had fun appearing in the video for Carly Rae Jepsen's #CallMeMaybe for #MissUSA 2012 __HTTP__ _E_ That Saturday Night Live is able to joke about the Germanwings air tragedy is disgusting. They should apologize to all of those suffering! _E_ Will be on @foxandfriends at 7:00 this morning enjoy! _E_ Congratulations to @joniernst on her impressive @IowaGOP primary win last night. Now all should unite & defeat Bruce Braley this November _E_ Host of the @PGATOUR & @CadillacChamp @TrumpDoral is home to 4 unique courses including the famous Blue Monster __HTTP__ _E_ I will be on Greta @gretawire tonight at 10 PM on Fox News. _E_ Autism Speaks head up by Bob & Suzanne Wright does a fantastic job—if only we had more people like them! To help: __HTTP__ _E_ Congratulations to Obama on building a strong economy. There are 49500000 people on food stamps. A historic record! _E_ Animals representing Hillary Clinton and Dems in North Carolina just firebombed our office in Orange County because we are winning @NCGOP _E_ Via @newsbusters: "Donald Trump Issues Statement Regarding $5 Million Lawsuit Against Bill Maher" __HTTP__ _E_ The Republican Senators must step up to the plate and after 7 years vote to Repeal and Replace. Next Tax Reform and Infrastructure. WIN! _E_ Join me in Wichita Kansas tomorrow morning! Looking forward to it!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Pollster Trend National GOP Average223 national polls & 33 pollsters.#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ RT @EricTrump: Honored to speak at the RNC Summer Meeting in Nashville Tennessee this evening! @GOP #MAGA @GOPChairwoman __HTTP__ _E_ Via @BreitbartNews: 'MAJOR COUP': DONALD TRUMP PICKS UP TOP IOWA GRASSROOTS OPERATIVE FOR POTENTIAL 2016 CAMPAIGN __HTTP__ _E_ Great to be in Riyadh Saudi Arabia. Looking forward to the afternoon and evening ahead. #POTUSAbroad __HTTP__ _E_ Sexual pervert Anthony Weiner has zero business holding public office. _E_ Immigration reform is all risk for the @GOP. Their base doesn't want it and the 12M illegals will all vote Democrat. _E_ Gas is $6 already in California. Don' worry @BarackObama's Algae energy policy is going to pay major (cont) __HTTP__ _E_ The job plan by @BarackObama is nothing more than a second stimulus. The first failed and so will this one. _E_ In the latest poll Danger Weiner's numbers have sunk. I wonder how Carlos handled the stress? He is one whacko sicko sexter. _E_ .@johnhawkinsrwn Great speaking to you today we will speak again soon. _E_ I am working on a new system where there will be competition in the Drug Industry. Pricing for the American people will come way down! _E_ Kay Hagan profited off of the stimulus.She just skipped a debate. Kay supports amnesty weak border & __HTTP__ @ThomTillis! _E_ The @timestribune @EricTrump: Eyes are on Northeast Pa. with gas development __HTTP__ _E_ Stock Market hit another all time high yesterday despite the Russian hoax story! Also jobs numbers are starting to look very good! _E_ Hillary defrauded America as Secy of State. She used it as a personal hedge fund to get herself rich! Corrupt dangerous dishonest. _E_ My @foxandfriends interview from Monday discussing Obama's tone going over the curb and Republican debt ceiling card __HTTP__ _E_ Thank you @ATFD17! #ImWithYouVideo: __HTTP__ _E_ President Obama was terrible on @60Minutes tonight. He said CLIMATE CHANGE is the most important thing not all of the current disasters! _E_ Democrats used to support border security — now they want illegals to pour through our borders. _E_ "Confidence is contagious. So is lack of confidence." Vince Lombardi _E_ If the Boston killer applies for Obama Care the paperwork will be too complicated for him to understand! _E_ Congratulations to Chuck Hagel on one of the shortest tenures as Sec. of Defense. Another terrible appointee by Obama. _E_ I got to know @ScottWalker well—he's a very nice person and has a great future. _E_ I read @willweatherford's comments that "the lights are dimming on gambling in Florida"—nothing could be worse for the state. _E_ Congrats to @EricTrump and @LaraLeaYunaska on a great five years! _E_ These are facts: In 2001 the US opened its markets to China & since then more than 2 million Americans can't (cont) __HTTP__ _E_ If Obama attacks Syria and innocent civilians are hurt and killed he and the U.S. will look very bad! _E_ Meeting with Generals at Mar a Lago in Florida. Very interesting! _E_ .@AGSchneiderman should remove his eyeliner as pointed out by Cuomo when he does his commercials! _E_ Celebrity Apprentice returns to NBC Sunday 3/14 9 11PM ET/PT. Outstanding list of celebrities & season should be the best one yet! _E_ Via @Newsmax_Media by @OwenTew: "Donald Trump: Kerry Has to Walk If Iran Doesn't Make Deal" __HTTP__ _E_ Dummy political pundit @krauthammer constantly pressed the crazy war in Iraq. Many lives and trillions of dollars wasted. U.S. got NOTHING! _E_ Does anyone remember the fight @mcuban had w/ the referee—he was weak & pathetic—a non athlete trying to live life thru his players. _E_ .@NFL: Too much talk not enough action. Stand for the National Anthem. _E_ Working on major Trade Deal with the United Kingdom. Could be very big & exciting. JOBS! The E.U. is very protectionist with the U.S. STOP! _E_ Our foreign policy decisions are dumbest in U.S. history _E_ Ellen was so awkward and insecure last night. The pizza skit was terrible. She should dump Andy Lassner a guy with no absolutely no talent! _E_ "Partnerships also require negotiation. It should be a win win setup. Otherwise it's not a partnership." – 'Midas Touch' _E_ Join me in Wichita Kansas tomorrow morning! Looking forward to it!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ It is only the people that were never asked to be VP that tell the press that they will not take the position. _E_ Highly untalented Wash Post blogger Jennifer Rubin a real dummy never writes fairly about me. Why does Wash Post have low IQ people? _E_ Check out my interview from MSNBC at __HTTP__ _E_ Obama's ideas don't move us 'Forward' they take us 'Backwards.' These are ideas people come to America to get away from. @marcorubio _E_ Saw Michael Jordan and Ray Allen today playing golf at Trump National Doral the Blue Monster. Great guys! _E_ Capitalism requires capital. When government robs capital from investors it takes away the money that creates (cont) __HTTP__ _E_ Thanks everyone they all said I won the debate. Even won the @CNBC Poll! _E_ Success requires 100% of your focus and 100% of your effort. Don't sell yourself short. _E_ My 757 is incredible I think the teams agree on that. _E_ Economy growing! Excluding hurricane effects CEA estimates that real GDP growth would have been 3.9% in Q3.Stock market at a new high unemployment at a low. We are winning and TAX CUTS will shift our economy into high gear! __HTTP__ _E_ If everything seems under control you're not going fast enough. Mario Andretti _E_ We must stop releasing hard core criminals all over the United States. Our country must be strong again! _E_ Despite the fact that I have had great success with the words YOU'RE FIRED I do not like firing people. But ZERO on ObamaCare mess no way! _E_ "If you plan for the worst – if you can live with the worst – the good will always take care of itself." – The Art of the Deal _E_ Leaving Puerto Rico now for D.C. Will be in Las Vegas early tomorrow to pay my respects. Everyone is in my thoughts and prayers. __HTTP__ _E_ Late Night host are dealing with the Democrats for their very unfunny & repetitive material always anti Trump! Should we get Equal Time? _E_ #MakeAmericaGreatAgain __HTTP__ _E_ My @foxandfriends interview discussing the Super Bowl the real unemployment numbers Iran and @MittRomney's (cont) __HTTP__ _E_ I wonder what the work atmosphere is like @VanityFair. It must be hard working at a dying institution. _E_ Best ratings for the Dateline show were for six months not two months! _E_ Senate passed the VA Accountability Act. The House should get this bill to my desk ASAP! We can't tolerate substandard care for our vets. _E_ Post Debate via @OANN. Thank you!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ "Trump Dana Farber waiting on Bill Maher" __HTTP__ via @BostonGlobe _E_ Why doesn't President Obama call upon the NSA to fix the badly broken website then they could spy on all of the many cheaters & arrest them! _E_ We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_ If US Air & American Airlines are allowed to merge ticket prices will skyrocket—there will be no competition. _E_ Via @trscoop: "Mark Levin DEFENDS Trump: Hillary Clinton is a CROOK and a FRAUD and she's not treated this way!" __HTTP__ _E_ RT @RightlyNews: @realDonaldTrump @LouDobbs It is NOT a coincidence that the economy boomed immediately after the 2016 election. _E_ This is my pledge to the American people: __HTTP__ _E_ The reason I am staying in Bedminster N. J. a beautiful community is that staying in NYC is much more expensive and disruptive. Meetings! _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ It's going to get hotter in Las Vegas tonight! Watch the Miss Universe Pageant tonight on NBC at 9 p.m. I'm looking forward to being there! _E_ Brian Williams was never a smart guy but always passes himself off as such. People will learn the truth! @NBCNightlyNews _E_ .@bretbaier has a wonderful new book #specialheart and it's proving to be a great success already. Bret is a winner! _E_ I never thought I'd say it in my lifetime but President Barack Hussein Obama aka Barry Sotoro is a far worse president than Jimmy Carter! _E_ China has just overtaken us as the world's largest economy. We are busy wasting $'s while China builds airports & skyscrapers. _E_ Inauguration Day is turning out to be even bigger than expected. January 20th Washington D.C. Have fun! _E_ Awarded 5 Stars by @VisitScotland @TrumpScotland's MacLeod House & Lodge boutique hotel is an historic masterpiece __HTTP__ _E_ "What the mind can conceive and believe and the heart desire you can achieve." Norman Vincent Peale _E_ ....and don't forget that Foxconn will be spending up to 10 billion dollars on a top of the line plant/plants in Wisconsin. _E_ If everything seems under control you're not going fast enough. Mario Andretti _E_ People are really liking my new book Crippled America. Check it out! _E_ It's about time for all Americans (Republicans & Democrats) to force our elected officials to start acting fiscally responsible! _E_ .@TheBrodyFile great job on @AC360. Thank you for the very smart and kind words! _E_ Chris Cuomo in his interview with Sen. Blumenthal never asked him about his long term lie about his brave service in Vietnam. FAKE NEWS! _E_ Congratulations to @TrumpPanama for winning the 2015 Traveler's Choice Award from @TripAdvisor __HTTP__ _E_ I don't care what people say I like Tom Cruise. He works his ass off and never ever quits. He's one of the few true movie stars. _E_ We will NEVER FORGET the victims who lost their lives one year ago today in the horrific #PulseNightClub shooting.... __HTTP__ _E_ Under Mayor @MikeBloomberg and Police Commissioner @Ray Kelly all violent crime in NYC is down dramatically. That's leadership. _E_ Expect the best from people. They will rise to the challenge and it's important to inspire confidence. _E_ Great @ANHQDC segment with @CharlesHurt: Breaking Down the Trump Factor __HTTP__ Let's Make America Great Again! _E_ The brand new Blue Monster just opened at Trump National Doral Miami. Also great new driving range which is open 'till midnight. GO SEE! _E_ Success tip: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_ In Hillary Clinton's America things get worse. #TrumpPence16 __HTTP__ _E_ "Get it straight: Pakistan is not our friend. We've given them billions and billions of dollars and what (cont) __HTTP__ _E_ Message to Obama re: Iran: "The worst thing you can possibly do in a deal is seem desperate to make it." – The Art of the Deal _E_ I own @DannyZuker but he has his friends & haters & losers tweeting that he beat me. He can't beat me at anything! _E_ Still waiting for a response from @billmaher. Does he even have $5 million? _E_ Stupid George Will gave @MittRomney no chance 3 months ago. Take off his little spectacles and he's just another dummy. _E_ Just watched Jeb's ad where he desperately needed mommy to help him. Jeb mom can't help you with ISIS the Chinese or with Putin. _E_ .@TheBrodyFile was fantastic tonight on @CNN. Thank you we will MAKE AMERICA GREAT AGAIN! _E_ #WeeklyAddress __HTTP__ _E_ Instead of driving jobs and wealth away AMERICA will become the WORLD'S great magnet for innovation & job creation! __HTTP__ _E_ Success consists of going from failure to failure without loss of enthusiasm. Winston Churchill _E_ "Donald Trump turns over 11.5 ac.in Rancho Palos Verdes for recreational open space" __HTTP__ @DailyBreezeNews by @meg_barnes _E_ RT @Franklin_Graham: Join me in praying for @POTUS. He reminded the world "If the righteous many do not confront the wicked few then evil... _E_ CNBC Titans: Donald Trump will be shown Friday Nov 19th at 9 pm and 1 am Sunday 11/21 at 9 pm and 11/24 at 7 pm __HTTP__ _E_ So many stories about me in the @washingtonpost are Fake News. They are as bad as ratings challenged @CNN. Lobbyist for Amazon and taxes? _E_ The Democrats should be ashamed. This is a disgrace!#DrainTheSwamp __HTTP__ _E_ Feels good to be home after seven months but the White House is very special there is no place like it... and the U.S. is really my home! _E_ I never fall for scams. I am the only person who immediately walked out of my 'Ali G' interview _E_ Gas prices are going up big league—I told you so—payback to OPEC! _E_ .@KatrinaPierson you did a fantastic job tonight on @FoxNews. Thank you for your very tough and very smart representation! _E_ .@BilldeBlasio should focus on running #NYC & all of the problems that he has caused with his ineptitude & not be so focused on me! _E_ With Miami's top #NewYearsEve vacation package @TrumpDoral is the perfect option to celebrate the start of 2015. __HTTP__ _E_ "Obamacare Data Mismatch Could Leave Thousands Uninsured" __HTTP__ ObamaCare is not working and has missed all targets. _E_ I hope Bill Clinton starts talking about women's issues so that voters can see what a hypocrite he is and how Hillary abused those women! _E_ RT @foxandfriends: Senators learn the hard way about the fallout from turning on Trump __HTTP__ _E_ Common Core is a federal takeover of school curriculum. Department of Education should be disbanded not expanded. Focus on local education. _E_ I was asked about healthcare by Anderson Cooper & have been consistent I will repeal all of #ObamaCare including the mandate period. _E_ Food stamps up 45%. Federal handouts up 45%. Is @BarackObama happy? __HTTP__ _E_ Ultimately Trump Tower became much more than just another good deal. I work in it I live in it and I have a (cont) __HTTP__ _E_ With over 260 5 Star guest rooms & suites @TrumpTO is 65 stories of pure luxury in the center of downtown Toronto __HTTP__ _E_ The new joke in town is that Russia leaked the disastrous DNC e mails which should never have been written (stupid) because Putin likes me _E_ Finally an accurate story from the Washington Post! __HTTP__ _E_ Mayweather is getting absolutely killed! _E_ Speaking of our very stupid war with Iraq it is totally disintegrating and Iran (with Russia) will walk in and take it over (lots of oil)! _E_ Big dinner with Governors tonight at White House. Much to be discussed including healthcare. _E_ I'm not hearing much from Obama or his administration about my $5M offer to charity or to which charity the money will go. _E_ Stock Market just hit another record high! Jobs looking very good. _E_ NYC should hold a parade for returning Iraq and Afghanistan veterans. _E_ RT @DonaldJTrumpJr: If you live in Louisiana Maine Kentucky or Kansas remember to vote today! Together let's #MakeAmericaGreatAgain __HTTP__ _E_ Congratulations to @EmilyMiller @mboyle1 & @NolteNC on making @FishbowlDC's list of 10 Journos You Don't Want to Fight on Twitter. _E_ .@MarkHalperin works so hard but just doesn't have a natural instinct for politics. Others do and those are the people you want to follow! _E_ Isn't it crazy that people of little or no talent or success can be so critical of those whose accomplishments are great with no retribution _E_ Jodi if you're listening MAKE A DEAL! _E_ Today I'm in Aberdeen Scotland preparing for the July 10th opening of perhaps the world's greatest golf course __HTTP__ _E_ .@Larry_Kudlow 'Donald Trump Is the middle class growth candidate' __HTTP__ _E_ I am watching @FoxNews and how fairly they are treating me and my words and @CNN and the total distortion of my words and what I am saying _E_ The right leadership can help economy while creating security around the world. Let's make America great again! __HTTP__ _E_ Thanks to @johnrich for putting on such a great concert fot @Stjude. John was a winner on Celebrity Apprentice and is a fantastic guy. _E_ Thank you Michigan. This is a MOVEMENT. We are going to MAKE AMERICA SAFE AND GREAT AGAIN! #TrumpPence16 __HTTP__ _E_ "Watch what people are cynical about and one can often discover what they lack." General George S. Patton _E_ It's been stated that dopey NY @AGSchneiderman used cocaine while he was a state senator. __HTTP__ _E_ Today @BarackObama is in Ohio on a bus tour. Tomorrow Pennsylvania. How about actually running the country? _E_ SHOCK! ObamaCare will cost double what @BarackObama promised over $1.76 __HTTP__ and result (cont) __HTTP__ _E_ "Donald trump files statement of candidacy" __HTTP__ via @CBSNews _E_ Vancouver's most anticipated hotel & residences @TrumpVancouver will unveil Canada's first Mar a Lago Spa __HTTP__ _E_ Just left a great rally in Florida now heading to Ohio for two more. Will be there soon. _E_ QE3 a political favor for Obama will cause record inflation on food and fuel. This hits low income families the hardest. Big mistake. _E_ .@ABFAlecBaldwin P.S. Your brother @StephenBaldwin7 is doing very well on @ApprenticeNBC and he stated he adores you. _E_ Congrats to @MiamiHEAT on winning @NBA championship. @MickyArison is a tremendous owner & has done wonders for (cont) __HTTP__ _E_ Enviro friendly? AP IMPACT: Obama administration allows wind farms to kill eagles birds despite federal laws __HTTP__ _E_ Sen. Corker is the incompetent head of the Foreign Relations Committee & look how poorly the U.S. has done. He doesn't have a clue as..... _E_ Don't be easily pleased with yourself or with anything else. Be tough & fight to keep your standards high. Think Like a Champion _E_ In light the Benghazi emails released last night it is apparent that Obama has no problem lying to the American public... _E_ Don Butler and executives are doing a great job at @Cadillac the cars are fantastic. _E_ Huma should dump the sicko Weiner. He is a calamity that is bringing her down with him. _E_ He @BarackObama promised to close Gitmo in his first year. It is still open 3 years later and about to get a (cont) __HTTP__ _E_ Ashley Judd Targeted by @karlrove's Super PAC in Ad (Video) __HTTP__ _E_ Via @FurnitureToday by Cindy W. Hodnett: "Dorya to introduce Trump Home high end furniture" __HTTP__ _E_ People love @LilJon! __HTTP__ #CelebApprentice _E_ China just hacked our federal government & stole gov. workers' information. Why do our leaders let China get away with this?! No respect. _E_ I truly believe that our country has the worst and dumbest negotiators of virtually any country in the world. _E_ #TrumpVine @arod sucks! __HTTP__ _E_ Our inner cities have been left behind. We will never have the resources to support our people if we have an open border. _E_ Judy Garland was much better to put it mildly! #Oscars _E_ U.S. small businesses are truly worried about rising healthcare costs and taxes __HTTP__ I told you so! _E_ True courage is being afraid and going ahead and doing your job anyhow that's what courage is. Gen. Norman Schwarzkopf (1934 2012) _E_ Thank you Virginia! 15000 amazing supporters! Everyone get out and #VoteTrump tomorrow! __HTTP__ _E_ Great afternoon in Little Havana with Hispanic community leaders. Thank you for your support! #ImWithYou __HTTP__ _E_ Obama never consulted with Congress about a prisoner exchange. HE BROKE THE LAW AND SHOULD BE TRIED. OUR PRESIDENT IS A TOTAL DISASTER! _E_ Zegarelli and Vescio: Pine Road looks like hell. Must be re paved now—very bad for town. @BriarcliffManor _E_ Thank you for the wonderful welcome @WEF! #Davos2018 __HTTP__ _E_ If America was under the threat of imminent attack would Obama use torture or a kiss? _E_ .@NRO Really important to save National Review from going out of business. We need a true conservative voice! _E_ With only a very small majority the Republicans in the House & Senate need more victories next year since Dems totally obstruct no votes! _E_ Logic will get you from A to B. Imagination will take you everywhere. Albert Einstein _E_ "Develop success from failures. Discouragement and failure are two of the surest stepping stones to success." – Dale Carnegie _E_ The NYC casting call for The Apprentice is thisThursday April 1 at Trump Tower. For all the information you need go to NBC.com/casting. _E_ Pathetic attempt by @foxnews to try and build up ratings for the #GOPDebate. Without me they'd have no ratings! __HTTP__ _E_ We had a wonderful visit to Vietnam thank you President Tran Dai Quang! Heading to the #ASEANSummit 50th Anniv Gala in the Philippines now. __HTTP__ _E_ Excited that @OurCountryPAC's Amy Kremer has endorsed the Newsmax iontv debate. The Tea Party Express is a great group. _E_ Yesterday Barack Obama said he wants wind turbines manufactured here in China __HTTP__ I don't think this was a gaffe. _E_ George Ross could be right—@THEGaryBusey would be better in the adventure task than the romance task. #CelebApprentice _E_ Thank you @GolfMagazine for your fantastic review of The Blue Monster at Trump National Doral BEST U.S. RESORT RENOVATION & ALL TIME _E_ On my way to San Diego to raise money for the Republican Party. I am spending a lot myself and also helping others. _E_ I'll be doing @piersmorgan show tonight on CNN at 9 PM. Will be very interesting. (I hope!) _E_ Focus on your goals not your problems. Problems are a mind exercise learn to play beyond your comfort zone. _E_ Thank you our great honor! __HTTP__ _E_ GREAT EVENING last night in Pensacola Florida. Arena was packed to the rafters the crowd was loud loving and really smart. They definitely get what's going on. Thank you Pensacola! _E_ Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_ Just finished speaking in Sydney Australia in front of 20000 people and today I'm off to Melbourne for anot... (cont) __HTTP__ _E_ Glad to hear that @FLGovScott will be speaking at the @RNC Convention. He is a true conservative and fantastic governor! _E_ Entrepreneurs: Follow your own path—it will bring you to the places you were meant to be. _E_ I bought the great Turnberry Resort today considered by many to have the greatest golf course in the World. I will take good care of it! _E_ Plain & Simple: We should only admit into this country those who share our VALUES and RESPECT our people. __HTTP__ _E_ .@StephenBaldwin7's mother thinks I'm very handsome. Now I see where Stephen and Alec get their smarts. #CelebApprentice _E_ One positive from last week for Lance was that everyone was focused on Manti Te'o! Why did Lance do that interview? _E_ #TeamTrump is thinking of Captain Andrew Maitner. A true American hero. #MaitnerStrong __HTTP__ __HTTP__ _E_ Our country has to come together. We have to start working with and really liking each other. The whole world is watching Baltimore. _E_ An analysis showed that Bernie Sanders would have won the Democratic nomination if it were not for the Super Delegates. _E_ Watching biased Charles @krauthammer a @FoxNews flunky who didn't know that I won every debate in particular the last one. Check polls! _E_ For you newcomers George Ross was one of my first advisors on the original Apprentice. #CelebApprentice _E_ So many great things happening new poll numbers looking good! News conference at 11:00 A.M. today Trump Tower! _E_ RT @DonaldJTrumpJr: Nice piece and video today in the Wall Street Journal: Trump's three eldest children jump into campaign __HTTP__ _E_ The State of Florida is so embarrassed by the antics of Crooked Hillary Clinton and Debbie Wasserman Schultz that they will vote for CHANGE! _E_ I had a great day in D.C. even though the subject was an unpleasant one the horrible Iran Nuke deal. Amazing crowd and enthusiasm! _E_ The New York Times should never have moved out of their magnificent original home... _E_ The era of strategic patience with the North Korea regime has failed. That patience is over. We are working closely... __HTTP__ _E_ I do what I do out of pure enjoyment. Hopefully nobody does it better. Theres a beauty to making a great deal. It's my canvas. _E_ Hey @KimKardashian I hear you are undecided in the election. I can explain why you should vote for @MittRomney. _E_ China is about to acquire 82800 net acres of a Texas shale oil and gas field __HTTP__ What are we doing! _E_ Keep difficulties in perspective. Ask yourself is this a blip or is it a catastrophe? _E_ Just arrived in Italy after having a very successful NATO meeting in Brussels. Told other nations they must pay more not fair to U.S. _E_ Thank you @chucktodd for your commentary last night on @NBCNightlyNews. Very fair we are making progress together! _E_ .@alexsalmond @pressjournal @BBCNews RT ‏@DanScavino the photos that they don't show the public... __HTTP__ _E_ The purpose of China's massive military buildup on the Nork's border is to intimidate us. China attacked us during the Korean War. _E_ The people of South Carolina are embarrassed by Nikki Haley! _E_ Be sure to tune in to another amazing episode of #CelebApprentice this Sunday on @nbc at 9PM EST! This Sunday's (cont) __HTTP__ _E_ You will love Celebrity Apprentice tonight 9 PM on NBC. Must watch from beginning two early firings! _E_ Hope & Change the number of 26 year olds living with parents has jumped 46% under Obama __HTTP__ Four more years? _E_ Thank you CBS & Breitbart total vindication! Will the mainstream media apologize? Many many witnesses. #Trump2016 __HTTP__ _E_ Remember: Obama turned down $5M to charity which I said I would increase by 10X to $50M just to show simple records. He's hiding lots! _E_ Thank you Grand Rapids Michigan! #ICYMI watch: __HTTP__ __HTTP__ _E_ Via @CBNNews: Exclusive: Backstage Interview w/ Donald Trump at CPAC __HTTP__ by @TheBrodyFile Great seeing you David! _E_ I'm getting The Commandant's Leadership Award from the U.S.Marines tonight at The Waldorf Astoria a great honor! @BretBaier _E_ I told @megynkelly that @oreillyfactor and I had identical views on a certain issue and she cut it out of the taped interview. Why? Too bad! _E_ My thoughts on @andyroddick in today's #trumpvlog.... __HTTP__ _E_ .@FrankLuntz your so called focus groups are a total joke. Don't come to my office looking for business again. You are a clown! _E_ The only way to do great work is to love what you do. – Steve Jobs _E_ .@Rosie—No offense and good luck on the new show but remember you started it! __HTTP__ _E_ I love New Hampshire will be an exciting evening! _E_ I would do same thing if I were China. They want Obama. __HTTP__ _E_ Just won The Club Championship at Trump International Golf Club in Palm Beach lots of very good golfers never easy to win a C.C. _E_ RT @ReutersPolitics: Trump to give $5 million to charity if Obama releases records __HTTP__ _E_ In a little reported event China has just overtaken the United States as the NUMBER ONE World economic power! Great going Washington! _E_ A Rod's appeal will go nowhere. He will get a long suspension. Good for the @Yankees. And sends strong message to @MLB players. _E_ Thank you @elvisduran for dedicating your birthday today to the @EricTrumpFdn for @StJude! Click here to donate: __HTTP__ _E_ People get what is going on! __HTTP__ _E_ #ICYMI: Will Media Apologize to Trump? __HTTP__ _E_ I don't get @billmaher and his terrible show he is dumb as a rock but tries so hard to pass himself off as a great intellect. Check past! _E_ He is a professional and true gentleman: @GeorgeTakei is one of my favorite contestants from #CelebApprentice. _E_ Tomorrow is the 10 year anniversary of the Apprentice one of the biggest hits in television history. How time flies! _E_ I hear that @SenTedCruz's $$ man Robert Mercer a good man is very angry because Cruz lied to him about liquidating his (Ted's) holdings.? _E_ DON'T LET HER FOOL US AGAIN. __HTTP__ _E_ Looking forward to speaking at tonight's gala for @MittRomney supporters at the Intrepid. Mitt's doing well. _E_ Donald Trump song is up to almost 60 million hits crazy! _E_ NobamaCare won't work never will work and can't work it is a total waste of time and energy except that it is hurting people (& economy!) _E_ RT @RealBenCarson: Please read my full endorsement of @realDonaldTrump for President of the United States: __HTTP__ _E_ The press has very inaccurately covered this event see for yourself! __HTTP__ _E_ Will be doing @greta interview tomorrow. So much to talk about! _E_ RT @IvankaTrump: Since @realDonaldTrump inauguration over 1 million net new jobs have been created in the American economy! #MAGA _E_ ...Why did the DNC REFUSE to turn over its Server to the FBI and still hasn't? It's all a big Dem scam and excuse for losing the election! _E_ RT @usairforce: "#AirForce relief efforts in #PuertoRico & #VirginIslands" __HTTP__ _E_ Sugar @Lord_Sugar Unlike yours my financials are phenomenal. People don't know your real numbers & would not be impressed. _E_ A Clinton already defeated a Bush. The definition of insanity is doing the same thing twice & expecting a different result. _E_ We don't have a country if we don't have borders. #VoteTrump Video: __HTTP__ __HTTP__ _E_ North Korea has just launched another missile. Does this guy have anything better to do with his life? Hard to believe that South Korea..... _E_ Donald Trump Plans To Continue GOPLegacy Of Leading On Women's Civil Rights Against Racist Sexist Democrats __HTTP__ _E_ The media is so in the tank for Obama that it is amazing—the funny thing is he can't stand them! _E_ What would All Star @ApprenticeNBC be w/out a Baldwin? @StephenBaldwin7 is at the top of his game this season. Our fans will be happy. _E_ Dopey @BillKristol who has lost all credibility with so many dumb statements and picks said last week on @Morning_Joe that Biden was in. _E_ Can you imagine not taking Snowden's passport away before he jetted happily away to foreign lands (where he gave away many U.S. secrets). _E_ Catch the second part of my interview with Bill O'Reilly tonight at 8pm on Fox News.... _E_ #CrookedHillary is not qualified! __HTTP__ _E_ N.A.T.O. is obsolete and must be changed to additionally focus on terrorism as well as some of the things it is currently focused on! _E_ Thank you. __HTTP__ _E_ The approval process for the biggest Tax Cut & Tax Reform package in the history of our country will soon begin. Move fast Congress! _E_ Unsustainable @BarackObama has increased total federal budget outlays by over 24% during his term __HTTP__ He loves debt. _E_ Rosie O'Donnell has failed again. Her ratings were abysmal and Oprah cancelled her on Friday night. When will (cont) __HTTP__ _E_ Get rid of all of these commercials. #DemDebate _E_ I hearby demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_ Here is my statement. __HTTP__ _E_ The U.S. has 69 treaties with other countries where we would have to defend them and their borders. How nice but what do we get? NOT ENOUGH _E_ .@PrimeMinisterSX has no clue what's going on in St. Maarten. Mullet Bay is a third world slum. _E_ Heading to the Great State of Wisconsin to talk about JOBS JOBS JOBS! Big progress being made as the Real News is reporting. _E_ New GOP platform now includes language that supports the border wall. We will build the wall and MAKE AMERICA SAFE AGAIN! _E_ Have a fantastic beautiful and happy Easter everyone and then when Easter is over have great wins and triumphs in life. Never give up! _E_ Totally dishonest Donna Brazile chokes on the truth. Highly illegal! Watch: __HTTP__ __HTTP__ _E_ The failing @nytimes just announced that complaints about them are at a 15 year high. I can fully understand that but why announce? _E_ Doral Tournament was great best 18th hole in golf and a wonderful winner in @JustinRose _E_ RT @realDonaldTrump: Senator Dicky Durbin totally misrepresented what was said at the DACA meeting. Deals can't get made when there is no t... _E_ .@HillaryClinton lists litany of ways she plans to restrict gun rights. 2A will not survive a Hillary presidency. #Debate #BigLeagueTruth _E_ It is Clinton and Sanders people who disrupted my rally in Chicago and then they say I must talk to my people. Phony politicians! _E_ 45000 construction & manufacturing jobs in the U.S. Gulf Coast region. $20 billion investment. We are already winning again America! _E_ Thank you Columbus Ohio! __HTTP__ _E_ America is at a great disadvantage. Putin is ex KGB Obama is a community organizer. Unfair. _E_ If you think you can do a thing or think you can't do a thing you're right. Henry Ford _E_ Besides an award winning golf course @TrumpGolfLA features exquisite estates on top the Palos Verdes Peninsula __HTTP__ _E_ The failing @nytimes has become a newspaper of fiction. Their stories about me always quote non existent unnamed sources. Very dishonest! _E_ Ebola has been confirmed in N.Y.C. with officials frantically trying to find all of the people and things he had contact with.Obama's fault _E_ Remember I am the only one who is self funding my campaign. All of the other candidates are bought and paid for by special interests! _E_ It does not cost anything to dream. Spend your time enjoying your big dreams. Think Big _E_ People are really unhappy with the endless security checks at the new World Trade Center. Durst is a terrible manager. Tenants furious! _E_ Congrats to Senator McConnell and @TheTeaParty_net's Kellen Guida on yesterday's successful Tea Party Caucus __HTTP__ _E_ "Trump Brand Expands To South America: The Donald Lends His Name To Luxury Tower In Uruguay" __HTTP__ via @Forbes _E_ Blue Ribbon Commission to find and agree to future spending cuts? Bad idea. _E_ Thank you Georgia! 15000 amazing supporters tonight! Everyone get out & #VoteTrump tomorrow! #SuperTuesday __HTTP__ _E_ .@GovernorPerry failed on the border. He should be forced to take an IQ test before being allowed to enter the GOP debate. _E_ Friends of mine who are driving Cadillacs it is becoming a very hot car are raving about what a great job @Cadillac has done. _E_ It means so much to me receiving an endorsement from Phyllis Schlafly. A truly great woman & conservative. __HTTP__ _E_ .@VanityFair magazine is doing so poorly that they make even @NYMag look good. Graydon Carter should've been fired a long time ago. _E_ I just gave lots of money away at Trump Tower to people who needed it...they were very happy and appreciative! _E_ Have a good chance to win Texas on Tuesday. Cruz is a nasty guy not one Senate endorsement and despite talk gets nothing done. Loser! _E_ If the government doesn't start working together the media is right & we will hit a fiscal cliff. We need to avoid this. _E_ The lobbyists & special interests have just put out an ad for Jeb which hits me just a little but is very false! _E_ Into our first week of filming @ApprenticeNBC the Celebrities are already turning up the heat. Major fireworks! _E_ Everyone is talking about the incredible event we had in Dallas last night. Spectacular crowd & arena! Thank you @mcuban. _E_ Via @CNNPolitics by @teddyschleifer: Trump: San Francisco killing shows perils of illegal immigration __HTTP__ _E_ .@Lexi Great job in winning your first of many majors . We are proud of you at Trump International. Work hard be an all time great! _E_ Fantastic job on @CNN tonight. @kayleighmcenany is a winner! @donlemon _E_ .@MELANIATRUMP and I are looking forward to watching @AnnDRomney's speech tonight. She is an amazing woman who will be a great First Lady! _E_ OPEC is setting crude at $94/barrel on 'signs US economy is improving.' OPEC uses any excuse to rip us off and our leaders just watch. _E_ A ship is only as good as the people who serve on it — and the AMERICAN SAILOR is the BEST in the world. @USNavy #USSGeraldRFord __HTTP__ _E_ Obama can sign an illegal executive action anytime for ObamaCare but he can't fix the illegal loophole. _E_ Love seeing union & non union members alike are defecting to Trump. I will create jobs like no one else. Their #Dem leaders can't compete! _E_ Join us tomorrow night in Charleston South Carolina! #SCPrimary #Trump2016 __HTTP__ _E_ They should close down Rolling Stone Magazine after the phony rape charge story. University of Virginia should sue them for big bucks! _E_ It was so great being in Nebraska last week. Today is the big day get out and vote! _E_ The Republican establishment out of self preservation is concerned w/ my high poll #'s. More concerned are Dems—I beat Hillary heads up! _E_ Meeting with biggest business leaders this morning. Good jobs are coming back to U.S. health care and tax bills are being crafted NOW! _E_ Andy Williams has died. He was a friend of mine and a great guy. _E_ That's Adrian in the elevator— he works at @TrumpTowerNY & he's got a lot of stories. #CelebApprentice _E_ So many people have told me that I should host Meet the Press and replace the moron who is on now. Just too busy especially next 10 years! _E_ Today the Democrats lose big. But tomorrow the Republicans must communicate a positive pro growth agenda. _E_ The scum that gets high on badly hurting old ladies and others through knockout assaults wouldn't feel that way with a gun at their head! _E_ It was a great honor to welcome President Petro Poroshenko of Ukraine to the @WhiteHouse today with @VP Pence.... __HTTP__ _E_ #TimeToGetTough presents bold solutions on taxes national security the debt dealing with OPEC and China and defeating @BarackObama. _E_ My interview with @HowardKurtz on #MediaBuzz will air tomorrow on @Fox at 11am and 5pm. Great job Howie very insightful. _E_ The Obama Economy workers added to disability and individuals added to food stamps more than doubles net jobs created __HTTP__ _E_ I met Prince on numerous occasions. He was an amazing talent and wonderful guy. He will be greatly missed! _E_ Great new numbers. Thank you! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Be sure to watch my #CPAC2015 speech with intro by @DLoesch and a Q&A with @seanhannity __HTTP__ _E_ #ElectionDay __HTTP__ __HTTP__ _E_ Unions who secure the border oppose the amnesty bill __HTTP__ Their expert opinions should at least be listened to. _E_ So Obama and Congress can waste billions in Iraq & Afghanistan building roads & schools but can't get money to the NJ & NY Sandy victims? _E_ Remember tonight's 8 o'clock episode of Celebrity Apprentice is the best ever—you will see nothing like it on tv. @ApprenticeNBC _E_ Mexico will pay for the wall 100%!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ Our economy has had worst recovery under Obama since the Depression. Results of his policies speak for themselves. No new taxes! _E_ Russian leaders are publicly celebrating Obama's reelection. They can't wait to see how flexible Obama will be now. _E_ Bob Corker gave us the Iran Deal & that's about it. We need HealthCare we need Tax Cuts/Reform we need people that can get the job done! _E_ .@Franklin_Graham so many people have tweeted about your amazing words to me thank you! Heading to big crowd in South Carolina! _E_ Watch the Miss Universe competition LIVE from the Bahamas Sunday 8/23 @ 9pm (ET) on NBC: __HTTP__ _E_ Greece should get out of the euro & go back to their own currency they are just wasting time. _E_ Wow Twitter Google and Facebook are burying the FBI criminal investigation of Clinton. Very dishonest media! _E_ The Governor of Puerto Rico Ricardo Rossello is a great guy and leader who is really working hard. Thank you Ricky! _E_ I will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House. _E_ In the UK taxpayers are wasting £24 million on wind farms that don't even operate. __HTTP__ They (cont) __HTTP__ _E_ Thanks. __HTTP__ _E_ My thoughts on the situation in Norway and Amanda Knox... __HTTP__ #trumpvlog _E_ RT @Scavino45: .@POTUS @realDonaldTrump in the Oval Office w/senior U.S. military leaders prior to dinner hosted by the President & First L... _E_ President Obama's weakness and indecision may have saved us from doing a horrible and very costly (in more ways than money) attack on Syria! _E_ INTELLIGENCE INSIDERS NOW CLAIM THE TRUMP DOSSIER IS A COMPLETE FRAUD! @OANN _E_ ... but @billmaher is allowed to say that about me. _E_ The reporter who pulled back from his 14 year old never retracted story is having fun. I don't know what he looks like and don't know him! _E_ Conde Nast made a big mistake going into the World Trade Center. The place is a total disaster and I feel this is only the beginning! _E_ RT @T_Lineberger: Thanks @IvankaTrump for coming to help win Michigan! More people here than a Hillary rally with less than 24 hours notice... _E_ For accurate reporting of my @CPACnews speech read @PoliticalTicker @Newsmax_Media @politico @HuffPostPol.... _E_ Not good news for Jeb Bush __HTTP__ _E_ My shirts ties and suits are selling great @Macy's because they are the best and most stylish at a really reasonable price thanks! _E_ It is a great victory for NYC that A Rod will never wear pinstripes again. _E_ Consumer spending fell in September __HTTP__ Another indicator the 7.8% unemployment number is cooked. _E_ The @nytimes was very nice in reporting that @CelebApprentice was #1 on all television for "top brand impact 2012." Thank you! _E_ Great poll Florida thank you! #ImWithYou #AmericaFirst __HTTP__ _E_ Even the SEALS who killed Bin Laden don't like @JoeBiden __HTTP__ _E_ Happy Birthday @EricTrump! __HTTP__ _E_ Thank you Alex! __HTTP__ _E_ Who is winning the debate so far (just last name)? #DemDebate _E_ Of course the Australians have better healthcare than we do everybody does. ObamaCare is dead! But our healthcare will soon be great. _E_ Has Pres. Obama or the White House told the public what happened in Algeria yet? Where's the media? _E_ It is amazing how @LindseyGrahamSC gets on so many T.V. shows talking negatively about me when I beat him so badly (ZERO) in his pres run! _E_ "The future is always beginning now." Mark Strand former Poet Laureate _E_ "Be sure you put your feet in the right place then stand firm." Abraham Lincoln _E_ Unless you catch hackers in the act it is very hard to determine who was doing the hacking. Why wasn't this brought up before election? _E_ Remember that things are cyclical so be resilient be patient be creative and remain positive. Think Like a Champion _E_ Trump Golf Links at Ferry Point an 18 hole public golf course in the Bronx New York is opening soon! __HTTP__ _E_ Join me tomorrow in Des Moines Iowa with Vice President Elect @mike_pence at 7:00pm!#ThankYouTour2016 #MAGA... __HTTP__ _E_ How can FBI Deputy Director Andrew McCabe the man in charge along with leakin' James Comey of the Phony Hillary Clinton investigation (including her 33000 illegally deleted emails) be given $700000 for wife's campaign by Clinton Puppets during investigation? _E_ In war there is no substitute for victory. Douglas MacArthur _E_ Mogul Donald Trump has many powerful friends. And it turns out one of them is Anna Wintour." __HTTP__ via @FoxNews _E_ Thank you for all of your support! Let's #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_ The Old Post Office building in Washington (D.C.) will soon be transformed into one of the great hotels anywhere in the world lots of jobs! _E_ I want to do negative ads on John Kasich but he is so irrelevant to the race that I don't want to waste my money. _E_ Thank you Sacramento California! #MakeAmericaGreatAgain __HTTP__ _E_ "Ability is nothing without opportunity." Napoleon Bonaparte _E_ President Obama was able to fool the Americans by getting elected but not able to fool Vladimir Putin. Too bad for us! _E_ Will be on @SeanHannity tonight at 10pmE delivering an important speech live from Wisconsin. #MakeAmericaGreatAgain _E_ Why has all time hits leader Pete Rose paid a 20 year price whrn A Rod gets 200 game penalty. It's time to let Pete into The Hall of Fame! _E_ Good Morning America is thrilled @Rosie is working for the @todayshow that means almost guaranteed success for @GMA _E_ Our great VPE @mike_pence is in Louisiana campaigning for John Kennedy for US Senate. John will be a tremendous help to us in Washington. _E_ Had a great time on @IngrahamAngle this morning. _E_ Great new poll numbers! Thank you for your support! #Trump2016 __HTTP__ _E_ .@MattGinellaGC Thx for the nice story @TrumpDoral. Look forward to showing you Trump Int'l in Aberdeen in the spring & Turnberry plans. _E_ #sweepstweet @teresa_giudice definitely fell under @lisalampanelli's negotiation skills—an important business tool. _E_ Via @Newsmax_Media: Trump @oreillyfactor Make Up After Digs at Each Other __HTTP__ _E_ I will be on @oreillyfactor tonight on @FoxNews at 8 PM and 11 PM. _E_ I build beautiful websites with very smart and imaginative people for almost NOTHING. OUR GOVERNMENT SPENT ALMOST $535 000 000 for NOTHING _E_ I am the only candidate (in many years) who is self funding his campaign. Lobbyists and $ interests totally control all other candidates! _E_ ISIS is on the run & will soon be wiped out of Syria & Iraq illegal border crossings are way down (75%) & MS 13 gangs are being removed. _E_ . @BBCNews' child molestation sex scandal is the latest in continued downward spiral of BBC.I know personally they do not check for accuracy _E_ As a candidate I promised we would pass a massive tax cut for the everyday working Americans. If you make your voices heard this moment will be forever remembered as a great new beginning – the dawn of a brilliant American future shining with PATRIOTISM PROSPERITY AND PRIDE! __HTTP__ _E_ Via @TIME by @lullintheaction: #REALTIME: Donald Trump Weighs a 2016 Run At #CPAC2015 __HTTP__ _E_ .@stuartpstevens did a horrible job for Mitt—is a refund in order? Sadly Stuart is a disaster! _E_ The Republicans look so weak and foolish—what the hell are they doing? _E_ Entrepreneurs: Set the bar high. Do the best you possibly can. Apply your skills and talent but above all be tenacious. _E_ The Trump Doctrine: Peace Through Strength. #Trump2016 __HTTP__ __HTTP__ _E_ Great win last night by Peyton Manning & @Denver_Broncos in San Diego coming from 24 points behind on the road. Very impressive. _E_ What a great time we just had in the atrium of Tump Tower for __HTTP__ The place was happy and packed! _E_ Admiral McRaven had full operational control of the Bin Laden mission __HTTP__ @BarackObama gave vague directions. _E_ Thank you for the massive turnout tonight Cleveland Ohio! Get out & VOTE #TrumpPence16 on 11/8.Watch rally here:... __HTTP__ _E_ .@DottieandBogey Thanks for nice comments over weekend re Turnberry. You and your husband have fantastic taste! Also great commentary. _E_ Congratulations to @BretBaier on his five year anniversary as the anchor @SpecialReport. Brett is great! _E_ When the stupid people start feeling sorry for the Boston killer and want to release him and give him medals remember the killings maimings _E_ Virginia's highest rated wine by @WineEnthusiast @trumpwinery is inspired by the regions of Bordeaux & Champagne __HTTP__ _E_ The United States needs to fix its own problems of which there are many first! _E_ The hardest thing Clinton has to do is defend her bad decision making including Iraq vote e mails etc. _E_ Great pick by Buffalo Sammy Watkins will be GREAT! _E_ General John Allen who I never met but spoke against me last night failed badly in his fight against ISIS. His record = BAD #NeverHillary _E_ Only 36 days until the election. @MittRomney needs to stay on offense. Make Obama's terrible record the issue. #TimeToGetTough _E_ .@IamStevenT visited me at @TrumpTowerNY what a great guy! __HTTP__ _E_ Take a look at what happened w/ Bill Clinton. The system is totally rigged. Does anybody really believe that meeting was just a coincidence? _E_ Trump's Menie golf resort enjoys bumper first year __HTTP__ via @TheScotsman _E_ I look forward to meeting @joniernst today in New Jersey. She has done a great job as Senator of Iowa! _E_ ... By releasing his records he can come clean with the American people and have $5 million go to a charity. _E_ WikiLeaks emails reveal Podesta urging Clinton camp to 'dump' emails. Time to #DrainTheSwamp! __HTTP__ _E_ Our Southern border is totally out of control. This is an absolutely disgraceful. situation. __HTTP__ We need border security! _E_ Saw @mcuban try to hit a ball in Lake Tahoe while I played in tournament he's got no talent or strength!!!! @TMZ _E_ .@MattGinellaGC @GCMorningDrive Matt will be talking about Trump National Doral tomorrow A.M. Terrific guy looking forward to it! _E_ .@antbaxter I predict somebody is going to sue you! _E_ Many people are now saying I won South Carolina because of the last debate. I showed anger and the people of our country are very angry! _E_ Why does @KarlRove lie about his Reagan credentials? __HTTP__ He's a Bushie through and through. _E_ One good aspect of the Obama depression is that it will separate the winners from the losers. If you can make it now you deserve it! _E_ "Strong men have sound ideas and the force to make these ideas effective." Andrew Mellon _E_ .@danabrams editor of @mediaite explained on radio this morning that I am so widely covered because I draw high interest. True! _E_ My interview from yesterday on Fox and Friends GOP Crazy If They Don't Get Everything They Want __HTTP__ _E_ Thanks! __HTTP__ _E_ "Don't be afraid of mistakes. They can be learning tools on the way to building something great for yourself." Think Like a Champion _E_ Thank you ARIZONA! This is a MOVEMENT like nobody has ever seen before. Together we are going to MAKE AMERICA SAFE... __HTTP__ _E_ Lyin' Ted I have already beaten you in all debates and am way ahead of you in votes and delegates. You should focus on jobs & illegal imm! _E_ Government needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. #TimeToGetTough _E_ Thank you @GeraldoRivera @FoxandFriends. Agree! __HTTP__ _E_ Sleepy eyed @chucktodd thinks Las Vegas is a state see @todayshow this morning. _E_ I opposed going into Iraq. Hillary voted for it. As with everything else she's supported it was a DISASTER. __HTTP__ _E_ The Keystone pipeline will create 20000 jobs and make us less energy dependent from the Middle East. @BarackObama says No! _E_ Golf Odyssey just named Trump Scotland "Golf course of the year." __HTTP__ _E_ Marco Rubio was a complete disaster today in an interview with Chris Wallace @FoxNews concerning our invading Iraq.He was as clueless as Jeb _E_ CLINTON'S CLOSE TIES TO PUTIN DESERVE SCRUTINY: __HTTP__ #VPDebate _E_ Congratulations to @STEPHENATHOME I will see you on the show! _E_ The very foul mouthed Sen. John McCain begged for my support during his primary (I gave he won) then dropped me over locker room remarks! _E_ Shark Tank is a dead Friday night filler compared to the Apprentice which has been number one show for week in the T. V. ratings! _E_ Today as we Remember Pearl Harbor it was an incredible honor to be joined with surviving Veterans of the attack on 12/7/1941. They are HEROES and they are living witnesses to American History. All American hearts are filled with gratitude for their service and their sacrifice. __HTTP__ _E_ Apprentice will be amazing tomorrow night! _E_ Via @HuffPostPol: "Donald Trump: 'Republicans May Be The Worst Negotiators In History'" __HTTP__ _E_ Hillary Clinton is weak and ineffective no strength no stamina. _E_ Only those who will risk going too far can possibly find out how far one can go. T. S. Eliot _E_ The trade deficit rose to a 7yr high thanks to horrible trade policies Clinton supports. I will fix it fast JOBS! __HTTP__ _E_ I will be interviewed on @foxandfriends with the legendary Coach Bobby Knight tomorrow morning. Enjoy! #INDPrimary __HTTP__ _E_ The road to success is always under construction. Arnold Palmer _E_ President Xi thank you for such an incredible welcome ceremony. It was a truly memorable and impressive display! 📸 __HTTP__ __HTTP__ _E_ Great new ad from @CmteForIsrael: 'Next Year...President @MittRomney in Jerusalem the Capital of Israel' __HTTP__ _E_ Drugs are pouring into this country. If we have no border we have no country. That's why ICE endorsed me. #Debate #BigLeagueTruth _E_ CBO now estimates that over 2.5M will lose jobs directly because of ObamaCare. REPEAL now before it is too late. _E_ Thank you @thefix for your very honest commentary. One thing we do have great teams in IA NH SC and beyond. __HTTP__ _E_ .@KarlRove wasted $400 million + and didn't win one race—a total loser. @FoxNews _E_ ‏.@richardroeper Perhaps one of the worst replacements in showbiz once you went on it was over! Your taste sucks! _E_ Do you believe that @UnionLeader in NH was demanding ads? Look at enclosed letter from them just received: __HTTP__ _E_ The lights went out in New Orleans...the Country's lights went out also. We are not the same place! _E_ Thank you for today's endorsement New York Veteran Police Association! #NewYorkValues __HTTP__ __HTTP__ _E_ New Gravis Poll in NH just out: Trump 32% Carson 13% __HTTP__ _E_ The silent majority is silent no more! Remember the importance of VOTING!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Hear Donald Trump discuss big gov spending banks & taxes on Your World w/Neil Cavuto: __HTTP__ _E_ Weekly Address from @WhiteHouse: __HTTP__ __HTTP__ _E_ Despite firing @StephenBaldwin7 in last night's All Star Celebrity @ApprenticeNBC Stephen had strong overall performance this season _E_ We will defend our country protect our communities and put the safety of the AMERICAN PEOPLE FIRST! Replay: __HTTP__ __HTTP__ _E_ We are going to make our country so strong again so great again. No more ripping off the United States. We will MAKE AMERICA GREAT AGAIN! _E_ Reporters say it's the Trump Bump I tell CNBC I am buying stocks and the market goes up. _E_ Congratulations to @piersmorgan on his new position as Editor at Large for the United States of @MailOnline! My Apprentice champ! _E_ Turn on @oreillyfactor now and enjoy true brilliance! _E_ After a great evening and packed auditorium in Iowa I am now in Colorado looking forward to what I am sure will be a very unfair debate! _E_ ... where he raised 2 million dollars for the wonderful kids. Eric has a great heart! _E_ Putin just sent a Russian nuclear sub to the Gulf of Mexico. @BarackObama can't be bothered he is too concerned with @MittRomney's taxes. _E_ The Trump Organization Finalizes Purchase of Legendary Turnberry Resort in Scotland. __HTTP__ _E_ "A vampire with a day pass?" We are in @THEGaryBusey land. #CelebApprentice _E_ Watched Saturday Night Live hit job on me.Time to retire the boring and unfunny show. Alec Baldwin portrayal stinks. Media rigging election! _E_ The @Yankees should break A Rod's contract immediately—he misrepresented. _E_ #Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_ Very sad what happened last night at the Miss Universe Pageant. I sold it 6 months ago for a record price. This would never have happened! _E_ I'll be on @gretawire tonight at 10 PM Fox News _E_ Now that the election is over watch Chrysler ship @Jeep production to China my prediction. _E_ Two policemen just shot in San Diego one dead. It is only getting worse. People want LAW AND ORDER! _E_ The #1 trend on Twitter right now is #TrumpWon thank you! _E_ You must admit that Bryant Gumbel is one of the dumbest racists around an arrogant dope with no talent. Failed at CBS etc why still on TV? _E_ "In order to build your wealth and improve your business smarts you need to know about real estate." Think Like a Billionaire _E_ I am going to Iowa today sold out crowds. People don't want our country ripped off anymore. Must stop now! _E_ As we told the @nydailynews I was asked to speak at the RNC but said no because I will be doing something much bigger just watch! _E_ RT @FLOTUS: Preparations are underway to celebrate the holidays at the @WhiteHouse! __HTTP__ _E_ Amazing my tweets are covered across every spectrum from @espn to @politico to @WSJ. _E_ "Offshore wind is a dead duck in Scotland and it's time Alex Salmond manned up stopped blaming Westminster (cont) __HTTP__ _E_ "You just can't beat the person who never gives up." – Babe Ruth _E_ .@McIlroyRory Thanks for your nice note they love you at Trump National Doral. You are looking good will have a GREAT year! _E_ You are always there for us – THE MEN AND WOMEN IN BLUE.Thank you to our police thank you to our sheriffs and thank you to our law enforcement families. God Bless you all and GOD BLESS AMERICA! #LESM __HTTP__ _E_ The best social program by far is a JOB! Our jobs are being taken away from us by China and many other countries incompetent leader. _E_ On behalf of @FLOTUS Melania and myself thank you for a wonderful dinner and evening President Sergio Mattarella.... __HTTP__ _E_ A person who never made a mistake never tried anything new. Albert Einstein _E_ Carl Cameron @FoxNews is the only reporter I know who consistently fumbles & misrepresents poll results. He has been so wrong & he hates it! _E_ RT @TeamTrump: Law enforcement officers bring communities together & keep us safe. @mike_pence & @realDonaldTrump RESPECT & stand by them!... _E_ Worired that the USC will strike down ObamaCare @BarackObama is trying to implement his debacle in public schools __HTTP__ _E_ Trump Int'l Hotel & Tower Toronto. #1 in all of Canada. __HTTP__ _E_ .@BarackObama is promoting ugly inefficient unreliable bird killing noisy neighborhood destroying wind turbines. Big mistake. _E_ Is Anthony Weiner also delusional? Add him to NY Sex Offender list instead! _E_ January 20th 2017 will be remembered as the day the people became the rulers of this nation again. _E_ I said that Crooked Hillary Clinton is not qualified to be president because she has very bad judgement Bernie said the same thing! _E_ Iranian Pastor #Nadarkhani has just been sentenced to death by the Mullahs because he is a Christian (cont) __HTTP__ _E_ THE CHOICE IS CLEAR!#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_ After tearing W Bush down for 12 years now the media loves him. Why not? He gave them Obama. _E_ In my office with Banana Joe who just won the @WKCDOGS at @MSGnyc. __HTTP__ _E_ Congratulations to @seanhannity on his great ratings and ratings increase as reported by the @AP today. Amazing job! _E_ Alert...The president knew that the ambassador was being attacked in Benghazi. He did nothing...he is no leader. _E_ Jamiel Shaw was incredible on @foxandfriends this morning. His son who was viciously killed by an illegal immigrant is so proud of pop! _E_ To EVERYONE including all haters and losers HAPPY NEW YEAR. Work hard be smart and always remember WINNING TAKES CARE OF EVERYTHING! _E_ My @gretawire interview where I discuss the #ObamaCare USC argument gas prices & @IvankaTrump's new clothing line __HTTP__ _E_ Via @TMZ_Sports: "Donald Trump: Don't Mess Up @terrellowens' Name. 'I've Seen Him Go Crazy At People'" __HTTP__ _E_ Pres. Obama's steady support of @Israel throughout this crisis helped stop the war. He did a good job. _E_ Miami Dade Mayor drops sanctuary policy. Right decision. Strong! __HTTP__ _E_ Macy's was very disloyal to me bc of my strong stance on illegal immigration. Their stock has crashed! #BoycottMacys __HTTP__ _E_ Amy Pascal of Sony was totally used by Rev. Al Sharpton. She should be fired for stupidity. _E_ Melania our great and very hard working First Lady who truly loves what she is doing always thought that "if you run you will win." She would tell everyone that "no doubt he will win." I also felt I would win (or I would not have run) and Country is doing great! _E_ I hope we never find life on another planet because if we do there's no doubt that the United States will start sending them money! _E_ It is a MOVEMENT not a campaign. Leaving the past behind changing our future. Together we will MAKE AMERICA SAF... __HTTP__ _E_ After @TrumpScotland I will visit @TrumpDoonbeg in Ireland the magnificent resort fronting on the Atlantic Ocean. _E_ "Tomorrow is the first blank page of a 365 page book. Write a good one." — @BradPaisley _E_ #MakeAmericaSafeAgain!#GOPConvention #RNCinCLE __HTTP__ __HTTP__ _E_ National GOP Presidential Poll via @OANN @realDonaldTrump 35.6% #Trump2016 __HTTP__ _E_ Intelligence agencies should never have allowed this fake news to leak into the public. One last shot at me.Are we living in Nazi Germany? _E_ I was the first & only potential GOP candidate to state there will be no cuts to Social Security Medicare & Medicaid. Huckabee copied me. _E_ Rumor has it Apple is going to release iPhones with bigger screens. That's good news. _E_ Washington needs common sense conservative solutions. Let's make America great again! __HTTP__ _E_ My new book Time To Get Tough will be out Dec 5th. Solutions you won't hear from the politicians. The bes... (cont) __HTTP__ _E_ Thank you to @IvankaTrump for her wonderful acknowledgement this morning on @foxandfriends... _E_ There are many Jonathan Gruber types selling the global warming stuff and they really do believe the American public is stupid. _E_ __HTTP__ _E_ Great day in Virginia. Crowd was fantastic! _E_ Excited for tomorrow's Politics & Eggs @saintanselm co hosted by @NECouncil & @nhiop. Live stream here __HTTP__ _E_ Looking forward to meeting the great folks of Sarasota GOP party when I am honored as 'Statesman of the Year.' Should be a wonderful time. _E_ This assignment has stretched not just the imaginations but the patience quotas of @lisarinna and @pennjillette. #CelebApprentice _E_ I started to get very worried about Mitt's chances when I heard that A Rod donated to his campaign. Everything A Rod touches turns bad. _E_ Sorry I won't be able to do @foxandfriends at 7 AM on Monday—will be in India. _E_ Via @thehill by @HenschOnTheHill: "Trump says US roads are 'falling apart'" __HTTP__ _E_ I will be on @foxandfriends at 7:00 there is much to talk about (sadly)! Enjoy! _E_ Thank you @megynkelly for the nice things you said about Melania. You will like her great heart and smart always wanting to help people! _E_ Surprise In a post election delayed release food stamp rolls surged to biggest monthly increase and an all time high __HTTP__ _E_ Named best golf course in the world by @RobbReport Trump Int'l Golf Links Scotland is a 7400 yd par 72 __HTTP__ _E_ I really enjoyed last night's Tele Town Hall with @ralphreed's Faith and Freedom Coalition. Thanks to the thousands who joined. _E_ Congratulations to Bernie Marcus & Herman Cain @JobCreatorsUSA on the #TruthTour2012 All employers need to check this out! _E_ Why are we sending thousands of ill trained soldiers into Ebola infested areas of Africa! Bring the plague back to U.S.? Obama is so stupid. _E_ I'll be on @Foxandfriends Monday at 7:30 AM. _E_ We're not talking about religion we're talking about security. #GOPDebate __HTTP__ _E_ Looks like Obama will not stop the very potentially dangerous flights to and from West Africa. What the hell is wrong with this guy? _E_ THANK YOU to everyone in Little Rock Arkansas tonight! A record crowd of 12K. #Trump2016 __HTTP__ __HTTP__ _E_ On the luxurious Palos Verdes Peninsula @TrumpGolfLA features @GolfWorldUS' top public course & elite restaurants __HTTP__ _E_ Via @kmovnewsfeed: Photos: Tour Donald Trump's NC golf club __HTTP__ _E_ 32º in New York it's freezing! Where the hell is global warming when you need it? _E_ I am the only Republican who will get large numbers of Dems and Indies (crossover). I will also get states that no other Republican can get. _E_ .@IvankaTrump is right—Plan B has descended into a state of total chaos. #CelebApprentice _E_ "George has a real twinkle about him" says @TheRealMarilu. Really? The shark should be scared. #CelebApprentice _E_ Just landed in New Hampshire a very exciting morning planned! _E_ #AmericaFirst #ImWithYou __HTTP__ _E_ Who do you like hate so far? _E_ released by Intelligence even knowing there is no proof and never will be. My people will have a full report on hacking within 90 days! _E_ Thank you to Time Magazine and Financial Times for naming me Person of the Year a great honor! _E_ Romney's failed advisors like campaign mgr Stuart Stevens are all over TV telling people how to win. But they lost don't know how to win! _E_ If we let Crooked run the govt history will remember 2017 as the year America lost its independence. #DrainTheSwamp __HTTP__ _E_ Via @DMRegister by @SharynJackson: "Trump: @SteveKingIA has 'the right views' __HTTP__ _E_ The 'brunt' of ObamaCare will be shouldered by folks making under $120K __HTTP__ _E_ I would like to thank @GolfMagazine for the really nice review of Trump National Doral Best Renovation of the Year (and maybe all time). _E_ My motto is: 'Never give up.' I follow this very strictly. I do not let problems and challenges stop me they are normal. _E_ Wow @Politico is in total disarray with almost everybody quitting. Goodnews bad dishonest journalists! __HTTP__ _E_ A great American Kurt Cochran was killed in the London terror attack. My prayers and condolences are with his family and friends. _E_ New York Fashion Week is really bad and used to be so glamorous and exciting! No stars no fun just boring. They need serious help. #NYFW _E_ Glad to hear @BrentBozell @marklevinshow @EWErickson & @TPPatriots are standing up to @KarlRove's attack on the Tea Party. _E_ Thank you America! #Trump2016Via @DRUDGE_REPORT __HTTP__ _E_ Our VISA system is broken like so much else in our country. We better get it fixed really fast. MAKE AMERICA GREAT AGAIN! _E_ Our wonderful future V.P. Mike Pence was harassed last night at the theater by the cast of Hamilton cameras blazing.This should not happen! _E_ To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment. Ralph Waldo Emerson _E_ 'Trump administration seen as more truthful than news media' __HTTP__ _E_ Yes I won the right to have my name taken off Trump Plaza in A.C. because it was not operated up to a very high standard and NO involvement _E_ I am in Las Vegas at the best hotel (by far) Trump International. I will be working with my wonderful teams and volunteers to WIN Nevada! _E_ Via @Newsmax_Media by @wandacarruthers: "Trump: Baghdad Likely to Fall to ISIS" __HTTP__ _E_ VOTE #TrumpPence16 on 11/8/16! __HTTP__ _E_ The media must immediately stop calling ISIS leaders MASTERMINDS. Call them instead thugs and losers. Young people must not go into ISIS! _E_ ...want everything to be done for them when it should be a community effort. 10000 Federal workers now on Island doing a fantastic job. _E_ Obama has no understanding of how to create jobs or opportunity. He believes in Government. _E_ It was great seeing @MissUniverse and @MissTeenUSA yesterday __HTTP__ _E_ .@BillMaher's show is great for helping me get to sleep better than Sominex. _E_ In light of the horrible attack in Nice France I have postponed tomorrow's news conference concerning my Vice Presidential announcement. _E_ "It's sad—truly sad and disgraceful—the way Obama has allowed America to be abused and kicked around (cont) __HTTP__ _E_ The Tea Party is filled with great Americans. Despite being mistreated by everyone including @GOP they will continue to fight on _E_ A very big thank you to Bill Donohue head of The Catholic League for the wonderful interview on @CNN and article in Newsmax! Great insight _E_ I told you! Premiums are soaring! #RepealObamacare #Trump2016 __HTTP__ _E_ RT @glamourizes: @realDonaldTrump Only true Americans can see that president Trump is making America great. He's the only person who can! H... _E_ When somebody challenges you fight back be tough! _E_ Crooked Hillary Clinton knew that her husband wanted to meet with the U.S.A.G. to work out a deal. The system is totally rigged & corrupt! _E_ Will be in Novi Michigan this Friday at 5:00pm. Join the MOVEMENT! Tickets available at: __HTTP__ __HTTP__ _E_ I hope corrupt Hillary Clinton chooses goofy Elizabeth Warren as her running mate. I will defeat them both. _E_ I will be in Huntsville Alabama on Saturday night to support Luther Strange for Senate. Big Luther is a great guy who gets things done! _E_ I hope Washington makes a good deal to avert the fiscal cliff. Both sides need to work together. _E_ People ask about @AmandaTMiller. She is actually a VP of Marketing at the Trump Organization. #CelebApprentice _E_ Will be leaving Palm Beach for the 11 A.M. ceremony opening the magnificent GARY PLAYER VILLA at Trump Nationak Doral Miami. GARY IS GREAT! _E_ Really enjoyed my interview with @marklevinshow. He is terrific! _E_ The phony lawsuit against Trump U could have been easily settled by me but I want to go to court. 98% approval rating by students. Easy win _E_ Great making keynote speech at 2014 Lincoln Day Dinner hosted by Dan Isaacs & NY Republican County Committee. Wonderful people! _E_ .@katyperry is no bargain but I don't like John Mayer he dates and tells be careful Katy (just watch!). _E_ Crooked Hillary Clinton is soft on crime supports open borders and wants massive tax hikes. A formula for disaster! _E_ Join me in Reno Nevada tomorrow at 3:30pm! #AmericaFirst #MAGATickets: __HTTP__ _E_ The media and establishment want me out of the race so badly I WILL NEVER DROP OUT OF THE RACE WILL NEVER LET MY SUPPORTERS DOWN! #MAGA _E_ FLORIDA: Do not miss this opportunity to #MakeAmericaGreatAgain! Thank you @IvankaTrump: __HTTP__ __HTTP__ _E_ Oregon is voting today. Keep the big numbers going VOTE TRUMP! MAKE AMERICA GREAT AGAIN! _E_ An iconic building and top tourist attraction @TrumpTowerNY sets New York City's luxury standard __HTTP__ & great food! _E_ Laura Massive crowd had to move to Phoenix Convention Center. __HTTP__ _E_ Wow sleepy eyes @chucktodd is at it again. He is do totally biased. The things I am saying are correct. far better vision than the others _E_ Now that Iran ripped us off by making one of the best deals of any kind in history they have just moved to block any imports from the U.S. _E_ In today's #trumpvlog I answer your questions about what you should be doing in this uncertain economy... __HTTP__ _E_ Thank you to the BRAVE servicemen & women who have served and continue to serve the United States our true HEROES... __HTTP__ _E_ Follow Trump @DoralResort's WGC @CadillacChamp leadership board here at @nbc's @GolfChannel __HTTP__ _E_ Leadership: Whatever happens you're responsible. If it doesn't happen you're responsible. _E_ Amazing! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ According to @pewresearch illegal immigrants favor Dems 8:1 __HTTP__ @GOP pushing amnesty. Do they have death wish _E_ Putin's letter is a masterpiece for Russia and a disaster for the U.S. He is lecturing to our President.Never has our Country looked to weak _E_ ...now it's the "greatest pageant on earth" broadcast in 190 countries to 1 billion people—"hot!" _E_ either elect more Republican Senators in 2018 or change the rules now to 51%. Our country needs a good shutdown in September to fix mess! _E_ Article from The Street The Donald's Trump Card: Himself __HTTP__ _E_ Republicans have the cards because of the debt ceiling—but it doesn't seem that way! _E_ Even the Left realizes that @BarackObama's policies have led to more jobs being outsourced out of this country. __HTTP__ _E_ Hillary says things can't change. I say they have to change. It's a choice between Americanism and her corrupt globalism. #Imwithyou _E_ Dennis Rodman was either drunk or on drugs (delusional) when he said I wanted to go to North Korea with him. Glad I fired him on Apprentice! _E_ Watch Kasich squirm if he is not truthful in his negative ads I will sue him just for fun! _E_ The best investors are visionaries—they look beyond the present. _E_ Young entrepreneurs – remember quality and results are the key metrics to success. _E_ Next she says she's being set up by Omarosa to fail....is somebody confused? _E_ Thank you Congressman Steven Palazzo! __HTTP__ __HTTP__ _E_ It was wonderful to have President Petro Poroshenko of Ukraine with us in New York City today. #UNGA __HTTP__ __HTTP__ _E_ Senator @LindseyGrahamSC made horrible statements about @SenTedCruz – and then he endorsed him. No wonder nobody trusts politicians! _E_ My @SquawkCNBC #TrumpTuesday interview discussing the 2012 election OPEC ripping us off & @MittRomney's job policy __HTTP__ _E_ Toyota & Mazda to build a new $1.6B plant here in the U.S.A. and create 4K new American jobs. A great investment in American manufacturing! _E_ No wonder @NYMag is doing so poorly with an idiot Sr. Editor like @DanAmira it will only get worse! _E_ None of Romney's leaked comments change the fact that Obama is a complete disaster. 20% real unemployment and $6T in deficit spending. _E_ Hillary is too weak to lead on border security no solutions no ideas no credibility.She supported NAFTA worst deal in US history. #Debate _E_ Wow! Honored to be chosen by the highly respected + accurate Washington & Lee Mock Convention. I hope you are right I will make you proud! _E_ Pres @BarackObama expects @MittRomney to play nice like @SenJohnMcCain it's not going to happen & the result is going to be much different. _E_ WEEKLY ADDRESS __HTTP__ _E_ No money wasted like bad ads—the Republicans spent more & got nothing for it. _E_ Watch this tour by @TrumpIntRealty's @M_Griffith1 of this luxurious penthouse in Trump Park Avenue __HTTP__ _E_ Another attack this time in Germany. Many killed. God bless the people of Munich. _E_ Crooked Hillary wants to take your 2nd Amendment rights away. Will guns be taken from her heavily armed Secret Service detail? Maybe not! _E_ RT @newtgingrich: Seems out of touch w/ reality to announce a VP nominee before securing 1237 delegates. __HTTP__ __HTTP__ _E_ #VoteTrumpNH #NHPrimary #FITN __HTTP__ _E_ I will be interviewed on @greta tonight at 7pm. Enjoy! __HTTP__ _E_ .@MittRomney shouldn't give additional tax returns until @BarackObama gives his passport records college records & applications... _E_ Every American needs to say 2 simple words to every Vet they meet: THANK YOU! John Wayne Walding __HTTP__ _E_ Job openings are at a 4 year high but businesses aren't hiring __HTTP__ Why? ObamaCare US debt & @BarackObama's tax plan. _E_ Seems like the teams are surprised when @THEGaryBusey comes back. #CelebApprentice _E_ "There can be no liberty unless there is economic liberty." – The Iron Lady Margaret Thatcher _E_ My appearance on The View... __HTTP__ and __HTTP__ _E_ It is almost time. I will be making a major announcement from @TrumpTowerNY at 11AM. Follow on social media! #MakeAmericaGreatAgain _E_ The federal gov. has handled Sandy worse than Katrina. There is no excuse why people don't have electricity or fuel yet. _E_ Stock Market hits new Record High. Confidence and enthusiasm abound. More great numbers coming out! _E_ Sometimes people spend too much time focusing on problems instead of focusing on opportunities Think Like a Champion _E_ Really big crowd expected tomorrow morning at # CPAC2013. I look forward to it! _E_ #MakeAmericaGreatAgain #Trump2016 Story: __HTTP__ __HTTP__ _E_ If only speeches could create jobs then @BarackObama wouldn't have such a dismal economic record. _E_ .@BradSteinle Great talking to you and your parents—fantastic people. Keep your sister's very important memory alive—big impact! _E_ "A savvy investor is a sponge for information. You have to read the newspapers... _E_ Priorities: @BarackObama wants to slash a Trillion dollars from military spending while raising the salaries of (cont) __HTTP__ _E_ I give the President's speech a 7 on the scale of 0 to 10! Not bad but room for improvement! _E_ Why was the Hanukah celebration held in the White House two weeks early? @BarackObama wants to vacation in Hawaii in late December. Sad. _E_ Iran must immediately allow Christian #PastorSaeed out of prison or we should put back sanctions (which should never have been lifted) _E_ I was never a fan of Colin Powell after his weak understanding of weapons of mass destruction in Iraq = disaster. We can do much better! _E_ I am in New Hampshire. Just received great news from Reuters poll. Thank you for your support! __HTTP__ _E_ The @FBIPressOffice police & others are doing an amazing job. How genius was it putting together that tape? _E_ Does anybody really believe that Bill Clinton and the U.S.A.G. talked only about grandkids and golf for 37 minutes in plane on tarmac? _E_ Miss Universe 2012 Pageant will be airing live on @nbc & @Telemundo december 19th. Open invite stands for Robert Pattinson. _E_ I will replace it with private plans health savings accounts & allow purchasing across state lines. Maximum choice & freedom for consumer. _E_ Word is that @NBCNews is firing sleepy eyes Chuck Todd in that his ratings on Meet the Press are setting record lows. He's a real loser! _E_ Now that China's own economy is slowing __HTTP__ watch how they start doing even bigger numbers in (cont) __HTTP__ _E_ Insurgents in Iraq show they can still mount horrifying attacks US wastes trillions. _E_ My meetings with President Xi Jinping were very productive on both trade and the subject of North Korea. He is a highly respected and powerful representative of his people. It was great being with him and Madame Peng Liyuan! _E_ Florida has been very good to me. I am really esxcited to give back at the Sarasota GOP event and @RNC convention. Will be fun! _E_ All weights are on crane's wrong side very precarious below move out! _E_ We should leave Afghanistan immediately. No more wasted lives. If we have to go back in we go in hard & quick. Rebuild the US first. _E_ Government can be efficient with the right leadership. Let's Make America Great Again __HTTP__ _E_ Via @GolfweekMag by @BKleinGolfweek: "Donald Trump reopens Doral's Blue Monster" __HTTP__ _E_ Entrepreneurs: Don't tread water. Get out there and go for it. _E_ Hillary said I really deplore the tone and inflammatory rhetoric of his campaign. I deplore the death and destruction she caused stupidity _E_ RT @foxandfriends: HAPPENING TODAY: House to vote on immigration bills including 'Kate's Law' and 'No Sanctuary for Criminals Act' __HTTP__ _E_ If you want to know about Hillary Clinton's honesty & judgment ask the family of Ambassador Stevens. _E_ Thank you Rhode Island! #Trump2016 __HTTP__ _E_ #ICYMI On Saturday I signed two EO's to help keep jobs & wealth in our country.EO1: __HTTP__ EO2:... __HTTP__ _E_ Kentucky has a chance to have the Senate Majority Leader Mitch McConnell representing it in Washington. Big power for State. Don't blow it _E_ massive increases of ObamaCare will take place this year and Dems are to blame for the mess. It will fall of its own weight be careful! _E_ What truly matters is not which party controls our government but whether our government is controlled by the people. _E_ Will be on @foxandfriends at 8:00. Enjoy! _E_ Will be in Orlando Florida this afternoon. 25000 people expected. This is a movement like our GREAT COUNTRY has never seen before! _E_ My @Shalom_TV interview discussing my video endorsement of @IsraeliPM @netanyahu and past visits to @Israel __HTTP__ _E_ ...money to Bill the Hillary Russian reset praise of Russia by Hillary or Podesta Russian Company. Trump Russia story is a hoax. #MAGA! _E_ Watched low rated @Morning_Joe for first time in long time. FAKE NEWS. He called me to stop a National Enquirer article. I said no! Bad show _E_ On my way to Iowa just received new national poll numbers. Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Very productive bilateral meeting with Prime Minister Benjamin @Netanyahu of Israel in Davos Switzerland! #WEF18 __HTTP__ _E_ My thoughts on the Emmys in today's #trumpvlog.... __HTTP__ _E_ Why are we still giving billions of dollars we don't have in foreign aid to the Muslim Brotherhood in Egypt? _E_ Politicians are trying to chip away at the 2nd Amendment. I won't let them take away our guns! #Trump2016Watch: __HTTP__ _E_ "Do not give in to anger. It destroys your focus on goals and ruins your concentration." – Think Big _E_ The upcoming All Star @CelebApprentice puts the celebrities under the hardest tasks we have ever given. We really pushed the envelope _E_ I hope Mark Zuckerberg signs a prenup with his current girlfriend perhaps soon to be wife. Otherwise she can walk away with 9 billion. _E_ My @bostonherald interview on Tom Brady Hillary Clinton the Granite State & Making America Great Again! __HTTP__ _E_ Via @washingtonpost by @costareports: "Trump says he is serious about 2016 bid is hiring staff and delaying TV gig" __HTTP__ _E_ Only 88000 jobs were added this past March. Prediction was 190000. Businesses can't expand with Obama Care & high taxes on horizon. _E_ Just left Columbus rally of 14000 people a far bigger crowd than even I expected! Unbelievable evening incredible spirit in the arena! _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Thank you. __HTTP__ _E_ Consumer Confidence is at an All Time High along with a Record High Stock Market. Unemployment is at a 17 year low. MAKE AMERICA GREAT AGAIN! Working to pass MASSIVE TAX CUTS (looking good). _E_ Joan Rivers on The Apprentice tonight at 8:00. I will be live tweeting. JOAN WAS GREAT! _E_ Are you allowed to impeach a president for gross incompetence? _E_ Sleep eyes @ChuckTodd is killing Meet The Press. Isn't he pathetic? Love watching him fail! _E_ Via @thehill by @HugginsRachel: "Trump looking 'very seriously' at 2016 run" __HTTP__ _E_ "Representing your own brand yourself is the best way to go. If you can't sell it who will?" – Midas Touch _E_ For eight years Russia ran over President Obama got stronger and stronger picked off Crimea and added missiles. Weak! @foxandfriends _E_ The GOP needs to learn how to get tough and outnegotiate @BarackObama and his big spending allies in (cont) __HTTP__ _E_ Good luck to Bob Kraft Tom Brady and Coach Bill Belichick tonight. _E_ Internationally recognized as an iconic landmark @TrumpTowerNY beams over Fifth Avenue __HTTP__ _E_ A great day at the White House! _E_ It is fatal to enter any war without the will to win it. Douglas MacArthur _E_ Via @Mediaite by @evanmcmurry: "Trump Calls @AGSchneiderman a Cokehead" __HTTP__ Schneiderman is by his own admission! _E_ North Korea has conducted a major Nuclear Test. Their words and actions continue to be very hostile and dangerous to the United States..... _E_ Massive crowd in VT tonight. Venue not big enough. Officials say NO to outside event and sound system. Arrive early! _E_ Wow @CNN is so negative. Their panel is a joke biased and very dumb. I'm turning to @FoxNews where we get a fair shake! Mike will do great _E_ Via @MailOnline Trump still in the lead by a whopping 14 points after fluke survey had put Carson on top __HTTP__ _E_ RT @realDonaldTrump: On #PurpleHeartDay💜I thank all the brave men and women who have sacrificed in battle for this GREAT NATION! #USA __HTTP__ _E_ RT @realDonaldTrump: HAPPY 241st BIRTHDAY to the @USArmy! THANK YOU! __HTTP__ _E_ Good luck! Enjoy. __HTTP__ _E_ With all of the bad economic numbers and horrendous foreign policy Obama should be down by 12 points and he's not. _E_ WEEKLY ADDRESS __HTTP__ _E_ A great ad from @MittRomney showing A Few of the 23 Million unemployed who need economic change __HTTP__ Take it to him Mitt! _E_ His @BarackObama's budget: interest payments to China will exceed US defense spending by 2019 __HTTP__ @BarackObama's America! _E_ Angela Merkel is doing a fantastic job as the Chancellor of Germany. Youth unemployment is at a record low & she has a budget surplus. _E_ Now with the Danger Weiner campaign dead time to focus on crazy Eliot Spitzer. A man who has never earned 10 cents in his life. _E_ When ISIS caught the soldiers do you think they read them their legal rights prior to executing them? _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Watch Obama's favorability numbers drop even further if he doesn't accept my charitable offer. No one approves (cont) __HTTP__ _E_ Hurricane Irene and Libya in today's #trumpvlog.... __HTTP__ _E_ Wow more than 90% of Fake News Media coverage of me is negative with numerous forced retractions of untrue stories. Hence my use of Social Media the only way to get the truth out. Much of Mainstream Meadia has become a joke! @foxandfriends _E_ If you stop by Trump Tower (Fifth Avenue between 56th and 57th Streets) you can buy a pre signed copy of #TimeToGetTough. _E_ Thank you to the amazing law enforcement officers today in Daytona Beach Florida! #LESM #MAGA __HTTP__ _E_ By rejecting my ad on ugly windmills & @AlexSalmond's faulty thinking on the "Lockerbie bomber" the ad is now on worldwide newscasts. _E_ RT @GOPChairwoman: .@realDonaldTrump is the Paycheck President. Learn how the tax bill will put more money in your pocket & how to contact... _E_ Will be going to Pennsylvania today in order to give my total support to RICK SACCONE running for Congress in a Special Election (March 13). Rick is a great guy. We need more Republicans to continue our already successful agenda! _E_ Back by popular demand @TraceAdkins delivers in the upcoming @CelebApprentice All Stars season. Yes he sings. _E_ John Kasich should focus his special interest money on building up his failed image not negative ads on me. _E_ This Sunday's All Star Celebrity @ApprenticeNBC features the return of @Joan_Rivers. Sunday at 9 PM on @NBC full 2 hours. _E_ Wisconsin we will MAKE AMERICA GREAT AGAIN! _E_ Join me live from Bedminster New Jersey: __HTTP__ _E_ Getting ready to leave for Melbourne Florida. See you all soon! _E_ Via @11AliveNews by @JenniferJJacobs: "Trump heads to Iowa as '16 speculation rises" __HTTP__ _E_ Anytime you see a story about me or my campaign saying sources said DO NOT believe it. There are no sources they are just made up lies! _E_ Just heard that the great Golf Week Magazine named my Trump International Golf Course Scotland The Best Modern Day Golf Course In The World! _E_ Why do the Republicans keep apologizing on the so called birther issue? No more apologies take the offensive! _E_ Omarosa always promises and delivers high drama... _E_ Debate polls look great thank you!#MAGA #AmericaFirst __HTTP__ _E_ Now China is trying to take over a U.S. airbase __HTTP__ This is only the beginning. They only understand toughness! _E_ A massive tax increase will be necessary to fund Crooked Hillary Clinton's agenda. What a terrible (and boring) rollout that was yesterday! _E_ RT @Fuctupmind: @realDonaldTrump Donald Trump's amazing golf swing #CrookedHillary __HTTP__ _E_ The @rydercup is currently going on and is one of the truly great sporting events. _E_ Why is crude oil priced at $86/Barrel? OPEC is ripping us off. Not worth $30/Barrel. America needs new leaders. _E_ I will be announcing my decision on the Paris Accord over the next few days. MAKE AMERICA GREAT AGAIN! _E_ Leadership: the art of getting someone else to do something you want done because he wants to do it. Dwight D. Eisenhower _E_ What took investigators so long to interview the pilots of Asiana San Fran crash? WHY NO DRUG TESTS FOR PILOTS they were really off . _E_ .@Apprenticenbc cast will be announced tomorrow at 7:30am ET on the @todayshow with @MLauer _E_ Gary Johnson is asking people to waste their vote on him. Make it count vote for @MittRomney. _E_ THANK YOU to everyone who joined me at the @WhiteHouse yesterday. Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ .@TheRealMarilu was very impressive and is a great person. The All Star Celebrity @ApprenticeNBC viewers loved her. _E_ .@HillaryClinton has been part of the rigged DC system for 30 years? Why would we take policy advice from her? #Debates2016 _E_ .@pennjillette and @dennisrodman as PM's I'm proud of Dennis and his performance this season. #CelebApprentice _E_ Good luck @MittRomney tonight have no doubt you will be great. _E_ Whitey Bulger's prosecution starts today. Will be one of the most interesting and intriguing trials. _E_ "Donald Trump unveils vision for @TrumpTurnberry" __HTTP__ via @BunkeredOnline by @MMcEwanBunkered _E_ There should be no further releases from Gitmo. These are extremely dangerous people and should not be allowed back onto the battlefield. _E_ "You should always feel comfortable bargaining for goods and services. I do it all the time." – Think Like a Billionaire _E_ Reason I canceled my trip to London is that I am not a big fan of the Obama Administration having sold perhaps the best located and finest embassy in London for "peanuts" only to build a new one in an off location for 1.2 billion dollars. Bad deal. Wanted me to cut ribbon NO! _E_ Great Kevin McCarthy drops out of SPEAKER race. We need a really smart and really tough person to take over this very important job! _E_ After Crooked @HillaryClinton allowed ISIS to rise she now claims she'll defeat them? LAUGHABLE! Here's my plan: __HTTP__ _E_ Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion! _E_ Mexican leaders and negotiators are much tougher and smarter than those of the U.S. Mexico is killing us on jobs and trade. WAKE UP! _E_ Fort Hood shooting should be declared a terror attack. Respect the wounded and dead. _E_ Wisdom comes as a result of both experience and knowledge. It's something you can't teach someone else you have to achieve it on your own. _E_ My @SquawkCNBC interview discussing Jamie Dimon banking regulations and Mark Zuckerberg's prenuptial __HTTP__ _E_ Trump Tees Up Another 'Hole in One' in Scotland __HTTP__ _E_ In Vegas? Enjoy Thanksgiving in @TrumpLasVegas' DJT lounge where the @nfl games will be playing all day __HTTP__ _E_ Would be really bad if columnist Mike Lupica left the @NYDailyNews. A wonderful and talented guy! _E_ Tune in & join me live in Albany New York! 7pmE start time! I love you New York! #Trump2016 #TrumpTrain __HTTP__ _E_ #ICYMI: I joined #OnTheRecord with @kimguilfoyle on @FoxNews this evening. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Why is someone like George Pataki who did a terrible job as Governor of N.Y. and registers ZERO in the polls allowed on the debate stage? _E_ The West Coast's most luxurious public course @TrumpGolfLA features spectacular panoramic Pacific Ocean views __HTTP__ _E_ Trump Int'l Hotel & Golf Links Ireland (formerly The Lodge at Doonbeg) is a 5 star resort fronting the Atlantic Ocean __HTTP__ _E_ .@BarackObama should be careful questioning @MittRomney on diplomacy how many times has Obama apologized for our country on foreign soil?! _E_ Since the Obama Administration was told way before the 2016 Election that the Russians were meddling why no action? Focus on them not T! _E_ Benghazi is just another Hillary Clinton failure. It justnever seems to work the way it's supposed to with Clinton. _E_ .@LisaLampanelli You are terrific (always). Great job on the Apprentice. _E_ The opening of #TrumpScotland an exciting day on perhaps the world's best golf course watch the video __HTTP__ _E_ NBC terminates The Chris Matthews Show __HTTP__ _E_ "Get some face time in The Spa at @TrumpLasVegas" __HTTP__ via @Vegascom by Renée Libutti _E_ Apprentice = big hit. Miss Universe = Big hit. I always get big ratings. If I hosted Meet the Press instead of Sleepy Eyesa smash! @NBCNews _E_ Record low temperatures and massive amounts of snow. Where the hell is GLOBAL WARMING? _E_ I have traveled the world. America is the most beautiful country on Earth. _E_ Our country is being run by total amateurs. Let's just call it "amateur hour." _E_ Media silent when @BarackObama called @MittRomney a murderer & felon. Mitt mentions 'birth certificate' and they go nuts. Double standard! _E_ My @extratv interview before Hurricane Sandy explaining that I would be staying in Trump Tower during the storm __HTTP__ _E_ RT @JacobAWohl: @realDonaldTrump President Trump alone has succeeded in bringing the Stock Market Small Business Index and Consumer Comfor... _E_ The real story here is why are there so many illegal leaks coming out of Washington? Will these leaks be happening as I deal on N.Korea etc? _E_ #USA #Japan __HTTP__ _E_ I will be signing copies of my new book Time To Get Tough: Making America #1 Again in Trump Tower on Frida... (cont) __HTTP__ _E_ Listen – my Citizens United Political Victory Fund robo call for @leezeldin __HTTP__ #zeldinforcongress _E_ Third rate @politico took every negative tweet or response they could find & put it out when in fact the response is incredibly positive. _E_ Thank you @WayneAllynRoot.Very nice! #Trump2016 __HTTP__ _E_ Happy #MedalOfHonorDay to our heroes! __HTTP__ __HTTP__ _E_ Really sad that Republicans would allow themselves to be used in a Clinton ad. Lindsey Graham Romney Flake Sass. SUPREME COURT REMEMBER! _E_ Read my tweets you dopes of course he should get a trial but fast (not a 12 year disaster). _E_ All Presidential candidates should immediately disavow their Super PAC's. They're not only breaking the spirit of the law but the law itself _E_ I am on @oreillyfactor tonight a big special. @FoxNews at 8:00 P.M. ENJOY! _E_ .@VanityFair's terrible piece on Mitt's faith is a new low even for them. _E_ I'll be on Piers Morgan Tonight this evening 9 pm on CNN. Be sure to tune in. @PiersTonight _E_ Today is the 53rd anniversary of the March on Washington today we honor the enduring fight for justice equality and opportunity. _E_ Can the relationship between the mayor of New York City and the police force ever be fixed? Tune in to @foxandfriends at 7:15. _E_ Many people advised me not to buy the Miss Universe pageant. They were all wrong. The deal worked out to be a great one! _E_ Photos from the @ApprenticeNBC press conference __HTTP__ Premieres January 4th on @NBC. _E_ Be sure to listen to my interview today w/@SteveMTalk on @Newsmax_Media __HTTP__ Congratulations to Steve on his new show! _E_ If Democrats were not such obstructionists and understood the power of lower taxes we would be able to get many of their ideas into Bill! _E_ The people of Scotland love Trump International Golf Links. _E_ The failure of the Super Committee shows Washington has truly incompetent leaders. #TimeToGetTough _E_ Practice positive thinking—this will keep you focused while weeding out anything that is unnecessary negative or detrimental... _E_ Without passion you don't have energy and without energy you don't have anything! _E_ "Worry destroys focus." – Think Big _E_ Dateline NBC featuring yours truly just set a season high in households in the ratings—no wonder NBC likes me so much! @nbc _E_ A wonderful place. __HTTP__ _E_ .@MarkBurnettTV and his incredible wife @RealRomaDowney did a fabulous movie @SonofGodMovie see it! _E_ People (pundits) gave me no chance in South Carolina. Now it looks like a possible win. I would be happy with a one vote victory! (HOPE) _E_ Wow just won Missouri! _E_ Come join us at the Verizon Wireless Center Manchester New Hampshire on 2/8! Register now: __HTTP__ __HTTP__ _E_ #CrookedHillary __HTTP__ _E_ It seems that Justice Scalia originally wrote the majority on ObamaCare and Roberts then switched his position. __HTTP__ _E_ Interesting how President Obama so haltingly said I would never be president This from perhaps the worst president in U.S. history! _E_ Do not settle for remaining in your comfort zone. Being complacent is a good way to get nowhere. Get your momentum going and keep it going. _E_ ...The ads made her look great and now she probably will run. _E_ Your most popular tweet answered why I'm holding off on a Presidential bid... __HTTP__ #trumpvlog _E_ Big crowd expected tomorrow night in Iowa. It will be interesting and fun great people! _E_ Wow the ratings for @60Minutes last night were their biggest in a year very nice! _E_ .@lancearmstrong teammate is angry and jealous he is no Lance. _E_ Another one of me on stage. #WWEHOF __HTTP__ _E_ Watch my appearances on Good Day NY... __HTTP__ and @FoxandFriends... __HTTP__ _E_ I'll be signing copies of my new book Time To Get Tough today at Trump Tower 11 am to 2 pm. Hope to see you there. #TimeToGetTough _E_ It's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_ Does anyone agree with Marilu that Gary while 'adorable' is a distraction? _E_ Based on the fraud committed by Senator Ted Cruz during the Iowa Caucus either a new election should take place or Cruz results nullified. _E_ RT @realDonaldTrump: Happy to announce we are awarding $1M to Las Vegas in order to help local law enforcement working OT to respond to l... _E_ Hamas has warned Pres. Obama not to visit the Temple Mount during his trip to Israel __HTTP__ _E_ Today The Blue Monster is torn up. The Trump National @DoralResort is being revolutionized with $200M of renovations. _E_ RT @Newsmax_Media: Donald Trump: Mean Spirited GOP Won't Win Elections @REALDonaldTrump __HTTP__ via @Newsmax_Media _E_ Fallout from Iowa: Trump Speech Drew Greatest Response __HTTP__ via @Newsmax_Media by Jim Meyers __HTTP__ _E_ Fiscal cliff negotiations have officially begun between the President and Congress Washington must come together and make a deal. _E_ Congratulations to the @thenyrangers on taking a 2 1 lead over the @washcaps. Great game last night! _E_ The prestigious 800 acre @TrumpDoral boast luxurious event spaces and 5 Star restaurants __HTTP__ _E_ Realize that persistence can go a long way. Being stubborn is often an attribute. _E_ China is threatening Washington over the currency bill. We should pass it immediately. _E_ I am giving away money. Check the crowdfunding site @fundanything __HTTP__ Raise money for anything! _E_ Edward Snowden is absolutely killing the the U.S. with other countries! _E_ Hillary will never reform Wall Street. She is owned by Wall Street! _E_ Our country is on the precipice. Washington is broken. Where is the leadership? _E_ Thank you @IngrahamAngle for your strength & wonderful words last night on @FoxNews but @KarlRove is easy to beat! _E_ I know it has been many years since our country made great deals but isn't it about time we start right now. MAKE AMERICA GREAT AGAIN! _E_ RT @foxandfriends: Jeb is a weak guy. @EricTrump __HTTP__ _E_ .@Morning_Joe @mikebarnicle on @realDonaldTrump: He finished 2nd but he made the turn successfully like a pro _E_ Sadly it took a hit & run auto accident to make us aware of who our Secretary of Commerce is and such an important position! _E_ My @gretawire interview discussing why @BarackObama is not a nice guy and who will win the 2012 election __HTTP__ _E_ "Do not pray for easy lives. Pray to be stronger men." – Pres. John F. Kennedy _E_ Undecideds in OHPA and WI will make the difference. All should ask themselves if they want $6/gallon gas because it will come under Obama. _E_ The meeting next week with China will be a very difficult one in that we can no longer have massive trade deficits... _E_ Shows how weak and desperate Lyin' Ted is when he has to team up with a guy who openly can't stand him and is only 1 win and 38 losses. _E_ Jobless claims have dropped to a 45 year low! _E_ He @MittRomney would do a great job on Saturday Night Live. @nbcsnl _E_ Our economy is better than it has been in many decades. Businesses are coming back to America like never before. Chrysler as an example is leaving Mexico and coming back to the USA. Unemployment is nearing record lows. We are on the right track! _E_ RT @GOP: .@POTUS: I want to work with Congress Republicans and Democrats on a plan that is pro growth pro jobs pro worker and pro Amer... _E_ My thoughts and prayers are with everyone involved in the train accident in DuPont Washington. Thank you to all of our wonderful First Responders who are on the scene. We are currently monitoring here at the White House. _E_ ........may be their number one act and priority. Focus on tax reform healthcare and so many other things of far greater importance! #DTS _E_ Mark They could use you. __HTTP__ _E_ Here we go Enjoy! _E_ "@PGAChampionship @seniorpgachamp both headed to Trump courses" __HTTP__ via @FoxNews _E_ Want to take a quiz with me? Download the @millonseconds app and watch @RyanSeacrest on Monday at 8/7c on @NBC _E_ Doing the @todayshow with @MLauer was great I really like Matt. _E_ It wasn't the White House it wasn't the State Department it wasn't father LaVar's so called people on the ground in China that got his son out of a long term prison sentence IT WAS ME. Too bad! LaVar is just a poor man's version of Don King but without the hair. Just think.. _E_ Thank you for all of the positive response on my Chicago lawsuit victory yesterday. Most of you saw through the phony age card ploy. _E_ Over 150000 more of our fellow Americans dropped out of the workforce in July. @BarackObama is a disaster! _E_ RT @EricTrump: Nevada we are on our way! #VoteTrumpNV #Trump2016Caucus locator: __HTTP__ __HTTP__ _E_ .@politico which is not read or respected by many may be the most dishonest of the media outlets and that is saying something. _E_ Bernie Sanders who has lost most of his leverage has totally sold out to Crooked Hillary Clinton. He will endorse her today fans angry! _E_ Bombings all over Iraq today.That country is falling apart such a horrible waste of lives and 1.5 trillion dollars (and I told you so!). _E_ DON'T LET HILLARY CLINTON DO IT AGAIN!#TrumpPence16 __HTTP__ _E_ Excited to have @SarahPalinUSA's endorsement of the Newsmax @iontv debate. Sarah is terrific. _E_ Thanks. __HTTP__ _E_ This has to stop! @BarackObama loves accruing American debt he missed his budget deficit goal by over $500 billion. __HTTP__ _E_ I will be interviewed on @oreillyfactor tonight at 8:00 P.M. (Eastern). Enjoy! _E_ RT @TeamTrump: Obama Clinton FAILED foreign policy: Bad nuclear deal Ransom payment to leading state sponsor of terror Sharing classifie... _E_ We are way over the fiscal cliff. And with Obama Care being fully implemented in less than 14 months it may be too late. _E_ Just leaving Las Vegas. Unbelievable crowd! Many Hispanics who love me and I love them! __HTTP__ _E_ "Revenge is sweet and not fattening." Alfred Hitchcock _E_ ObamaCare will destroy small business the backbone of America's economy. _E_ .@jimmykimmel is terrific but for Obama to fly on Air Force One ($'s) to do the show in these bad times is ridiculous. _E_ Thank you. __HTTP__ _E_ Via @CNNPolitics: Trump will have 'memorable' role at GOP convention __HTTP__ It's true just wait and see... _E_ Thank you Cedar Rapids Iowa!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ "You cannot push anyone up the ladder unless he is willing to climb." Andrew Carnegie _E_ Can you believe that President Karzai of Afghanistan is holding out for more more more and refuses to sign deal. Tell him to go to hell! _E_ "Iowans Drawn to Donald Trump Praise His Antiestablishment Bent" __HTTP__ via @WSJ by @heatherhaddon & @reidepstein _E_ What America needs: @MittRomney follows in steps of Kemp and Reagan with pro growth tax cut. _E_ "Do not underestimate yourself and know you are able to handle what comes your way." – Think Like a Champion _E_ RT @realDonaldTrump: The travel ban into the United States should be far larger tougher and more specific but stupidly that would not be... _E_ Via @ChristianPost @NaghmehAbedini to Testify at New Congressional Hearing on Persecution of Pastor Saeed Abedini __HTTP__ _E_ Congrats to @GovernorCorbett he's right to be suing @NCAA over the ridiculous deal made by the trustees of Penn State __HTTP__ _E_ The Iranians have just threatened to send warships to our coasts. They laugh at us. We can't allow them to develop nuclear weapons. _E_ The U.S. rocket that blew up and crashed yesterday is emblematic of the United States under Obama. Nothing works be it a rocket or website. _E_ If you include people who have left the work force unemployment rate is 15%. Labor participation rate is lowest in 70 yrs. _E_ #IndianaJones and #Ghostbusters what's wrong??? __HTTP__ _E_ I along with almost everyone else have so little confidence in President Obama. He has a horrible attitude a man who is resigned to defeat _E_ Thank you Maryland what a great way to conclude the day! Will be back soon. #Trump2016 __HTTP__ __HTTP__ _E_ Santorum calls Trump debate skippers hypocrites __HTTP__ @RickSantorum _E_ Via @Citizens_United: "Donald Trump To Speak At The Iowa Freedom Summit in Des Moines on January 24th" __HTTP__ _E_ I have tremendous respect for women and the many roles they serve that are vital to the fabric of our society and our economy. _E_ RT @JacobAWohl: @realDonaldTrump When Obama was President the #MSM LOVED talking about stock market rallies! Now they barely mention new a... _E_ ....Transgender individuals to serve in any capacity in the U.S. Military. Our military must be focused on decisive and overwhelming..... _E_ Thanks. __HTTP__ _E_ Watch this video of my wonderful golf club @TrumpNationalCN in beautiful Colts NeckNJ __HTTP__ _E_ After one of the great chokes in the history of sports it will be hard for the Spurs to beat the Heat but who knows. Good game on now! _E_ I like doing this once a month for the haters & losers (and as they know) I don't wear a wig . Some may not like my hairstyle but all mine _E_ The talk in Albany is that JCOPE & Moreland Commissions are taking my complaint against lightweight (cont) __HTTP__ _E_ The @SuperCommittee will fail. The Republicans never should have agreed to the debt deal. _E_ China's the leading exporter of Iraqi oil yet they won't lift a finger against ISIS. Why should we do the heavy lifting for China's gain? _E_ To each member of the graduating class from the National Academy at Quantico CONGRATULATIONS! __HTTP__ _E_ RT @PaulaReidCBS: .@CBSNews confirms FBI found emails on #AnthonyWeiner computer related to Hillary Clinton server that are new & not p... _E_ "Change can't be measured in speeches. It is measured in achievements." @MittRomney yesterday in Fairfax VA. _E_ Paying attention is a cost effective way of protecting yourself and your interests. _E_ This will be a very interesting day for HealthCare.The Dems are obstructionists but the Republicans can have a great victory for the people! _E_ Why can't the leaders of the Republican Party see that I am bringing in new voters by the millions we are creating a larger stronger party! _E_ #ICYMI on Monday I had the great honor of welcoming India's Prime Minister @narendramodi to the WH. Full Remarks:... __HTTP__ _E_ The debate last night proved that Hillary is running against the "B" team. She won't be so lucky when it comes to me! _E_ RT @GOP: On National #VoterRegistrationDay make sure you're registered to vote so we can #MakeAmericaGreatAgain __HTTP__ ht... _E_ Why is @BarackObama constantly issuing executive orders that are major power grabs of authority? This is the latest __HTTP__ _E_ Isn't it intetesting that anybody who attacks President Obama is considered a racist by the real racists out there! _E_ If ObamaCare is not repealed then we can expect stagnant growth long term unemployment and record high premiums. _E_ .@FoxNews owes me an apology for allowing clueless pundit @RichLowry to use such foul language on TV. Unheard of! _E_ Lightweight Senator Kirsten Gillibrand a total flunky for Chuck Schumer and someone who would come to my office "begging" for campaign contributions not so long ago (and would do anything for them) is now in the ring fighting against Trump. Very disloyal to Bill & Crooked USED! _E_ In New York March was the coldest month in recorded history we could use some GLOBAL WARMING! _E_ China will now pass our economy this year way ahead of projections. Pres. Obama – China's greatest asset! _E_ Via @RealClearNews by @rebeccagberg: "Is the White House Big Enough for Donald Trump?" __HTTP__ _E_ It's Jan. 2. President Obama should end his vacation early & get back to Washington to straighten out the ObamaCare catastrophe or end it. _E_ Wow so far everyone running for office who I did a ROBOCALL for has taken the lead in the polls the smart pols know this. GREAT! _E_ Challenges present opportunities. Always keep your focus and stay calm. _E_ Via @Newsmax_Media by @spiccoli: "Donald Trump Taking 'Serious Look' at 2016 Presidential Run" __HTTP__ _E_ In beautiful Miami inspecting the progress of @TrumpDoral's $250 million conversion into the country's #1 resort. _E_ Pathetic: @BarackObama did not want to veto Keystone himself so he lobbied the Democrats in the Senate to defeat it. __HTTP__ _E_ Thoughts & prayers are w/ our @USNavy sailors aboard the #USSJohnSMcCain where search & rescue efforts are underway. __HTTP__ _E_ Great news in Georgia! The just out Landmark poll shows me in first with 43%! Wow. __HTTP__ __HTTP__ _E_ Tune in Sunday June 3 to NBC at 9pm ET for the 2012 Miss USA competition coming from Planet Hollywood Resort & Casino in Las Vegas _E_ Thank you Georgia! I appreciate all of your support. #Trump2016 __HTTP__ _E_ Congratulations to Evan Lysacek for being nominated SI sportsman of the year. He's a great guy and he has my vote! #EvanForSI _E_ Mexico was just ranked the second deadliest country in the world after only Syria. Drug trade is largely the cause. We will BUILD THE WALL! _E_ Chris Wallace @fox at 10:00 A.M. _E_ "@limbaugh: 'Trump Has Changed the Entire Debate on Immigration'" __HTTP__ via @Newsmax_Media by Jason Devaney _E_ #TBT Trump and Gekko __HTTP__ _E_ Looking forward to IA & WI with Gov. Pence tomorrow. Join us! #MAGA __HTTP__ __HTTP__ __HTTP__ _E_ Phyllis Schlafly's Eagle Forum: 'National Review Will Be Defunct In The Next Year' __HTTP__ _E_ Priorities while fundraising and campaigning on our dime Obama has skipped over 50% of his intel briefings __HTTP__ _E_ Via @foxnewslatino by @GeraldoRivera: "@ApprenticeNBC Diary: And Now There Are Two" __HTTP__ _E_ Leaving Miami Trump National Doral will be GREAT! _E_ The leader and negotiators representing Mexico are far smarter and more cunning than the leader and negotiators representing the U.S.! _E_ My son @EricTrump has just done another great event and raised a lot of money for @StJude. He is a really good boy who loves helping kids. _E_ Watch the first #TrumpVine re: Anthony Weiner __HTTP__ _E_ Praying for the families of the two Iowa police who were ambushed this morning. An attack on those who keep us safe is an attack on us all. _E_ Huff post gets it wrong re: Ferry Point...the only leakage of gas is from Arianna Huffington. _E_ Today it was my privilege to welcome survivors of the #USSArizona to the WH. Remarks: __HTTP__ __HTTP__ _E_ The biggest problem with A Rod is he is bad for the chemistry of the Yankees he must go. _E_ Only in America can a Jihadi thug who murdered women and children be nursed back to health & then get a @RollingStone cover. _E_ Failed show @DannyZuker season 1 of @apprenticenbc had 28 million viewers and 41.5 million watching..... _E_ Obama believes Benghazi is a "phony scandal." Nothing phony about Americans being killed by Islamists. _E_ The global warming we should be worried about is the global warming caused by NUCLEAR WEAPONS in the hands of crazy or incompetent leaders! _E_ Thank you Connecticut! #Trump2016 __HTTP__ _E_ Numerous states are refusing to give information to the very distinguished VOTER FRAUD PANEL. What are they trying to hide? _E_ .@carlosbeltran15 is playing great for St. Louis Cardinals. They made a wise decision. _E_ Crooked Hillary Clinton is bought and paid for by Wall Street lobbyists and special interests. She will sell our country down the tubes! _E_ Uncomfortable looking NBC reporter Willie Geist calls me to ask for favors and then mockingly smiles when he is told of my high poll numbers _E_ "Face reality as it is not as it was or as you wish it to be." @jack_welch _E_ Good news disloyal @Macys stock is in a total free fall. Don't shop there for Christmas! __HTTP__ __HTTP__ _E_ AMERICA'S FUTURE __HTTP__ _E_ James Comey will be replaced by someone who will do a far better job bringing back the spirit and prestige of the FBI. _E_ #GodBlessTheUSA __HTTP__ _E_ ...Hopefully we will never have to use this power but there will never be a time that we are not the most powerful nation in the world! _E_ Rep.Tom Marino has informed me that he is withdrawing his name from consideration as drug czar. Tom is a fine man and a great Congressman! _E_ .@Boeing stock went way down because of 787 so I just bought stock in @Boeing great company! _E_ It was my great honor to have lunch with our INCREDIBLE U.S. and ROK troops at Camp Humphreys in South Korea. 🇰 __HTTP__ __HTTP__ _E_ Anybody (especially Fake News media) who thinks that Repeal & Replace of ObamaCare is dead does not know the love and strength in R Party! _E_ I have captured the smell of success. Meet me and the new Success @Macys Herald Square April 18 5:30pm first (cont) __HTTP__ _E_ I was on a tele townhall with @TeamBachmann and hosted her 4 times in Trump Tower yet she declined the Newsmax @iontv debate. No loyalty. _E_ At least 12 dead and 50 wounded in Colorado bring back fast trials & death penalty for mass murderers & terrorists. _E_ Lyin' Ted Cruz and 1 for 38 Kasich are unable to beat me on their own so they have to team up (collusion) in a two on one. Shows weakness! _E_ 1.5M have already lost their health care plans thanks to ObamaCare __HTTP__ Defund now and Repeal later! _E_ The innocent bystanders of American poverty are kids. Yet two thirds of childhood poverty in America is (cont) __HTTP__ _E_ I look forward to reading the @CommerceGov 232 analysis of steel and aluminum to be released in June. Will take major action if necessary. _E_ No I wasn't at the @Yankees game yesterday can't go today either. When I go they win. _E_ In order to #DrainTheSwamp & create a new GOVERNMENT of by & for the PEOPLE I need your VOTE! Go to __HTTP__ LET'S #MAGA! _E_ Congratulations to the Republicans in Congress. You are the only people Obama can out negotiate. #TimeToGetTough _E_ RT @PacificCommand: #USAF B 1B Lancer #bombers on Guam stand ready to fulfill USFK's #FightTonight mission if called upon to do so __HTTP__ _E_ Obama Administration official said they choked when it came to acting on Russian meddling of election. They didn't want to hurt Hillary? _E_ Let us never negotiate out of fear but let us never fear to negotiate. John F. Kennedy Inaugural Address January 1961 _E_ The roads and sidewalks airports and bridges are perfect in Dubai. Everything looks clean & strong. In U.S. everything is falling apart! _E_ Me voting it really is my hair! __HTTP__ _E_ Meet me at @TrumpTowerNY and get your copy of my new book CRIPPLED AMERICA signed on 11/3 at 12pm! __HTTP__ _E_ Great move to take A Rod out of game. Now terminate his contract based on misrepresentation (drugs). _E_ I am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ ... ... _E_ Never met but never liked dopey Robert Gates. Look at the mess the U.S. is in. Always speaks badly of his many bosses including Obama. _E_ Can you believe that the disrespect for our Country our Flag our Anthem continues without penalty to the players. The Commissioner has lost control of the hemorrhaging league. Players are the boss! __HTTP__ _E_ "The team with the best players wins." @jack_welch _E_ It takes guts to be a brand. You cannot be all things to all people if you want to be a brand. Midas Touch _E_ The secret of getting ahead is getting started. Mark Twain _E_ Thank you Arizona I love you! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ RT @mike_pence: .@EdWGillespie is fighting to grow the economy & cut taxes! He's fighting for a safer VA. And he's is fighting for affordab... _E_ No wonder @BBC is in such big trouble & boss was just fired they are lost. _E_ Just took a look at Time Magazine looks really flimsy like a free handout at a parking lot! The sad end is coming just like Newsweek! _E_ Why is the United States Post Office which is losing many billions of dollars a year while charging Amazon and others so little to deliver their packages making Amazon richer and the Post Office dumber and poorer? Should be charging MUCH MORE! _E_ 'President Trump Congratulates Exxon Mobil for Job Creating Investment Program' __HTTP__ _E_ The rally in Cincinnati is ON. Media put out false reports that it was cancelled! #MakeAmericaGreatAgain #Trump2016 _E_ Just met with the incoming Speaker of the Florida House @SteveCrisafulli – a fantastic guy! He will be a truly great leader. _E_ I endorsed a book on ObamaCare & it just went to #2 on the New York Times bestseller list! _E_ Druggie @AROD is now scheming to sue the @Yankees. He will go down as the biggest sports embarrassment of all time. _E_ We are now sending thousands of additional troops to Iraq to teach them how to fight they will run billions wasted! WHAT DOES U.S. GET? _E_ My @WendyWilliams appearance re Sony Atlantic City @ApprenticeNBC & 2016 __HTTP__ Always love going on Wendy's show! _E_ Thank you to our law enforcement officers! #LESM #Trump2016 __HTTP__ _E_ Heading back to Washington after working hard and watching some of the worst and most dishonest Fake News reporting I have ever seen! _E_ ... but like many other great business people have used the laws to corporate advantage. _E_ Oh the wonders of the Arab Spring. Our new 'ally' the Muslim Brotherhood hosted Ahmadinejad yesterday __HTTP__ No more aid. _E_ Via @MiamiNewTimes by @Munzenrieder : "Doral Mayor Declares Emergency to Give Donald Trump Key to the City" __HTTP__ _E_ Dummy Bill Maher did an advertisement for the failing New York Times where the picture of him is very sad he looks pathetic bloated & gone! _E_ "I'm not afraid of failing. I don't like to fail. I hate to fail. But I'm not afraid of it." @VinceMcMahon _E_ Rosie O'Donnell's show is dead can't keep going for long with such poor ratings. @Rosie is a stone cold (cont) __HTTP__ _E_ There is great unity in my campaign perhaps greater than ever before. I want to thank everyone for your tremendous support. Beat Crooked H! _E_ Mexico is killing the United States economically because their leaders and negotiators are FAR smarter than ours. But nobody beats Trump! _E_ Negotiation tip: Be patient be persistent be stubborn. Know exactly what you want and keep it to yourself. _E_ Django Unchained is the most racist movie I have ever seen it sucked! _E_ Thank you America great #CommanderInChiefForum polls! __HTTP__ _E_ .@MarcoRubio is weak on illegal immigration and will allow anyone into the country..... _E_ Thank you Florida we are going to MAKE AMERICA GREAT AGAIN! Join us: __HTTP__ #AmericaFirst __HTTP__ _E_ Problem is that the acting head of the FBI & the person in charge of the Hillary investigation Andrew McCabe got $700000 from H for wife! _E_ ...conquests how brave he was and it was all a lie. He cried like a baby and begged for forgiveness like a child. Now he judges collusion? _E_ Via @rcpvideo: "Donald Trump on Who He Likes For President: Donald Trump" __HTTP__ _E_ U.S. jobless claims are at a 2 month high. __HTTP__ @BarackObama's gas policy and ObamaCare are directly killing jobs. _E_ Great parade in The Villages I love you all. We will #MAGA. Thank you for the incredible support I will not forget! __HTTP__ _E_ It is so sad to see what has happened to Atlantic City. So many bad decisions by the pols over the years airport convention center etc. _E_ Terrible economic numbers released today. US GDP only grew 0.4% during Oct Dec 2012 quarter __HTTP__ Great news for China. _E_ Congrats @Jean_GeorgesNYC for being named the 6th best hotel restaurant in the world! __HTTP__ _E_ A great article by @NolteNCspelling out the truth on Mexico trade the border & illegals. Thank you @BreitbartNews __HTTP__ _E_ Big crowds standing ovations in South Carolina MAKE AMERICA GREAT AGAIN! _E_ 'Donald Trump is already helping the working class' __HTTP__ _E_ RT @mike_pence: Join me in Colorado today! Look forward to seeing you!Denver 2pm __HTTP__ Springs 6pm __HTTP__ _E_ He should be ignored: @RonPaul's foreign policy is a dream come true for our enemies. He has zero chance to beat @BarackObama. _E_ It amazes me that other networks seem to treat me so much better than @FoxNews. I brought them the biggest ratings in history & I get zip! _E_ Here we go again with another Clinton scandal and e mails yet (can you believe). Crooked Hillary knew the fix was in B never had a chance! _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ My @seanhannity interview where I discuss @BarackObama's Job Council @RealSheriffJoe's investigation & 2012 election __HTTP__ _E_ Obama is our unlucky President. Everything he touches turns into a mess. Some people just don't have it! _E_ HAPPY NEW YEAR! We are MAKING AMERICA GREAT AGAIN and much faster than anyone thought possible! _E_ You need to overcome the tug of people against you as you reach for high goals. General George S. Patton _E_ .@Morning_Joe can you believe Kasie Hunt's poor and purposely inaccurate reporting on my great night and crowd in Iowa. @politico is a scam! _E_ .@timkaine is the ANTI DEFENSE SENATOR. #VPDebate #BigLeagueTruth __HTTP__ _E_ .@EWErickson ran @RedState into the ground. A change was necessary. Congratulations to @RedState and good luck in the future! _E_ Dwight Howard just signed with Houston. _E_ Doesn't dummy @chucktodd realize that when I considered running for president I filed financial papers showing unbelievable numbers. _E_ Just left West Palm Beach Fire & Rescue #2. Met with great men and women as representatives of those who do so much for all of us. Firefighters paramedics first responders what amazing people they are! _E_ .@TrumpTurnberry's 149 award winning guest rooms offer a perfect blend of Edwardian tradition and timeless design __HTTP__ _E_ Just another desperate move by the man who should have easily beaten Barrack Obama. (2/2) _E_ Marble mouth @tombrokaw asks why do we think to have a successful eveving you have to have Donald Trump as your guest of honor? BORING TOM _E_ Don't believe the main stream (fake news) media.The White House is running VERY WELL. I inherited a MESS and am in the process of fixing it. _E_ Gas prices are soaring. $4.12 in CA. OPEC is laughing at how stupid we are. _E_ We just finished shooting a new season of Celebrity Apprentice and happily for all Joan plays my advisor in two episodes. She was great! _E_ Jeb's new slogan Jeb can fix it . I never thought of Jeb as a crook! Stupid message the word fix is not a good one to use in politics! _E_ I am a very calm person but love tweeting about both scum and positive subjects. Whenever I tweet some call it a tirade..totally dishonest! _E_ Thank you Carmel Indiana! Get out & #VoteTrump tomorrow! #INPrimary #MakeAmericaGreatAgain __HTTP__ _E_ Thank you to Brandon Judd of the National Border Patrol Council for his strong statement on @foxandfriends that we very badly NEED THE WALL. Must also end loophole of "catch & release" and clean up the legal and other procedures at the border NOW for Safety & Security reasons. _E_ Great bilateral meeting with President @Alain_Berset of the Swiss Confederation as we continue to strengthen our great friendship. Such an honor to be in Switzerland! #WEF18 __HTTP__ _E_ Gloria Allred is always talking about me. She needs publicity. She is by far a better PR agent than lawyer. _E_ Reporting that Orlando killer shouted Allah hu Akbar! as he slaughtered clubgoers. 2nd man arrested in LA with rifles near Gay parade. _E_ I will be interviewed on @foxandfriends this morning at 7:30. So much to talk about! _E_ .@AP has one of the worst reporters in the business @JeffHorwitz wouldn't know the truth if it hit him in the face. _E_ slaughter you. This is a purely religious threat which turned into reality. Such hatred! When will the U.S. and all countries fight back? _E_ At some point Sgt. Bergdahl will have to explain his capture. In 2009 he simply wandered off his base without a weapon. Many questions! _E_ .@latoyajackson informs @ArsenioHall that @Omarosa is a "conniving witch"—is he surprised? Are we surprised? #CelebApprentice _E_ Via @amspec by Jeffrey Lord: Is Eric Schneiderman a Crook? What a great writer & researcher amazing story. __HTTP__ _E_ The Republicans once again hold all the cards with the debt ceiling. They can get everything they want. Focus! _E_ Great advice from my mother: "Trust in God and be true to yourself." – Mary MacLeod Trump _E_ An architectural landmark @TrumpTowerNY offers sweeping panoramic views of Fifth Avenue __HTTP__ _E_ Trump: I Love the Tea Party They Love Me __HTTP__ via @Newsmax_Media (cross posted on @foxnation __HTTP__ _E_ It snowed over 4 inches this past weekend in New York City. It is still October. So much for Global Warming. _E_ I am not available to be in @adamcarolla's new movie #RoadHard.bit.ly/roadhardmovie _E_ Offering top amenities along w/ award winning architectural design @TrumpChicago's condominiums are world class __HTTP__ _E_ Something really bad happened to the @Yankees psyche much like our President! _E_ Thank you @SenJohnMcCain for your kind remarks on the important issue of PTSD and the dishonest media. Great to be in Arizona yesterday! _E_ Thank you Jacob! __HTTP__ _E_ Dummy @Clare_OC @Forbes: Tiny fragrance deal with Parlux means nothing. Still sold at Trump Tower... _E_ #StandForOurAnthem _E_ So many veterans groups are beyond happy with all of the money I raised/gave! It was my great honor they do an amazing job. _E_ Now is the time for the @GOP to be united with the mission of electing @MittRomney this November. Stop with the public divisions. _E_ I really enjoyed the debate last night.Crooked Hillary says she is going to do so many things.Why hasn't she done them in her last 30 years? _E_ Prediction: The disaster known as ObamaCare will only get worse and Republicans will gain far greater power than they have had in years! _E_ FMR PRES of Mexico Vicente Fox horribly used the F word when discussing the wall. He must apologize! If I did that there would be a uproar! _E_ Unbelievable crowd in Dallas! __HTTP__ _E_ Bruce Willis wearing my hat on @FallonTonight last Friday __HTTP__ _E_ Via @ BreitbartNews by @BobPriceBBTX: "DONALD TRUMP HEADING TO TEXAS BORDER" __HTTP__ _E_ Being good in business is the most fascinating kind of art. Making money is art & working is art & good business is the best artAndy Warhol _E_ .@oreillyfactor bad and very deceptive journalism. Show must be heading in wrong direction too bad! @SarahPalinUSA _E_ Congratulations! 'First New Coal Mine of Trump Era Opens in Pennsylvania' __HTTP__ _E_ You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_ If you have built castles in the air your work need not be lost that is where they should be. Now put the foundations under them. Thoreau _E_ "Sixteen" @TrumpChicago is winning accolades and is a destination point restaurant—don't miss it! _E_ My beautiful daughter Ivanka just had a healthy baby boy. Jared and Ivanka are very proud! _E_ Wow NATO's top commander just announced that he agrees with me that alliance members must PAY THEIR BILLS. This is a general I will like! _E_ .@WSJ reports that @GOP getting ready to treat me unfairly—big spending planned against me. That wasn't the deal! _E_ "The four page memo released Friday reports the disturbing fact about how the FBI and FISA appear to have been used to influence the 2016 election and its aftermath....The FBI failed to inform the FISA court that the Clinton campaign had funded the dossier....the FBI became.... _E_ In all fairness to Anthony Scaramucci he wanted to endorse me 1st before the Republican Primaries started but didn't think I was running! _E_ AMERICA USED TO BE THE LEADER OF THE WORLD. THANKS TO OBAMA AMERICA ISN'T EVEN LEADING FROM BEHIND. _E_ .@SanDiegoPD Fantastic job on handling the thugs who tried to disrupt our very peaceful and well attended rally. Greatly appreciated! _E_ There has been a systematic targeting of the Tea Party by the Obama administration. Now Schneiderman goes after me. No coincidence. _E_ Our country needs a president with great leadership skills and vision not someone like Hillary or Barack neither of which has a clue! _E_ "Think of yourself as a one man army. You're not only the commander in chief you're the soldier as well." – Think Like a Billionaire _E_ I still love Derek he is a winner! _E_ It's true—@dennisrodman gets the comeback of the year award. I didn't like having to fire him. #CelebApprentice _E_ ... Apprentice was #1 among ABC CBS and NBC from 10:30 11 p.m. in all key demos (adults men and women 18 34 18 49 and 25 54) Nielsen. _E_ Join me at 11:00am:Watch here: __HTTP__ __HTTP__ _E_ Thank you for your support & friendship Governor @ChrisChristie!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Obama will quarantine all soldiers returning from Africa for 21 days. But he still allows all who contract Ebola into country? Hypocrite. _E_ 'Over 250000 to Lose Health Insurance in Battleground North Carolina Due to #Obamacare' __HTTP__ _E_ No one has done more for people with disabilities than me. I have spent many millions of dollars to help out and am happy to have done so! _E_ ...but interestingly the same people seem to be lucky. _E_ .@RobinRoberts everyone adores you including me get well fast! _E_ It's Thursday. @billmaher is still a very dumb guy just look at his past. _E_ Congratulations to Doug Jones on a hard fought victory. The write in votes played a very big factor but a win is a win. The people of Alabama are great and the Republicans will have another shot at this seat in a very short period of time. It never ends! _E_ He @BarackObama will lose a delegate in Oklahoma he only got 57% of the vote in the Democrat primary __HTTP__ _E_ Join me in Fayetteville North Carolina tomorrow evening at 6pm. Tickets now available at: __HTTP__ _E_ If Graydon Carter's very dumb bosses would fire him for his terrible circulation numbers at failing Vanity Fair his bad food restaurants die _E_ Will be going to North Dakota today to discuss tax reform and tax cuts. We are the highest taxed nation in the world that will change. _E_ John Podesta says nominee will be Cruz b/c last person Hillary wants to face is Trump! Use your head folks! 46 41! __HTTP__ _E_ Don't forget to tune in to the Celebrity Apprentice this Sunday night 9 pm on NBC. The fireworks continue.... __HTTP__ _E_ So many major problems for the U.S. and no answers by our leaders. When will it all change? Many of our difficulties are so easy to solve! _E_ Now that I started my war on illegal immigration and securing the border most other candidates are finally speaking up. Just politicians! _E_ Just returned from Europe. Trip was a great success for America. Hard work but big results! _E_ Iran just test fired a Ballistic Missile capable of reaching Israel.They are also working with North Korea.Not much of an agreement we have! _E_ Congratulations @Jean_GeorgesNYC for 10 years of 3 #MichelinStars! Visit the restaurant in @TrumpNewYork for a meal you'll never forget. _E_ I will be using Facebook and Twitter to expose dishonest lightweight Senator Marco Rubio. A record no show in Senate he is scamming Florida _E_ Who are your favorites on Team Power? Team Plan B? #CelebApprentice _E_ Mr. Pesident @BarackObama you cannot attack free enterprise and expect to have a healthy economy! _E_ When will the Democrats and Hillary in particular say we must build a wall a great wall and Mexico is going to pay for it? Never! _E_ Join me Tuesday Nov. 3rd at 12pm in Trump Tower NYC. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_ Big time in U.S. today MAKE AMERICA GREAT AGAIN! Politicians are all talk and no action they can never bring us back. _E_ Just spent two days in Ireland at Trump International Golf Links & Hotel absolutely magnificent. __HTTP__ _E_ Both being optimistic and remembering the big picture have served me well throughout my life. You need to stay positive. _E_ RT @Harlan: Watching MSM you would have no idea @realDonaldTrump clearly unambiguously & repeatedly condemned the bigotry & violence in Ch... _E_ .@ErinBurnett's @OutFrontCNN ratings are so pathetic she even loses to @hardball_chris at 7PM which is replay of 5PM __HTTP__ _E_ Four more days until the Miss Universe Pageant. Be sure to tune in on Monday night at 9 p.m. on NBC it will be an amazing show. _E_ To aspiring entrepreneurs always remember that if your enemies aren't talking about you then you aren't doing well...and must work harder! _E_ Michele Bachmann will finish dead last tonight in Iowa because she is disloyal and a terrible boss. Sadly it is over for Michele. _E_ Once again @LilJon has competed at a very high level on Celebrity All Star @ApprenticeNBC. He is a great competitor. _E_ While I am in OH & PA you can also join @Mike_Pence in Nevada on Mon!Carson City: __HTTP__ __HTTP__ _E_ Great! __HTTP__ _E_ RT @IBDeditorials: Was Barack Obama A Foreign Exchange Student? __HTTP__ _E_ NATO commander agrees members should pay up via @dcexaminer: __HTTP__ _E_ Doesn't the US have better things to do than to destroy an American hero for the world to see? Now other (cont) __HTTP__ _E_ Join me live at the @WhiteHouse. __HTTP__ __HTTP__ _E_ Via @TVGrapevine: "@ApprenticeNBC: Premieres Sunday January 4 2015" __HTTP__ _E_ With Obama and Bernanke destroying the value of the dollar gold and real estate should continue to rise in value. _E_ China's best friend @BarackObama wants to cut the US fleet down to 230 ships the lowest level since WWI. __HTTP__ _E_ They let Crooked & the Gang off the hook for the crime but it looks like the cover up is just as bad. Unbelievable! __HTTP__ _E_ The Federal Government is teaching citizens 'Financial Literacy' while it is running $16T in debt __HTTP__ Only in America! _E_ Entrepreneurs: Paying attention can be a cost effective way of protecting yourself. _E_ RT @RealBenCarson: I endorse @realDonaldTrump. It's time to unite behind the candidate who will beat Hillary Clinton and return government ... _E_ Great to hear that our loyal @CelebApprentice fans are happy with today's announcement of the new cast. This will be something special! _E_ We can't even stop the Norks from blasting a missile. China is laughing at us. It is really sad. _E_ Thanks for all of the accolades on my speech today it's all about the truth! _E_ Change has to come from outside our very broken system. #MAGA __HTTP__ _E_ Thank you to our amazing law enforcement officers! #AmericaFirst __HTTP__ _E_ Join me in North Carolina tomorrow at 7:30pm! #ImWithYou Tickets: __HTTP__ _E_ The truth is that we could have much better healthcare in our country at a much more affordable price everyone in U.S. would benefit! _E_ Uh oh... @OMAROSA & @piersmorgan once again reunite in the Board Room in next week's 'All Star' @ApprenticeNBC. Fireworks! _E_ 6 @TrumpCollection hotels made @CNTraveler reader's choice! @TrumpNewYork @TrumpSoHo @TrumpChicago @TrumpToronto @TrumpPanama @Trump_Ireland _E_ .@alexsalmond RT @King_Pepp Driving through Indiana and seeing tons of ugly windmills. Now I know what @realDonaldTrump is talking about _E_ My @gretawire interview discussing my $5M charitable offer to Obama his lack of transparency & my tremendous support __HTTP__ _E_ EARLY VOTING: MN & IA already underway more states coming up in the next week: OH ME AZ IN — check w/local officials for details & VOTE! _E_ So many signs that the Florida shooter was mentally disturbed even expelled from school for bad and erratic behavior. Neighbors and classmates knew he was a big problem. Must always report such instances to authorities again and again! _E_ My comment last March "Anthony Weiner is a sick pervert you think he will change? He will never change." __HTTP__ _E_ Again the story that there was collusion between the Russians & Trump campaign was fabricated by Dems as an excuse for losing the election. _E_ Via @examinercom by @Mellyora13: "Trump: Was Benghazi the result of incompetence or something more sinister?" __HTTP__ _E_ With the economy still on a downward trajectory the best investment young people can make now is buying property... _E_ Action speaks louder than words but not nearly as often. Mark Twain _E_ ...contributions. The RNC is taking in far more $'s than the Dems and much of it by my wonderful small donors. I am working hard for them! _E_ Now that the three basketball players are out of China and saved from years in jail LaVar Ball the father of LiAngelo is unaccepting of what I did for his son and that shoplifting is no big deal. I should have left them in jail! _E_ Getting ready to make a major speech to the National Assembly here in South Korea then will be headed to China where I very much look forward to meeting with President Xi who is just off his great political victory. _E_ My @BreitbartNews' @biggovt editorial: "'A COUNTRY THAT CANNOT PROTECT ITS BORDERS WILL NOT LAST" __HTTP__ _E_ Snowing in Texas and Louisiana record setting freezing temperatures throughout the country and beyond. Global warming is an expensive hoax! _E_ An old picture with Nancy and Ronald Reagan. __HTTP__ _E_ Dopey Sugar.@Lord_Sugar ...Your net worth doesn't even qualify you to host the Apprentice. Keep making me money. _E_ Obama and the Democrats have no respect for WWII vets trying to get into the memorial. _E_ Last night was the first time Obama said we instead of I in respect to Bin Laden's killing. _E_ 'Hillary's Two Official Favors To Morocco Resulted In $28 Million For Clinton Foundation' #DrainTheSwamp __HTTP__ _E_ .@Andre_Reed83 Thanks for your nice words. You are a real champion. I'm pushing! _E_ Why does @CNN & @andersoncooper waste airtime by putting failed campaign strategist Stuart Stevens who lost BIG for Romney on the show? _E_ This will be one of the biggest and most beautiful Miss Universe events ever. _E_ If Senate Republicans don't get rid of the Filibuster Rule and go to a 51% majority few bills will be passed. 8 Dems control the Senate! _E_ The Penn State Board should resign based on the grossly incompetent way they handled the NCAA. They gave away (cont) __HTTP__ _E_ #TedCruz eligibility to be President not settled law says Cruz' Constitutional Law Professor #LaurenceTribe __HTTP__ _E_ I've just done a major Dateline for NBC March 3rd just ahead of Apprentice. _E_ Why Franklin Graham says Donald Trump is right about stopping Muslim immigration __HTTP__ _E_ Thank you @ScottWalker! #AmericaFirst #RNCinCLE __HTTP__ _E_ I had a lot of fun answering your questions in the latest round of #AskTheDonald. See if your question made it __HTTP__ _E_ .@marklevinshow has been saying very nice things about me on his show recently. He has a fantastic radio show that I always enjoy! _E_ I look so forward to debating Crooked Hillary Clinton! Democrat Primaries are rigged e mail investigation is rigged so time to get it on! _E_ John McCain never had any intention of voting for this Bill which his Governor loves. He campaigned on Repeal & Replace. Let Arizona down! _E_ Thank you St. Louis Missouri! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ The Senate must go to a 51 vote majority instead of current 60 votes. Even parts of full Repeal need 60. 8 Dems control Senate. Crazy! _E_ Eli Manning. Great Athlete. Great Guy. @NYGiants great teamwork! _E_ No @DannyZuker just the opposite lots of money can go to charity if you have the guts to play the game (deal)! _E_ Anybody whose mind SHORT CIRCUITS is not fit to be our president! Look up the word BRAINWASHED. _E_ Via @TheTodaysGolfer "Trump @TurnberryBuzz transformation on course" __HTTP__ _E_ Thank you New Jersey! #Trump2016 __HTTP__ _E_ Derek Jeter had a great career until 3 days ago when he sold his apartment at Trump World Tower I told him not to sell karma? _E_ 3 Chief of Staffs in less than 3 years of being President: Part of the reason why @BarackObama can't manage to pass his agenda. _E_ RT @FoxBusiness: .@charliekirk11: What this president has done is truly historic and if a Democrat president achieved 1/10th of what @POT... _E_ When renovations are completed Trump National Doral will be the finest resort in the U.S. _E_ It wasn't Donald Trump that divided this country this country has been divided for a long time! Stated today by Reverend Franklin Graham. _E_ I agree with Pres. Obama on Afghanistan. We should have a speedy withdrawal. Why should we keep wasting our money rebuild the U.S.! _E_ Who would you rather have negotiating with Iran President Obama or Toronto Mayor Ford? My money is on Ford. _E_ People are really liking the new ties and shirts @Macy's they are amazing and selling great! _E_ We need a president who knows how to get things done who can keep America strong safe and free and who can (cont) __HTTP__ _E_ .@DannyZuker You're starting up again because people have forgotten you. You wouldn't take my bet but it's (cont) __HTTP__ _E_ Many people are now saying that this is the worst storm/hurricane they have ever seen. Good news is that we have great talent on the ground. _E_ Just left Florida for D.C. The people and spirit in THAT GREAT STATE is unbelievable. Damage horrific but will be better than ever! _E_ "I also protect myself by being flexible. I never get too attached to one deal or one approach." – The Art of The Deal _E_ The interview with Oprah will cause Lance Armstrong huge legal and financial problems sometimes it is better to go into a corner and hide. _E_ Great news as a result of our TAX CUTS & JOBS ACT! __HTTP__ _E_ It's time for @PeteRose_14 to enter @MLB's @BaseballHall. All time hits leader has paid the price. _E_ My son @EricTrump and @LaraLeaYunaska just announced their engagement. Great news! A wonderful couple! _E_ Donald Trump Returns For 'All Star Celebrity Apprentice' __HTTP__ via @HuffPostTV _E_ New poll by ABC News/Washington Post TRUMP 32 CARSON 22 RUBIO 10 BUSH 7 Wow how will the media put a negative spin on this one? _E_ Where is the main stream media reporting on Univision's new expose of Fast and Furious? Too busy looking at Mitt's taxes? _E_ .@yankees are privately ecstatic over A Rod's latest doping bust. The evidence is damning __HTTP__ @yankees don't want him. _E_ Everyone is telling me that @EliotSpitzer is going to run against lightweight @AGSchneiderman Spitzer would win! _E_ THANK YOU ILLINOIS! Let's not forget to get family & friends out to VOTE IN 2016! __HTTP__ __HTTP__ _E_ So to all Americans in every city near and far small and large from mountain to mountain... __HTTP__ _E_ Barack Obama used to mock Bush's 300K monthly job reports __HTTP__ Now Obama wishes he could have a month half as good. _E_ When an employee leaves me and begs to come back I never let them. Loyalty is very important. _E_ Great win in Kansas last night for Ron Estes easily winning the Congressional race against the Dems who spent heavily & predicted victory! _E_ RT @WhiteHouse: The current tax code is a burden on American taxpayers and harmful to American job creators. Learn more: __HTTP__ _E_ I can confirm the reports @BillRancic my first season winner will be returning to this All Star season of @CelebApprentice. _E_ Iran's attack on Israeli diplomats is an attack on the West _E_ RT @SpoxDHS: Schumer Rounds Collins destroys the ability of @DHSgov to enforce immigration laws creating a mass amnesty for over 10 millio... _E_ Rising over Bay Street @TrumpTO brings opulent luxury along with our famous world class amenities to the Queen City __HTTP__ _E_ The Irish government is too smart to destroy their beautiful coastline w/ bird killing ugly wind turbines. @AlexSalmond @AberdeenCC _E_ Thank you so much. Earnest must have been a great person. __HTTP__ __HTTP__ _E_ The city of Buffalo is struggling. Moving the @buffalobills would be catastrophic. The Bills belong in Buffalo! _E_ Immigration reform really changes the voting scales for the Republicans—for the worse! _E_ Texas is heeling fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow! _E_ I am in Trump International Hotel Las Vegas getting ready and waiting for the debate tonight. Look forward hope I get treated fairly! _E_ So great that John McCain is coming back to vote. Brave American hero! Thank you John. _E_ President Obama just told President Putin how important the Russian air strikes against ISIS have been. I TOLD YOU SO! _E_ We don't want to have a recount in any of the battleground states. Obama will steal it. Make sure all your friends and family vote. _E_ Class of 2013. #WWEHOF __HTTP__ _E_ I would like to congratulate @SenateMajLdr on having done a fantastic job both strategically & politically on the passing in the Senate of the MASSIVE TAX CUT & Reform Bill. I could have not asked for a better or more talented partner. Our team will go onto many more VICTORIES! _E_ The global warming scientists don't want to be airlifted off the ship they are having too much fun and that is too simple a solution FAME! _E_ "The unemployment rate remains at a 17 year low of 4.1%. The unemployment rate in manufacturing dropped to 2.6% the lowest ever recorded. The unemployment rate among Hispanics dropped to 4.7% the lowest ever recorded..."@SecretaryAcosta @USDOL __HTTP__ _E_ Mr. Trump removing the broken teleprompter in North Carolina in front of a massive crowd. He goes on&delivers the b... __HTTP__ _E_ Next week the Senate is going to vote on legislation to save Americans from the ObamaCare DISASTER. #WeeklyAddress __HTTP__ _E_ The off shore Aberdeen wind farm site is "experimental" & has no track record delivering energy. __HTTP__ @guardian _E_ Via @Golfmagic: Golden Bear and American business tycoon finish their unlikely masterpiece __HTTP__ _E_ The greatest influence over our election was the Fake News Media screaming for Crooked Hillary Clinton. Next she was a bad candidate! _E_ Remember that Carson Bush and Rubio are VERY weak on illegal immigration. They will do NOTHING to stop it. Our country will be overrun! _E_ Hillary's refusal to mention Radical Islam as she pushes a 550% increase in refugees is more proof that she is unfit to lead the country. _E_ Obama promised 5.2% unemployment by October 2012. His promises are worthless! _E_ ...case against him & now wants to clear his name by showing the false or misleading testimony by James Comey John Brennan... Witch Hunt! _E_ How did the NCAA which is weak and becoming irrelevant extract such a big & reputation shattering settlement from Penn State. Others zero! _E_ Via @CBSLA: Donald Trump Fights To Keep Large American Flag Flying At Southland Golf Course __HTTP__ _E_ Why does the failing @WSJ write a false editorial about me and let dummy @KarlRove make the same mistake in the same edition of the paper? _E_ Watching Gates on @seanhannity looks like he got hit by a truck! Why didn't Obama get him and othersto sign a confidentiality agreement? _E_ Congrats @adamcarolla on #RoadHard raising $1M on @fundanything a record. _E_ Great poll out of Illinois! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ What a series the @nyrangers @NHLDevils is turning out to be! Tonight's game should be another close one. _E_ RT @EricTrump: Join @TeamTrump on Saturday for National Day of Action as we work to #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_ Ted Cruz does not have the right temperment to be President. Look at the way he totally panicked in firing his director of comm. BAD! _E_ Join me on Tuesday in Greensboro North Carolina! #Trump2016 #AmericaFirst __HTTP__ _E_ Certain Internet sites are like a bad epidemic that won't go away others are terrific _E_ Make sure you get out and vote...most important election of our generation...go Romney! _E_ Via @MiamiHerald by Bill Van Smith: @jacknicklaus reminisces amid honor at @TrumpDoral __HTTP__ _E_ The economy will come back but it will not be the same economy. The old economy of the Industrial Age is (cont) __HTTP__ _E_ Be sure to look for my beautiful wife Melania Trump tonight on QVC at 9 pm ET where she will be debuting her fantastic jewelry collection. _E_ I would have millions of votes more than Hillary except for the fact that I had 17 opponents and she just had a socialist named Bernie! _E_ Robert Pattinson is putting on a good face for the release of Twilight. He took my advice on Kristen Stewart...I hope! _E_ .@Joan_Rivers —I know you're watching what did you think of your impersonator? _E_ You can watch all the highlights of last night's record 14th season premiere of @ApprenticeNBC __HTTP__ _E_ "One thing I've learned about the press is they're always hungry for a good story the more sensational the better." Art of the Deal _E_ JOIN ME! #MAGATODAY:Springfield OH Toledo OH Geneva OH FRIDAY:Manchester NH Lisbon ME Cedar Rapids IA __HTTP__ _E_ Last Thursday Obama said investing in infrastructure would improve our economy for the long term The next day he again stopped Keystone _E_ Look at the solution not the problem. Learn to focus on what will give results. _E_ We left Iraq and it is quickly falling apart what a waste of lives and money and so obvious. _E_ . #JoeTheismann was great as a political analyst on @FoxNews. He knows far more than football. Thanks for the nice words Joe! _E_ Join me live at 9:00 P.M. #JointAddress __HTTP__ __HTTP__ _E_ Via @AmSpec by Jeffrey Lord: "Donald Trump Takes Ice Bucket Challenge – Dares Obama" __HTTP__ _E_ "Success is having to worry about every damn thing in the world except money." Johnny Cash _E_ You can change your vote in six states. So now that you see that Hillary was a big mistake change your vote to MAKE AMERICA GREAT AGAIN! _E_ My @FoxBusiness interview on @Varneyco discussing @BarackObama's dirty tactics & how @MittRomney should respond __HTTP__ _E_ Via @NYDailyNews by Eugene Dunn: "Trump the Nation's Great Hope" __HTTP__ _E_ The military threat from China is gigantic and it's no surprise that the Communist Chinese government lies (cont) __HTTP__ _E_ We have all the cards. Now is the time to make a great deal with Iran. _E_ White House Press Sec. had a hard time explaining why @BarackObama supported tax breaks for oil companies in (cont) __HTTP__ _E_ .@JohnLegere T Mobile service is terrible! Why can't you do something to improve it for your customers. I don't want it in my buildings. _E_ Go to @greta show will be talking about OPO and plenty else ENJOY! _E_ .@THEGaryBusey feels he's been abandoned by his team. Do you think so? #CelebApprentice _E_ With the signature services of Trump Attaché @TrumpWaikiki brings premiere luxury to the white sands of Waikiki __HTTP__ _E_ We as a country either have borders or we don't. IF WE DON'T HAVE BORDERS WE DON'T HAVE A COUNTRY! _E_ Hitting at home. Democrat Sen. Joe Donnelly's son had his healthcare plan dropped __HTTP__ _E_ The ObamaCare websites have cost over $5B & many still do not work __HTTP__ One of the greatest fiascos in modern history! _E_ #CelebApprentice @apprenticenbc returns tonight at 9/8c on NBC __HTTP__ _E_ We have to get tough with China before they destroy us. _E_ "The great question is not whether you have failed but whether you are content with failure." Laurence J. Peter _E_ Believe you can and you're halfway there. Theodore Roosevelt _E_ Lyin' Ted Cruz lost all five races on Tuesday and he was just given the jinx a Lindsey Graham endorsement. Also backed Jeb. Lindsey got 0! _E_ Is this what we want for a President? __HTTP__ _E_ WHY CAN'T THE MEDIA TELL THE TRUTH WE WOULD ALL BE SO MUCH BETTER OFF! _E_ Ted Cruz is lying again. Polls are showing that I do beat Hillary Clinton head to head. Check out __HTTP__ Poll snd Q Poll. _E_ Just been informed by @nbc they want to extend the run of the @ApprenticeNBC by two shows because it is doing so well. Two hours live. _E_ One of the most effective press conferences I've ever seen! says Rush Limbaugh. Many agree.Yet FAKE MEDIA calls it differently! Dishonest _E_ China 'scorns' US cyber espionage charges China does not respect us __HTTP__ and feels Obama is a dummy _E_ Jobs are returning illegal immigration is plummeting law order and justice are being restored. We are truly making America great again! _E_ Kathy Griffin should be ashamed of herself. My children especially my 11 year old son Barron are having a hard time with this. Sick! _E_ Thank you @Heritage! This is our once in a generation opportunity to revitalize our economy revive our industry & renew the AMERICAN DREAM! __HTTP__ _E_ Don't forget to watch me tonight on Late Night with Jimmy Fallon 12:35 a.m. on NBC. I'll be making a big announcement! _E_ A coincidence that the NSA leaker is living openly in Hong Kong?! At the same time the Chinese Pres. met with Obama in CA. _E_ The dying @NRO National Review has totally given up the fight against Barrack Obama. They have been losing for years. I will beat Hillary! _E_ Steven Tyler got more publicity on his song request than he's gotten in ten years. Good for him! _E_ Via @politico by "Poll: Trump has twice the support of Bush in New Hampshire" __HTTP__ _E_ The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non competitive. _E_ Today I will be rallying with with 15000 patriots in Arizona for border security! Let's Make America Great Again! __HTTP__ _E_ Achievers move forward at all times. Achievement is not a plateau it's a beginning. Don't waste time treading water. _E_ Iranian officials say that the WH is misleading public about the details of an interim nuclear agreement __HTTP__ _E_ I wonder what @JoeBiden was thinking last night as @PaulRyanVP delivered that knockout speech. Joe should call in sick for the VP debate. _E_ A dishonest slob of a reporter who doesn't understand my sarcasm when talking about him or his wife wrote a foolish & boring Trump hit _E_ Congratulations Stephen Miller on representing me this morning on the various Sunday morning shows. Great job! _E_ Corporations have NEVER made as much money as they are making now. Thank you Stuart Varney @foxandfriends Jobs are starting to roarwatch! _E_ Via @ShinySheet by @soapbox1: "Show jumping grand prix returns to Mar a Lago Sunday __HTTP__ _E_ Join us tomorrow in Scranton Pennsylvania at 3pm!#TrumpPence16 #MAGA Tickets: __HTTP__ __HTTP__ _E_ The Palestinian terror attack today reminds the world of the grievous perils facing Israeli citizens....continued: __HTTP__ _E_ I told you that the Giants starting Hudson was a mistake. Just got knocked out of the game. I love being right! _E_ Just read about my friend @HulkHogan he was set up too bad he has to use the court system instead of his muscles. _E_ Will miss @RealBenCarson tonight at the #GOPDebate. I hope all of Ben's followers will join the #TrumpTrain. We will never forget. _E_ During @BarackObama's presidency median family income has fallen 4.8% __HTTP__ Terrible for the middle class. _E_ Wishing all of those celebrating #Hanukkah around the world a happy and healthy eight nights in the company of those they love. __HTTP__ __HTTP__ _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ Wow USA Today did todays cover story on my record in lawsuits. Verdict: 450 wins 38 losses. Isn't that what you want for your president? _E_ Michael Morell the lightweight former Acting Director of C.I.A. and a man who has made serious bad calls is a total Clinton flunky! _E_ The @washingtonpost report on potential VP candidates is wrong. Marco Rubio and most others mentioned are NOT under consideration. _E_ Cruz just lied again I am and have been totally against #ObamaCare repeal and replace! _E_ Trump National Golf Club Washington D.C. is on 600 beautiful acres fronting the Potomac River. A fantastic setting! __HTTP__ _E_ Great boardroom. #CelebApprentice _E_ Phil Mickelson's final 66 round in @The_Open was amazing. Congrats on his well deserved win. Amazing competitor & a great guy! _E_ In more and more places throughout this region citizens of SOVEREIGN and INDEPENDENT nations have taken greater control of their destinies and unlocked the potential of their people. #APEC2017 __HTTP__ _E_ This election is being rigged by the media pushing false and unsubstantiated charges and outright lies in order to elect Crooked Hillary! _E_ Donald Trump helped expose the silliness of the move by offering to pay for the White House tours. __HTTP__ _E_ My @gretawire interview discussing why the sequestration cuts are necessary our $17T national debt & 2016 election __HTTP__ _E_ A market is never saturated with a good product but it is very quickly saturated with a bad one. Henry Ford _E_ I have just lost my beautiful & elegant long time exec. assistant Norma Foerderer. She passed away yesterday – a truly magnificent woman. _E_ via WSJ. Wake up @AlexSalmond before you destroy Scotland. @David_Cameron @AberdeenCC @pressjournal __HTTP__ _E_ Set high standards and meet them. The proof is in the doing: learn by doing and taking risks. _E_ In Nov. '11 Al Qaeda's flag flew over the 'birthplace' of Libya's revolution __HTTP__ In Sept. '12 it flew over our Embassy. _E_ Hillary Clinton does not have the STRENGTH or STAMINA to be President. We need strong and super smart for our next leader or trouble! _E_ Based on the incredibly inaccurate coverage and reporting of the record setting Trump campaign we are hereby: __HTTP__ _E_ Obama's '07 speech which @DailyCaller just released not only shows that Obama is a racist but also how the press always covers for him. _E_ Never allow your attitude to be a liability. Be positive and strong. Set your mind on winning and keep it there. _E_ Thank you Iowa see you soon!#Trump2016 #ImWithYou __HTTP__ __HTTP__ _E_ I loved beating these two terrible human beings. I would never recommend that anyone use her lawyer he is a total loser! _E_ Will be meeting at 9:00 with top automobile executives concerning jobs in America. I want new plants to be built here for cars sold here! _E_ Why are people upset w/ me over Pres Obama's birth certificate?I got him to release it or whatever it was when nobody else could! _E_ .@FloydMayweather Good luck tonight Floyd. _E_ Fifth Avenue's most iconic building @TrumpTowerNY features Trump Grill nestled in the corner of the Atrium __HTTP__ _E_ The G 20 Summit was a great success for the U.S. Explained that the U.S. must fix the many bad trade deals it has made. Will get done! _E_ China's Olympic training program is abusive __HTTP__ It is modern day slavery & shameful. Their (cont) __HTTP__ _E_ We need a president who is smart and tough enough to recognize the national security threat China poses in the (cont) __HTTP__ _E_ Less than two weeks until @WWE's @WrestleMania XXIX. @TheRock v. @JohnCena willbe epic! Excited to be inducted into the Hall of Fame. _E_ .@Newsmax by @melaniebatley: Donald Trump Tells Why He's Eyeing the White House.I'll Tell You Why He Could Win. __HTTP__ _E_ Don't forget to enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_ "If winning isn't everything why do they keep score?" Vince Lombardi _E_ .@weeklystandard I know your business is failing but you should try to get writers far better than @stephenfhayes. _E_ It's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_ Senator Luther Strange has gone up a lot in the polls since I endorsed him a month ago. Now a close runoff. He will be great in D.C. _E_ I still can't believe we didn't t take the oil from Iraq. _E_ For reasons only they can explain the @USChamber wants to continue our bad trade deals rather than renegotiating and making them better. _E_ "Remember to keep going: if you stop your momentum will stop." – Think Big _E_ Could be a fight over red heads with @lisalampanelli—this could be good. #sweepstweet _E_ "Statement from President Donald J. Trump on #GivingTuesday" __HTTP__ _E_ Even if @BarackObama stays in DC taxpayers will pay millions for his Hawaii vacation when Americans are struggling __HTTP__ _E_ He who demands little gets it. Ellen Glasgow _E_ Via @MiamiHerald by Hannah Sampson: "BLT Prime coming to Trump's Doral resort" __HTTP__ _E_ .@TIME Magazine should definitely pick David Pecker to run things over there he'd make it exciting and win awards! _E_ The only thing more boring than @bwilliams newscast is his show Rock Center which is totally dying in the ratings—a disaster! _E_ The secret of success in life is for a man to be ready for his opportunity when it comes. Benjamin Disraeli _E_ I will be interviewed by @ericbolling tonight at 8pm on the @oreillyfactor. Enjoy! _E_ Time magazine should name David Pecker of American Media to be its top guy...but they are not smart enough to do that! _E_ "Remember that fear can be conquered. Go full throttle and the odds will be on your side." – Trump Never Give Up _E_ A special message for Martin Bashir __HTTP__ _E_ Jeff Zucker failed @NBC and he is now failing @CNN. _E_ Obama is in Texas but will not be visiting the border. He is too busy fundraising! _E_ This new Russian strategy guarantees victory for the Syrian government and makes Obama and U.S. look hopelessly bad. President in trouble! _E_ .@CarlyFiorina I only said I was on @60Minutes four weeks ago with Putin—never said I was in Green Room. Separate pieces—great ratings! _E_ Thank you Tennessee! #MAGA __HTTP__ _E_ Mattis Says Trump's Warning Stopped Chemical Weapons Attack In Syria __HTTP__ _E_ Your work will never be in vain if you work for a cause that is greater than yourself. _E_ Wow Corey Lewandowski my campaign manager and a very decent man was just charged with assaulting a reporter. Look at tapes nothing there! _E_ Last week to enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_ Great @FOXSports art. by @jillpainter on Doc River's annual golf charity event @TrumpGolfLA. Doc is a great friend! __HTTP__ _E_ __HTTP__ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ He who knows when he can fight and when he cannot will be victorious. Sun Tzu _E_ The reason great dealmakers do not OPENLY celebrate a deal especially one that is not complete is that it shows weakness to the other side _E_ The Ground Zero Mosque should not go up where planned. It is wrong. My offer still stands to buy the property. Good deal for everyone. _E_ #TBT Saturday Night Live __HTTP__ _E_ The United Nations Security Council just voted 15 0 to sanction North Korea. China and Russia voted with us. Very big financial impact! _E_ Initial reports say 2nd debate viewership dropped. See what happens when I am not mentioned. _E_ W/ spectacular panoramic Pacific Ocean views @TrumpGolfLA is the top luxury public golf course in the country __HTTP__ _E_ "Get in. Get it done. Get it done right. Get out." – Fred C. Trump (My father!) _E_ Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_ When I renovated Wollman Rink in Central Park it came in $750000 under budget.. _E_ I win a state in votes and then get non representative delegates because they are offered all sorts of goodies by Cruz campaign. Bad system! _E_ An ad hoc interview I filmed with a German journalist at Ground Zero hours after the attack __HTTP__ _E_ Hillary Clinton's open borders are tearing American families apart. I am going to make our country Safe Again for all Americans. #Imwithyou _E_ Just landed in Baton Rouge Louisiana. Reports are out that lines are three quarters of a mile to get in. Wow! #MakeAmericaGreatAgain _E_ .@DannyZuker I'm in front of the camera and behind the camera just looked at your picture you'll never be in front of the camera! _E_ I always enjoy being interivewed on @WOR710 by John Gambling. My father Fred used to listen to his father's show. _E_ Director Clapper reiterated what everybody including the fake media already knows there is no evidence of collusion w/ Russia and Trump. _E_ Things happen that make you question whether you should keep going. As long as you are enjoying what you are doing keep going. _E_ My interview last night with Greta on the GOP going El Foldo __HTTP__ _E_ Thanks Larry. Best wishes. __HTTP__ _E_ After witnessing first hand the horror & devastation caused by Hurricane Harveymy heart goes out even more so to the great people of Texas! _E_ Yesterday the Christmas tree arrived at Rockefeller Plaza. An iconic event for New York! _E_ The President of Taiwan CALLED ME today to wish me congratulations on winning the Presidency. Thank you! _E_ .@FoxNews is devastated that lightweight Senator Marco Rubio got trounced tonight and is the big loser. I won the two big states great! _E_ My @SquawkCNBC interview re: Europe's financial mess investing in Spain Germany's economy and the future of the Euro __HTTP__ _E_ Sorry the best and most beautiful ties and shirts made anywhere and at a really reasonable cost. Also fragrance is amazing. GO TO MACY'S. _E_ .@BarackObama is now taking credit for changing party platform language but he reviewed it prior to the convention __HTTP__ _E_ I should release the sad and totally apologetic letter that Penn @pennjillette hand delivered to me. Minds would be changed very fast! _E_ This is the Cruz voter violation certificate sent to everyone a misdemeanor at minimum. __HTTP__ _E_ Just had a very open and successful presidential election. Now professional protesters incited by the media are protesting. Very unfair! _E_ What would you do if a large group of Muslims had a very public meeting drawing horrible and mocking cartoons of Jesus? Oh really be cool! _E_ Congratulations to @woodyjohnson4 and @nyjets on yesterday's very exciting game. _E_ Wow the ridiculous deal made between Lyin'Ted Cruz and 1 for 42 John Kasich has just blown up. What a dumb deal dead on arrival! _E_ The ROLL CALL is beginning at the Republican National Convention. Very exciting! _E_ I thought that @CNN would get better after they failed so badly in their support of Hillary Clinton however since election they are worse! _E_ Get ALL the info then quick trial then death penalty for the Boston killer of innocent children and people! Do not be kind. _E_ Dummy @mcuban is at it again trying to use me to get publicity for himself! _E_ Iran was planning to attack the Israeli and Saudi DC embassies. We should respond accordingly. The diplomatic window is closed. _E_ Bringing true luxury to the Windy City @TrumpChicago soars 92 levels over the Chicago River __HTTP__ _E_ Would very much appreciate Saudi Arabia doing their IPO of Aramco with the New York Stock Exchange. Important to the United States! _E_ The episode of the Apprentice that everyone has been waiting for....Joan Rivers stars and she is and does GREAT! Next Monday night at 8:00 _E_ Congrats @MittRomney on a huge NV victory. Let's make @BarackObama a one term president __HTTP__ #OneTermFund _E_ Honest reporters stated that the Prayer Breakfast was going on during my CPAC speech and security was very slow to let people in long lines! _E_ It's amazing my weekly scheduled interviews on @foxnews and @CNBC draw the highest ratings. And they get bigger week by week thanks folks! _E_ "Donald Trump on 'cliff': 'Other countries are eating our lunch'" __HTTP__ via @BIZPACReview _E_ A great Christmas movie & perfect #TBT! #MakeAmericaGreatAgain Story: __HTTP__ __HTTP__ _E_ Entrepreneurs: Paying attention is a cost effective way of protecting yourself. _E_ I wonder if @BarackObama ever had an Indonesian passport. Did he become an Indonesian citizen when he lived there? _E_ Now Michelle Nunn will not admit she voted for Obama. Of course she did. Nunn supports ObamaCare & is anti Second Amendment. _E_ A total lightweight: @JonHuntsman continues to give the worst responses on China in the debates. I can see why (cont) __HTTP__ _E_ NYPD Officer Larry DePrimo has made the entire city proud with a his generous act of kindness __HTTP__ NYC loves the NYPD. _E_ .@BrookslawBrooks Thank you so much for your nice words. I will make you look very smart! _E_ Set the bar high do the best you possibly can and believe in yourself—because if you don't no one else will either. _E_ People will be very surprised by our ground game on Nov. 8. We have an army of volunteers and people with GREAT SPIRIT! They want to #MAGA! _E_ We're getting down to the wire on The Apprentice tune in tonight for some great action! 10 p.m. on NBC. _E_ .@FrankLuntz works really hard but is a guy who just doesn't have it a total loser! _E_ Check out this photo shoot video of @IvankaTrump's Spring 2012 collection.... __HTTP__ _E_ Is Supreme Court Justice Ruth Bader Ginsburg going to apologize to me for her misconduct? Big mistake by an incompetent judge! _E_ Let's together Make America Great Again! Vote Trump at __HTTP__ _E_ The new winter menu @SixteenChicago @TrumpChicago explores the evolution of fine dining @RobbReport __HTTP__ _E_ It now turns out that the phony allegations against me were put together by my political opponents and a failed spy afraid of being sued.... _E_ Entrepreneurs: Put everything you've got into what you're doing. Be totally focused nothing should be haphazard. _E_ Disgusting @BarackObama's supporters are launching an anti Mormon whisper campaign __HTTP__ Shameful but no surprise. _E_ Because of the tornado tragedy I will not be doing @piersmorgan tonight. I wish everyone well! _E_ My @FoxNews interview with @TeamCavuto discussing why I will not be moderating the Newsmax @iontv debate __HTTP__ _E_ Our country is blowing up and @BarackObama is out campaigning. _E_ The Republicans must be patient and smart ObamaCare could sweep them into office in far greater numbers than anyone ever thought possible! _E_ Convention Center officials in Phoenix don't want to admit that they broke the fire code by allowing 12 15000 people in 4000 code room. _E_ Everyone loves TV's darling @TheRealMarilu. But wait until you see her tough & competitive side in the upcoming @CelebApprentice! _E_ Will be working with contractors at Trump National Doral in Miami today. _E_ Thanks. __HTTP__ _E_ My @FoxNews interview with @gretawire explaining that I am keeping all my options available for 2012 __HTTP__ _E_ Druggie A Rod @MLB's biggest fraud is lucky George Steinbrenner is no longer with us. @Yankees would have voided his contract. _E_ RT @realDonaldTrump: "Arrests of MS 13 Members Associates Up 83% Under Trump" __HTTP__ _E_ Last weeks Dateline which I hosted was the highest rated Dateline since January! _E_ A great Father's Day gift—a stay at my 5 star hotel @TrumpNewYork along with items from my signature collection __HTTP__ _E_ Statement on Relationship with NBC __HTTP__ _E_ I am going to Trump National Doral in Miami today to check out the brand new and just opened BLUE MONSTER and the spectacular driving range. _E_ I hear by demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_ RT @FoxNews: .@EricTrump: People have seen a year that's incredible that's been filled with nothing but the best for our country America... _E_ The polls are close so Crooked Hillary is getting out of bed and will campaign tomorrow.Why did she hammer 13 devices and acid wash e mails? _E_ CBS's FACE THE NATION Posts Largest Audience Since 2001#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ With Democrats Spitzer Danger Weiner & Filner which party really has the war on women? _E_ The latest book on Hillary—Wow a really tough one! __HTTP__ @RogerJStoneJr _E_ It's time for government to stop picking winners & losers. Let's make sure everyone can achieve the American dream! __HTTP__ _E_ "Set the example and you'll be a magnet for the right people. That's the best way to work with people you like." – Think Like a Champion _E_ Maybe if Obama knew too much about the spying it would be worse than knowing nothing but either way it is just another disaster! _E_ China has done very well under Obama. Now they just released their first aircraft carrier. _E_ .@JustinRose99 The display you put on this weekend was unprecedented. Even the best putters couldn't believe it. You're amazing. See u soon. _E_ We are stupidly paying Iran billions of dollars that we should not be paying. Why isn't this part of the nuclear negotiations? Really dumb! _E_ This past Sunday's All Star Celebrity @ApprenticeNBC continued to win the key demographic of adults 25 54. An amazing run! _E_ This summer is very tough for the nation's worst AG @AGSchneiderman. Moreland Commission is his disaster. _E_ RT @seanhannity: @ericbolling To my dear friend please know we all love you will be here for you and your family. _E_ People are just now starting to find out how dishonest and disgusting (FakeNews) @NBCNews is. Viewers beware. May be worse than even @CNN! _E_ Congress must stop Obama's reckless deal with Iran. The framework is a pathway for Iran to develop nukes. _E_ Congrats to my friend @Schwarzenegger who is doing next season's Celebrity Apprentice. He'll be great & will raise lots of $ for charity. _E_ The Justice Dept. should ask for an expedited hearing of the watered down Travel Ban before the Supreme Court & seek much tougher version! _E_ Having a good relationship with Russia is a good thing not a bad thing. Only stupid people or fools would think that it is bad! We..... _E_ Congratulations to Miss Rhode Island on winning the Miss USA contest. She did an amazing job. _E_ Listen to my interview with @gretawire tonight at 10PM ET on @FoxNews. _E_ The Paley Center for Media is a great place to visit when you're in NYC. #CelebApprentice _E_ ...use subsidies to buy health plans. In other words Ocare is dead. Good things will happen however either with Republicans or Dems. _E_ An insightful article on @BarackObama __HTTP__ _E_ I make no apologies for this country my pride in it or my desire to see us become strong and rich again. (cont) __HTTP__ _E_ Donald Trump's Guns by @EmilyMiller @washtimes __HTTP__ _E_ #CelebApprentice Who will hear those two famous words? @Apprenticenbc premieres tomorrow at 9/8c on NBC. __HTTP__ _E_ When I said that if within the Orlando club you had some people with guns I was obviously talking about additional guards or employees _E_ I was thrilled to be back @LibertyU. Congratulations to the Class of 2017! This is your day and you've earned it.... __HTTP__ _E_ Just got back from New Hampshire. Amazing people we all had a great time together! _E_ Via @zpolitics: "Donald Trump Sends Message to @GaRepublicans" __HTTP__ _E_ ....countries which are doing badly. I want a merit based system of immigration and people who will help take our country to the next level. I want safety and security for our people. I want to stop the massive inflow of drugs. I want to fund our military not do a Dem defund.... _E_ The Dow just broke 24000 for the first time (another all time Record). If the Dems had won the Presidential Election the Market would be down 50% from these levels and Consumer Confidence which is also at an all time high would be "low and glum!" _E_ Jeb Bush had a tough night at the debate. Now he'll probably take some of his special interest money he is their puppet and buy ad's. _E_ Don't forget my book signing tonight at Costco on 1250 Old Country Road in Westbury NY from 6 8 pm. Hope to see you there. _E_ RT @BarackObama: RT if you agree: We need a President who is fighting for all Americans not one who writes off nearly half the country. _E_ Great! __HTTP__ _E_ What Bernie Sanders really thinks of Crooked Hillary Clinton. __HTTP__ _E_ .@JerryLawler was terrific. #WWEHOF __HTTP__ _E_ Watch this clip from earlier this year. Time & time again I have been right about terrorism. It's time to get tough! __HTTP__ _E_ Over the years I've discovered that for a brand to build the people surrounding it have to work exceptionally well together. _E_ There is no substitute for private sector experience. _E_ Jeb Bush "I am a conservative" = Barack Obama "If you like your healthcare plan you can keep your plan." _E_ #CongratsPeggy! __HTTP__ _E_ Via @CBSNewYork: "@TrumpFerryPoint Opens In The Bronx" __HTTP__ _E_ Thank you to the people of New Hampshire I love you! Now off to South Carolina. _E_ .@cher should spend more time focusing on her family and dying career! _E_ Can it just be new age that Manti Te'o fell in love with a girl he never met or is it a hoax? _E_ Congratulations to the White House. For every 1 ObamaCare enrollment there are 44 cancellation notices. Very unfair! _E_ 'As Senator Clinton promised 200000 jobs in Upstate New York her efforts fell flat.' __HTTP__ __HTTP__ _E_ Thanks for all of the nice tweets re Sgt. Tahmooressi. Especially nice that the money will be sent today #VeteransDay. _E_ .@maddow Standing in front of wind turbines is sad. Rachel windmills are terrible for the environment— _E_ Flashback: Donald Trump: $200M plan for Doral __HTTP__ via @ESPNGolf. Trump Doral's @cadillacchamp is one week away! _E_ Today is a day that I've been looking very much forward to ALL YEAR LONG. It is one that you have heard me speak about many times before. Now as President of the United States it is my tremendous honor to finally wish America and the world a very MERRY CHRISTMAS! __HTTP__ _E_ Reports by @CNN that I will be working on The Apprentice during my Presidency even part time are ridiculous & untrue FAKE NEWS! _E_ #SuccessByTrump exclusively available @Macy's has set sale records for fastest selling cologne. Makes a great gift __HTTP__ _E_ Manufacturers' record high optimism reported in the 1st qtr has carried into the 2nd qtr of 2017 via @ShopFloorNAM: __HTTP__ __HTTP__ _E_ Trump Signature mattress is from Serta the best there is! Thanks _E_ The Muslim Brotherhood dictator in Egypt is bad news. He will never be our true ally! _E_ 🚨BREAKING🚨: State Department's Kennedy pressured FBI to unclassify Clinton emails: FBI documents __HTTP__ _E_ A world famous testament to architectural excellence @TrumpTowerNY features a 60 ft waterfall __HTTP__ _E_ Don't be afraid of being unique it's like being afraid of your best self. Donald J. Trump __HTTP__ _E_ The Trump Organization is going revolutionize Rio de Janeiro's downtown port area with Trump Towers. Construction begins soon! _E_ On the cover of @TIME Magazine—a great honor! __HTTP__ _E_ .@VP Mike Pence will be speaking at today's #MarchForLife You have our full support! __HTTP__ _E_ Via @Newsmax_Media by Cathy Burke: "Donald Trump on 2016 Bid: On Scale of 1 10 I'm 'Much More Than Five'" __HTTP__ _E_ have enough problems around the world without yet another one. When I am President Russia will respect us far more than they do now and.... _E_ Nasty tactics being used by @BarackObama campaign against @MittRomney. Must stop saying Obama is a nice man he is not! _E_ ..and now holds an adjunct professorship at Columbia University. Boudin also received an academic laurel from NYU Law School... _E_ Via HT Politics __HTTP__ _E_ #DrainTheSwamp __HTTP__ _E_ Snowden if you're such a hero then come back home and face justice. In reality you are just another wiseguy traitor. _E_ So sad to hear of the terrorist attack in Egypt. U.S. strongly condemns. I have great... _E_ The $9B that @BarackObama spent in 'Stimulus' for Solar Wind Projects created 910 total jobs costing $9.8M each. __HTTP__ _E_ Lawyers have sent @billmaher demand notice and necessary documentation. _E_ On #PurpleHeartDay💜I thank all the brave men and women who have sacrificed in battle for this GREAT NATION! #USA __HTTP__ _E_ Horrific incident in FL. Praying for all the victims & their families. When will this stop? When will we get tough smart & vigilant? _E_ Comic @sethmeyers21 bombed at University of Texas at Arlington—crowd was dismal as was his performance—I told you so! _E_ Looking forward to @THEGaryBusey's book of Buseyisms ! _E_ Wow. @nfl ratings are down big league. Glad I didn't get the Bills. Rather be lucky than good. _E_ What you get by achieving your goals is not as important as what you become by achieving your goals. Goethe _E_ We pay a disproportionate share of the cost of N.A.T.O. Why? It is time to renegotiate and the time is now! _E_ "The Trumps pay tribute to the late @Joan_Rivers" __HTTP__ via @azcentral _E_ For too many years our inner cities have been left behind. I am going to deliver jobs safety and protection for those in need. _E_ Going on Letterman now let me know what you think how did I do? Here we go! _E_ Matt Harvey @Mets Don't let the @NYDailyNews get you down nobody reads it. Play well. _E_ New home sales reach a 10 year high. Stock Market has more record gains. Hopefully Republican Senators will give us the much needed Tax Cuts to keep it all going! Democrats want big Tax Increases. _E_ Wisconsin's economy is doing poorly and like everywhere else in U.S. jobs are leaving. I will make our economy strong again bring in jobs _E_ Today it was my tremendous honor to visit Marine Helicopter Squadron One (HMX 1) at the Marine Corps Air Facility in Quantico Virginia. I am honored to serve as your Commander in Chief. On behalf of an entire Nation THANK YOU for your sacrifice and service. We love you! __HTTP__ _E_ Looking forward to @VinceMcMahon inducting me into @WWE Hall of Fame this Saturday in @TheGarden. #WWEHOF #WrestleMania _E_ .@AlexSalmond See attached article. Very frightening to people living around these monstrosities __HTTP__ _E_ .@TrumpNewYork's 176 rooms have floor to ceiling windows providing unparalleled views of Central Park & NYC __HTTP__ _E_ Join me in Denver Colorado tonight at 9:30pm: __HTTP__ Scranton Pennsylvania Monday @ 5:30pm: __HTTP__ _E_ See my picks at @Fund_Anything at __HTTP__ and giving away money!!! #FundAnything _E_ Thank you to Governor @ScottWalker for such warm support. Great speech! _E_ Great! __HTTP__ _E_ I still hold the all time attendance and pay per view record at @WWE. _E_ Alabama was great last night amazing people. 30000 folks was largest crowd of political season. Nice! _E_ The Democrats have become nothing but OBSTRUCTIONISTS they have no policies or ideas. All they do is delay and complain.They own ObamaCare! _E_ Donald Trump: If Bill Maher Does Not Pay Off His $5 Million Bet – 'Then I'll Sue Him' __HTTP__ via @gatewaypundit _E_ People often ask me the secret to my success and the answer is simple: passion focus and hard work. Momentum keeps it all going. _E_ Why does @ThisWeekABC w/ @GStephanopoulos allow a hater & racist like @tavissmiley to waste good airtime? @ABC can do much better than him! _E_ The Trump Signature Collection is the best menswear design for young entrepreneurs. Great style & design exclusively available @Macys. _E_ Jennifer is a terrific person. __HTTP__ _E_ Still a great time to buy residential property. The courts are holding up foreclosures. Buy directly from the banks. _E_ Don't forget to tune in tonight at 10 p.m. on NBC for another action packed episode of The Apprentice. __HTTP__ _E_ How will the client react? They've got both Elle Magazine and Chi to please. #sweepstweet _E_ In less than 30 minutes watch the season premiere of @ApprenticeNBC on NBC. _E_ Make sure to tune in to All Star Celebrity @ApprenticeNBC this Sunday at 9PM EST for another round of fireworks and surprises! _E_ I only go on shows that get ratings that's why I do @oreillyfactor @hannityshow and @gretawire. Your sho... (cont) __HTTP__ _E_ Back from Miami where my Cuban/American friends are very happy with what I signed today. Another campaign promise that I did not forget! _E_ In Britain more Muslims join ISIS than join the British army. __HTTP__ _E_ President @EmmanuelMacronThank you for inviting Melania and myself to such a historic celebration in France. #BastilleDay #14juillet __HTTP__ _E_ We enjoy hosting tourists in @TrumpTowerNY. They come from all over the world to see the Atrium a NYC landmark. __HTTP__ _E_ Wow did great in the debate polls (except for @CNN which I don't watch). Thank you! _E_ Obama is angry frustrated and desperate. He said "voting is the best revenge" __HTTP__ He is divisive. _E_ Hence legal documents are being crafted which take me completely out of business operations. The Presidency is a far more important task! _E_ .@FoxNews is so biased it is disgusting. They do not want Trump to win. All negative! _E_ Message to Edward Snowden you're banned from @MissUniverse. Unless you want me to take you back home to face justice! _E_ This was the Republicans election to win but they just blew it reasons why to follow. _E_ Dopey @Lawrence O'Donnell whose unwatchable show is dying in the ratings said that my Apprentice $ numbers were wrong. He is a fool! _E_ Am now in L.A. Will be going to the U.S.S. IOWA at 5:30 P.M. to speak to our great VETERANS and other friends! _E_ It's Tuesday. How many more 'The View' Execs will leak that they want @rosie gone? Show is failing. _E_ American homeownership rate in Q2 2016 was 62.9% lowest rate in 51yrs. WE will bring back the 'American Dream!' __HTTP__ _E_ Look where the world is today a total mess and ISIS is still running around wild. I can fix it fast Hillary has no chance! _E_ I started this campaign to Make America Great Again. That's what I'm going to do. #MAGA #debate _E_ In Tampa Florida thank you to all of our outstanding volunteers who want to #MakeAmericaGreatAgain! __HTTP__ _E_ .@HallieJackson Why didn't you report Hillary lying about the ISIS video. Bad reporting. Perhaps @NBC will do better next year but doubt it! _E_ Big thanks to @David_Bossie @Citizens_United & @AFPhq for hosting me at #NHFreedomSummit. Will be back to the Granite State soon! _E_ Tremendous support (except for some Republican leadership ). Thank you. _E_ Congratulations to @FoxNews for being number one in inauguration ratings. They were many times higher than FAKE NEWS @CNN public is smart! _E_ Great meeting with active & retired law enforcement officers at the Fraternal Order of Police lodge in Akron Ohio. __HTTP__ _E_ Gov. Scott Walker just left my office we had a really wonderful talk. Very interesting! @GovWalker _E_ Sources inside @AGSchneiderman's office are saying that they are very concerned with the allegations against their lightweight boss. _E_ 51% of @JonHuntsman's NH voters are satisfied with @BarackObama as president __HTTP__ So is @JonHuntsman! _E_ Excited by my acquisition of Doral Hotel & Country Club in Miami already world class but will soon be The Best. _E_ Crooked Hillary Clinton spent hundreds of millions of dollars more on Presidential Election than I did. Facebook was on her side not mine! _E_ Entrepreneurs: Being stubborn is a big part of being a winner. Don't give in and don't give up! _E_ Jon Stewart @TheDailyShow is a total phony –he should cherish his past—not run from it. _E_ Obama can attend a fundraiser every day but can't be bothered to get briefed on national security. Commander in Chief?! _E_ Just left Florida amazing how well State is doing jobs way up taxes down. Congrats to @FLGovScott _E_ 'How Trump Would Stimulate the U.S. Economy' __HTTP__ _E_ New Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: __HTTP__ _E_ It probably was not drugs that caused the San Fran crash but why aren't they testing who knows? _E_ The joke around town is that I freed El Chapo from the Mexican prison because the timing was so good w/ my statements on border security. _E_ I will be interviewed by @IngrahamAngle on @FoxNews at 10:00. Enjoy! _E_ Crazy Election officials saying that there is nothing stopping illegal immigrants from voting. This is very bad (unfair) for Republicans! _E_ Thank you Alabama! #Trump2016#SuperTuesday _E_ Just out @ApprenticeNBC was in first place in all demos during the 10PM hour in the ratings. _E_ Well back to the drawing board! _E_ I will be doing a Town Hall tonight at 10:00 P.M. on @seanhannity @FoxNews _E_ Looking forward to a speedy recovery for George and Barbara Bush both hospitalized. Thank you for your wonderful letter! _E_ I really like Jay Z but there is trouble in paradise. When his wife's sister starts whacking him not good! No help from B leads to a mess. _E_ I truly hope President Obama doesn't do something irrational and dangerous for our country in order to save face. He must sit back and chill _E_ The North Coast of Scotland is spectacular the sea the sand dunes the rolling bluffs we walked the course and it is fantastic. _E_ MUST READ ARTICLE: "Immigration reform could be bonanza for Democrats" __HTTP__ Are the @RNC & @GOP suicidal? _E_ Congratulations to @TrumpDoral for being named one of @LINKSMagazine's Great Destinations: __HTTP__ _E_ For all of those who have been asking about online sales the Donald J. Trump Signature Collection ties & shirts are sold @Macys.com _E_ Small Business Poll has highest approval numbers in the polls history. All business is just at the beginning of something really special! _E_ .@williebosshog such an honor to get your endorsement. You are a fantastic guy! It will not be forgotten. Don and Eric say hello! _E_ America's men & women in uniform is the story of FREEDOM overcoming OPPRESSION the STRONG protecting the WEAK & G... __HTTP__ _E_ Very different styles but each totally effective in his own way at the debate. _E_ Choose your own path: It doesn't have to be the path less traveled...What matters is that it's the right one for you. Vince Lombardi _E_ President Obama's approval rating at 38% is at an all time low. Gee I wonder why? _E_ My best wishes to everyone for a Happy Thanksgiving! _E_ The United States better address China's exchange rate before they steal our country and it is too late! China is laughing at us. _E_ Where the hell is global warming when you need it? _E_ So a woman in Chicago who never had a job has 9 kids with 7 different men (she is one of many). These kids will never work. Trouble! _E_ .@MacMiller's 'Donald Trump' song is at 64.5M views on YouTube __HTTP__ You're welcome Mac! _E_ Rev. Graham made a critical point. @BarackObama has turned a blind eye to the Christians being persecuted in (cont) __HTTP__ _E_ Welcome to Obama's America record high poverty and an 8% drop in median household family income __HTTP__ Four more years? _E_ Businesses have already started massive layoffs and reducing employees' hours due to Obama Care. Reality is setting in. _E_ A Great 4th of July! America a great country who's brightest days with wise leadership lie ahead. _E_ RT @JasonMillerinDC: Is @realDonaldTrump debating Crooked @HillaryClinton or the moderators @AC360 and @MarthaRaddatz? #rattledhillary _E_ In light of Boston immigration legislation will be much harder to get. _E_ You would think a paper like the Washington Post would be fair and objective. For the record almost all polls showed I won all debates. _E_ Hillary Clinton's Campaign Continues To Make False Claims About Foundation Disclosure: __HTTP__ _E_ .@AlexSalmond of Scotland may be the dumbest leader of the free world. I can't imagine that anyone wants him in office. _E_ The dying @UnionLeader newspaper in NH is in turmoil over my comments about them like a bully that got knocked out! _E_ Jeanne Shaheen was the deciding vote for ObamaCare. Premiums have skyrocketed 90% for New Hampshire. Send @SenScottBrown to the Senate! _E_ Kevin Garnett's response to Ray Allen last night was that of a great competitor nothing wrong in fact it was terrific. A champion! _E_ Why is @BarackObama always campaigning or on vacation? _E_ The Trump Doral's @cadillacchamp is Florida's premiere golf tournament. I'll be there! Tickets available here: __HTTP__ _E_ .@hardball_chris must have the lowest IQ on television—now telling people that domestic terrorists are from the right. _E_ RT @DRUDGE_REPORT: 43 39 __HTTP__ _E_ That the Obama administration didn't know the facts about who Bergdahl was before making the stupid 5 killers for one trade is pathetic! _E_ .@MittRomney is 100% right. The US Supreme Court should do the right thing & overturn ObamaCare or the country (cont) __HTTP__ _E_ For years even as a civilian I listened as Republicans pushed the Repeal and Replace of ObamaCare. Now they finally have their chance! _E_ .@pgaofamerica A really great tournament congrats to Monty Pete B and Ted Bishop. FANTASTIC JOB! _E_ Congratulations to Tom Brady on yet another great victory Tom is my friend and a total winner! _E_ In today's #trumpvlog @RepWeiner the Secret Service and Dick Clark..... __HTTP__ _E_ NEVER forget our HEROES held prisoner or who have gone missing in action while serving their country.Proclamation: __HTTP__ __HTTP__ _E_ Save Medicare. Vote for @MittRomney. He will repeal Obamacare on day one. _E_ Deja vu I can remember a time when our embassies were stormed under another failed President. Obama=Carter. _E_ I will be interviewed on @foxandfriends by @ainsleyearhardt starting at 6:00 A.M. Enjoy! _E_ Joining @oreillyfactor from Waukesha Wisconsin now live! Enjoy! _E_ Success requires 100% effort and 100% focus. Nothing less. _E_ What is your thought as to why Obama refused millions for charity and did not show his records and applications? _E_ THANK YOU Clemson South Carolina! #MakeAmericaGreatAgain #SCPrimary __HTTP__ _E_ People rarely succeed unless they have fun in what they are doing. Andrew Carnegie _E_ Important meetings and calls scheduled for today. Military and economy are getting stronger by the day and our enemies know it. #MAGA _E_ Can you believe the worst Mayor in the U.S. & probably the worst Mayor in the history of #NYC @BilldeBlasio just called me a blow hard! _E_ Based on the fact that Ted Cruz was born in Canada and is therefore a natural born Canadian did he borrow unreported loans from C banks? _E_ Just left Family Leadership Summit in Iowa got a standing ovation from many wonderful people. I will be back soon. _E_ I don't believe I have been given any credit by the voters for self funding my campaign the only one. I will keep doing but not worth it! _E_ Bird killing windfarm that I oppose in Aberdeen got delayed by at least two years.@AlexSalmond forced the failing developers to delay! _E_ This Man Is the Most Dangerous Political Operative in America via Bloomberg Politics __HTTP__ _E_ Hillary Clinton is not a change agent just the same old status quo! She is spending a fortune I am spending very little. Close in polls! _E_ Heading to Sioux County Iowa where the crowd is amazing. Dr. Robert Jeffress will make the introduction. Make America Great Again! _E_ Thank you Pennsylvania!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ "The only way to do great work is to love what you do. If you haven't found it yet keep looking. Don't settle." – Steve Jobs _E_ #TBT With the wonderful actor Jack Nicholson __HTTP__ _E_ Join the MOVEMENT! __HTTP__ __HTTP__ _E_ Via @fitsnews: The Donald Trump Show Is Returning To SC: BILLIONAIRE MOGUL HEADS BACK TO PALMETTO STATE __HTTP__ _E_ Now Syria is bombing Iraq and Secy. Kerry after we blew the hell out of the place says please don't do that. Syria is a front for Iran. _E_ Is it possible for @megynkelly to cover anyone but Donald Trump on her terrible show. She totally misrepresents my words and positions! BAD. _E_ Entrepreneurs: Do not view any failure as the final say for your efforts. Learn your lessons quickly then move on. _E_ .@HillaryClinton's Nuclear Agreement Paved The Way For The $400 Million Ransom Payment #DebateNight __HTTP__ _E_ Job numbers today terrible! So what else is new? _E_ I have always done well with properties fronting on oceans lakes and rivers. If something works stay with it. _E_ Sean Spicer is a wonderful person who took tremendous abuse from the Fake News Media but his future is bright! _E_ Via @TheBrodyFile: Iowa Evangelical Leader Says Donald Trump Is Bold And Transparent __HTTP__ _E_ The only reason I bid on @buffalobills was to make sure they stayed in Buffalo where they belong. Mission accomplished. _E_ Don't forget! Sunday night at 9 pm EST on @nbc Celebrity Apprentice is back! Tune in for a great show. @ApprenticeNBC _E_ 70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow! _E_ He knows he won't have to spend much: @JonHuntsman has offered to match any donation dollar for dollar. _E_ Alaska had a 200% plus increase in premiums under ObamaCare worst in the country. Deductibles high people angry! Lisa M comes through. _E_ Both Obama administration and House leadership staffs are exempt from ObamaCare. Why not the American people? #MakeDCListen _E_ .@WWE: He's answered the call! @realDonaldTrump responds to @VinceMcMahon's #ALSIceBucketChallenge! __HTTP__ #SmackDownALS _E_ I was recently asked if Crooked Hillary Clinton is going to run in 2020? My answer was I hope so! _E_ Thank you Abingdon Virginia! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Knockout assaults are the new rage by sick and depraved youth. We better start getting tough in this country and they want to take our guns! _E_ "I always follow my own instincts but I am not going to kid you: it's also nice to get good reviews." The Art of the Deal _E_ Coming soon to Pennsylvania Avenue __HTTP__ _E_ Results of recovery efforts will speak much louder than complaints by San Juan Mayor. Doing everything we can to help great people of PR! _E_ Wow just heard that that next Tuesday's @saintanselm Politics & Eggs is the largest crowd ever. Looking forward to making new friends. _E_ Joining @SeanHannity tonight at 9pmE on @FoxNews. Enjoy! __HTTP__ _E_ Doing Fox & Friends at 7.00 A.M. ENJOY! _E_ .@BillBratton was a great choice for NYC Police Commissioner. He will make us proud and safe! _E_ Obama's own gun study proves gun control is ineffective __HTTP__ @BIZPACReview _E_ It's 46º (really cold) and snowing in New York on Memorial Day tell the so called scientists that we want global warming right now! _E_ What a rotten deal we made with Iran. We get nothing (except laughter at our stupidity). They get everything including delay and big cash! _E_ The people of Alabama will do the right thing. Doug Jones is Pro Abortion weak on Crime Military and Illegal Immigration Bad for Gun Owners and Veterans and against the WALL. Jones is a Pelosi/Schumer Puppet. Roy Moore will always vote with us. VOTE ROY MOORE! _E_ Congratulations to Justice Neil Gorsuch on his elevation to the United States Supreme Court. A great day for Americ... __HTTP__ _E_ It's Tuesday. How many fundraisers travelling on the taxpayer dime will Obama hold today? _E_ ...They should realize that these relationships are a good thing not a bad thing. The U.S. is being respected again. Watch Trade! _E_ .@TMobile You service is absolutely terrible get on the ball! @JohnLegere _E_ .@BarackObama has completely failed the American people. U.S. annual incomes have fallen over 5% during his term __HTTP__ _E_ Waste. With 22 new taxes & $1.8T in added debt @BarackObama's disgraceful 'ObamaCare' will still leave 30M uninsured __HTTP__ _E_ I spell out some of the differences between Ben Carson and myself at 9:00 A.M. on @CNN @jaketapper. Ben is very weak on illegal immigration. _E_ Spent a beautiful weekend golfing at Trump National Golf Club Westchester and Trump National Golf Club Bedminster. _E_ RT @foxandfriends: France vehicle attack leaves at least six soldiers injured __HTTP__ _E_ ... to OPEC countries that hate our guts. It's stupid policy." Time To Get Tough _E_ When @crowleyCNN defended Obama on Benghazi in the presidential debate she was defending a complete lie __HTTP__ _E_ I'm at Trump National DC @TrumpGolfDC watching the #2013JuniorPGA championship fantastic young players! @ThePGAofAmerica. _E_ "The way to get started is to quit talking and begin doing." – Walt Disney _E_ With our brand new Tennis Performance Center @TrumpGolfDC offers countless activities along with top courses __HTTP__ _E_ First Titantic sunk on its maiden voyage.Next the Hindenburg explodes on its first flight to America.Now we suffer the ObamaCare rollout! _E_ "President Donald J. Trump Proclaims January 16 2018 as Religious Freedom Day" __HTTP__ _E_ I recorded robo calls for @Perduesenate @leezeldin & @SteveKingIA. All had record wins. #MidasTouch _E_ "Mastering others is strength. Mastering yourself is true power." – Lao Tzu _E_ No surprise with the talk of amnesty in DC illegal immigration is picking up in Arizona __HTTP__ _E_ As I told everyone once before Wiener is a sick puppy who will never change 100% of perverts go back to their ways. Sadly there is no cure _E_ Certain Republicans who have lost to me would rather save face by fighting me than see the U.S.Supreme Court get proper appointments. Sad! _E_ Crooked Hillary's bad judgement forced her to announce that she would go to Charlotte on Saturday to grandstand. Dem pols said no way dumb! _E_ My @foxandfriends interview discussing the 9/11 Trials at Gitmo @MittRomney the job numbers and @CelebApprentice __HTTP__ _E_ Few if any Administrations have done more in just 7 months than the Trump A. Bills passed regulations killed border military ISIS SC! _E_ Great meeting all of you. This group knocked on 50K doors & counting here in Maine thank you! @MaineGOP __HTTP__ _E_ Congratulations to @ABC News for suspending Brian Ross for his horrendously inaccurate and dishonest report on the Russia Russia Russia Witch Hunt. More Networks and "papers" should do the same with their Fake News! _E_ Not only are wind farms disgusting looking but even worse they are bad for people's health __HTTP__ (cont) __HTTP__ _E_ .@Oprah was great amazing that she got Lance Armstrong to totally destroy his life. Why did he ever do that interview? _E_ #ThrowbackThursday #Trump2016 __HTTP__ _E_ Thank you New Hampshire! #FITN#Trump2016 #NHPolitics __HTTP__ _E_ Hey @SnoopDogg @ItstheSituation @SethMacFarlane: Oh I'm real scared. #TrumpRoast airs tonight at 10:30/9:30 on @Comedy Central. _E_ #TBT With @DonaldJTrumpJr almost 35 years ago __HTTP__ _E_ Not only giving out money but Obama will be seen today standing in water and rain like he is a real President don't fall for it. _E_ Great day in Kentucky with Wayne LaPierre Chris Cox & the @NRA! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ A classic China just signs massive oil and gas deal with Russia giving Russia plenty of ammo to continue laughing in U.S. face. _E_ America's top public course @TrumpGolfLA's greens on Palos Verdes Peninsula have been celebrated by @GolfMagazine __HTTP__ _E_ beepee2004 Thank you very much Donald. Here is another. __HTTP__ Thanks both do justice to a fantastic place. _E_ RT @TeamTrump: Mrs. Saucier's son is in prison for having classified info on an unsecured device. @HillaryClinton did FAR WORSE & is runnin... _E_ What message does it send when @BarackObama's campaign has to spin whether America is better off than it was 4 years ago? _E_ and here's another.... __HTTP__ _E_ Can you believe this fool Dr. Thomas Frieden of CDC just stated anyone with fever should be asked if they have been in West Africa DOPE _E_ Bain did not list @MittRomney as an Executive on its website in 2000 __HTTP__ @BarackObama's Saul Alinsky tactics won't work! _E_ Our great country is respected again in Asia. You will see the fruits of our long but successful trip for many years to come! _E_ China's corporate espionage is a continued threat to the American economy. With the right leadership it can be stopped. _E_ I am looking forward to being in New Hampshire tomorrow. The silent majority is taking our country back. We will MAKE AMERICA GREAT AGAIN! _E_ We will no longer be silent. We can take our country back! Let's Make America Great Again! __HTTP__ _E_ Such a great honor. Final debate polls are in and the MOVEMENT wins!#AmericaFirst #MAGA #ImWithYou... __HTTP__ _E_ Unbelievable. __HTTP__ _E_ I refuse to call Megyn Kelly a bimbo because that would not be politically correct. Instead I will only call her a lightweight reporter! _E_ Your higher self is in direct opposition to your comfort zone. Donald J. Trump __HTTP__ _E_ I watched parts of @nbcsnl Saturday Night Live last night. It is a totally one sided biased show nothing funny at all. Equal time for us? _E_ .@LukeDonald You are so good and so talented that I have no doubt you will conquer the 18th hole at the New Blue Monster @DoralResort _E_ The polls have been consistently great. The silent majority is speaking. Politicians are failing. #MakeAmericaGreatAgain! _E_ Dianne Gallagher @DianneG is a great reporter for News Channel 36 in Charlotte NC. Fantastic interview thanks! _E_ I am very proud of my friend @OMAROSA. Despite her recent lossshe gracefully performs in the upcoming All Star @ApprenticeNBC _E_ We must build a great wall between Mexico and the United States! __HTTP__ _E_ Didn't the Boston killer even run over his own brother with a car in order to get away? We are not dealing with an innocent baby here DEATH _E_ David Pecker would be a brilliant choice as CEO of TIME Magazine nobody could bring it back like David! __HTTP__ _E_ Boasting @AAAFiveDiamond & @ForbesInspector 5 Star ratings @TrumpNewYork's @jeangeorges features a superb menu __HTTP__ _E_ The @whitehouse has 'clarified' that the unemployment is actually 8.254% not 8.3% __HTTP__ A little sensitive are we? _E_ Great investor John Paulson just sought bankruptcy protection for a unit of his hedge fund very smart but he didn't go bankrupt you morons! _E_ With a stupid guy like Jonah Goldberg who uses "tweeting like a 14 year old girl" to hit me no wonder the NRO is doing so poorly. @JonahNRO _E_ .@DaveWeigel @WashingtonPost put out a phony photo of an empty arena hours before I arrived @ the venue w/ thousands of people outside on their way in. Real photos now shown as I spoke. Packed house many people unable to get in. Demand apology & retraction from FAKE NEWS WaPo! __HTTP__ _E_ Wow the failing @nytimes has not reported properly on Crooked's FBI release. They are at the back of the pack no longer a credible source _E_ Just as I predicted Iraq is deteriorating into utter chaos __HTTP__ The war was a waste. China is taking all the oil. _E_ Via @bpolitics by @BetBrod "Trump sets an aggressive tone as he insisted he's serious about running for POTUS. __HTTP__ _E_ Looking like a really big night for Republicans a tremendous refutation of President Obama and his failed policies! _E_ .@genesimmons Amazing! Thank you.   __HTTP__ _E_ .@BarbaraJWalters @theviewtv Why did you choose me as one of the 10 Most Fascinating People of the Year last season (and more than once?) _E_ I discuss South Korea in today's all new #TrumpVlog __HTTP__ _E_ It was my great honor to welcome the 2016 World Series Champion Chicago @Cubs to the @WhiteHouse this afternoon.... __HTTP__ _E_ After North Korea missile launch it's more important than ever to fund our gov't & military! Dems shouldn't hold troop funding hostage for amnesty & illegal immigration. I ran on stopping illegal immigration and won big. They can't now threaten a shutdown to get their demands. _E_ "The object of golf is not just to win. It is to play like a gentleman and win." Phil Mickelson @MickelsonHat _E_ The ObamaCare website was hacked. $5B dollars later and the site can't even secure your personal information. _E_ Instead of attacking me Ashish J. Thakkar should worry about the culture of corruption plaguing Uganda __HTTP__ _E_ .@MittRomney and his campaign manager should not be critical of candidates after they blew an election that should never have been lost! _E_ Te'o's imaginary girlfriend is one of the great cons of all time—or he's very stupid. _E_ In the just out @FoxNews Poll I easily beat Hillary Clinton and I havn't even focused on her yet. On our way: MAKE AMERICA GREAT AGAIN! _E_ If you've looked over the yearsI've been right on virtually every issue from Iraq (not going in but if so taking the oil) to jobs to China _E_ Wow Ted Cruz got booed off the stage didn't honor the pledge! I saw his speech two hours early but let him speak anyway. No big deal! _E_ Wind turbines are not only killing millions of birds they are killing the finances & environment of many countries & communities. _E_ As President I wanted to share with Russia (at an openly scheduled W.H. meeting) which I have the absolute right to do facts pertaining.... _E_ "Build confidence starting with small successes that lead to greater and greater successes there is nothing like winning. Think Big _E_ And Trump SoHo New York is one of the hottest new hotels anywhere.... __HTTP__ _E_ I've got news for President Obama: America is not what's wrong with the world. #TimeToGetTough __HTTP__ __HTTP__ _E_ Can you imagine if @BarackObama had passed Cap and Trade?! Energy costs would be double from already record highs. _E_ Entrepreneurs: Pay attention to details. If you don't know everything about what you're doing you'll be in for some big surprises. _E_ No one wants the government to shut down but if ObamaCare is fully implemented then our country will eventually shutdown anyway! _E_ Haters stop saying I went bankrupt it is not so. I never went bankrupt... _E_ Typical @BarackObama's Press Secretary deflects any criticism of Obama's constant celebrity visits by attacking me. My great honor. _E_ A Rod's lawsuit trying to overturn a binding arbitration agreement is going nowhere. He should be banned from spring training. _E_ My book Midas Touch with Robert Kiyosaki (Rich Dad Poor Dad) will be in bookstores tomorrow it's a grea... (cont) __HTTP__ _E_ Heading to New Hampshire. #MakeAmericaGreatAgain __HTTP__ _E_ Bernie Sanders has been treated terribly by the Democrats—both with delegates & otherwise. He should show them & run as an Independent. _E_ "All our dreams can come true if we have the courage to pursue them." – Walt Disney _E_ Congratulations to the winners of the Commander in Chief's Trophy the great Air Force Falcons! Watch:... __HTTP__ _E_ I have great confidence that China will properly deal with North Korea. If they are unable to do so the U.S. with its allies will! U.S.A. _E_ There's never been anyone more abusive to women in politics than Bill Clinton.My words were unfortunate the Clintons' actions were far worse _E_ Answer to your questions I will be voting at 10:30 AM at Lighthouse International 110 East 60th Street Manhattan _E_ #CelebApprentice contestants @DeeSnider and @DebbieGibson joined me for interviews today __HTTP__ _E_ Congrats to @nbc on the success of the new smash show @NBCBlacklist. Fantastic suspense. Great acting. Must see TV! _E_ .@foxandfriends int. on how the Boston thug deserves death penalty @FBI's great work & firing Brande Roderick __HTTP__ _E_ Okay I think I'm going to do it—I'll open the Miss Universe Pageant as Santa tonight at 8 pm on @NBC _E_ Don't forget to watch The Tonight Show with the wonderful @jimmyfallon at 11:30 P.M. You will not be disappointed! @NBC _E_ The Eric Trump Foundation has raised over $1000000 towards St. Jude Children's Research Hospital. __HTTP__ _E_ Today is #VeteransDay. Let us be thankful for our nation's finest who fight at all corners of the earth to protect our freedoms. _E_ Watch my latest appearance on Squawk Box .... __HTTP__ _E_ EXCLUSIVE: Newt Gingrich: 'The Country Is in Rebellion' Trump Can 'Kick Down the Doors' __HTTP__ _E_ Biden's sarcastic smiling may or may not be effective depending on who is watching. #VPDebate _E_ .@Natalie_Gulbis Thank you for the nice piece in @SInow / @Golf_com.Keep up the great work! __HTTP__ _E_ Libya is being taken over by Islamic radicals with @BarackObama's open support. _E_ Heading to Richmond Virginia now. Join me tonight! #Trump2016Tickets: __HTTP__ _E_ Beautiful evening in Kinston North Carolina thank you! Get out and VOTE!! You can watch tonight's rally here:... __HTTP__ _E_ The world economy is under deep stress with growth slowing everywhere. Yet crude is over $87/barrel. Should be $25 at the most. _E_ #CelebrityApprentice Listening to the advice from @johnrich and @marleematlin adds another insight into the Final 4. #sweepstweet _E_ Go to Macy's today and buy Trump ties shirts suits and cufflinks as a Christmas or holiday present.Great style great price! ONLY THE BEST _E_ The Pledge #MakeAmericaGreatAgain __HTTP__ _E_ ....Because of the Democrats not being interested in life and safety DACA has now taken a big step backwards. The Dems will threaten "shutdown" but what they are really doing is shutting down our military at a time we need it most. Get smart MAKE AMERICA GREAT AGAIN! _E_ Interesting case from UK re @stellacreasy and abusive troll __HTTP__ _E_ Lyin' Crooked Hillary's email stories all have one thing in common. __HTTP__ _E_ I am in Ireland inspecting my great and very beautiful Atlantic Ocean property. It is one of the most spectacular hotels anywhere! DOONBEG _E_ Whether you like it or not the Russians did a great job in hosting the Olympics! Remember when Obama went to Europe to get Olympics fourth. _E_ We will remain fully engaged w/ open lines of communication as #HurricaneHarvey makes landfall. America is w/ you! @GovAbbott @FEMA @DHSgov __HTTP__ _E_ The worst negotiators in history (otherwise known as Republicans) have just offered to suspend debt ceiling for four months. Pathetic! _E_ Via @thehill by @HenschOnTheHill: "Trump: 'I'm disappointed' in many Republicans" __HTTP__ _E_ Everybody is raving about the Trump Home Mattress by @SertaMattresses. If you are looking for a mattress go buy (cont) __HTTP__ _E_ During primetime of the Iowa Caucus Cruz put out a release that @RealBenCarson was quitting the race and to caucus (or vote) for Cruz. _E_ The Eric Trump Foundation Golf Invitational benefiting St. Jude Children's Research Hospital is today and i... (cont) __HTTP__ _E_ Eventually but at a later date so we can get started early Mexico will be paying in some form for the badly needed border wall. _E_ Will be delivering a major speech tonight live on @oreillyfactor at 8:10pm from Pensacola Florida. _E_ For all of those fools that want to attack Syria the U.S.has lost the vital element of surprise so stupid could be a disaster! _E_ My @Yahoo 'Power Players' interview with @jonkarl Inside Donald Trump's new digs on Pennsylvania Avenue" __HTTP__ _E_ Looking forward to being at the convention tonight to watch all of the wonderful speakers including my wife Melania. Place looks beautiful! _E_ Next time you are waiting in an emergency room remember the Boston killer was rushed to intensive care within minutes of capture. _E_ I will be interviewed by @kimguilfoyle at 7pm on @FoxNews. #Enjoy! _E_ My @showbiztonight interview on @KhloeKardashian @ApprenticeNBC & my surprising TV career __HTTP__ _E_ ICYMI via @DMRegister by @JenniferJJacobs: "Donald Trump to give Iowa speech on education" __HTTP__ _E_ I had a great time in Des Moines Iowa tonight! Thank you for all of the support. #Trump2016 __HTTP__ __HTTP__ _E_ The Muslim brotherhood is sending tanks into the Sinai & saying it doesn't violate Camp David accord. _E_ Praying for everyone in Florida. Hoping the hurricane dissipates but in any event please be careful. _E_ Today is the first day of the rest of your life make the most of it! _E_ The Emmys are sooooo boring! Terrible show. I'm going to watch football! I already know the winners. Good night. _E_ I gave out the Male Athlete of the Year Award last night to my friend @MichaelPhelps—22 Olympic medals—a record that will never be broken. _E_ Debbie Wasserman Schultz is hard to watch or listen to no wonder our country is going to hell! _E_ Let the Arab countries take care of Egypt they have more to gain and plenty of money..It's time for the U.S. to stop being stupid.NO DOLLARS _E_ Today in history WrestleMania 23: I shave @VinceMcMahon's hair highest rated show in WWE history @WrestleFact __HTTP__ _E_ Has Barack Obama been caught red handed laundering money into his campaign from illegal online foreign donations? Media? _E_ My interview with Andy Dean on @americanowradio I told him what I really thought about the @FoxNews debate. __HTTP__ _E_ Obamacare is a disaster. Rates going through the sky ready to explode. I will fix it. Hillary can't!#ObamacareFailed _E_ Late last Friday @BarackObama announced his 2011 budget deficit was $1.299 trillion the second largest in US history. _E_ .@FoxNews should not put @KarlRove on—he has no credibility a bush plant who called all races wrong. _E_ Thank you Ohio! Together we made history – and now the real work begins. America will start winning again!... __HTTP__ _E_ Someone must be fired at @AOL for that stupid deal they made buying Huffington Post. _E_ The failing @NRO National Review Magazine has just been informed by the Republican National Committee that they cannot participate in debate _E_ Hopefully there won't be any problems in Baltimore tonight. Be calm be cool do not let anybody get hurt.There is just too much to live for! _E_ .@TrumpDoral. Thanks for the many nice statements and to the media and golf critics for the great reviews of the brand new BLUE MONSTER! _E_ A nation WITHOUT BORDERS is not a nation at all. We must have a wall. The rule of law matters. Jeb just doesn't get it. _E_ .@RichLowry is truly one of the dumbest of the talking heads he doesn't have a clue! _E_ Those who refuse to draw red line to Iran don't have the moral right to put a red line to @Israel. @IsraeliPM @netanyahu _E_ .@HillaryClinton is on the front page of the @nytimes waving to 200 people in New Hampshire. My crowd next door was 5000 people – no pic! _E_ ......@DailyCaller @BreitbartNews @DRUDGE_REPORT & @gatewaypundit. _E_ Congratulations to @thomtillis on winning @NCGOP Senate primary. Time for the party to unite and defeat ObamaCare advocate Kay Hagan! _E_ .@KeithUrban is excellent on American Idol—great touch solid guy! _E_ Just spoke to Governor Rick Scott. We are working closely with law enforcement on the terrible Florida school shooting. _E_ .@TheView T.V. show which is failing so badly that it will soon be taken off thr air is constantly asking me to go on. I TELL THEM NO _E_ All the hotels currently open in the Trump Hotel Collection have been nominated for Travel & Leisure's World's Best Awards 2011 ..... _E_ My interview this morning on Good Morning America with George Stephanopoulos __HTTP__ _E_ Canadians kicked out the firm that the U.S. paid all that money to for the failed website. How stupid are our leaders ? This is a scandal! _E_ North Korea is reliant on China. China could solve this problem easily if they wanted to but they have no respect for our leaders. _E_ While I won't be running for Governor of New York State a race I would have won I have much bigger plans in mind stay tuned will happen! _E_ Teams are making a big mistake not taking Johnny Manziel he is going to be really good (and exciting to watch). _E_ Thank you Sarah Let's have pizza in New York soon with you & your great family __HTTP__ _E_ It was a GREAT day for the United States of America! This is a great plan that is a repeal & replace of ObamaCare.... __HTTP__ _E_ He would be crazy to play in L.A. really bad coach who can't adjust to his players! _E_ Obama planted that @nytimes story on Iran so it will be discussed in tonight's debate. He wants Libya and China off the table. _E_ Great poll numbers just coming out of New Hampshire. BIG lead for Trump according to @CNN! _E_ Michele Bachmann got less than 1200 more votes in the Caucus than she did in the Ames Straw Poll. Very sad for her a nice woman! _E_ Checking out the course at TNGC Westchester and it is fantastic. Should be a great season. __HTTP__ _E_ RT @foxandfriends: Sen. John McCain making his return to the Senate ahead of health care vote __HTTP__ _E_ Congrats to @BreitbartNews' @mboyle1 on being awarded the prestigious 'Eagle Award for Amnesty Reporting' __HTTP__ _E_ I was not scheduled to be on the @oreillyfactor. Pure fiction! _E_ Via @AmSpec BY Jeffrey Lord: "Donald Trump was right on Ebola" __HTTP__ _E_ We can't destroy the competitiveness of our factories in order to prepare for nonexistent global warming. China is thrilled with us! _E_ Dopey Sugar @Lord_Sugar The wind turbines are ruining the beauty & majesty of Scotland... _E_ Thank you American Legion Post 610 for hosting @Mike_Pence & I for a roundtable with labor leaders. #LaborDay #MAGA __HTTP__ _E_ GOP now viewed more favorably than Dems in Trump era (per NBC/WSJ poll) via @HotlineJosh: __HTTP__ _E_ I have founded and run one of the largest real estate empires in the world. I employ thousands of people. Why am I the enemy? _E_ If the UN unilaterally grants the Palestinians statehood then the US should cut off all its funding. Actions have consequences. _E_ Be a cautious optimist. Call it positive thinking with a lot of reality checks. _E_ Republicans gave Obama a free pass to the White House they just don't get it. _E_ .@FLGovScott: Amazing race tremendous courage you deserved this win for a very old fashioned reason you have been a great governor! _E_ More thoughts on the debt ceiling in today's #trumpvlog... __HTTP__ _E_ Bill Clinton did a great job last night the Democrats are lucky to have him. Do you really believe he likes @BarackObama? _E_ RT @KatiePavlich: Your boss pardoned a traitor who gave U.S. enemies state secrets he also pardoned a terrorist who killed Americans. Spar... _E_ Even the once great Caesars is bankrupt in A.C. Others to follow. Ask the Democrat City Council what happened to Atlantic City. _E_ I will be on @SeanHannity @FoxNews tonight at 10pmE w/ @MELANIATRUMP from Wisconsin. Enjoy! #WIPrimary #Trump2016 __HTTP__ _E_ .@eagles should sit Michael Vick. He is a great athlete but less than average quarterback. _E_ "No government ever voluntarily reduces itself in size. So governments' programs once launched never disappear." – Ronald Reagan _E_ We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_ Crooked's stop in Johnstown Pennsylvania where jobs have been absolutely decimated by dumb politicians drew less than 200 with Bill VP _E_ I will be doing the A.L.S. Ice Bucket Challenge this morning on twitter. It is not something I look forward to doing but is for a good cause _E_ ISIS is starting its own currency. May be stronger than the dollar if ObamaCare is fully implemented. _E_ RT @transition2017: President elect Trump announces selections for Attorney General National Security Advisor CIA Director. More here: ht... _E_ I picked seven Super Bowl winners in a row & would have been right last night had the refs thrown the flag. _E_ Congrats to @Team_Mitch on winning a spirited primary. Great job Mitch. _E_ Vera Coking saved me "mucho" money by turning down my offer—thanks Vera! _E_ Together we can save American JOBS American LIVES and AMERICAN FUTURES! #Debates __HTTP__ _E_ . #RepMikeKelly Great job on @foxandfriends this morning. Thank you for the nice words! _E_ RT @DRUDGE_REPORT: REUTERS POLL: CLINTON TRUMP ALL TIED UP... __HTTP__ _E_ We are going to have a wild time in Alabama tonight! Finally the silent majority is back! __HTTP__ _E_ .@Team_Mitch Congratulations Mitch! _E_ I'll be playing golf tomorrow in Palm Beach at the number one rated golf course in the State of Florida Trump International Golf Club. _E_ .@ESPN's apology(Brent Musburger) was a disgrace to broadcasting stop being so politically correct! _E_ Don't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great and getting major things done! _E_ It was great being in Michigan. Remember I am the only presidential candidate who will bring jobs back to the U.S.and protect car industry! _E_ Crooked Hillary Clinton has not held a news conference in more than 7 months. Her record is so bad she is unable to answer tough questions! _E_ .@AP is doing very badly. I can say from experience their reporting is terrible & highly inaccurate. Sadly they are now irrelevant! _E_ Congratulations @TrumpNewYork for being named in @CNTraveler's Top 10 US Hotels for Business Travelers! __HTTP__ _E_ RT @foxandfriends: Report accuses material James Comey leaked to a friend contained top secret information __HTTP__ _E_ Video game violence & glorification must be stopped—it is creating monsters! _E_ Great even in SC tonight! Fire Marshall would not let everyone in 5000 turned away. Thank you for coming! _E_ In today's #trumpvlog I talk about how well Will Smith handled the situation with the reporter __HTTP__ _E_ Big wins in West Virginia and Nebraska. Get ready for November Crooked Hillary who is looking very bad against Crazy Bernie will lose! _E_ .@MittRomney's entire life and career have built prosperity and growth. _E_ Explain how the women on The View which is a total disaster since the great Barbara Walters left ever got their jobs. @abc is wasting time _E_ Thank you @ASavageNation and keep up the great work! _E_ Based on John Sweeney's lousy reputation we are airing large parts of the interview that were not shown enjoy! __HTTP__ _E_ Why would @greta use @KarlRove as an election analyst when he has made so many mistakes. He still thinks Romney won. An establishment dope! _E_ #CelebApprentice We had lots of fun last night with the live tweeting so I will do it again tonight from 8 10pm. _E_ My great honor to join our incredible men and women of the @USCG at the Lake Worth Inlet Station in Riviera Beach Florida today!#HappyThanksgiving __HTTP__ _E_ Big speech tonight in South Carolina 7:00 P.M. Tremendous crowd! _E_ Republicans must stop listening to dopes like @KarlRove who still insists Mitt Romney won the last election. Think big & think strong! _E_ Obama did much better than he did last time but still lost decisively. _E_ Why is @BarackObama spending millions to try and hide his records? He is the least transparent President ever and he ran on transparency. _E_ Entrepreneurs: Realize that becoming an entrepreneur is not a group effort. You're in charge. Everything starts with you. _E_ Lying Cruz put out a statement "Trump & Rubio are w/Obama on gay marriage. Cruz is the worst liar crazy or very dishonest. Perhaps all 3? _E_ Wacky Congresswoman Wilson is the gift that keeps on giving for the Republican Party a disaster for Dems. You watch her in action & vote R! _E_ Getting ready to leave @TrumpDoral and the brand new Blue Monster course it's unbelievable! _E_ #CrookedHillary is outspending me by a combined 31 to 1 in Florida Ohio & Pennsylvania. I haven't started yet! __HTTP__ _E_ Before Star Jones begged me to put her on The Apprentice she was "professionally dead." I saved her tiny... __HTTP__ _E_ "Yesterday's home runs don't win today's games." – Babe Ruth _E_ Via @businessinsider by @hunterw: "TRUMP UNLOADS: Hillary Clinton was 'the worst' and is 'extremely bad'" __HTTP__ _E_ A day after Greece burned @BarackObama released a $3.8 Trillion budget for 2013 with a $900 Billion deficit.He will turn America into Greece _E_ I've had enough of this good night! _E_ Biggest story today between Clapper & Yates is on surveillance. Why doesn't the media report on this? #FakeNews! _E_ Who ever heard of a legal conviction statement "more probable than not" against Tom Brady? Sue them Tom and make lots of $. @nfl _E_ .@ShawnJohnson have a great Easter you are a real champion! _E_ Glad to see my interview with Ronald Kessler @Newsmax_Media. Hopefully the @GOP can get the message. _E_ This is the single greatest witch hunt of a politician in American history! _E_ RT @DanScavino: WE LOVE OUR DEPLORABLES!!!#TrumpTrain #Debates2016 __HTTP__ _E_ State Treasurer John Kennedy is my choice for US Senator from Louisiana. Early voting today election next Saturday. _E_ Here is another CNN lie. The Clinton News Network is losing all credibility. I'm not watching it much anymore. __HTTP__ _E_ We will MAKE AMERICA SAFE & GREAT AGAIN! #Trump2016 #VoteTrumpSC __HTTP__ __HTTP__ _E_ Watch Miss USA 2013 Sunday night at 9 PM ET. Live from Planet Hollywood Las Vegas. __HTTP__ _E_ Why did Mitt Romney BEG me for my endorsement four years ago? _E_ Happy #Hanukkah __HTTP__ _E_ Via @nbc6: "@MissUniverse Pageant Coming to @TrumpDoral in 2015" __HTTP__ _E_ How is @VanityFair editor Graydon Carter allowed to run bad food restaurant Beatrice Inn? Fire Graydon! _E_ .@drmoore Russell Moore is truly a terrible representative of Evangelicals and all of the good they stand for. A nasty guy with no heart! _E_ My @CENTURY21 Super Bowl commercial __HTTP__ which aired during the third quarter. _E_ Entrepreneurs: Money is not always the bottom line: it can be a score card not the final score. _E_ .@TrumpLasVegas is Sin City's most elite destination. Treat yourself to Vegas' most luxurious hotel rooms __HTTP__ _E_ What a surprise! Newly released audit proves that the IRS only targeted Tea Party groups __HTTP__ _E_ Clinton commented in Ohio today that @MittRomney is right the economy has not been fixed under Obama.I always said Bill was an honest man. _E_ "DONALD TRUMP TO BILL MAHER: PAY UP" __HTTP__ via @BreitbartNews _E_ If @megynkelly stopped covering me on her show her ratings would drop like a rock! My h to h interview with @AC360 beat her by millions! _E_ Republicans must stop relying on losers like @KarlRove if they want to start winning presidential elections. Be tough and get smart! _E_ I have a lot of @Apple stock and I miss Steve Jobs. Tim Cook must immediately increase the size of the screen... __HTTP__ _E_ Best of luck to my good friend Derek Jeter on his first game today back at shortstop. @Yankees Captain is a warrior & winner. _E_ ALso coming up: The Celebrity Apprentice returns. Sunday night March 6 at 9 pm EST __HTTP__ apprentice/ _E_ Just finished the wonderful event on the U.S.S. Iowa. VETERANS FOR A STRONG AMERICA endorsed me. Such a great honor thank you! _E_ I will be doing The Howard Stern Show at 7 a.m. (10 minutes). Always fun and interesting talking to Howard! _E_ Remember when @ariannahuff ran for Governor of California. She got 3 votes. _E_ Via Huffington Post Congrats America! Donald Trump Is Now A 2016 Presidential Front runner __HTTP__ by Igor Bobic _E_ The object of golf is not just to win. It is to play like a gentleman and win. Phil Mickelson _E_ Had a special visitor in my office yesterday for @TIME photo shoot. __HTTP__ _E_ .@GOP's election loss and failed negotiations will serve as a case study in how third parties come about. _E_ New York Magazine just named the most influential tweeters in N.Y. and one Donald Trump was #2 after ESPN. Actually I'm easily #1! _E_ The worst employee in today's #trumpvlog... __HTTP__ _E_ Why would anyone in Florida vote for lightweight Senator Marco Rubio. Check out his credit card scam his house sale & his no show voting! _E_ Myrtle Beach South Carolina #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Why are armed drones being released over our homeland by the Government? __HTTP__ Seems excessive. _E_ Ring in 2015 in downtown New York's most elite 5 Star hotel. @TrumpSoHo offers 46 luxurious stories of excellence __HTTP__ _E_ It makes me feel so good to hit sleazebags back much better than seeing a psychiatrist (which I never have!) _E_ Our online campaign store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_ .@janinegibson __HTTP__ _E_ It is snowing in Jerusalem and across Lebanon. Global warming! _E_ Congratulations to Jeb Hensarling & Republicans on successful House vote to repeal major parts of the 2010 Dodd Frank financial law. GROWTH! _E_ Every day St. Maarten loses vital tourism dollars due to the incompetence of PM Sarah Westcot Williams. @PrimeMinisterSX _E_ RT @PressSec: The Trump effect: "The U.S. economy is running at its full potential for the first time in a decade" WSJ __HTTP__ _E_ Congratulations to @jdickerson of Face the Nation on his highest ratings in 15 years. 4.6 million people watched my interview! Thank you! _E_ Looking forward to keynoting @ChesterfieldGOP Lincoln Reagan Gala this Friday at The Country Club at The Highlands. Sold out record crowd! _E_ Just released that international gangs are all over our cities. This will end when I am President! _E_ Realize that being an entrepreneur is not a group effort. You're in charge. Everything starts with you. _E_ Champion @bretmichaels is back competing in the upcoming All Star @CelebApprentice. Premiere is March 3rd on @NBC at 9 p.m. EST. _E_ .@BBC should never have played that piece of garbage documentary & yet the phones are ringing off the hook to play the course. _E_ Trump National Golf Club Washington D.C. is located on 600 acres and fronts the Potomac River. Spectacular! __HTTP__ _E_ I can't believe that @CNN would waste time and money with @smerconish he has got nothing going. Jeff Zucker must be losing his touch! _E_ More Anti Catholic Emails From Team Clinton: __HTTP__ __HTTP__ _E_ .@hardball_chris' very small audience is shrinking rapidly because people finally understand that he is very very dumb! _E_ Thank you for the endorsement Coach Bobby Knight! I will never forget it! __HTTP__ __HTTP__ _E_ My Administration has identified three major priorities for creating a safe modern and lawful immigration system: fully securing the border ending chain migration and canceling the visa lottery. Congress must secure the immigration system and protect Americans. __HTTP__ _E_ Don't believe the millions of dollars of phony television ads by lightweight Rubio and the R establishment. Dishonest people! _E_ Be a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_ Obama has exempted businesses his staff and all of Congress from ObamaCare. Why is he still forcing the monstrosity on the U.S.? _E_ Thank you Buffalo! #NYPrimary __HTTP__ __HTTP__ _E_ .@GovMikeHuckabee was great the other night. People love him. _E_ People ask me what I do in my free time. The answer I don't have any. _E_ I love you Arizona! Thank you!#Trump2016 #AmericaFirst __HTTP__ _E_ You are doing a great job the world is watching! Be safe. __HTTP__ _E_ It's amazing how badly the Knicks and Nets are playing. Everybody predicted they would be top teams with all of the money spent. Too bad! _E_ It all begins today WE WILL FINALLY TAKE OUR COUNTRY BACK AND MAKE AMERICA GREAT AGAIN! _E_ .@TrumpLasVegas' 7th floor provides the most urbane feel in Las Vegas w/private air conditioned cabanas & a massive 110 ft. heated pool. _E_ #CelebApprentice stay tuned for the 2nd half we have one more firing tonight! _E_ Now I know that Yahoo is in good hands. It took great courage for @marissamayer to take away the right of employees to work at home. _E_ .@KatrinaCampins You were absolutely great on @CNN! Thank you. _E_ 3rd rate writer Vicky Ward who begged me for help see her letters to me. __HTTP__ _E_ Watch @ FoxNews' @ShannonBream @LisWiehl & former prosecutor Doug Burns destroy ridiculous lawsuit __HTTP__ _E_ You can watch 360 video live from the podium! __HTTP__ #RNCinCLE #TrumpIsWithYou #MakeAmericaGreatAgain _E_ Great to see @RedSox win big yesterday. Good for Boston and the country. Yesterday we were all @RedSox fans. _E_ If my offer is refused every undecided OH voter will be fully aware that Obama denied $5M to charity all because he is hiding something! _E_ Many political pundits are using the term Art of the Deal .... they should thank me. That is my term and book title. _E_ The sad truth is some Republicans in Congress are clueless when it comes to negotiation. #TimeToGetTough _E_ Good morning America! Thank you for all of your support in the latest Drudge poll! __HTTP__ __HTTP__ _E_ Will be interviewed by @SarahPalinUSA tonight at 10:00 on OAN Network. Enjoy! _E_ Wow the ratings are in and Arnold Schwarzenegger got swamped (or destroyed) by comparison to the ratings machine DJT. So much for.... _E_ I have nothing to do with Atlantic City sold years ago (great timing). For losers and haters I NEVER went bankrupt. Plus $10 billion sorry _E_ Don't forget to tune in tonight for another exciting episode of The Apprentice 10 p.m. on NBC. _E_ .@davidaxelrod I'm sending you a check to help find a cure. @IvankaTrump says hi. _E_ Like Al Sharpton @DonnyDeutsch apologized to me for calling me a racist on @todayshow apology accepted! _E_ Let this be the day you go for your dream. Focus don't give up and only accept total and complete victory. You can do it! _E_ Can you imagine how embarrassing it would have been for the country if the candidates actually did get into a fist fight? _E_ Texas Georgia & many more VOTE EARLY! This is a movement!#Trump2016 VOTE VIDEO: __HTTP__ __HTTP__ _E_ Donald Trump Sends @FallonTonight to Highest Friday Rating in 18 Months. @JimmyFallon that is #HUGE! __HTTP__ _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ I'm with YOU! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Hopefully the Republican National Committee can straighten out the total mess that is taking place in Virginia's Republican Party. FAST! _E_ Watch Eric at 9 am (EST) today on Fox 5 w/ @rosannascotto and David Price to discuss Eric Trump Foundation's $20 million donation to St Jude _E_ Bill Hemmer of @FoxNews was very nice in explaining the excitement and energy in the arena. More than in past years. _E_ Sorry folks got to go to work now but I'll be baaaaack ! _E_ I would like to thank Reince Priebus for his service and dedication to his country. We accomplished a lot together and I am proud of him! _E_ "Trump the orator outlines the greatness of America to Democrats' disgust" __HTTP__ _E_ RT @VP: Our President is choosing to put American jobs American consumers American energy and American industry first. __HTTP__ _E_ Congrats to @msnbc for firing Martin Bashir—don't feel badly he didn't get ratings anyway. @SarahPalinUSA _E_ While the Pres. of Iran tweets sweet nothings to Obama he forbids the Iranians to use twitter. Very revealing. _E_ "Don't fight the problem decide it." – General George C. Marshall _E_ Too busy playing golf? @BarackObama sends form letters with an electronic signature to the parents of fallen SEALs __HTTP__ _E_ Congratulations to @Graeme_McDowell and @kristinstape. Your baby has seriously good genes will be a champ! _E_ Nobody should be allowed to burn the American flag if they do there must be consequences perhaps loss of citizenship or year in jail! _E_ How does @michellemalkin get a conservative platform? She is a dummy just look at her past. _E_ of position. Then separately she stated He said something truly horrifying ... he refused to say that he would respect the results of _E_ Go to __HTTP__ to help my friend Scott Brown take back our Senate. _E_ Such an honor to have my good friend Israel PM @Netanyahu join us w/ his delegation in NYC this afternoon. #UNGA __HTTP__ __HTTP__ _E_ RT @RightlyNews: @realDonaldTrump @LouDobbs Trust in the media is at the lowest level in all of U.S. history. The American people see throu... _E_ Thank you America! #MAGA __HTTP__ __HTTP__ _E_ Isn't it interesting that the tragedy in Paris took place in one of the toughest gun control countries in the world? _E_ My thoughts condolences and prayers to the victims and families of the New York City terrorist attack. God and your country are with you! _E_ Obama wanted to meet with the Iranian president yet the Iranians denied the request. So much for Hope & Change. _E_ Dear @MaraLiasson I greatly appreciate your fairness. My history shows I never disappoint. Looking forward to meeting you soon. _E_ The Republicans owe an apology for blowing the 2012 election. How could they lose to Obama?! _E_ He @BarackObama is caught on tape making election promises to @MedvedevRussiaE on missile defense and national security __HTTP__ _E_ 'Clinton Ally Aided Campaign of FBI Official's Wife' __HTTP__ _E_ We must build a wall to secure our border. It will save lives and help Make America Great Again! __HTTP__ _E_ Via @globegazette by John Skipper: North Iowan says Trump serious about POTUS run but he'll have to prove it __HTTP__ _E_ The NFL has just barred ball carriers from using helmet as contact. What is happening to the sport? The beginning of the end. _E_ We've all wondered how Hillary avoided prosecution for her email scheme. Wikileaks may have found the answer. Obama! __HTTP__ _E_ .@TrumpChicago's Spa at Trump® offers 12 treatment rooms & 53 spa guestrooms overlooking the Chicago skyline __HTTP__ _E_ Congrats to R. Emmett Tyrrell Jr of @AmSpec for the fantastic piece on Benghazi. _E_ ObamaCare's tax credit is underperforming by over 95% creating an even bigger cost to the debt __HTTP__ It must be repealed! _E_ The truth is a beautiful weapon. __HTTP__ _E_ New book by @ericbolling is absolutely terrific and a must read! #WakeUpAmerica _E_ The Republicans are funding ObamaCare and Amnesty. Obama beats them. __HTTP__ _E_ Providing backstage commentary at the Miss USA Pageant will be comedic mother daughter duo Joan and Melissa Rivers. A fantastic lineup! _E_ .@CNN & @CNNPolitics Please thank Alisyn Camerota David Chalian and John King for the very professional reporting of the new CNN Poll. _E_ People are always asking me about the very special word CONFIDENCE. The fact is there is (almost) nothing like it. Is derived from winning! _E_ .@FLOTUS Melania and I were honored to welcome Argentina President @MauricioMacri and First Lady Juliana Awada to t... __HTTP__ _E_ I hope Republican Senators will vote for Graham Cassidy and fulfill their promise to Repeal & Replace ObamaCare. Money direct to States! _E_ Purchase your copy of CRIPPLED AMERICA now & be on potential call list for my live streaming signing event tonite. __HTTP__ _E_ Wishing you and yours a very Happy and Bountiful Thanksgiving! _E_ A nation that cannot control its borders is not a nation. President Ronald Reagan _E_ Any negative polls are fake news just like the CNN ABC NBC polls in the election. Sorry people want border security and extreme vetting. _E_ Great crowd in Fletcher North Carolina thank you! Heading to Johnstown Pennsylvania now! Get out on November 8th... __HTTP__ _E_ Little Michael Bloomberg who never had the guts to run for president knows nothing about me. His last term as Mayor was a disaster! _E_ The class warfare being played by @BarackObama is the only way he can get reelected. He can't have America focus on his horrendous record. _E_ The liberal clown @ariannahuff told her minions at the money losing @HuffingtonPost to cover me as enterainment. I am #1 in Huff Post Poll. _E_ Great poll! Thank you North Carolina! #VoteTrumpNC on 3/15!Trump 36%Cruz 18%Rubio 18%Carson 10%Kasich 7%Via @SurveyUSA _E_ Wow Crooked Hillary was duped and used by my worst Miss U. Hillary floated her as an angel without checking her past which is terrible! _E_ Have you been to the @TrumpGrill in the Trump Tower Atrium? Best meatloaf in the City my mother's famous recipe. 212.836.3249 _E_ Enjoyed watching @ericbolling $ @SarahPalinUSA's @FoxNews special #PainatthePump over the weekend. (cont) __HTTP__ _E_ Those who believe in tight border security stopping illegal immigration & SMART trade deals w/other countries should boycott @Macys. _E_ My @morning_joe int. w/@morningmika @JoeNBC & @ThomasARoberts f/@trumpdoral on why Romney shouldn't be @GOP nominee __HTTP__ _E_ Persistence is a key for success. Don't give up. Continue to Think Big and you will be able to close deals. _E_ A very interesting take from @KatiePavlich: __HTTP__ _E_ Top Clinton Aides Bemoan Campaign 'All Tactics' No Vision: __HTTP__ _E_ The Washington Establishment will never rein in government spending waste fraud and abuse. A great thinker and outsider is needed. _E_ Thank you @JebBush you finally get it! __HTTP__ _E_ I will be on @SpecialReport with @BretBaier tonight at 6PM. __HTTP__ _E_ While @BarackObama is obsessed with 'green collar jobs' blue collar workers aren't buying it. (cont) __HTTP__ _E_ Spoke to President Xi of China to congratulate him on his extraordinary elevation. Also discussed NoKo & trade two very important subjects! _E_ From rags to riches and back to rags! __HTTP__ _E_ .@DianneG @WCNC To the "news bigs" elevate Dianne Gallagher immediately—she is terrific! _E_ The Intelligence briefing on so called Russian hacking was delayed until Friday perhaps more time needed to build a case. Very strange! _E_ Americans may no longer have access to their family doctors because of Obamacare. __HTTP__ via @Newsmax_Media _E_ Amazon is doing great damage to tax paying retailers. Towns cities and states throughout the U.S. are being hurt many jobs being lost! _E_ Thanks to @BarackObama rejecting the Keystone XL pipeline China has become Canada's biggest oil consumer. China is laughing at us! _E_ .@andersoncooper Anderson—Thank you for being so fair with your reporting & story last night. Greatly appreciated! _E_ .@nbc has increased @ApprenticeNBC to 2 hours until the end of the season full 2 hour episodes starting at 9 PM EST _E_ .@danabrams Dan of course stories on me do well. Glad you have found a medium you can actual do well on. TV was not your forte. _E_ Keep the big picture in mind. There are always opportunites & possibilities & thinking too small can negate a lot of them. _E_ I will be LIVE tweeting tomorrow (MONDAY) nights TWO shows starting at 8:00 P.M. They are both great. _E_ RT @seanhannity: Watch: Donald Trump OWNS A Heckler Who Said Illegal Immigrants Are The Backbone Of America __HTTP__ _E_ Many people are equating BREXIT and what is going on in Great Britain with what is happening in the U.S. People want their country back! _E_ Entrepreneurs: Problems are a mind exercise. Enjoy the challenge. _E_ I will be on @marklevinshow at 8PM tonight. Tune in! _E_ The #WomenWhoWork campaign from @IvankaTrump __HTTP__ ... _E_ Reading @nytdavidbrooks of the NY Times is a total waste of time he is a clown with no awareness of the world around him dummy! _E_ I dream for a living. Steven Spielberg _E_ Many people have been asking me to answer questions. You can ask me questions at any time. #TrumpQandA _E_ If @HillaryClinton is president she'll be all talk and nothing will get done. #Debate #BigLeagueTruth _E_ Straighten out The Republican Party of Virginia before it is too late. Stupid! RNC _E_ The new edition of The Apprentice will be on Thursdays this fall at 10 pm ET I'm putting people back to work! _E_ .@megynkelly Sorry there was only one breakout star this weekend in New Hampshire. Just check out the local New Hampshire media! _E_ Everybody that loves the people of New York and all they have been thru should get hypocrites like Ted Cruz out of politics! _E_ Loved being with my many friends in Tennessee. The crowd and enthusiasm was fantastic. I won the straw poll big! _E_ Republican Senators are working very hard to get there with no help from the Democrats. Not easy! Perhaps just let OCare crash & burn! _E_ I am going to expand the definition of LOBBYIST so we close all the LOOPHOLES! #DrainTheSwamp __HTTP__ _E_ Wow the Supreme Court passed @ObamaCare. I guess @JusticeRoberts wanted to be a part of Georgetown society more than anyone knew. _E_ Wow I am ahead of the field with Evangelicals (am so proud of this) and virtually every other group and Ben Carson just took a swipe at me _E_ Looking forward to touring the @sigsauerinc world headquarters tomorrow! One of the top gun manufacturers in the US! #GunRights #TCOT _E_ Oppressive regimes cannot endure forever and the day will come when the Iranian people will face a choice. The world is watching! __HTTP__ _E_ With President Obama it's all talk and no action. Our country is in desperate need of smart and decisive leadership before it is too late! _E_ Many people in our Country are asking what the "Justice" Department is going to do about the fact that totally Crooked Hillary AFTER receiving a subpoena from the United States Congress deleted and "acid washed" 33000 Emails? No justice! _E_ To the three UCLA basketball players I say: You're welcome go out and give a big Thank You to President Xi Jinping of China who made..... _E_ While on FAKE NEWS @CNN Bernie Sanders was cut off for using the term fake news to describe the network. They said technical difficulties! _E_ Rick Perry is right when he says we must stand by Israel in the UN. _E_ If China didn't play games with its currency and we played on a level economic playing field we could easily (cont) __HTTP__ _E_ Part of Obama's new found confidence is that the Republicans aren't using their power of ideas properly or effectively. _E_ My interview last night with Greta on Fox News __HTTP__ _E_ Thank you Arizona! #VoteTrump __HTTP__ _E_ The con artists changed the name from GLOBAL WARMING to CLIMATE CHANGE when GLOBAL WARMING was no longer working and credibility was lost! _E_ I predict that dying @UnionLeader newspaper which has been run into the ground by publisher Stinky Joe McQuaid will be dead in 2 years! _E_ Hillary Clinton: 'Architect of failure'#DrainTheSwamp #CrookedHillary __HTTP__ _E_ Congress must pass a budget and hold Obama to it. No more continuing resolutions and no more excuses. Republicans soon hold both houses. _E_ I am in Kansas. Will be an exciting day. Big speech this morning in Wichita and then go to caucus. Sorry CPAC (the format was fine!). _E_ The Time Magazine list of the 100 Most Influential People is a joke and stunt of a magazine that will like Newsweeksoon be dead. Bad list! _E_ Via @BBCNews Trump begins renewables mission in Scotland __HTTP__ _E_ What will be the response on Wednesday? If Obama doesn't take the 5 million dollars for charity. _E_ Join me in Pittsburgh tonight at 7pmE! #Trump2016 #TrumpTrainTickets: __HTTP__ _E_ Look for good ideas outside of your own areas of expertise. Find innovations approaches and practices that you could adapt in your field. _E_ Terrible jobs report just reported. Only 38000 jobs added. Bombshell! _E_ The military threat from China is gigantic and it's no surprise that the Communist Chinese government lies (cont) __HTTP__ _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ #MakeAmericaGreatAgain __HTTP__ _E_ Captured or not all our soldiers are heroes! _E_ I employ many people in the State of Virginia JOBS JOBS JOBS! Crooked Hillary will sell us out just like her husband did with NAFTA. _E_ Despite previous tweet Dennis Rodman would do a better job than the current (cont) __HTTP__ _E_ Will the Benghazi terrorist use the videotape as a defense? If so will Obama apologize to him? _E_ In some ways it is sad. We all wanted @BarackObama to succeed. It's not worked out that way. _E_ My two sons Eric & Don have long been expert hunters & marksmen @NRA. They go on safaris & give animals to the poor & starving villagers! _E_ It is time to get out of Afghanistan. We are building roads and schools for people that hate us. It is not in our national interests. _E_ ...What about all of the Clinton ties to Russia including Podesta Company Uranium deal Russian Reset big dollar speeches etc. _E_ After Friday's Twilight release I hope Robert Pattinson will not be seen in public with Kristen she will cheat on him again! _E_ I hear @glennbeck is in big trouble. Unlike me his viewers & ratings are way down & he has become irrelevant—glad I didn't do his show. _E_ .@BillGates and @JimBrownNFL32 in my Trump Tower office yesterday two great guys! __HTTP__ _E_ My interview w/ @nbc6 re: @CadillacChamp & my $200M of future renovations invested in Trump @DoralResort __HTTP__ _E_ SCARY $6T in debt and $1T annual budget deficits later @BarackObama is asking for more time to fix the economy __HTTP__ _E_ Weakness of attitude becomes weakness of character. Albert Einstein _E_ Q/A @saychowder I receive a great many requests for interviews nationally and internationally. _E_ Thank you @CrainsChicago for featuring @TrumpChicago in your list of Best Private Dining Rooms in Chicago. __HTTP__ _E_ ...Those stupid people bought @mcuban's company (of which he owned a piece). _E_ The travel ban into the United States should be far larger tougher and more specific but stupidly that would not be politically correct! _E_ I consider my health stamina and strength one of my greatest assets.The world has watched me for many years and can so testify great genes! _E_ Comey lost the confidence of almost everyone in Washington Republican and Democrat alike. When things calm down they will be thanking me! _E_ Why would Obama ever nominate someone for Sec. of Defense who opposes sanctions against Iran when Obama claims to support them? _E_ The @MittRomney healthcare plan post ObamaCare relies on consumer choices with more options __HTTP__ The perfect remedy! _E_ The sexual abuse that is so rampant has according to generals greatly weakened our military. They have failed to stop it. _E_ What do you think of @DennisRodman's Donald Trump head? The hair's not quite right for one thing. #CelebApprentice _E_ This is an incredible MOVEMENT WE are going to take our country BACK! #November8th #BigLeagueTruth #Debate __HTTP__ _E_ Dopey @Lord_Sugar—Look in the mirror and thank the real Lord that Donald Trump exists. You are nothing! _E_ Getting ready to open the magnificent Turnberry in Scotland. What a great day especially when added to the brave & brilliant vote. _E_ Mitt's subsequent rise in the polls post debate shows that the American public can still spot a real winner. _E_ Obama met with Chinese Premier Wen yesterday __HTTP__ and talked trade. The Chinese are robbing us blind be tough! _E_ Will @JebBush in his phony advertising campaign show himself asking me to apologize to his wife in the debate? _E_ RT @TeamTrump: We are going to be THRIVING again. @realDonaldTrump #BigLeagueTruth #Debates2016 __HTTP__ _E_ Speaking to a record crowd of over 20000 people in Charlotte Arena this Saturday morning—look forward to it! _E_ 122 vicious prisoners released by the Obama Administration from Gitmo have returned to the battlefield. Just another terrible decision! _E_ RT @DonnaWR8: .@POTUS #TRUMP & @FLOTUS🌺When ALL seemed HOPELESS...YOU brought HOPE!You INSPIRE us ALL!#MAGA #Harvey @Scavino45 #USA... _E_ IMO Manti Te'o was involved in a hoax for sympathy to get the Heisman Trophy. _E_ NYC's sole hammam The Spa at @TrumpSoHo offers classic treatments inspired by wellness rituals f/around the world __HTTP__ _E_ Thank you @LtStevenLRogers. We will respond to terrorism with strength in 2017! __HTTP__ _E_ Our economy cannot stay competitive with policies like these: @BarackObama is proposing over $90 Billion in new regulations. _E_ The massive Blue Monster @TrumpDoral is getting rave reviews. I built it in one year—no easy feat! _E_ Watch out. Champion @Joan_Rivers returns to the Boardroom as a judge in this week's All Star Celebrity @ApprenticeNBC. Don't cross her! _E_ .@deneenborelli Thank you for your nice words greatly appreciated. _E_ China court: Apple pays $60M to settle iPad case. China is getting away with murder. __HTTP__ _E_ Donald Trump: Jeb Bush's Support of Common Core 'a Disaster' __HTTP__ via @BreitbartNews by Dr. Susan Berry _E_ I'm going to D.C. today to check on the hotel I'm building on Penn. AVE. and then being honored by the Wharton School of Finance the BEST! _E_ Networks other than low ratings @CNN have been very fair and exciting! _E_ Hypocrite: @HillaryClinton is the single biggest beneficiary of Citizens United in history by far. #debate #bigleaguetruth _E_ Such a total miscarriage of Justice in San Francisco! __HTTP__ _E_ Obama's offer to Iran will not stop Iran's breakout capability. It is a bad desperate deal negotiated from weakness. Pass sanctions! _E_ Great interview tonight @donlemon very professionally done. @CNN _E_ My latest Celebrity Apprentice video blog... __HTTP__ _E_ Sorry banks when we accused lightweight AG Eric Schneiderman of not going after banks he started going after banks—but years too late! _E_ ObamaCare could eat up your raise __HTTP__ Why isn't Congress defunding it? They're obsessed with amnesty. _E_ Mark Levin's @marklevinshow 'The Liberty Amendments: Restoring the American Republic is a truly great & important book. _E_ The addition of the iconic Doral Resort to the Trump portfolio is one of the most exciting transactions __HTTP__ _E_ The fact that we are taking the Ebola patients while others from the area are fleeing to the United States is absolutely CRAZY Stupid pols _E_ Thank you @Morning_Joe for explaining to @CNN and @andersoncooper and so many others that I am leading in almost all national & state polls. _E_ Liberal press won't look into why Obama ignored security warnings for embassies but is obsessed with Romney's private comments. _E_ Looking forward to seeing Joe McQuaid Curtis Barry and my many friends in the Granite State! _E_ Why does @BarackObama have such a fascination with my plane? He is more than welcomed to come for a ride. _E_ Looking forward to press conference on taxes at 11AM at @TrumpTowerNY. _E_ Lets go America! Get out & #VoteTrump! #Trump2016#MakeAmericaGreatAgain!#SuperTuesday __HTTP__ __HTTP__ _E_ Obama will let Ebola fly into US & drugrunners cross our border daily. But he won't pressure Mexico on Sgt. Tahmooressi. #FreeOurMarine _E_ Chuck Hagel: Wrong For Defense __HTTP__ via @NewYorkObserver _E_ Oscar Pistorius only gets five years in prison for killing his girlfriend. Ridiculous decision! Judge couldn't even read her own writings. _E_ When the New York Times sold their beautiful long time building for peanuts & the buyer flipped it for a massive profit—they lost me! _E_ ...accountability say the Governor. Electric and all infrastructure was disaster before hurricanes. Congress to decide how much to spend.... _E_ .@TrumpNewYork is NYC's only @ForbesInspector 5 Star & @AAAnews 5 Diamond hotel w/a 5 Star & 5 Diamond restaurant __HTTP__ _E_ RT @FieldofFight: We Can Do Better We Must Do Better We Will Do Better By LTG (R) Keith Kellogg and LTG (R) Michael Flynn @GenFlynn __HTTP__ _E_ "No one remembers who came in second." Walter Hagen _E_ ObamaCare is a failure. Costs are rising much faster under Obama than other Presidents. _E_ Welcome to the @WhiteHouse Prime Minister @JustinTrudeau! __HTTP__ _E_ Looks like a lawsuit against GoAngelo won't work—my ties & shirts doing too well at Macy's he's actually helping. I have no damages! _E_ Glad to hear that @taylorswift13 will be co hosting the Grammy nominations special on 12.5. Taylor is terrific! _E_ "Mistakes are always forgivable if one has the courage to admit them." Bruce Lee _E_ Wow the Failing @nytimes said about @foxandfriends ....the most powerful T.V. show in America. _E_ Via @UnionLeader by @tuohy: "Trump: You're Hired" __HTTP__ _E_ Get out and vote West Virginia we will MAKE AMERICA GREAT AGAIN! _E_ I look forward to @MittRomney hitting Obama hard tonight for lying about Benghazi. CIA told Obama it was a terrorist attack after 24 hrs. _E_ #AMERICA FIRST! _E_ Saturday Night Live has some incredible things in store tonight. The great thing about playing myself is that it will be authentic! Enjoy _E_ Will be interviewed by @chucktodd on @meetthepress at 10:30 A.M. _E_ Time is on your side things do not continue downward forever. Think Big _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ There is no longer a Bernie Sanders political revolution. He is turning out to be a weak and somewhat pathetic figurewants it all to end! _E_ Can you imagine what Putin and all of our friends and enemies throughout the world are saying about the U.S. as they watch the Ferguson riot _E_ #TrumpAdvice __HTTP__ _E_ RT @foxandfriends: NYT editor apologizes for misleading tweet about New England Patriots' visit to the White House (via @FoxFriendsFirst) h... _E_ ...and borrow cheap! You will thank me someday. _E_ Word is that Ford Motor because of my constant badgering at packed events is going to cancel their deal to go to Mexico and stay in U.S. _E_ Isn't it crazy I'm worth billions of dollars employ thousands of people and get libeled by moron bloggers who can't afford a suit! WILD. _E_ I watched @todayshow this AM re: @MarthaStewart & dating. She looks terrific better than ever any guy would be lucky to be with her. _E_ I hear that sleepy eyes @chucktodd will be fired like a dog from ratings starved Meet The Press? I can't imagine what is taking so long! _E_ Board Room finale of this week's All Star @ApprenticeNBC will leave viewers wondering where the rest of the season goes...It's great! _E_ The terrorist came into our country through what is called the Diversity Visa Lottery Program a Chuck Schumer beauty. I want merit based. _E_ I don't think the voters will forget the rigged system that allowed Crooked Hillary to get away with murder. Come November 8 she's out! _E_ Trace delivers check to hospital in NYC: American Red Cross must be grateful to Trace and his team for their tremendous work. _E_ My response to the failing Des Moines Register the ultra liberal paper that has no power in Iowa __HTTP__ _E_ Leaving West Palm Beach Florida now heading to St. Augustine for a 3pm rally. Will be in Tampa at 7pm join me:... __HTTP__ _E_ Thank you @SahilKapur for the wonderful story. __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Just got great national poll numbers double digit lead! Thank you we will all MAKE AMERICA GREAT AGAIN! _E_ Trump at Tea Party __HTTP__ via @myrbeachonline _E_ Donald Trump donates land to conservation group in Palos Verdes __HTTP__ via @MyNewsLA _E_ Nobody would fight harder for free speech than me but why taunt over and over again in order to provoke possible death to audience. DUMB! _E_ To aspiring entrepreneurs: Be ready for problems. You'll have them every day. So remember to look at the solution not the problem. _E_ I will be interviewed on @MariaBartiromo @FoxBusiness at 7:30 _E_ Remember if you do not promote yourself no one else will. When you have success let people know about it. _E_ Unemployment for Black Americans is the lowest ever recorded. Trump approval ratings with Black Americans has doubled. Thank you and it will get even (much) better! @FoxNews _E_ George Will was a big Iraq fool. $2 trillion thousands of lives lost & we got nothing! Dummy. _E_ "Successful leaders see the opportunities in every difficulty rather than the difficulty in every opportunity." Reed Markham _E_ Rupert Murdoch Defends Trump: 'Complete Refugee Pause' Makes Sense' __HTTP__ _E_ Via @CNNMoney by @jtotoole: "U.S. taps Donald Trump to convert DC's Old Post Office into luxury hotel" __HTTP__ _E_ The original Apprentice returns with a two hour premiere on Thursday September 16th. Looking forward to a fantastic season! _E_ Crooked Hillary said that I couldn't handle the rough and tumble of a political campaign. ReallyI just beat 16 people and am beating her! _E_ Going to Scotland Ireland & other places in Europe to close up deals. Getting ready for the June 16th announcement @TrumpTowerNY! _E_ RT @IvankaTrump: Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote! #Election2... _E_ Alternatives are important but first Repubs must repeal ObamaCare. It's an unsustainable monstrosity that's destroying our healthcare. _E_ In the heart of midtown New York @TrumpTowerNY is a landmark which hosts tourists from the around the world daily __HTTP__ _E_ Big day at the United Nations many good things and some tricky ones happening. We have a great team. Big speech at 10:00 A.M. _E_ RT @IvankaTrump: My next project is pretty amazing...!xx Ivanka __HTTP__ __HTTP__ _E_ Kern County CA has secured $1.2B for windfarms __HTTP__ They also just secured more eagle deaths & low property values. _E_ People like lawyer Elizabeth Beck and failed writer Harry Hurt & others talk about me but know nothing about me—crazy! _E_ #TBT With the cast of GoodFellas __HTTP__ _E_ .@SarahPalinUSA did a great job @CPACnews. Much of what she said was plain old common sense. _E_ My @NewsRadio967 interview re Jeb Bush's absurd immigration comment & @Citizens_United @AFPhq Freedom Summit. __HTTP__ _E_ The horrible shooting that took place in San Bernardino was an absolute act of terror that many people knew about. Why didn't they report? _E_ Thank you Rep. @CynthiaLummis! __HTTP__ __HTTP__ _E_ A Rod hit ball hard first at bat. Time for him to step up and leave. _E_ Sometimes the best thing you can do is just let things ride let time go by. Donald J. Trump _E_ Dummy goAngelo keep letting people know how great my shirts ties and cufflinks (also Success) are at Macy's.The BEST now everyone's aware! _E_ I would have had many millions of votes more than Crooked Hillary Clinton except for the fact that I had 16 opponents she had one! _E_ Thank you for your nice words @MikeNeedham @Heritage for the nice words on @FoxNewsSunday with Chris Wallace. #FNS #Trump2016 _E_ Remember to watch the series finale of The Men Who Built America this Sunday at 8/7c on @History _E_ Go out and vote this will be the most important election of our time! _E_ I hope the Mexican judge is more honest than the Mexican businessmen who used the court system to avoid paying me the money they owe me. _E_ Do you believe this one Secretary of State John Kerry just stated that the most dangerous weapon of all today is climate change. Laughable _E_ Imitation is the sincerest form of flattery Huntsman goes Donald Trump __HTTP__ _E_ Polls close in 3 hours! Everyone get out and VOTE!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Thank you @SenatorFischer! #TrumpPence16 __HTTP__ _E_ Via @advisorsource: Donald Trump speaks in Novi drawing largest crowd in Oakland County Republican Party's history __HTTP__ _E_ Trump Nat'l Westchester is among the most highly regarded clubs in New York. A great place. __HTTP__ _E_ Thank you @oreillyfactor for your wonderful editorial as to why I should have been @TIME Magazine's Person of the Year. You should run Time! _E_ Newsweek ending print edition sad. Now my Newsweek covers mean nothing they lost all credibility. TIME to follow? _E_ Everyone's wondering what's wrong with A Rod. Not one sports writer blames it on his not being able to use drugs anymore the real reason. _E_ .@HillaryClinton has been doing this for THIRTY YEARS....where has she been? #BigLeagueTruth _E_ Large Block Grants to States is a good thing to do. Better control & management. Great for Arizona. McCain let his best friend L.G. down! _E_ Let Pete Rose into the Baseball Hall of Fame. It's time he has paid a big and very long price! _E_ I am at the @USGA #USWomensOpen. An amateur player is co leading for the first time in many decades very exciting! _E_ Something must be done with dopey @KarlRove he is pushing Republicans down the same old path of defeat. Don't fall for it Karl is a loser _E_ .@Megynkelly spent a big part of her show talking about other shows spending so much time on me. Really weird she's being driven crazy! _E_ Thank you Speaker @PRyan!#AmericaFirst #Trump2016 __HTTP__ _E_ .@hardball_chris became a super liberal Obama fan only because he must need the money and on @MSNBC that's the way it is. _E_ "Expand your life every day." –Donald J. Trump __HTTP__ _E_ Tonight @FLOTUS Melania and I were thrilled to welcome so many wonderful friends to the @WhiteHouse – and wish them all a very #HappyHanukkah __HTTP__ __HTTP__ _E_ "Borrowing and spending is not the way to prosperity." @PaulRyanVP _E_ .@AlexSalmond See photo __HTTP__ _E_ Wishing everyone a Happy Memorial Day Weekend with a special thought for all the veterans who have done so much for our freedom. _E_ Re Omarosa: Nasty tough or smart...or all? _E_ From: @Newsmax_Media: @realDonaldTrump: Public not Worried About @MittRomney's Tax Returns __HTTP__ _E_ Sitting at the foot of the Whitestone Bridge @TrumpFerryPoint is an 18 hole @jacknicklaus signature course __HTTP__ _E_ The new @DarKnightRises trailer is fantastic __HTTP__ Trump Tower stood in for Wayne Enterprises during filming. _E_ Hillary Clinton is being badly criticized for her poor performance in answering questions. Let us all see what happens! _E_ .....Ahead of schedule and under budget! Will be in Oklahoma tonight! _E_ Lots of pressure on Obama tonight even more than A Rod. If he doesn't perform well it could be over. _E_ Beauty arrives to Moscow's Crocus City Hall this 11.9.! On @nbc the world will watch @MissUniverse 2013 crowned __HTTP__ _E_ Success is not the key to happiness. Happiness is the key to success. If you love what you are doing you'll be a success. A. Schweitzer _E_ It's a plain fact: free trade requires having fair rules that apply to everyone. (cont) __HTTP__ _E_ The totally unexpected loss of Supreme Court Justice Antonin Scalia is a massive setback for the Conservative movement and our COUNTRY! _E_ Today on #NationalAgDay we honor our great American farmers & ranchers. Their hard work & dedication are ingrained... __HTTP__ _E_ Do you believe the way Karzai talks down to the United States zero respect! _E_ When Americans are free to thrive innovate & prosper there is no challenge too great no task too large & no goal beyond our reach. We are a nation of explorers pioneers innovators & inventors. We are nation of people who work hard dream big & who never ever give up... __HTTP__ _E_ RT @EricTrump: 2016 was such an incredible year for our entire family! My beautiful wife @LaraLeaTrump made it even better! __HTTP__ _E_ Many countries including allies already see China as world superpower __HTTP__ We have greatest military yet no respect _E_ Watch the @nbc video where @realmissnvusa is crowned as the 63rd @MissUSA __HTTP__ The Crowning Moment! _E_ In the 1950's our climate was far more unstable than it has been over the last 5 years. _E_ I loved Walter Cronkite one of the all time greats. He couldn't stand Dan Rather I agree with Walter. @DanRatherReport _E_ .@AlexSalmond the Scottish politician who released the terrorist who blew up Pan Am flight 103 over Lockerbie... _E_ Iran will soon take all of the oil in Iraq...and Iraq itself Keep the oil. _E_ With all of the recently reported electronic surveillance intercepts unmasking and illegal leaking of information I have no idea... _E_ Go Republican Senators Go! Get there after waiting for 7 years. Give America great healthcare! _E_ With @IvankaTrump and crew at the start of a new @DoralResort. __HTTP__ _E_ The US Navy wants to go green. Our Navy should use the best & most powerful fuel & not play games. Give me a break! _E_ Rosie is crude rude obnoxious and dumb other than that I like her very much! _E_ We could only get a small fraction of this 25k crowd in. The movement to Make America Great Again is unbelievable! __HTTP__ _E_ Looking forward to being honored at @citadelgop's Patriot Dinner with @SenatorTimScott in Charleston SC this Sunday __HTTP__ _E_ Big news—WOW—U.S. economy shrinks! _E_ Investors are visionaries in some respects they look beyond the present. _E_ Obama's statement that illegals "can't stay" = Obama's promise "if you like your healthcare plan you can keep it." _E_ Today I signed an Executive Order on Enforcing Statutory Prohibitions on Federal Control of Education. EO:... __HTTP__ _E_ Entrepreneurs: Absorb assess and then act. Don't negate your own power. Whatever you've been dealt know you can deal with it. _E_ My @foxandfriends interview discussing Obama's failed and dangerous foreign policy and the real unemployment numbers __HTTP__ _E_ I've known @hardball_chris for a long time & sadly he gets dumber each & every year & started from a very low base. _E_ RT @mitchellvii: Trump always ends up being right. It's almost a little freaky. _E_ What I would do on my first day in office. #MakeAmericaGreatAgainWatch: __HTTP__ __HTTP__ _E_ Real estate is always a great asset to own but especially now. Try to take advantage if you can and buy (cont) __HTTP__ _E_ Bernie Sanders is pushing hard for a single payer healthcare plan a curse on the U.S. & its people... _E_ Per @rushlimbaugh: Why does Hillary Clinton get the benefit of the doubt (after she DESTROYS her illegal email server) ... _E_ Six days and counting until my offer to Barack Obama expires... _E_ The Apprentice will be very exciting and interesting tonight at 8:00. Joan Rivers puts on a great show! _E_ The recent Kansas election (Congress) was a really big media event until the Republicans won. Now they play the same game with Georgia BAD! _E_ Wake Up America! See article: Israeli Science: Obama Birth Certificate is a Fake __HTTP__ _E_ Tonight's episode of The Apprentice is one you won't want to miss! Be sure to tune in 10 p.m. on NBC. _E_ The devastation left by Hurricane Irma was far greater at least in certain locationsthan anyone thought but amazing people working hard! _E_ If only Obama would treat @IsraeliPM @netanyahu with the same respect he awards tyrants. Very strange & dangerous for our national security. _E_ Who are our generals that are allowing this fiasco to happen right before our eyes. Call it the PLENTY OF NOTICE WAR _E_ Would anyone in the music industry treat a Democrat like this? @RealMeatLoaf is being punished for his political views __HTTP__ _E_ Why are we building a $1Billion embassy in Iraq when the country kicked us out didn't give us any oil & is about to get taken over by Iran? _E_ I'll be speaking on Thursday April 12 at the first ever National Achievers Congress at the San Jose Convention (cont) __HTTP__ _E_ RT @GOPLeader: .@POTUS made the right call in leaving a deal that would have put an unnecessary burden on the United States. __HTTP__ _E_ Thank you! CNBC #DebateNight poll with over 400000 votes. Trump 61%Clinton 39%#AmericaFirst #ImWithYou... __HTTP__ _E_ Just spoke to Governor Kenneth Mapp of the U.S. Virgin Islands who stated that #FEMA and Military are doing a GREAT job! Thank you Governor! _E_ I wonder if @BarackObama has promised Iran and China that he can be more flexible after his last election? _E_ Getting ready to leave for South Korea and meetings with President Moon a fine gentleman. We will figure it all out! _E_ I am in Virginia @RegentU Presidential forum with Dr. Pat Robertson beginning now! Watch here: __HTTP__ _E_ Big Republican Dinner tonight at Mar a Lago in Palm Beach. I will be there! _E_ The biggest thrill in the world is entertaining the public there is no bigger thrill than that. Vince McMahon @WWE _E_ Just read in the failing @nytimes that I was not aware the event had to be held in Cleveland a total lie. These people are sick! _E_ Wonderful meeting with Canadian PM @JustinTrudeau and a group of leading CEO's & business women from Canada and th... __HTTP__ _E_ Everybody wants to see and talk to Dennis Rodman he will be on Celebrity Apprentice tonight at 9. _E_ Trump Int'l Hotel & Tower New York has the perfect Manhattan location & @jeangeorges is the signature restaurant. __HTTP__ _E_ Keep the big picture in mind. There are always opportunities & possibilities & thinking too small can negate a lot of them. _E_ RT @RealBenCarson: Many people fight for change in DC. @realDonaldTrump is a leader with an outsider's perspective & the vision guts & ene... _E_ We look forward to making the Old Post Office in DC one of the great hotels of the World. __HTTP__ _E_ Many people have been asking to see my plane The Apprentice's @AmandaTMiller will give you a tour... __HTTP__ _E_ Heading to Trump National Doral to check the progress prior to the start of the Cadillac Championship on Thursday. I'll be there all week _E_ True. __HTTP__ _E_ The interview was great for @Oprah and terrible for Lance Armstrong! _E_ China's submarines will soon be carrying nukes __HTTP__ They will be sent to patrol our coasts Obama won't do anything. _E_ I know the Governors and Jeb Bush who has gone nasty with lies is by far the weakest of the lot. His family used private eminent domain! _E_ Robust Economic growth is the answer to the Medicare Problem not cuts on the elderly. _E_ Thank you New Mexico! #Trump2016 __HTTP__ __HTTP__ _E_ Home Sales hit BEST numbers in 10 years! MAKE AMERICA GREAT AGAIN _E_ Melania and I send our thoughts and prayers to Senator McCain Cindy and their entire family. Get well soon. __HTTP__ _E_ Will be doing @oreillyfactor tonight at 8pm. Enjoy! _E_ Hillary Clinton will use American tax dollars to provide amnesty for thousands of illegals. I will put... __HTTP__ _E_ I will be interviewed on @TODAYshow and Good Morning America at 7:00 A.M. _E_ Shirts and ties are doing great @Macys thanks! _E_ I cannot believe how well certain areas are doing relative to the U.S. There is no reason for this other than poor leadership.WE SHOULD BE 1 _E_ RT @Scavino45: President Trump pays respects and delivers #MemorialDay remarks at Arlington National Cemetery. __HTTP__ _E_ If you can't adapt to new situations then you will never be successful. Every change is a new opportunity to use your talent. _E_ .@Andre_Reed83. Congratulations Andre you deserve it! _E_ Public Policy Polling (PPP) has just come out with a major poll putting me #1 with Hispanics leading all Republican candidates.Told you so _E_ Did Crooked Hillary help disgusting (check out sex tape and past) Alicia M become a U.S. citizen so she could use her in the debate? _E_ The Fed's reckless policies of low interest and flooding the market with dollars needs to be stopped or we will face record inflation. _E_ Liberal SD Dem candidate Rick Weiland wants to expand ObamaCare to single payer & opposes Ebola travel ban. Send @RoundsforSenate to Senate! _E_ Mainstream media never covered Hillary's massive "hacking"or coughing attack yet it is #1 trending. What's up? _E_ People are finally beginning to hit China and OPEC. They never give me credit for being the first by far but that's okay! _E_ Some good news for New York – Weiner has dropped 12 points in the polls & that is before more of the pervert's old texts are released. _E_ President Obama close down the flights from Ebola infected areas right now before it is too late! What the hell is wrong with you? _E_ Plane was carrying those terrible lithium ion batteries which are highly combustible as cargo. Fire could have started in cockpit. _E_ With the record $200M renovations on track & budget (a miracle in DC) Trump Int'l Washington DC is being built into a national marvel. _E_ .@TheHill Trump on Boehner resignation: 'It's a good thing' __HTTP__ _E_ Have a great Good Friday and a Happy Easter. _E_ In the just released SC poll I increased my lead by 4 points since last poll by same firm. Up by 14! Cruz dropped 3. __HTTP__ _E_ Speech in Dallas went really well. Big and wonderful crowd. Just arrived in L.A. Big day tomorrow! _E_ The basketball coach at Rutgers looks bad but I had a coach who made him look like a baby coaches can be tough! _E_ My @SquawkCNBC interview discussing @BarackObama's #WHCD my Scotland property & @BarackObama using Bin Laden's death __HTTP__ _E_ Thank you! #Trump2016 __HTTP__ __HTTP__ _E_ The wonderful people of Puerto Rico with their unmatched spirit know how bad things were before the H's. I will always be with them! _E_ Yesterday in Iowa was amazing two speeches in front of two great sold out crowds. They love that I am the only candidate self funding! _E_ 'Podesta urged Clinton team to hand over emails after use of private server emerged' __HTTP__ _E_ Stock Market has increased by 5.2 Trillion dollars since the election on November 8th a 25% increase. Lowest unemployment in 16 years and.. _E_ The next ObamaCare disaster will be doctors being dropped from plans. _E_ Breitbart gets it! Vote now Obama should release his college application records & grades. He says he loves (cont) __HTTP__ _E_ Jodi Arias has stated that she follows me on twitter so I really hate to be saying that she is guilty but sadly she is as guilty as it gets _E_ .@VanityFair could come back if Graydon Carter paid as much attention as he does to his bad food restaurants. @CondeNastCorp _E_ Congratulations to @TrumpNewYork for being named #1 Best Business Hotel in NYC in @TravlandLeisure's 2014 World's Best Business Hotels. _E_ Thank you North Carolina get out & #VoteTrump on 11/8/2016!#MakeAmericaGreatAgain __HTTP__ _E_ Via @G_Liberty_Voice by Melody Dareing: "Donald Trump Wants to Build a Wall Between U.S. And Mexico" __HTTP__ _E_ Obama Putin Moscow meeting on 9.3 4 __HTTP__ On the agenda 2013 Trump @MissUniverse Pageant in Moscow on 11.9 on @nbc! _E_ When I bought the #MissUniverse pageant 13 years ago it was on life support... _E_ If everything seems under control you're just not going fast enough. Mario Andretti _E_ Jeb Bush George W and George H.W. all called to express their best wishes on the win. Very nice! _E_ 2004 VIDEO:Pocahontas describing Crooked Hillary Clinton as a Corporate Donor Puppet. Time for change! #Trump2016 __HTTP__ _E_ The French police are afraid to go into many communities. How did France let this all happen and how did the female terrorist ever escape? _E_ Where serenity meets luxury: Trump Nat'l Jupiter's Spa offers treatments which help restore youthful vitality __HTTP__ _E_ Via @BET: "Donald Trump Blasts Beyoncé for Suggestive Super Bowl Show" __HTTP__ _E_ Little @MacMiller—I have more hair than you do and there's a slight age difference. _E_ Scotland is having a virtual revolt over obsolete wind turbines which are driving up energy costs and killing the bird population (and more) _E_ Does anybody really want to throw out good educated and accomplished young people who have jobs some serving in the military? Really!..... _E_ Breaking ground shortly Trump Int'l Washington DC will bring the DC Post Office far beyond its original grandeur __HTTP__ _E_ President @BarackObama's vacation is costing taxpayers millions of dollars Unbelievable! _E_ Everyone is excited for @THEGaryBusey's return to All Star @CelebApprentice. Be warned this time Gary is even more insane! _E_ The State Department's 'shadow government' #DrainTheSwamp __HTTP__ _E_ New rule for @billmaher: check the law before you make a public absolute offer. _E_ If this doctor who so recklessly flew into New York from West Africahas Ebolathen Obama should apologize to the American people & resign! _E_ See you tomorrow Wisconsin!'Trump spurs small business optimism in Milwaukee area' __HTTP__ _E_ Check out Gray Line's site for the Donald Trump Ride of Fame... __HTTP__ _E_ Shows how dumb Joe McQuaid (@deucecrew) of the dying Union Leader is to put out the letter I wrote saying why I didn't do his failed debate! _E_ NYC's sole hamman the bi level @TrumpSoHo features indoor & outdoor relaxation lounges with luxury services __HTTP__ _E_ Terrible for the economy & middle class gas has now been over $3/gallon for a record 1245 days __HTTP__ FRACK NOW & FAST! _E_ Shocking over 92% of France who just elected a socialist for its new PM want @BarackObama re elected __HTTP__ _E_ Ray Kelly is the best Police Commissioner in NYC history. Keeping NYC safe thru vigilance. @RayKelly _E_ Be sure to watch #CelebApprentice on Sunday night at 9 pm on NBC. Another great episode! __HTTP__ _E_ Have you been watching how Saudi Arabia has been taunting our VERY dumb political leaders to protect them from ISIS. Why aren't they paying? _E_ .@McLaughlinGroup Greatly appreciate yr wonderful comments this weekend. People of "great accomplishment" should easily quality for prez. _E_ Come on @DannyZuker take the bet show your friends and family (& your bosses on Modern Family) that you're not chicken shit _E_ President Obama please take the $5M check for charity tomorrow. It is so easy and could do so much good! _E_ Thank you Hawaii! #Trump2016 _E_ Watched chief negotiator for Iran on @charlierose last night. He is far smarter than our reps—increase sanctions and walk! _E_ This is an outrage! Bias Free Language Guide claims the word 'American' is 'problematic' WHAT?! __HTTP__ _E_ Lets fight like hell and stop this great and disgusting injustice! The world is laughing at us. _E_ Lets go Pennsylvania! #VoteTrump __HTTP__ _E_ Congrats to great golfer @Frostpga on his big win last week. Always been best putter. Frost Wins for Trump _E_ Pay attention to details. If you don't know every aspect of what you're doing you're setting yourself up for some big surprises. _E_ Why didn't movie Lincoln use Ford's Theater for big scene instead of the stage of an unrelated theater? _E_ Can't wait for tonight's debate actually delayed my trip to Europe so I can watch. This is going to be a great night. _E_ Thank you @gawker! Call me on my cellphone 917.756.8000 and listen to my campaign message. _E_ Did anyone notice that Obama failed to get a coalition of other countries to go along with us. He couldn't even get Britain! NO LEADERSHIP. _E_ Our government now imports illegal immigrants and deadly diseases. Our leaders are inept. _E_ Unemployment has risen today and some other very bad news has just been reported the stock market is way down. _E_ My @piersmorgan interview on Snowden the traitor national security and China hacking us __HTTP__ _E_ Watch the 2011 #MissUniverse Pageant tonight at 9PM on NBC... __HTTP__ _E_ "Results are what matter...A series of efforts will add up to experience and achievement." Think Like a Champion _E_ #MakeAmericaGreatAgain #ImWithYou __HTTP__ __HTTP__ _E_ Drain the Swamp should be changed to Drain the Sewer it's actually much worse than anyone ever thought and it begins with the Fake News! _E_ .@georgewillf is perhaps the most boring political pundit on television. Got thrown off ABC like a dog. At Mar a Lago he was a total bust! _E_ "@NBCApprentice: And the fired celebrities are..." __HTTP__ via @ew by @DaltonRoss _E_ Thank you New York I will never forget! _E_ Please help @autismspeaks with their petition to the White House for a national strategy for the autism epidemic __HTTP__ _E_ If you entered our country illegally and are then granted amnesty why would you abide by other laws? No Amnesty! _E_ Rape is a huge problem in the U.S. military. Over 19000 rapes last year. _E_ Who will be the next @TheRealTeenUSA? Find out this Saturday at 8PM ET on missteenusa.com #TeenUSA _E_ The Democrats had to come up with a story as to why they lost the election and so badly (306) so they made up a story RUSSIA. Fake news! _E_ Bad sign for Obama's campaign now publicly admitting they are focused on 4 states. Their internals must be horrendous. _E_ So many people think I will not run for President.Wow I wonder what the response will be if I do. Even the haters and losers will be happy! _E_ From @FoxNews Bombshell: In 2016 Obama dismissed idea that anyone could rig an American election. Check out his statement Witch Hunt! _E_ .@GovernorPerry just gave a pollster quote on me. He doesn't understand what the word demagoguery means. _E_ Thank you to all of the men and women who have served our country. You are our true heroes! #ArmedForcesDay __HTTP__ _E_ #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ It is so nice that the shackles have been taken off me and I can now fight for America the way I want to. _E_ It's clear to me that @teresa_giudice needs some lessons in negotiation #sweepstweet _E_ Maybe @THEGaryBusey should stick to words... vs. barking. He's got a definite talent when he wants to use it. #CelebApprentice _E_ Ready to get mad?! We are sending foreign aid to China our greatest threat __HTTP__ We are financing our enemy. _E_ Thank you Worcester Massachusetts!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_ Just arrived in Wisconsin to discuss JOBS JOBS JOBS! #MAGA __HTTP__ _E_ 'Jeff Sessions a Fitting Selection for Attorney General' __HTTP__ _E_ It is time for the airline pilots flight attendants and the airlines themselves to stop flights to and from West Africa. Do it right now! _E_ .@GovernorPataki couldn't be elected dog catcher if he ran again—so he didn't! _E_ The dishonest media will NEVER keep us from accomplishing our objectives on behalf of our GREAT AMERICAN PEOPLE!... __HTTP__ _E_ Via @TIMEPolitics by @zekejmiller: "Trump To Visit New Hampshire" __HTTP__ _E_ Looking forward to speaking @nranews Convention in Nashville __HTTP__ The 2nd Amendment is a right not a privilege! _E_ I will be on @foxandfriends tomorrow morning at 7:15 Hope you enjoy and agree! _E_ THANK YOU to the amazing staff and their families of the United States Embassy in the Philippines. Keep up the GREAT WORK! __HTTP__ _E_ Will be leaving the Philippines tomorrow after many days of constant mtgs & work in order to #MAGA! My promises are rapidly being fulfilled. _E_ When will CNN do a segment on Hillary's plan to increase Syrian refugees 550% and how much it will cost? _E_ We have to get tough on China. For every one American child there are four Chinese. China is out to steal our (cont) __HTTP__ _E_ There was a major diplomatic breakthrough yesterday w/the White House Iran & China. All celebrated Chuck Hagel being voted in as SOD. _E_ Big win in the House very exciting! But when everything comes together with the inclusion of Phase 2 we will have truly great healthcare! _E_ Cryin' Chuck Schumer stated recently I do not have confidence in him (James Comey) any longer. Then acts so indignant. #draintheswamp _E_ #USAatUNGA #UNGA __HTTP__ _E_ Terrible story on front page of NYTimes about lightweight @AGSchneiderman __HTTP__ Does Eric wear eyeliner? _E_ Being tough doesn't mean being nasty difficult or unreasonable. It means being tenacious and refusing to give in or give up. _E_ The Russia hoax continues now it's ads on Facebook. What about the totally biased and dishonest Media coverage in favor of Crooked Hillary? _E_ No amnesty. Protect the rule of law! Let's Make America Great Again __HTTP__ _E_ All NYC needs is the mentally unstable Elliot Spitzer in office again. _E_ The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_ US trade deficit hit $64B+ in April 2 yr record high __HTTP__ We must do better. China is ripping us. Bring the jobs home! _E_ .@Betsy_McCaughey Thanks so much. Really appreciate your comments. I will help the veterans like no one else. __HTTP__ _E_ Why aren't the Democrats speaking about ISIS bad trade deals broken borders police and law and order. The Republican Convention was great _E_ RT @realDonaldTrump: Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our danger... _E_ What's the primary ingredient for success? Passion. You have to love what you're doing or you won't get too far. _E_ Wow Eliot Spitzer has lost great news for New York City! _E_ .@BillMoyers is a liberal hack whose career is being laid to rest @PBS. Here Moyers coddles @JeremiahWright __HTTP__ _E_ Speaker @johnboehner seems to have gained strength in house—a good thing! _E_ Why do we continue to sit idly by while China steals our national security and corporate secrets? China is an enemy not a friend. _E_ Work is expected to begin today on my golf course in Scotland. It will be spectacular! __HTTP__ _E_ Defense Sec.Hagel has quit. Great news for our country. The guy didn't have a clue—grossly outmatched by our enemies. Couldn't even speak _E_ See the amazing views from @TrumpGolfLA located directly on the Pacific Ocean __HTTP__ _E_ Almost every major dealmaker has used the bankruptcy laws as a business tool. Icahn Black Zell—but nobody says they went bankrupt! _E_ "Always bear in mind that your own resolution to succeed is more important than any other." – Abraham Lincoln _E_ Many people look at successful people & don't see anything but the end result. They don't see all the work that went into getting there. _E_ #WeeklyAddress __HTTP__ _E_ South Korea is absolutely killing us on trade deals. Their surplus vs U.S. is massive and we pay for their protection. WHO NEGOTIATES? _E_ Today I am here to offer a renewed partnership with America to work together to strengthen the bonds of friendship and commerce between all of the nations of the Indo Pacific and together to promote our prosperity and security. #APEC2017 __HTTP__ _E_ Good morning Ohio! Some additional information from my daughter @IvankaTrump! #VoteTrump #SuperTuesday __HTTP__ _E_ .@MacMiller's "Donald Trump" __HTTP__ just crossed 73.5 million views on @YouTube. You're welcome Mac! _E_ Thank you Brian France Bill Elliott @chaseelliott @DavidRagan & @RyanJNewman! #NASCAR #Trump2016 #VoteTrump __HTTP__ _E_ How bad has our leader made us look on Syria. Stay out of Syria we don't have the leadership to win wars or even strategize. _E_ Hope & Change since @BarackObama has taken office the US debt has increased by an average of $64K per taxpayer. _E_ Via @CBNNews by @TheBrodyFile: Brody File Exclusive: Donald Trump Comes Out In Support Of 20 Week Abortion Ban __HTTP__ _E_ RT @foxandfriends: Another Dem 'queasy' over claim of Loretta Lynch meddling in Clinton case __HTTP__ _E_ Designed by @jacknicklaus Trump Golf Links at Ferry Point's 18 hole course sits by the Bronx's Whitestone Bridge __HTTP__ _E_ How incompetent are our leaders allowing these Ebola infected people to come into our country with all of the problems and danger entailed! _E_ This is why @TimTebow is a winner. He lays everything out on the field. He never quits and never gives up. That's why he is a success. _E_ Why would Kim Jong un insult me by calling me old when I would NEVER call him short and fat? Oh well I try so hard to be his friend and maybe someday that will happen! _E_ Don't miss the #MissUniverse Pageant tonight at 8/7c with performances by @NickJonas @PrinceRoyce and @GavinDeGraw __HTTP__ _E_ A quote from the late great golfer Sam Snead: Practice puts brains in your muscles! THIS IS TRUE ALSO IN LIFE. _E_ Thank you Costa Mesa California! 31000 people tonight with thousands turned away. I will be back! #Trump2016 __HTTP__ _E_ Myself with mother and father at New York Military Academy. See I can be very military. High rank!... __HTTP__ _E_ The Russians are playing a very smart game. In the meantime they are buying lots of time for Syria and making U.S. look foolish. Dangerous! _E_ I am now going to the brand new Trump International Hotel D.C. for a major statement. _E_ Thank you Graham Ledger of the Daily Ledger @OANN for your really fair coverage and your great interview with Peter Roff of U.S. NEWS & W.R. _E_ Must read article in @washtimes: @RealSheriffJoe probe could dwarf Watergate __HTTP__ _E_ ObamaCare premiums rising 13.2% in 2015 __HTTP__ Elections have consequences! _E_ What is Frank VanderSloot getting for agreeing to back Marco Rubio? Last victim was Mitt Romney see how that turned out. _E_ "UPDATE: Trump plans public event at @WartburgCollege" __HTTP__ via @wcfcourier: _E_ I will miss Mike Wallace. He did a major interview with me for 60 Minutes and it was totally fair and balanced. (cont) __HTTP__ _E_ My @foxandfriends int. on the Zimmerman trial & verdict courage of the jury and reactions! __HTTP__ _E_ The habitual vacationer @BarackObama spent 9 days before the critical Super Committee deadline traveling. He failed to lead again. _E_ Why do the networks continue to put dopey @BillKristol on panels when he has called every single shot about me wrong for 2 yrs? _E_ .@CNN is unwatchable. Their news on me is fiction. Theyare a disgrace to the broadcasting industry and an arm of the Clinton campaign. _E_ This Sunday's All Star Celebrity @ApprenticeNBC has the most beautiful boardroom judges ever w/ @IvankaTrump & @MELANIATRUMP together! _E_ When will @TedCruz give all the New York based campaign contributions back to the special interests that control him. _E_ Germany is going through massive attacks to its people by the migrants allowed to enter the country. New Years Eve was a disaster. THINK! _E_ Get smart on knockout assaults and crime we have to be slightly more vicious (and violent) than the assaulter and crime would end FAST! _E_ We can't let this happen. We should march on Washington and stop this travesty. Our nation is totally divided! _E_ NYC politicians better stop pandering ending stop & frisk would be a disaster. __HTTP__ _E_ I'll be in Dallas at the American Airlines Center on Sept 14th at 6 PM. Will be great to be back in Texas. __HTTP__ _E_ RT @EricTrump: Nevada: Reminder that today is the LAST day to register to vote in the February 23rd caucus! __HTTP__ __HTTP__ _E_ Get to the essence immediately. Learn to economize. People appreciate brevity in today's world. Think Like a Champion _E_ Required reading 4 success in politics & life read @kimguilfoyle's book #MakingTheCase. Brilliant Advice ! __HTTP__ _E_ Wow President Obama just landed in Cuba a big deal and Raul Castro wasn't even there to greet him. He greeted Pope and others. No respect _E_ Thank you Washington! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_ Congratulations to John Roberts for making Americans hate the Supreme Court because of his BS __HTTP__ _E_ Entrepreneurs: Focus on your goals not on fixed patterns. Do what's necessary and what's unnecessary will be made clear. _E_ At some point and for the good of the country I predict we will start working with the Democrats in a Bipartisan fashion. Infrastructure would be a perfect place to start. After having foolishly spent $7 trillion in the Middle East it is time to start rebuilding our country! _E_ My @MorningJoe interview with @JoeNBC & @morningmika discussing the Newsmax @iontv debate and #TimeToGetTough __HTTP__ _E_ We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_ Via @DMRegister by @JenniferJJacobs: Trump: 'I would've won the race against Obama' __HTTP__ _E_ Sadly Democrats want to stop paying our troops and government workers in order to give a sweetheart deal not a fair deal for DACA. Take care of our Military and our Country FIRST! _E_ The so called 87 year old lady was a vicious and skilled investor who was trying to rip me off with made up facts and a blowhard lawyer. _E_ Dave Letterman @Late_Show said during my interview that Obama was probably born in the US the word probably is a disaster for Obama. _E_ This morning Chris Wallace has the best political show on television but that's only because I'm on it (kidding)! Have fun. _E_ .@CBSNews Poll WOW! New Hampshire TRUMP 38% CARSON 12% BUSH 8% South Carolina TRUMP 40% CARSON 23% CRUZ 8% Iowa TRUMP 27% CARSON 27% _E_ George Will may be the dumbest(and most overrated) political commentator of all time. If the Republicans listen to him they will lose. _E_ Morning Joe's weakness is its low ratings. I don't watch anymore but I heard he went wild against Rudy Giuliani and #2A sad & irrelevant! _E_ Wow @CNBC ratings are really low worst in many years. I guess I'll have to start doing my Tuesday morning interviews with them again! _E_ Obama just said @MittRomney was a very successful investor big mistake for Obama to admit he has less and less credibility. _E_ What's more dangerous for the country the Iranian nuclear threat or @BarackObama as President? _E_ Amazing view of @TrumpGolfLA __HTTP__ _E_ Major article in New York Times today discusses the cost of environmental damage in China and how it is RAPIDLY GROWNG! Rest of World pays. _E_ Obama' ststement on Egypt was terrible and dumb now being used by military as a rallying cry our foreign policy is worst in U.S. history. _E_ It was an honor to be the Grand Marshall in the Salute to Israel Parade back in 2004. __HTTP__ _E_ I'd like to wish all of my friends and even my many enemies a very Merry Christmas and Happy New Year. _E_ MAKE AMERICA SAFE AGAIN! __HTTP__ __HTTP__ _E_ Will be in Chicago tomorrow for a record setting (by far) luncheon. _E_ .@Franklin_Graham @BillyNungesser @SamaritansPurse so humbled by my time w/ you. You are in our thoughts & prayers. __HTTP__ _E_ Congratulations Jim Herman! We are all proud of you @TrumpGolf! __HTTP__ _E_ Had a very good call last night with the President of China concerning the menace of North Korea. _E_ A vote for Clinton Kaine is a vote for TPP NAFTA high taxes radical regulation and massive influx of refugees. _E_ Rally last night in San Jose was great. Tremendous love and enthusiasm in the hall. Big crowd. Outside small group of thugs burned Am flag! _E_ Wow was Ted Cruz disloyal to his very capable director of communication. He used him as a scape goat fired like a dog! Ted panicked. _E_ #MakeAmericaGreatAgain __HTTP__ _E_ LAWFARE: Remarkably in the entire opinion the panel did not bother even to cite this (the) statute. A disgraceful decision! _E_ Really disgusting that the failing New York Times allows dishonest writers to totally fabricate stories. _E_ Ashley Judd's candidacy was created by Karl Rove's terrible ads even before she thought seriously about running... _E_ Donald Trump Tells @theblaze About His Obama Announcement: PASSPORT APPLICATIONS TELL YOU A LOT __HTTP__ by @BillyHallowell _E_ I just wrapped up a Q&A @TwitterNYC. Thanks for all your questions! #AskTrump __HTTP__ _E_ In less than a week I'll be honored by Sarasota GOP as Statesman of the Year & then give my big surprise to @RNC convention. Will be fun! _E_ The Muslim Brotherhood @BarackObama's allies in Egypt will cancel the Camp David Agreement. __HTTP__ What a disaster! _E_ I know Mark Cuban well. He backed me big time but I wasn't interested in taking all of his calls.He's not smart enough to run for president! _E_ Re Super PAC scam: What the other candidates are doing is a disgrace. _E_ FLASHBACK – "Donald Trump Answers Boy's Prayer for New Bike" __HTTP__ via @FoxNewsInsider _E_ Thank you Greeley CO! REAL change means restoring honesty to the govt. Our plan will END govt. corruption! Watch:... __HTTP__ _E_ Just watched Cookie Roberts on @ABC. Her predictions have been so wrong for so long that she has lost all credibility. Just another sad case _E_ Eric did a great job with his Eric Trump Foundation annual charity outing. I'm proud of him. __HTTP__ _E_ Great speech by my good friend @GovChristie. He did something you won't hear at @BarackObama's convention tell the truth. _E_ The economy is broken. Entrepreneurship is being suppressed. See what I do Wednesday 11 AM at Trump Tower atrium. _E_ Gary Sinise is doing tremendous work for veterans through his foundation—check it out @GarySiniseFound _E_ Congratulations to @IsraeliPM @netanyahu on forming his new unity government. A major political success for the Jewish State of Israel. _E_ The Blue Monster at Trump National Doral recieved rave reviews from both players and architectural critics following the Cadillac WGC.Thanks _E_ Thank you Columbus Ohio! I will be back soon. #ImWithYou #MAGA __HTTP__ _E_ Hypocrite @BarackObama has major investments in companies that are outsourcing jobs overseas __HTTP__ _E_ I am at Trump National Doral best resort in U.S. Rory and Adam Scott are doing great! Watch on NBC at 3:00 P.M. MAKE AMERICA GREAT AGAIN! _E_ I am having a really hard time watching @FoxNews. _E_ Broken borders $18T debt ObamaCare failing & over budget. Don't worry our president is still fundraising __HTTP__ Priorities _E_ I'm at @WrestleMania tonight but will be doing a few tweets. I know the episode well.... #CelebApprentice _E_ RT @foxandfriends: Trump fires new warning shot at McConnell leaves door open on whether he should step down __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ It's time for Ted Cruz to either settle his problem with the FACT that he was born in Canada and was a citizen of Canada or get out of race _E_ Via @BreitbartNews by @mboyle1: "Obama's Amnesty Will Give Illegal Aliens Public Benefits" __HTTP__ _E_ Entrepreneurs: Negotiation is an art. Treat it like one. _E_ Spoke to President of Mexico to give condolences on terrible earthquake. Unable to reach for 3 days b/c of his cell phone reception at site. _E_ A strong military makes us respected by our allies & feared by our enemies. Let's Make America Great Again! __HTTP__ _E_ You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_ Via @BBCNews: "US property tycoon Donald Trump confirms Turnberry buy" __HTTP__ _E_ Michael Vick of the Philadelphia @eagles is a great athlete but not a great quarterback. _E_ I would love to see the Republican party and everyone get together and unify.Video: __HTTP__ __HTTP__ _E_ Continuous effort not strength or intelligence is the key to unlocking our potential. Winston Churchill _E_ His @BarackObama's specialties? Vacations and campaigning. Jobs not so much! _E_ 'Food Groups' – Emails Show Clinton Campaign Organized Potential VPs By Race And Gender: __HTTP__ _E_ A great day in New Jersey for Trump! __HTTP__ & __HTTP__ _E_ "Donald Trump on Mark Levin: Karl Rove is one of the most overrated people in politics" __HTTP__ via @TheRightScoop _E_ In my new book #TimeToGetTough I make a full financial disclosure detailing my net worth. __HTTP__ _E_ Making his case in a nice and articulate manner. _E_ Amazing how the haters & losers keep tweeting the name "F**kface Von Clownstick" like they are so original & like no one else is doing it... _E_ Be sure to watch The Apprentice tonight 10 p.m. on NBC it's an episode you won't forget! _E_ Remember the most hated part of ObamaCare is the Individual Mandate which is being terminated under our just signed Tax Cut Bill. _E_ It is my great honor to be speaking at CPAC 2013. They are all about what's good for America. _E_ Across the battlefields oceans and harrowing skies of Europe and the Pacific throughout the war one great battle cry could be heard by America's friends and foes alike:"REMEMBER PEARL HARBOR." __HTTP__ _E_ Dummy @BillMaher forgot to say that he made an absolute offer which I accepted. Hopefully charity gets $5M dollars. _E_ Join me in Phoenix Arizona today at 4pm! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_ Must read editorial co written by @weeklystandard editor William Kristol & @NRO editor @RichLowry 'Kill the Bill' __HTTP__ _E_ Under President @BarackObama China has experienced unusually fast gains and America unusually fast losses. #TimeToGetTough _E_ The MOVEMENT in Portsmouth New Hampshire w/ 7K supporters. THANK YOU! This is the biggest election of our lifetime... __HTTP__ _E_ Remember this: Obama wants to raise taxes @MittRomney wants to lower taxes need I say more! _E_ I can't believe the great @wjcarter got canned by @nytimes. He was a fantastic reporter & really knew entertainment. He will be missed! _E_ FLASHBACK: "Alex Salmond pleaded with Donald Trump to back release of Lockerbie bomber" __HTTP__ @telegraphnews ... _E_ I thought people weren't celebrating? They were cheering all over even this savage from Orlando. I was right. __HTTP__ _E_ Will be interviewed by @SeanHannity on @FoxNews at 10:00pm tonight. Enjoy! _E_ Big poll just out by @TheEconomist has me in 1st. place by a lot. A great honor but we have a long way to go to MAKE AMERICA GREAT AGAIN! _E_ My @IngrahamAngle interview discussing healthcare monopolies @MittRomney oil prices and @AnnRomney's birthday __HTTP__ _E_ Our way of life is under threat by Radical Islam and Hillary Clinton cannot even bring herself to say the words. _E_ .@foxandfriends Dems are taking forever to approve my people including Ambassadors. They are nothing but OBSTRUCTIONISTS! Want approvals. _E_ Via @OceanDriveMag by @SuzMcGeeNYC: Q&A: Ivanka Trump on the Business of Golf & the Championships __HTTP__ _E_ Change is the law of life. And those who look only to the past or present are certain to miss the future. John F. Kennedy _E_ Finally held our first full @Cabinet meeting today. With this great team we can restore American prosperity and br... __HTTP__ _E_ What is he reading? #Oscars _E_ President should not be telling the Washington Redskins to change their name our country has far bigger problems! FOCUS on themnot nonsense _E_ Enjoy the ratings of President Obama. __HTTP__ _E_ I am on @foxandfriends now! Tune in! _E_ Just left the #G7Summit. Had great meetings on everything especially on trade where.... _E_ The numbers at the @nytimes are so dismal especially advertising revenue that big help will be needed fast. A once great institution SAD! _E_ Thank you for such a wonderful and unforgettable visit Prime Minister @Netanyahu and @PresidentRuvi. _E_ Obama's Amnesty Executive Order can now be stopped by Majority Leader McConnell with riders. That's one reason we needed the Senate. _E_ This is a once in a generation opportunity to offer historic tax relief to the American people! Join me today: __HTTP__ __HTTP__ _E_ Bob Tyrrell @AmSpec—Thank you and also for the great work you do. _E_ Let us give thanks for all that we have and let us boldly face the exciting new frontiers that lie ahead. Happy Th... __HTTP__ _E_ Susan Rice is a good woman but Pres. O should not taunt the Republicans by appointing her S of S... _E_ Are you expanding your business? Interview returning soldiers. Give them strong consideration. Their sacrifices deserve it. _E_ A list from @Heritage: Top 10 Most Expensive Obamacare Taxes and Fees __HTTP__ _E_ Justice Roberts turned on his principles with absolutely irrational reasoning in order to get loving press from (cont) __HTTP__ _E_ What a year it's been and we're just getting started. Together we are MAKING AMERICA GREAT AGAIN! Happy New Year!! __HTTP__ _E_ Welcome back @SteveScalise!#TeamScalise __HTTP__ _E_ I would feel sorry for @JebBush and how badly he is doing with his campaign other than for the fact he took millions of $'s of hit ads on me _E_ RT @EricTrump: Friends in #FL #OH #NC #IL & #MO we would be honored to have your #VOTE! #SuperTuesday #LetsDoThis #MakeAmericaGreatAgain #T... _E_ Isn't it sad that lightweight Senator Bob Corker who couldn't get re elected in the Great State of Tennessee will now fight Tax Cuts plus! _E_ Mexican gov doesn't want me talking about terrible border situation & horrible trade deals. Forcing Univision to get me to stop no way! _E_ Next year I will be changing the name of 800 acre Doral to Trump National Doral. It will be the best resort in the country—Miami is hot! _E_ Crazy Dennis Rodman is saying I wanted to go to North Korea with him. Never discussed no interest last place on Earth I want to go to. _E_ The first meeting Jeff Sessions had with the Russian Amb was set up by the Obama Administration under education program for 100 Ambs...... _E_ Thank you Tampa Florida!#AmericaFirst #TrumpTrain __HTTP__ _E_ I will be on Fox & Friends tomorrow morning at 7.ºº _E_ What a STUPID deal for Verizon to buy AOL for $4.4 billion. AOL has been bad luck for everyone who touched it. Worth less than $1 billion! _E_ Why are we giving away our entire strategy and tactics we will deploy against ISIS? It puts our troops at a disadvantage. _E_ RT @Bet22325450ste: @FoxBusiness @foxandfriends Come on America. Get on the Trump Train. The winners already have boarded! The losers are w... _E_ Now @BarackObama is praising China's cooperation in negotiations over Chen Guangcheng __HTTP__ This is a sad episode for us. _E_ Don't ever think you've done it all already or that you've done your best. That's a shortcut to undermining your own potential. _E_ The Jets should have let them score to get the number one draft pick who will be really good. It will just never change for them! _E_ Hillary Clinton doesn't have the strength or the stamina to MAKE AMERICA GREAT AGAIN! #AmericaFirst __HTTP__ _E_ Under President Trump unemployment rate will drop below 4%. Analysts predict economic boom for 2018! @foxandfriends and @Varneyco _E_ Congratulations to @arsenioofficial on his new late night show! He will do really well. (It pays to win #CelebrityApprentice) _E_ Today it was my great honor to meet with the Crown Prince of Bahrain at the @WhiteHouse. Bahrain and the United States are important partners.During the Crown Prince's visit he is advancing $9 BILLION in commercial deals including finalizing the purchase of F 16's... __HTTP__ _E_ Glad to hear @InsideEdition has hired @_KatherineWebb to cover @SuperBowl. She will be absolutely terrific! Miss USA pageant is proud. _E_ Snowden has given serious information to China and Russia anyone who thinks otherwise is a dope! He is a traitor who fled he knew the crime! _E_ .@GOP need to face reality – not one of the illegal immigrants granted amnesty will vote Republican. _E_ How much is South Korea paying the U.S. for protection against North Korea???? NOTHING! _E_ France is losing its businesses and wealth rapidly and day by day. _E_ Take a tour of this amazing penthouse in Trump Park Avenue.... __HTTP__ _E_ THANK YOU ARIZONA! 20000 amazing supporters! Get out and #VoteTrump on Tuesday. I love you!#MakeAmericaGreatAgain __HTTP__ _E_ Last time lightweight @JebBush tried to knock off @marcorubio he made a total fool of himself. If he doesn't do better this time he is out! _E_ How does a dummy like @billmaher get a television show & his ratings stink. You'd think @HBO could do a lot better. _E_ Via @TWtravelnews by Robert Silk: "Renovations make Trump's Doral a showcase once again" __HTTP__ _E_ 'ICE OFFICERS WARN HILLARY IMMIGRATION PLAN WILL UNLEASH GANGS CARTELS & DRUG VIOLENCE NATIONWIDE'... __HTTP__ _E_ The terrorists cut off the heads of Americans and laugh then want to sell us the bodies for $1000000. We fight over sleep deprivation! _E_ Via @BBCNews: "Donald Trump visits his newly purchased Turnberry golf resort" __HTTP__ _E_ Hopefully the violence & unrest in Charlotte will come to an immediate end. To those injured get well soon. We need unity & leadership. _E_ Scotland will be so lucky if this monstrosity is not built—I will tie them up in courts for years if necessary. _E_ This is going to be a special season truly great characters and cast. You will soon see! _E_ The Lincoln Day Dinner last night in Michigan was fantastic. Record attendance and tremendous enthusiasm I loved it! _E_ Do you notice the Fake News Mainstream Media never likes covering the great and record setting economic news but rather talks about anything negative or that can be turned into the negative. The Russian Collusion Hoax is dead except as it pertains to the Dems. Public gets it! _E_ Great rally in Iowa! Such wonderful people. Traveling now with @SarahPalinUSA to Tulsa massive crowd expected! __HTTP__ _E_ This afternoon I'll be speaking with Neil Cavuto on Your World with Neil Cavuto 4 p.m. on FOX News. _E_ The highly neurotic Debbie Wasserman Schultz is angry that after stealing and cheating her way to a Crooked Hillary victory she's out! _E_ After years of Comey with the phony and dishonest Clinton investigation (and more) running the FBI its reputation is in Tatters worst in History! But fear not we will bring it back to greatness. _E_ Many people will be surprised at what is about to be released concerning @BarackObama's background. I for one won't be. _E_ James Comey leaked CLASSIFIED INFORMATION to the media. That is so illegal! _E_ The GOP Debate Scorecard: Donald Trump and Energy by Wayne Allyn Root. __HTTP__ _E_ In order to try and deflect the horror and stupidity of the Wikileakes disaster the Dems said maybe it is Russia dealing with Trump. Crazy! _E_ I will be interviewed tonight at 7pm ET by @greta #OnTheRecord _E_ Hillary flunky who lost big. For the 100th time I never mocked a disabled reporter (would never do that) but simply showed him....... _E_ What is Obama thinking? __HTTP__ _E_ Many countries are cutting back big time on ugly industrial wind turbines. The energy is very inefficient & (cont) __HTTP__ _E_ I want to thank @RealSheriffJoe for all of his help in our historic Arizona win. Could not have done it without you Joe! _E_ RT @FoxNews: TONIGHT on Justice @JudgeJeanine talks to special guests @EricTrump and @LaraLeaTrump Tune in at 9p ET on Fox News Channe... _E_ I don't want to hit Crazy Bernie Sanders too hard yet because I love watching what he is doing to Crooked Hillary. His time will come! _E_ The dying @NYDailyNews asked me to do an Editorial on the Central Park 5 ripoff & then they pretend it was my idea. Loser newspaper! _E_ Irresponsible! In the last 6 months @BarackObama has held over 100 fundraisers and not a single meeting with his Job Council. _E_ .@DennisDMZ Thanks for the nice words. You are fantastic! _E_ "Be objective and strive to be your own counselor. Listen to others but know the final decision is yours." – Think Like a Champion _E_ Thank you Eau Claire Wisconsin. #VoteTrump on Tuesday April 5th!MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ You're never a loser until you quit trying. Mike Ditka _E_ Instead of trash talking @PMIsrael on the world stage @BarackObama should be defending @Israel. _E_ Wow Vanity Fair was totally shut out at the National Magazine Awards it got NOTHING. Graydon Carter is a loser with bad food restaurants! _E_ RT @Newsmax_Media: Trumps Warns of Obama Tipping Point that May Destroy America __HTTP__ via @Newsmax_media _E_ Jeb's brother George insisted on a $100000 fee and $20000 for a private jet to speak at a charity for severely wounded vets. Not nice! _E_ Welfare's purpose should be to eliminate as far as possible the need for its own existence. – Pres. Ronald Reagan _E_ What I am saying is stay out of Syria. _E_ By the US winning the Olympic medal count we proved that both the American spirit & talent is greater than a 1.4B population. USA! _E_ This just in re: FundAnything and producer Brad Wyman __HTTP__ _E_ If you think we have a problem with Social Security and Medicare now try taking in millions of new citizens all at once. _E_ ObamaCare continues to increase insurance premiums & raise record deductibles. New Congress must use every tool to defund. _E_ If you've got some problems today that's a good sign that's life. So give them some thought and make the most of the situation. _E_ The U.S. Coast Guard FEMA and all Federal and State brave people are ready. Here comes Irma. God bless everyone! _E_ RT @foxandfriends: U.S. Air Force jets take off from Guam for training ensuring they can 'fight tonight' __HTTP__ _E_ "Success breeds success. The best way to impress people is through results." – Think Like a Billionaire _E_ "Sometimes by losing a battle you find a new way to win the war. The Art of The Deal _E_ .@piersmorgan is back! Did I see @OMAROSA wince? #CelebApprentice _E_ Why wouldn't the @WSJ call for comment or clarification before writing an editorial which is so totally wrong. No wonder it is doing poorly! _E_ RT @AmbJohnBolton: Our country & civilians are vulnerable today because @BarackObama did not believe in national missile defense. Let's nev... _E_ They only changed the term to CLIMATE CHANGE when the words GLOBAL WARMING didn't work anymore. Come on people get smart! _E_ Word is that Crooked Hillary has very small and unenthusiastic crowds in Pennsylvania. Perhaps it is because her husband signed NAFTA? _E_ Rory Tiger Phil and Ernie will be fun to watch this weekend at Trump National Doral. _E_ I would like to wish all fathers even the haters and losers a very happy Fathers Day. _E_ Entrepreneurs: Let your actions show that you're the best. See each day as an opportunity to show you can do business at the highest level. _E_ Hillary and Sanders are not doing well but what is the failed former Mayor of Baltimore doing on that stage? O'Malley is a clown. _E_ One thing I will say about Rep. Keith Ellison in his fight to lead the DNC is that he was the one who predicted early that I would win! _E_ Despite thousands of hours wasted and many millions of dollars spent the Democrats have been unable to show any collusion with Russia so now they are moving on to the false accusations and fabricated stories of women who I don't know and/or have never met. FAKE NEWS! _E_ Thanks @PiersMorgan. You're great! _E_ If you want to know how to prevail through tough circumstances then read The Art of the Comeback. _E_ "True courage is being afraid and going ahead and doing your job anyhow!" General Norman Schwarzkopf _E_ #MakeAmericaGreatAgain #TrumpRallyAL __HTTP__ _E_ To aspiring entrepreneurs: Be focused! Know your goals. Put everything you've got into what you're doing every single day. _E_ I think everyone will like my new and very successful book Crippled America. Go get it and let me know what you think! _E_ Just tried watching Modern Family written by a moron really boring. Writer has the mind of a very dumb and backward child. Sorry Danny! _E_ RT @gatewaypundit: BREAKING POLL: Trump Gains 11 Points on Clinton Since March=> Now Leads Crooked Hillary 46 44 __HTTP__ vi... _E_ The forgotten men and women of our country will be forgotten no longer. From this moment on it's going to be #AmericaFirst _E_ Obama promised premiums would lower $2500/yr for family of 4. In truth healthcare will increase by $7450 __HTTP__ _E_ From Donald Trump: Ivanka and Jared's wedding was spectacular and they make a beautiful couple. I'm a very proud father. _E_ The federal gov. has handled Sandy worse than Katrina. There is no excuse why people don't have electricity or fuel yet. _E_ I'm right TPM is wrong @BarackObama did not issue a special statement for Christmas however he issued one (cont) __HTTP__ _E_ .@CNN is so embarrassed by their total (100%) support of Hillary Clinton and yet her loss in a landslide that they don't know what to do. _E_ Now @BarackObama's Vice Chief of Joint Staff is defending China while they cheat __HTTP__ Wrong course of action. _E_ Congratulations to @gretawire on the 11 year anniversary of @FoxNews 'On the Record.' Always enjoy being interviewed by Greta. She's great. _E_ Newly minted diplomat @dennisrodman is a completely different competitor in All Star @CelebApprentice. Dennis is a legend! _E_ President Donald J. Trump Proclaims 5/14/2017 through 5/20/2017 as #PoliceWeek Proclamation... __HTTP__ _E_ The Dunes here are amazing and they're how I learned about geomorphology which is the study of movement landforms. We've had a great trip _E_ I will be making a major announcement today at 12:30 pm PST at Trump International Hotel & Tower Las Vegas (cont) __HTTP__ _E_ Left New Hampshire for Turnberry in Scotland which I am renovating. This place is incredible! @TrumpTurnberry _E_ .@realDonaldTrump on ISIS&OIL FIELDS! Saying it for years! @AndersonCooper you should acknowledge this! #Trump2016 __HTTP__ _E_ Lightweight reporter Alex Pareene @pareene is known as a total joke in political circles. Hence he writes for Loser Salon. @Salon _E_ Why does the media with a strong push from Crooked Hillary keep pushing the false narrative that I want to raise taxes. Exactly opposite! _E_ President Reagan put it best: Welfare's purpose should be to eliminate as far as possible the need for its own existence. _E_ Ted Cruz is a cheater! He holds the Bible high and then lies and misrepresents the facts! _E_ Join me in Florida this Saturday at 5pm for a rally at the Orlando Melbourne International Airport!Tickets:... __HTTP__ _E_ Via @gazettedotcom by James Q. Lynch: "Trump to run typical caucus campaign 'but bigger'" __HTTP__ _E_ .@TraceAdkins is back—good news for Plan B. #CelebApprentice _E_ In Miami tracking @TrumpDoral's $250M renovations. Will be America's top resort. @PGATOUR just signed for 10 yr ext. __HTTP__ _E_ Whatever the United States can do to help out in London and the U. K. we will be there WE ARE WITH YOU. GOD BLESS! _E_ By popular request I will be live tweeting during Celebrity Apprentice (Sunday 9 P.M.). _E_ Listen to my interview with @KathieLGifford at @PodcastOne __HTTP__ _E_ For all of my many Jewish friends Happy Passover. _E_ Watch my video blog to see if your questions from my Facebook page were answered __HTTP__ _E_ _E_ _E_ After all is said and done more is said than done. Aesop _E_ .@ArsenioHall How quickly people forget but not me! You told me that without The Apprentice you could never have gotten your show Sad! _E_ Why can't the pundits be honest? Hopefully we are all looking for a strong and great country again. I will make it strong and great! JOBS! _E_ Sen. Kay Hagan voted for Amnesty & ObamaCare. She is a proven liberal who recklessly goes along with Obama. Vote @ThomTillis in November! _E_ Entrepreneurs: Keep an open mind. Business is a creative endeavor. _E_ .@ArsenioHall The only thing you don't mention in the nice Esquire piece about you is The Apprentice without which you would be nowhere! _E_ New polls out today are very good considering that much of the media is FAKE and almost always negative. Would still beat Hillary in ..... _E_ Amazing playing with an ankle injury @Yankees Captain Derek Jeter tied Willie Mays last night for #10 on (cont) __HTTP__ _E_ The Massive Tax Cuts which the Fake News Media is desperate to write badly about so as to please their Democrat bosses will soon be kicking in and will speak for themselves. Companies are already making big payments to workers. Dems want to raise taxes hate these big Cuts! _E_ Obama & Democrat leaders did a great disservice by releasing the papers on torture. The world is laughing at us—they think we are fools! _E_ If @BarackObama had such a wonderful academic record why wouldn't he want to show it? _E_ .@Macys stock just dropped. Interesting. So many people calling to say they are cutting up their @Macys credit card. Thank you! _E_ The EPA official who wants to crucify gas companies resigned __HTTP__ Good but his attitude is endemic in the EPA _E_ You can't compare anything to ObamaCare because ObamaCare is dead. Dems want billions to go to Insurance Companies to bail out donors....New _E_ While Jeb Bush is cutting staff and salaries after having paid ridiculous amounts of money why did he pay so much in the first place? _E_ "If you don't have time to do it right when will you have time to do it over?" John Wooden _E_ Each time I see one of Anthony Weiner's television ads for mayor I ask what the hell is he doing just wasting money & time go get a job! _E_ Getting the strong endorsement of the great coach Bobby Knight has been a highlight of my stay in Indiana. Big speech tomorrow with Bobby! _E_ Wow the highly respected Governor of Iowa just stated that Ted Cruz must be defeated. Big shoker! People do not like Ted. _E_ Getting rid of the mortgage interest deduction would be a disaster for homeowners who have suffered enough! _E_ Flashback: "NYers were grateful when Donald Trump finished ahead of schedule and under budget the Wollman Rink" __HTTP__ _E_ The weather has been so cold for so long that the global warming HOAXSTERS were forced to change the name to climate change to keep $ flow! _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ .@Merck Pharma is a leader in higher & higher drug prices while at the same time taking jobs out of the U.S. Bring jobs back & LOWER PRICES! _E_ WORKING TOGETHER we will defeat this #OpioidEpidemic & free our nation from the terrible affliction of drug abuse. __HTTP__ __HTTP__ _E_ I'm giving away money! 11AM Trump Tower. Be there or be left behind! _E_ Conservatives have to be smart in the way we speak. Using crazy language that terifies seniors accomplishes (cont) __HTTP__ _E_ Stupid Arianna @huffingtonpost hired the man who ruined the once great NYTimes Business Section... _E_ The response has been fantastic actually overwhelming! Thank you! _E_ Great job by all law enforcement officers and Boston Mayor @Marty_Walsh. _E_ Always remember SOMETIMES YOUR BEST INVESTMENTS ARE THE ONES YOU DON'T MAKE! _E_ Brande would have been fired immediately if she didn't raise $132000 a really large sum. Bret on the other hand raised very little... _E_ Entrepreneurs: Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_ Just returned from Colorado. Amazing crowd! _E_ Getting ready to make my speech at #KansasCaucus. A great honor! #MakeAmericaGreatAgain #Trump2016 _E_ The word is that Lance Armstrong will now implicate officials and others but who knows if he's telling the truth _E_ Iran will only get stronger in Iraq with the latest civil war. We should have taken the oil immediately after the invasion. _E_ Looking forward to the debate tonight and will be tweeting live with very honest assessment. _E_ Watch my appearance on @Morning_Joe great interview! __HTTP__ _E_ Along with a soaring bar of sky bound gold @TrumpLasVegas' pool deck overlooks the City of Lights __HTTP__ _E_ Getting China to stop playing its currency charades can begin whenever we elect a president ready to take (cont) __HTTP__ _E_ .@IanJamesPoulter Great going and almost as importantly your clothing line is selling well! _E_ Hillary Clinton colluded with the Democratic Party in order to beat Crazy Bernie Sanders. Is she allowed to so collude? Unfair to Bernie! _E_ Yet more evidence of a media rigged election: __HTTP__ _E_ Obama's Secret Service catastrophe has openly revealed a great lack of respect for our President. If they (cont) __HTTP__ _E_ Over 50 women were interviewed by the @nytimes yet they only wrote about 6. That's because there were so many positive statements. _E_ So so so important MAKE AMERICA GREAT AGAIN! _E_ "Failures are expected by losers ignored by winners." @CoachJoeGibbs _E_ Lots of response to my comment on Diet Coke  let's face it it doesn't work just makes you hungry. _E_ Playing golf with Prime Minister Abe and Hideki Matsuyama two wonderful people! __HTTP__ _E_ Many Democrats up for reelection in 2012 are skipping the DNC convention in Charlotte __HTTP__ Smart politics! _E_ Thank you to the @washingtonpost for the accurate and very discriptive story on my speech in Alabama last night. It was a great evening! _E_ First the Ninth Circuit rules against the ban & now it hits again on sanctuary cities both ridiculous rulings. See you in the Supreme Court! _E_ I don't believe in government picking winners or in the case of (@BarackObama) picking losers @MittRomney _E_ Must read column for all young people: Obama's war on young voters who elected him __HTTP__ _E_ Former Homeland Security Advisor Jeh Johnson is latest top intelligence official to state there was no grand scheme between Trump & Russia. _E_ Together our task is to strengthen our families to build up our communities to serve our citizens and to celebrate AMERICAN GREATNESS as a shining example to the world.... __HTTP__ _E_ Help fight autism go to __HTTP__ website for __HTTP__ donations & government activation. _E_ The ABC/Washington Post Poll even though almost 40% is not bad at this time was just about the most inaccurate poll around election time! _E_ In the heart of the city Trump International Toronto is the city's most elite property __HTTP__ True luxury at its finest. _E_ Josh Brolin a friend of mine was terrific in Men in Black. Congrats! _E_ .@JebBush was terrible on Face The Nation today. Being at 2% and falling seems to have totally affected his confidence. A basket case! _E_ There are only 22 days for @BarackObama to drop @JoeBiden. Obama is not a loyal guy. I think he is strongly considering it. _E_ The Republicans are always worried about the press they should just do what is right. _E_ ISIS gained tremendous strength during Hillary Clinton's term as Secretary of State. When will the dishonest media report the facts! _E_ Entrepreneurs: Don't ever think you've done it all already or that you've done your best. Don't sell yourself short! _E_ I wonder if Marshawn Lynch will now speak and call some coach a moron for not allowing him to run the ball three times for one yard? _E_ Playef golf today with Prime Minister Abe of Japan and @TheBig_Easy Ernie Els and had a great time. Japan is very well represented! _E_ The atrium of @TrumpTowerNY dressed up for Christmas __HTTP__ _E_ Going to Salt Lake City Utah for a big rally. Lyin' Ted Cruz should not be allowed to win there Mormons don't like LIARS! I beat Hillary _E_ I'm leading by big margins in every poll but the press keeps asking would you ever get out? They are just troublemakers I'm going to win! _E_ Rev. Wright called @BarackObama on tape a liar. Why isn't this being looked into? It would be a great commercial for the republicans. _E_ Really interesting President Obama was quick to shut down flights to Isreal but is totally unwilling to shut down flights from West Africa! _E_ If Saudi Arabia which has been making one billion dollars a day from oil wants our help and protection they must pay dearly! NO FREEBIES. _E_ Paul Teutul is always good on the show. #CelebApprentice _E_ Derek Jeter broke ankle one day after he sold his apartment in Trump World Tower. _E_ We have a sacred duty to care for our vets and their families. Veterans deserve universal access to care anywhere and anytime! _E_ The Fake News Media will not talk about the importance of the United Nations Security Council's 15 0 vote in favor of sanctions on N. Korea! _E_ Thank you Gettysburg Pennsylvania! #DrainTheSwamp __HTTP__ _E_ RT @dmartosko: This is the #NYTimes. Can you understand why so many reporters are cautious about working for them? __HTTP__ _E_ Love the people of South Carolina look very much forward to the debate tonight. _E_ If a new HealthCare Bill is not approved quickly BAILOUTS for Insurance Companies and BAILOUTS for Members of Congress will end very soon! _E_ The #USSJohnFinn will provide essential capabilities to keep America safe. Our sailors are the best anywhere in the world. Congratulations! __HTTP__ _E_ I hope that Crooked Hillary picks Goofy Elizabeth Warren sometimes referred to as Pocahontas as her V.P. Then we can litigate her fraud! _E_ I will unveil my first campaign ads on @Morning_Joe at 6:30am tomorrow. Enjoy! #MakeAmericaGreatAgain _E_ The new season of the Celebrity Apprentice begins Feb. 12 be prepared for the best season yet! __HTTP__ _E_ Derek Jeter is playing phenomenal baseball. He is a total winner and also a great guy. @DerekJeter _E_ I make good deals. That's what I do. I would make great deals for our country. my @SRQRepublicans speech _E_ I have fun I love what I do. You should too. Find out how at the National Achievers Conference this October in London __HTTP__ _E_ ALABAMA get out and vote for Luther Strange he has proven to me that he will never let you down! #MAGA _E_ Happy Thanksgiving to all. Have a great day and look forward to the future. We will MAKE AMERICA GREAT AGAIN! _E_ I will be on with @BretBaier tonight at 6PM. #Trump2016 _E_ The Trumping of Turnberry via Links Magazine @TrumpTurnberry __HTTP__ _E_ Watch Donald Trump's recent appearance on The Late Show with David Letterman: __HTTP__ _E_ I will be interviewed by @oreillyfactor tonight on @FoxNews at 11pm. Enjoy! _E_ Great @foxbusiness interview with @EricTrump on @TeamCavuto discussing the real estate economy & 2016 __HTTP__ _E_ Trump buys mansion adjacent to family winery __HTTP__ via @trdny _E_ When will we stop wasting our money on rebuilding Afghanistan? We must rebuild our country first. _E_ Join me LIVE at 5:45pmE from Harrisburg Pennsylvania! #TaxReform #USA __HTTP__ __HTTP__ _E_ We only want to admit those who love our people and support our values. #AmericaFirst _E_ Will be on @foxandfriends tomorrow morning at 7:00. _E_ Don't forget episodes 2 and 3 of @ApprenticeNBC are on tonight at 8PM and 9PM on @NBC. _E_ More on Benghazi cover up: "ATTORNEY FOR WHISTLEBLOWER: 400 U.S. MISSILES STOLEN IN BENGHAZI" __HTTP__ Really bad. _E_ Weiner is gone Spitzer is gone next will be lightweight A.G. Eric Schneiderman. Is he a crook? Wait and see worse than Spitzer or Weiner _E_ Obama wants to unilaterally put a no fly zone in Syria to protect Al Qaeda Islamists __HTTP__ Syria is NOT our problem. _E_ The Fake News is becoming more and more dishonest! Even a dinner arranged for top 20 leaders in Germany is made to look sinister! _E_ If you don't publicize your successes your competitors will be sure to belittle them. Get the word out! _E_ Obama is now warning North Korea on the Yongbyon nuclear reactor __HTTP__ After Syria our enemies are laughing! _E_ Go out and buy CRIPPLED AMERICA: How to Make America Great Again. Doing really well. Great Thanksgiving or Christmas present! _E_ Can't wait for @VanityFair to fold which under Graydon Carter will be sooner rather than later. _E_ Hmmm...can you imagine me speaking at the RNC Convention in Tampa? __HTTP__ That's a speech everyone would watch. _E_ RT @realDonaldTrump: More and more people are suggesting that Republicans (and me) should be given Equal Time on T.V. when you look at the... _E_ A great story in the New York Post really well written! __HTTP__ _E_ Yes it is true Carlos Slim the great businessman from Mexico called me about getting together for a meeting. We met HE IS A GREAT GUY! _E_ Today we witnessed an incredible moment in history – the presentation of Congress' highest civilian honor to our friend and true AMERICAN HERO Bob Dole. #CongressionalGoldMedal __HTTP__ _E_ .@FoxNews from multiple sources: There was electronic surveillance of Trump and people close to Trump. This is unprecedented. @FBI _E_ Canadians: My ultra luxury private plane will be featured on Sunday's episode of #MightyPlanes on @DiscoveryCanada don't miss it at 8 ET! _E_ Hillary Clinton just lost every Republican she ever had including Never Trump all farmers & sm. biz by saying she'll tax estates at 65%. _E_ Thinking big is the driving force that has forged all the great achievements in modern life. Think Big _E_ All time hit leader Pete Rose should now be in the Baseball Hall Of Fame. He has paid his penalty! _E_ MAKE AMERICA SAFE AND GREAT AGAIN! #TrumpPence16 __HTTP__ __HTTP__ _E_ Strange why didn't @BarackObama hold any special event to celebrate the 2 year anniversary of ObamaCare? __HTTP__ _E_ Thank you! #MakeAmericaGreatAgain __HTTP__ _E_ Jerry Falwell Jr. stated speech was best in University's history...my great honor. _E_ The more I get to know @MittRomney the more I like him. He has the judgment and private sector experience America needs in the White House. _E_ You have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_ Thank you @DonaldJTrumpJr & @EricTrump. #Trump2016 __HTTP__ _E_ The failing @nytimes finally gets it In places where no insurance company offers plans there will be no way for ObamaCare customers to.. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ AIR FORCE TRUMP: AHEAD OF 2016 THE DONALD SLAMS ROMNEY BUSH IN SOUTH CAROLINA __HTTP__ via @BreitbartNews by @mboyle1 _E_ Now that the ineffective Baltimore Police have allowed the city to be destroyed are the U.S. taxpayers expected to rebuild it (again)? _E_ Wow great ratings for @ApprenticeNBC __HTTP__ Don't forget watch 2 new episodes tonight at 8PM on @NBC. _E_ Received a standing ovation in packed house @MorningsideEdu after Sam Clovis intro! Let's Make America Great Again! __HTTP__ _E_ "Donald Trump: I'm Not Buying the @BrooklynNets" __HTTP__ via @TMZ_Sports _E_ Based on the fact that the very unfair and unpopular Individual Mandate has been terminated as part of our Tax Cut Bill which essentially Repeals (over time) ObamaCare the Democrats & Republicans will eventually come together and develop a great new HealthCare plan! _E_ Las Vegas' most elite destination @TrumpLasVegas' has 64 stories of golden glass & offers ultimate luxury __HTTP__ _E_ Achievers move forward at all times. Don't tread water. Get out there and go for it. _E_ RT @ABC: Pres. Trump: We cannot be defined by the evil that threatens us or the violence that incites such terror. __HTTP__ _E_ Doing Fox & Friends at 7 A.M. _E_ I wish the @WSJ Wall Street Journal had reported the just out @CNN Iowa Poll correctly. I lead by a wide margin13 points going up big! _E_ Read an excerpt from Think Like A Champion by Donald J. Trump: __HTTP__ _E_ Join me in Rome NY tomorrow!#Trump2016 #NYPrimaryTickets available: __HTTP__ _E_ I wish President @BarackObama the best of luck in his second term... _E_ Governor Alejandro García Padilla said presidential hopeful Sen. Marco Rubio "is no friend of Puerto Rico. __HTTP__ _E_ Ohio Senator @RobPortman: @MittRomney knows how to return prosperity: __HTTP__ #Mitt2012 #tcot _E_ I believe that President Obama is so overwhelmed by what is happening in the U.S. and throughout the World that he has totally given up! _E_ .@Yankees Kevin Youkilis is off to a terrific start. He's less than half the price and a much better player than a drug free A Rod. _E_ Via @BreitbartNews by @mboyle1: TRUMP: OBAMA SHOULDN'T ATTACK AMERICANS OVERSEAS HILLARY'S EMAIL WAS 'CRIMINAL __HTTP__ _E_ Don't let up keep getting out to vote this election is FAR FROM OVER! We are doing well but there is much time left. GO FLORIDA! _E_ Another Crooked Hillary Fan! __HTTP__ _E_ Even more @BarackObama crony capitalism & corruption. We are guaranteeing a $105M loan to another Obama donor __HTTP__ _E_ ...Trump however would kick his ass! _E_ ...LaVar you could have spent the next 5 to 10 years during Thanksgiving with your son in China but no NBA contract to support you. But remember LaVar shoplifting is NOT a little thing. It's a really big deal especially in China. Ungrateful fool! _E_ "We need more grown ups in Washington people who will shoot straight and level with the American people." #TimeToGetTough _E_ I'm sure the media will not report the highly respected new national poll that just came out via The Economist. 32%! __HTTP__ _E_ Getting ready to go to Iowa today. Big crowd will be a great day! _E_ Threatening phone calls from Obama supporters are being made to the Michigan GOP office __HTTP__ Don't be intimidated! _E_ #AskTrump Send me your questions to answer live from @TwitterNYC later this afternoon. _E_ We create success or failure on the course primarily by our thoughts. Gary Player _E_ Thank you! #Trump2016 __HTTP__ _E_ Today it was my great honor to sign the largest TAX CUTS and reform in the history of our country. Full remarks: __HTTP__ __HTTP__ _E_ Will be on Fox & Friends tomorrow morning at 7.00 hope you enjoy! _E_ Colorado Trump Delegates Scratched from Ballots at GOP Convention __HTTP__ _E_ Any American who fights with ISIS should have their passport revoked. Take them to Gitmo for interrogation. _E_ Wow Obama Care just got delayed by over a year because it is so complicated it cannot be understood the beginning of the end! _E_ JEB is a hypocrite! Used massive private Eminent Domain Just another clueless politician! __HTTP__ _E_ Penn State is doing a poor job in bringing its mess to a close.They should be ashamed for hiding Sandusky's crimes all these years... _E_ RT @opinionsamerica: @realDonaldTrump Strong administration leads to a strong response. _E_ The law requires individuals pay 15% on carried interest. Why would a potential President pay more than he or she is supposed to? _E_ .@Yankees manager Joe Girardi is a gritty leader who stands up for his players. Doing a great job! _E_ Obama is going to take away over 90M Americans' healthcare plans but he is letting Iran keep its nukes. Just think about that. _E_ Trump National Golf Club Los Angeles offers 18 holes fronting the Pacific Ocean on the Palos Verdes Peninsula. __HTTP__ _E_ The Fake News Media hates when I use what has turned out to be my very powerful Social Media over 100 million people! I can go around them _E_ See you tomorrow Dutchess County New York! #NYPrimary #TrumpTrain __HTTP__ __HTTP__ _E_ Big TAX REFORM AND TAX REDUCTION will be announced next Wednesday. _E_ Lifting off right now for U.S.S. Wisconsin in Norfolk. See ya' _E_ Next year @TomBrokaw should be the comedian at the White House Correspondents' dinner. The only problem is that... __HTTP__ _E_ Today we together won the Republican Nomination for President! __HTTP__ _E_ Obama has changed the Census so "it will be difficult to measure the effects" of O'Care __HTTP__ REAL data hidden _E_ The Inspector General's report on Crooked Hillary Clinton is a disaster. Such bad judgement and temperament cannot be allowed in the W.H. _E_ I will be on Fox & Friends tomorrow morning at 7.00. Ebola and ISIS will be topics. _E_ .@WhiteHouse #CEOTownHall __HTTP__ __HTTP__ _E_ Statement on Preventing Muslim Immigration: __HTTP__ __HTTP__ _E_ .@mcuban you were excellent on Howard Stern...thanks for the nice comments about my kids...yours are winners also! _E_ #trumpvlog My thoughts on gasoline prices skyrocketing...... __HTTP__ _E_ Only the Obama WH can get away with attacking Bob Woodward. _E_ To all of my twitter followers please contribute whatever you can to the campaign. We must beat Crooked Hillary. __HTTP__ _E_ .@TrumpDoral's golf courses The Red Tiger The Silver Fox & The Golden Palm are on track to open later this year __HTTP__ _E_ The attack on our Libyan consulate was the worst attack on the US since 9/11. Time for Obama to come clean. _E_ We have a MASSIVE trade deficit with Germany plus they pay FAR LESS than they should on NATO & military. Very bad for U.S. This will change _E_ I'm loyal to people who've done good work for me. #TheArtofTheDeal _E_ Is it a coincidence that the Middle East has blown up since Obama became president? _E_ Congratulations to @IvankaTrump on being named the @FoxNewsSunday Power Player of the Week __HTTP__ _E_ I started my business with very little and built it into a great company with some of the best real estate assets in the World. Amazing! _E_ Via @IBTimes: Miss Universe 2013: Contestants Stun in Gorgeous Gowns at National Gift Auction Gala __HTTP__ _E_ Via @USATODAY: "Trump endorses Wintour for ambassadorship" __HTTP__ _E_ Enjoyed watching @MonicaCrowley's analysis of my @BillOreilly interview. Great points! Thank you Monica. _E_ The CPAC speech went really well this morning first speaker standing ovation. I really enjoyed it. _E_ Ohio is losing jobs to Mexico now losing Ford (and many others). Kasich is weak on illegal immigration. We need strong borders now! _E_ The media can track down @PaulRyan's old girlfriend and marathon time but can't find @BarackObama's college applications or other info. _E_ Remember I am self funding my campaign the only one in either party. I'm not controlled by lobbyists or special interests only the U.S.A.! _E_ After years of long stops then starts why did dopey Eric Scheiderman tell people in The Trump Org. this case is going awaywe have no case _E_ Looking forward to visiting @SimpsonCollege on Wednesday to discuss education. Common Core is an attack on individual & local rights! _E_ The @Yankees should immediately stop paying A Rod—he signed his contract without telling them he was a druggie. _E_ Seven people shot and killed yesterday in Chicago. What is going on there totally out of control. Chicago needs help! _E_ Even Jimmy Carter just released a statement saying that Obama doesn't have a clue. That has to be a new low! _E_ RT @IngrahamAngle: "Far right"? You mean "right so far" as in @realDonaldTrump has been right so far abt how to kick the economy into high... _E_ Total fool @KarlRove is part of the Republican Establishment problem. An all talk no action dummy! __HTTP__ _E_ THANK YOU Grand Rapids Michigan! Time to end political correctness & secure our homeland! __HTTP__ __HTTP__ _E_ "Great effort springs naturally from great attitude." Pat Riley _E_ China is robbing us blind in trade deficits and stealing our jobs yet our leaders are claiming 'progress' __HTTP__ SAD! _E_ Thank you to @foxandfriends for the great review of the speech on immigration last night. Thank you also to the great people of Arizona! _E_ Hey @glennbeck see how I beat your boy Ted in your own Blaze poll? Your endorsement means nothing! #GOPDebate _E_ I am not angry at Russia (or China) because their leaders are far smarter than ours. We need real leadership and fastbefore it is too late _E_ RT @TeamTrump: We agree with Bill ObamaCare is "the craziest thing in the world." #BigLeagueTruth #Debates2016 __HTTP__ _E_ Thank you America! #Trump2016 __HTTP__ __HTTP__ _E_ Cryin' Chuck Schumer fully understands especially after his humiliating defeat that if there is no Wall there is no DACA. We must have safety and security together with a strong Military for our great people! _E_ Via @NewYorkObserver by @Bshapiro91: "Donald Trump @MelRivers Headline @Algemeiner Gala" __HTTP__ _E_ Obama is making speeches excoriating the Republicans and they never answer back. Why aren't they fighting? _E_ Great job on @donlemon tonight @kayleighmcenany @cherijacobus begged us for a job. We said no and she went hostile. A real dummy! @CNN _E_ I will be live tweeting the V.P. Debate. Very exciting! MAKE AMERICA GREAT AGAIN! _E_ Wow ISIS has just taken the City of Ramadi in Iraq. So many of our great soldiers died in originally going after it. Such a waste. _E_ Our spectacular ballroom under construction at the great Turnberry resort in Scotland. __HTTP__ _E_ Clinton's Top Aides Were Mired In Conflict Of Interest At The State Department: __HTTP__ #BigLeagueTruth _E_ .@yuSiddiqui @piersmorgan @rustyrockets I got much better—no contest—I got Melania! _E_ My thoughts on Joe Paterno and political analysts in today's #trumpvlog... __HTTP__ _E_ You should give the money back @HillaryClinton! #DrainTheSwamp __HTTP__ _E_ Weird why did BarackObama Sr. fail to list @BarackObama as his son in his 1961 INS application? __HTTP__ _E_ Illegal use of official Attorney General stationary by lightweight @AGSchneiderman. __HTTP__ _E_ Welcome to the new Egypt Muslim Brotherhood representatives who won't take questions from Israeli journalists __HTTP__ _E_ What my father really gave me is a good (great) brain motivation and the benefit of his experience unlike the haters and losers (lazy!). _E_ The tragedy in South Carolina is incomprehensible. My deepest condolences to all. _E_ I'm not hearing much from Obama or his administration about my $5M offer to charity or to which charity the money will go. _E_ What a convenient mistake: @BarackObama issued a statement for Kwanza but failed to issue one for Christmas. __HTTP__ _E_ I will be on @foxandfriends tomorrow morning at 7.00. Will be talking about sleazebag Jonathan Gruber ( Americans are stupid ) & exec order _E_ While the Fake News loves to talk about my so called low approval rating @foxandfriends just showed that my rating on Dec. 28 2017 was approximately the same as President Obama on Dec. 28 2009 which was 47%...and this despite massive negative Trump coverage & Russia hoax! _E_ Really looking forward to my address @CPACnews this Friday morning at 8:30. Will stress jobs etc. Can't wait to see my many friends. _E_ Sanders says he wants to run against me because he doesn't want to run against me. He would be so easy to beat! _E_ Congratulations to my friend @limbaugh on being named to the Hall of Famous Missourians. Rush is a great guy & a great character. _E_ Thanks Mark will be fun. __HTTP__ _E_ A great and important day at the United Nations.Met with leaders of many nations who agree with much (or all) of what I stated in my speech! _E_ Many people voted for Cruz over Carson because of this Cruz fraud. Also Cruz sent out a VOTER VIOLATION certificate to thousands of voters. _E_ My friend @ChristianJosi is making a very special LP. Follow him. Conservative leader by day likely 2015 GRAMMY winner by night. #LEGENDS _E_ Oh no another rapper doing a Trump song Young Jeezy Trump Lyrics. Why aren't these guys paying me? _E_ Was Susan Rice told to lie about Bergdahl? Obama and his representatives lie about virtually everything from ObamaCare to a deserter. _E_ USMC Andrew Tahmooressi should be freed immediately. He never should have been jailed in the first place. Weak leaders. #FreeOurMarine _E_ Adrian was recognized on a Disney cruise and has had many photo requests in @TrumpTowerNY. We have a new celebrity! #CelebApprentice _E_ Pageant people are really talking about Venezuela Brazil Mexico USA India Australia. _E_ What a great four days in Cleveland. So proud of the great job done by the RNC and all. The police and Secret Service were fantastic! _E_ On 800 pristine Miami acres @TrumpDoral boasts luxurious accommodations world class dining & championship golf __HTTP__ _E_ Watch @CNN at 9:00 A.M. @jaketapper. Then interviewed on @ABC @GStephanopoulos at 10:00 A.M. and then at 10:30 A.M. watch Face The Nation. _E_ Because the ban was lifted by a judge many very bad and dangerous people may be pouring into our country. A terrible decision _E_ It is time to remember that... __HTTP__ _E_ .@natalie_gulbis Thank you for your support this morning on @GolfChannel. Even more importantly play well this week! Say hi to all. _E_ Today we just passed 1.4 million twitter followers.. _E_ I will renegotiate NAFTA. If I can't make a great deal we're going to tear it up. We're going to get this economy running again. #Debate _E_ My @eonline interview discussing @_KatherineWebb's stardom and why @espn's apology was unwarranted __HTTP__ _E_ Degenerate former Congressman Anthony Weiner is trying to make a comeback. He is a sick & perverted man that New York does not want or need. _E_ For the nonbeliever here is a photo of @Neilyoung in my office and his $$ request—total hypocrite. __HTTP__ _E_ Thank you for your interest & support during last nights #GOPDebate! #IACaucus finder: __HTTP__ __HTTP__ _E_ ... ...Do your research before donating this holiday season! _E_ My wife Melania Trump's show was a tremendous success last night. In case you missed her you can see her again tonight on @QVC at 7 pm ET _E_ RT @foxandfriends: Trump vows U.S. 'power' will meet North Korean threat __HTTP__ _E_ I would like to express my warmest regards best wishes and condolences to all of the families and victims of the horrible bombing in NYC. _E_ A very big poll is coming out at 6 PM in New Hampshire. Will be very interested in the results. _E_ Always great to speak with Veterans our nation's heroes. We will Make America Great Again! __HTTP__ _E_ 'CNBC Time magazine online polls say Donald Trump won the first presidential debate' via @WashTimes. #MAGA __HTTP__ _E_ Wow Huffington Post just stated that I am number 1 in the polls of Republican candidates. Thank you but the work has just begun! _E_ The Apprentice was the #1 show on television last season on Sunday from 10 to 11 congratulations Donald! _E_ Scots should boycott Glenfiddich garbage for not choosing great Olympic & U.S. Open champ Andy Murray over total loser Michael Forbes. _E_ I will stand with police and protect ALL Americans! #Debates2016 #MAGA __HTTP__ _E_ Thank you Atlanta Georgia! Will be back soon! #AmericaFirst __HTTP__ _E_ A massive blow to Obama's message only 38000 new jobs for month in just issued jobs report. That's REALLY bad! _E_ True thanks. __HTTP__ _E_ AMERICA will once again be a NATION that thinks big dreams bigger and always reaches for the stars. YOU are the ones who will shape America's destiny. YOU are the ones who will restore our prosperity. And YOU are the ones who are MAKING AMERICA GREAT AGAIN! #MAGA __HTTP__ _E_ We had a GREAT year @Macys with ties shirts and suits thanks! New selections just arrived they are amazing! _E_ Young entrepreneurs – keep positive. Don't let the ObamaCare disaster stop your endeavors. There are great opportunities out there. _E_ Sure @BarackObama's literary agent claims the 1991 booklet was a 'mistake' __HTTP__ Pretty convenient. _E_ When will @AlexSalmond realize that he's destroying Scotland the most beautiful countryside in the world w/ his stupid wind turbines? _E_ Mike Huckebee a great guy said the President should appoint me Treasury Secretary. China and OPEC would not be happy. _E_ Forty six million Americans more than at any time ever in the history of this country now live under the poverty line. #TimeToGetTough _E_ This is no act of love as Jeb Bush said... __HTTP__ _E_ I'll be on @foxandfriends Monday at 7:30 AM. Be sure to tune in. _E_ "Trump: 'I like North Carolina we are looking at another deal'" __HTTP__ via @WSOC_TV _E_ Via @WSJPolitics by @reidepstein: "Trump Surges in Popularity in N.H." __HTTP__ _E_ .@WineEnthusiast's highest rated wine in Virginia @trumpwinery is the premier name in sophistication and quality __HTTP__ _E_ Via @Newsmax_Media by Courtney Coren: Trump: China Gets Iraq Oil US Gets Nothing __HTTP__ _E_ my presidency. Isn't this a ridiculous shame? He loves these kids has raised millions of dollars for them and now must stop. Wrong answer! _E_ ...Trump International Hotel Las Vegas and Trump International Hotel & Tower Waikiki Beach Walk. __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Looking forward to the @CadillacChamp at Trump Doral next week 3.6. 3.10. Can't wait to meet the attendees. #WGCDoral _E_ So much SPIRIT in LA! Thank you to all of our HEREOS who saved many lives. An honor to spend time w/ @NationalGuard #LEOs & the #CajunNavy! __HTTP__ _E_ Via @Newsmax_Media: Romney Said Nothing Wrong __HTTP__ _E_ Watch Melania on QVC this morning from 10 a.m. to 11 a.m. with her third line of her Melania Timepieces & Jewelry collection... _E_ If Ebola is so non contagious how come an NBC cameraman caught it so quickly while over in West Africa? U.S. is behaving very foolishly! _E_ Oscar Pistorious is guilty as hell! _E_ .@CarlyFiorina Carly—I did graduate from Wharton and did very well. Who is your fact checker? Will you apologize? _E_ Eric Trump on @JudgeJeanine on @FoxNews now! _E_ ...while her charity is getting less than 5 cents per donated dollar. She should be ashamed! _E_ The 2012 budget deficit is already $93 billion larger than earlier estimates __HTTP__ @BarackObama (cont) __HTTP__ _E_ Jodi should try but the Govt. should not make a deal no jury could be dumb enough to let her off (but you never know look at OJ & others) _E_ If you experience any harassment or heckling at the polling places from Obama supporters make sure you report it immediately. _E_ My @SquawkCNBC interview discussing why @MittRomney is a great nominee gas prices and why George Will is a loser. __HTTP__ _E_ You must be registered Republican by February 16th to vote TRUMP in the Florida primary. __HTTP__ _E_ Don't forget to watch Celebrity Apprentice tonight at 9 on NBC GREAT EPISODE! _E_ My Administration will continue to work around the clock with Governor @RicardoRossello & his team. Great progress being made! #PRStrong __HTTP__ _E_ Just got a call from my friend Bill Ford Chairman of Ford who advised me that he will be keeping the Lincoln plant in Kentucky no Mexico _E_ Will be on @foxandfriends at 7:00 5 minutes. _E_ The Chinese Envoy who just returned from North Korea seems to have had no impact on Little Rocket Man. Hard to believe his people and the military put up with living in such horrible conditions. Russia and China condemned the launch. _E_ #TrumpAdvice __HTTP__ _E_ Via @WashTimes by @CharlesHurt: Donald Trump declares war on lying street hustlers of Congress" __HTTP__ _E_ .@SpeakerRyan Congratulations and good luck you will do a GREAT job for our wonderful U.S.A.! _E_ .@bovanpelt. Bo I heard you were great at Trump National Westchester I am not at all surprised. Keep playing well you are a winner! _E_ Brand new selection of Trump Signature Collection shirts and ties @Macys. Go check them out. _E_ While the @Yankees look like they quit and are finished they won't quit for CC _E_ "Patriotism is supporting your country all the time and your government when it deserves it." Mark Twain _E_ It always seems impossible until it is done. Nelson Mandela _E_ My statement on NATO being obsolete and disproportionately too expensive (and unfair) for the U.S. are now finally receiving plaudits! _E_ "Labor disgraces no man unfortunately you occasionally find men who disgrace labor." Gen. Ulysses S. Grant _E_ Congratulations to @secupp on joining @newtgingrich on @CNN's Crossfire. Show will be excellent! _E_ Emin from Russia a very talented guy. All proceeds go to help the Philippines. @eminofficial #missuniverse __HTTP__ _E_ Which National Costume do you think should win? __HTTP__ _E_ Looking forward to joining @V4SA Tuesday 9/15 in L.A. aboard the @USSIowa The Battleship of Presidents! Join us! __HTTP__ _E_ Thanks @WWE @VinceMcMahon is an amazing guy. _E_ "Be flexible enough to adjust to changing circumstances." – Think Big _E_ We now have confirmation as to one reason Crooked H wanted to be sure that nobody saw her e mails PAY FOR PLAY. How can she run for Pres. _E_ Detroit's bankruptcy could just be the start __HTTP__ Many municipalities across US are over leveraged & losing citizens _E_ Giving away money and revolutionizing crowdfunding. Follow @fundanything to see which causes are financed daily _E_ #trumpvlog The Republicans must act now don't let @barackobama push you around.... __HTTP__ _E_ Via @theinquisitr: "Americans Agree With Donald Trump 58 Percent Want Flights Banned From Ebola Outbreak Countries" __HTTP__ _E_ __HTTP__ _E_ Obama administration is killing American industrial renaissance by stopping drilling and fracking. Terrible for economy. _E_ Republicans must get out today and VOTE in Georgia 6. Force runoff and easy win! Dem Ossoff will raise your taxes very bad on crime & 2nd A. _E_ It's too bad so few people showed up to @bobvanderplaats Family Leader dinner. Next year I'll try & be there and they'll have a huge crowd! _E_ Welcome to the @BarackObama recovery the labor force participation rate is at a NEW 30 year low of 64.3% __HTTP__ _E_ China is now attacking Japan's economy for leverage __HTTP__ Soon they will try the same with us. #TimeToGetTough _E_ Due to popular demand CNN will re broadcast the Larry King Live show I hosted in June in which I interview Larry. Monday July 5 9 pm CNN _E_ The late great William F. Buckley would be ashamed of what had happened to his prize the dying National Review! _E_ Be tough be focused. There are a lot of ups and downs but you can ride them out if you're prepared for them. _E_ Let's see whether or not Chuck Townsend @CondeNastCorp is smart enough to fire Graydon Carter who only cares about his bad food restaurants _E_ Looks like many anti police agitators in Boston. Police are looking tough and smart! Thank you. _E_ John Kasich despite being Governor of Ohio is losing to me in the Ohio polls. Pathetic! _E_ A former Secret Service Agent for President Clinton excoriates Crooked Hillary describing her as ERRATIC & VIOLENT. Bad temperament for pres _E_ Not his 'per se'? A Friday document dump shows @BarackObama all hands on deck as Solyndra collapsed __HTTP__ @BarackObama lies. _E_ Everybody's talking about my doing twitter during the likely very boring debate tonight. @realDonaldTrump #DemDebate _E_ It's Tuesday how much has China stolen from us today through cyber espionage? _E_ Remember to take time this weekend to relax and regroup. It will pay major dividends for the next week. _E_ For great success you need passion but make sure it's well directed. Learn everything you can about what you're doing. Be an expert. _E_ We have a sacred duty to care for our vets and their families. Our Vets are owed full access to healthcare anytime & anywhere! _E_ It was great to have @ApprenticeNBC veterans George Ross and @BretMichaels back in the boardroom. __HTTP__ #CelebApprentice _E_ Today we lost a great pioneer of air and space in John Glenn. He was a hero and inspired generations of future explorers. He will be missed. _E_ Via @FortuneMagazine by @mcasey1: "Donald Trump plans to build a Trump Tower in Mumbai" __HTTP__ _E_ An impromptu interview I did with German TV on 9/11 down by Ground Zero discussing the attack and WTC Towers __HTTP__ _E_ Great poll thank you America! Once we #DrainTheSwamp together we will #MAGA#Debate __HTTP__ _E_ Whose artwork was your favorite— and what team do you think will win? #CelebApprentice _E_ CHIP should be part of a long term solution not a 30 Day or short term extension! _E_ RT @TeamTrump: She calls our people deplorable and irredeemable. I will be a president for ALL of our people. @RealDonaldTrump #BigLeag... _E_ "Our runaway judiciary is badly in need of restraint by Congress." Phyllis Schlafly _E_ The Dunes of @TrumpScotland are a world treasure threading thru @GolfWorld1's Scotland top Par 72 7400 yd course __HTTP__ _E_ Yes this is a large scale version of when I built and saved the ice skating rink in Central Park (which all should go to). Great course! _E_ ... That's why so many huge deals are closed on a golf course." – TRUMP 101 _E_ New @RNC report calls for embracing "comprehensive immigration reform." __HTTP__ Does the @RNC have a death wish? _E_ .@Omarosa's meltdown—was it for real? @DennisRodman thinks she could be an Oscar winner for that performance... #CelebApprentice _E_ THANK YOU Connecticut Delaware Maryland Pennsylvania and Rhode Island! #MakeAmericaGreatAgain __HTTP__ _E_ ...people not interviewed including Clinton herself. Comey stated under oath that he didn't do this obviously a fix? Where is Justice Dept? _E_ Joan Rivers had great talent but also truly amazing stamina and drive she would never give up or quit. That is why she became a champion! _E_ The sex scandal at the CIA and Pentagon is rapidly unfolding getting more interesting by the minute! _E_ #AskTrump @TwitterNYC __HTTP__ _E_ A 'confidential source' has called my office and told me that @BarackObama has added over $6T to the new national debt & ruined US credit. _E_ Anna Wintour came to my office at Trump Tower to ask me to meet with the editors of Conde Nast & Steven Newhouse a friend. Will go this AM. _E_ My @foxandfriends interview where I discuss @Rosie being canceled yet again and how she just can't make it on TV __HTTP__ _E_ Wow so nice! Thank you Wayne Allyn Root. __HTTP__ _E_ FRACK NOW & FRACK FAST!!! American prosperity depends on it. Our economic renaissance is here. _E_ .@BarackObama is bankrupting this country. His budget adds another $4.4T to the debt putting us over $20T in total debt by 2016. _E_ Too bad about New York Magazine but there's a much bigger one out there currently doing a story on me to get even that I'll soon discuss! _E_ Get ready for some excitement the live finale of the Celebrity Apprentice is on this Sunday night don't miss it! __HTTP__ _E_ "On 1/20 the day Trump was inaugurated an estimated 35000 ISIS fighters held approx 17500 square miles of territory in both Iraq and Syria. As of 12/21 the U.S. military est the remaining 1000 or so fighters occupy roughly 1900 square miles.." @jamiejmcintyre @dcexaminer __HTTP__ _E_ Limited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ _E_ Obama now wants to give another $450M to the Muslim Brotherhood. Money we don't have going to people that hate us. Moronic. _E_ Beautiful morning thank you @ICLV! __HTTP__ _E_ "Remember the golden rule of negotiating: 'He who has the gold makes the rules.'" – Midas Touch _E_ 7 million Americans are going to lose their jobs due to ObamaCare. 46 million face 300% premium increases. DEFUND! #MakeDCListen _E_ Heading to Phoneix. Will be arriving soon. Tomorrow a big day. Tremendous crowds expected! #Trump2016 #MakeAmericaGreatAgain _E_ Entrepreneurs: See yourself as victorious: Look at the solution not the problem. _E_ Even a mistake may turn out to be the one thing necessary to a worthwhile achievement. Henry Ford _E_ .@nbcsnl So much fun last night! _E_ Now that the Mexican drug lord escaped from prison everyone is saying that most of the cocaine etc. coming into the U.S. comes over border! _E_ This cannot be the the Academy Awards #Oscars AWFUL!!!!!!!!!!!!!!! _E_ Wow just released that $67 million in negative ads was spent on me. How am I still number one by a lot? _E_ .@meetthepress and @chucktodd very dishonest in not showing the new @CNN Poll where I am at 39% 21points higher than Cruz. Be honest Chuck! _E_ Joe Biden said that the Taliban 'is not our enemy.' I wonder how our troops in Afghanistant that are under attack view Biden's statement. _E_ I want to win for the people of this great country. The only people I will owe are the voters. #Trump2016 Video: __HTTP__ _E_ Remember politicians are all talk and NO action. Our country is a laughing stock that is going to hell. The lobbyists & donors control all! _E_ Almost daily more discrepancies in @BarackObama's biography continue to arise. Who is this guy? _E_ Via @FSMtweet: "Trump is Right: Illegal Alien Crime is Staggering in Scope and Savagery" __HTTP__ _E_ What do African Americans and Hispanics have to lose by going with me. Look at the poverty crime and educational statistics. I will fix it! _E_ My @TeamCavuto int. on simplifying the tax code our incompetent leaders Iran and making America great again __HTTP__ _E_ .@NBA hall of famer @dennisrodman brings his A game in the 13th season of All Star @CelebApprentice. This time Dennis is a star! _E_ The @SuperCommittee must cut spending not raise taxes. Washington has a spending problem not a revenue problem. _E_ I am a defender of @MileyCyrus who I think is a good person (and not because she stays at my hotels) but last night's outfit must go! _E_ For all of my fantastic supporters and for the U.S.A. we are going to win and MAKE AMERICA GREAT AGAIN maybe greater than ever before! _E_ My @FoxNews interview last night on @gretawire On 2012: I'll Wait and See __HTTP__ _E_ Joe Girardi @Yankees must play his starters even A Rod they got you there. _E_ #BuyAmericanHireAmerican __HTTP__ _E_ "The problem is that we have a president who is more concerned with pursuing some sort of bizarre ideological (cont) __HTTP__ _E_ Thank you St. Louis Missouri!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Hillary Clinton Deleted Emails Using Program Intended To Prevent Recovery #CrookedHillary __HTTP__ _E_ Will be in Cleveland Ohio w/ @mike_pence tonight join us: __HTTP__ Florida tomorrow @ 6pm: __HTTP__ _E_ Part 1 of my @SpecialReport int. with @BretBaier discussing why I am strongly considering running for President __HTTP__ _E_ The Iraqi army has squandered the majority of the weapons & training we gave them for 10 long years. When will we learn? _E_ #TBT WrestleMania 23 __HTTP__ _E_ 'Why Trump' __HTTP__ _E_ Thank you Maria B! __HTTP__ _E_ While I own properties across the world I am very excited about my new acquisition of @Doral in Miami. (cont) __HTTP__ _E_ For those few people knocking me for tweeting at three o'clock in the morning at least you know I will be there awake to answer the call! _E_ I'm in Scotland to open what we hope to be the greatest golf course in the world it's amazing. _E_ When in doubt Obama fundraises. He has held 393 fundraisers in six years. Another record. _E_ It's hardly any wonder that our country's manufacturing dominance has evaporated. #TimeToGetTough (cont) __HTTP__ _E_ I have met & spent a lot of time with families @ The Remembrance Project. I will fight for them everyday!... __HTTP__ _E_ .@bubbawatson What a great player you have turned out to be but also what a great guy! Congratulations on another fantastic Masters win. _E_ Join me in Ohio & Maine!Cincinnati Ohio tonight @ 7:30pm: __HTTP__ Maine Saturday @ 3pm... __HTTP__ _E_ Personally I think Douglas Durst's brother got screwed by Douglas no wonder he's angry. _E_ "All the things I love is what my business is all about." @MarthaStewart _E_ They succeed because they think they can. Virgil _E_ Fernando thank you for the GREAT review of The Blue Monster in South Florida Golf especially top 10 in the WORLD. I love @SOFLAGOLF! _E_ Trump Tower Punta del Este's cylindrical tower redefines the essence of luxury. On the sands of Playa Brava __HTTP__ _E_ .@Deadspin guys are total losers—they had their story stolen right from under their bad complexions—other media capitalized! _E_ After my meeting with the pastors it's off to Georgia for a big rally many thousands of great people will be there a beautiful movement! _E_ For the 1st time in American history America's 16500 border patrol agents have issue a presidential primary endorsement—me! Thank you. _E_ Thank you! #VoteTrump #ImWithYou __HTTP__ _E_ When @mcuban had his own show The Benefactor it totally "bombed!" _E_ Trump has big plans for improving @DoralResort __HTTP__ via @nbc's @GolfChannel @CadillacChamp _E_ Australia is a beautiful country with terrific people who love America. _E_ We must not allow ISIS to return or enter our country after defeating them in the Middle East and elsewhere. Enough! _E_ Even the liberal CRS is now reporting Obama Care will cause 200% premium increases __HTTP__ Surprised? @Newsmax_Media _E_ RT @DonaldJTrumpJr: A message from Donald J. Trump to NEW YORK! __HTTP__ _E_ Golf Channel & Donald Trump's World of Golf host a Celebrity Match 1/25 @ TNGC LA CA Mark Wahlberg vs. Kevin Dillon __HTTP__ _E_ So biased: @TIME made 'The Protester' as the person of the year. @TIME celebrates OWS but vilified the Tea Party last year. _E_ Happy 226th Birthday to the United States Coast Guard. Thank you @USCG! #CoastGuardDay __HTTP__ _E_ Thank you for joining me in Mandan ND Gov. @DougBurgum Lt. Gov. @BrentSanfordND @SenJohnHoeven @RepKevinCramer & @SenatorHeitkamp. __HTTP__ _E_ Very proud of my Executive Order which will allow greatly expanded access and far lower costs for HealthCare. Millions of people benefit! _E_ The media is on a new phony kick about my management style. I spend much less money & get much better results! What we need as Prez! _E_ Melania and I are hosting Japanese Prime Minister Shinzo Abe and Mrs. Abe at Mar a Lago in Palm Beach Fla. They are a wonderful couple! _E_ Ron Paul is right that we are wasting trillions of dollars in Iraq and Afghanistan. _E_ Crooked Hillary Clinton is guilty as hell but the system is totally rigged and corrupt! Where are the 33000 missing e mails? _E_ The real story that Congress the FBI and all others should be looking into is the leaking of Classified information. Must find leaker now! _E_ RT @TeamTrump: Quite simply @HillaryClinton mistreats women. #BigLeagueTruth #Debate2016 __HTTP__ __HTTP__ _E_ Despite spending $500k a day on TV ads alone #CrookedHillary falls flat in nationwide @QuinnipiacPoll. Having ZERO impact. Sad!! _E_ Celebrating 1237! #Trump2016 __HTTP__ _E_ I feel bad for all @VanityFair employees. Every day at work they see circulation going down as Graydon runs his bad food restaurants. _E_ Are NFL games getting boring or is it just my magnificent imagination? In any event I'm just not watching them much anymore! _E_ Bill O'Reilly calls Trump and campaign brilliant. In first place by 27 points. _E_ Come celebrate Thanksgiving in the Windy City at @TrumpChicago's 5 Star 5 Diamond Sixteen restaurant __HTTP__ _E_ Entrepreneurs keep this in mind: Great spirits have always encountered violent opposition from mediocre minds. Albert Einstein _E_ RT @paulsperry_: __HTTP__ _E_ Resolve never to quit never to give up no matter what the situation. Jack Nicklaus _E_ I heard that @Morning_Joe was very nice on Friday but that little Donny D a big failure in TV (& someone I helped) was nasty. Irrelevant! _E_ China's media is attacking @MittRomney while endorsing @BarackObama __HTTP__ Of course. Mitt knows it's Time To Get Tough. _E_ The stock of my shirt and tie maker just hit an all time high great going great product! _E_ Remember the huge amount of money raised by @JohnRich and company... #sweepstweet _E_ If Republican Senate doesn't get rid of the Filibuster Rule & go to a simple majority which the Dems would do they are just wasting time! _E_ Your questions about my desk answered in today's #trumpvlog... __HTTP__ _E_ "Once you learn to quit it becomes a habit." Vince Lombardi _E_ Join us today! Together we will #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_ Vanity Fair circulation down 20 percent. My third rate stalker should start looking for a new job. _E_ We need to be smart vigilant and tough. We need the courts to give us back our rights. We need the Travel Ban as an extra level of safety! _E_ Entrepreneurs who develop their Midas Touch do not work for money. They work to create or acquire assets. Focus on assets. Midas Touch _E_ So good to see the Saudi Arabia visit with the King and 50 countries already paying off. They said they would take a hard line on funding... _E_ As usual the weather people got it wrong in Tampa. They just look for headlines & ratings! _E_ "Faster And Cheaper Trump Finishes NYC Ice Rink @TrumpRink" __HTTP__ Gov. can be efficient w/leadership & business acumen. _E_ My new book #TimeToGetTough out Dec 5th outlines how to make America rich again. Order now through Amazon __HTTP__ _E_ Just purchased NBC's half of The Miss Universe Organization and settled all lawsuits against them. Now own 100% stay tuned! _E_ I did what was an almost an impossible thing to do for a Republican easily won the Electoral College! Now Tax Returns are brought up again? _E_ Crooked Hillary just can't close the deal with Bernie. It will be the same way with ISIS and China on trade and Mexico at the border. Bad! _E_ Trump Int'l Golf Links & Hotel Ireland fronts the Atlantic Ocean in County Clare for 2.5 miles. Extraordinary! __HTTP__ _E_ When we have big disasters no one comes to our aid or even suggests helping but we are always expected to come to the aid of others! _E_ By self funding my campaign I am not controlled by my donors special interests or lobbyists. I am only working for the people of the U.S.! _E_ I heard poorly rated @Morning_Joe speaks badly of me (don't watch anymore). Then how come low I.Q. Crazy Mika along with Psycho Joe came.. _E_ Had a fantastic time at yesterday's All Star @ApprenticeNBC press conference with @StephenBaldwin7 in @TrumpTowerNY. _E_ Join me tomorrow in Michigan!Grand Rapids at 12pm: __HTTP__ at 3pm: __HTTP__ __HTTP__ _E_ Why is the UN planning to attack @Israel's sovereignty and ignore Iran's nuclear program? The US should look at future funding. _E_ RT @TODAY_Clicker: Get ready @ApprenticeNBC fans! @realDonaldTrump promises plenty of mean and nasty action.. __HTTP__ _E_ Great job First Lady Melania! __HTTP__ _E_ Inspiration exists but it must find you working. Pablo Picasso _E_ We have got to get our Marine out of that disgusting Mexican jail. Would be so easy if we had a real leader. One tough phone call & he's out _E_ Fox and Friends _E_ Story written by a @HuffingtonPost reporter that the HuffPost refused to print. Total bias but we will prevail! __HTTP__ _E_ HAPPY 70th BIRTHDAY to the @USAirForce! The American people are eternally grateful. Thank you for keeping America PROUD STRONG and FREE! __HTTP__ _E_ President Obama is the greatest hoax ever perpetrated on the American people Clint Eastwood _E_ #TrumpVlog Why are we the sad suckers? __HTTP__ _E_ .@ericbolling Great job on The Five tonight and not only because you were so nice to The Apprentice. See you soon and thanks! _E_ #CelebrityApprentice Paul Teutul Sr. joined me for a press event in Trump Tower last week __HTTP__ _E_ Yes I will give my @SuperBowl pick tomorrow. Watch @_KatherineWebb cover it on @InsideEdition. _E_ The failing @WSJ Wall Street Journal should fire both its pollster and its Editorial Board. Seldom has a paper been so wrong.Totally biased! _E_ Via @GolfMonthly by @jake0reilly: "Trump to build five new holes at @TurnberryBuzz" __HTTP__ _E_ I love being in South Carolina. We are leading big in all of the State polls Saturday is a BIG day. MAKE AMERICA GREAT AGAIN! _E_ LETS GO AMERICA! Time to take backour country and #MakeAmericaGreatAgainWatch video & go#VoteTrump!  __HTTP__ _E_ Don't believe the lies every budget @BarackObama has delivered to Congress raises the income tax on EVERYONE __HTTP__ _E_ Congratulations to the Philadelphia Eagles on a great Super Bowl victory! _E_ The Amazon Washington Post fabricated the facts on my ending massive dangerous and wasteful payments to Syrian rebels fighting Assad..... _E_ Congratulations to @gohermie for winning the @ShellHouOpen. We are all proud of you @TNGCBedminster & all @TrumpGolf clubs! Great going! _E_ Russia took Crimea during the so called Obama years. Who wouldn't know this and why does Obama get a free pass? _E_ The Donald J.Trump Signature Collection exclusively available @Macys offers top styles in menswear. Dress your best __HTTP__ _E_ By popular(extremely) demand I will be live tweeting the #Oscars2014 on Sunday night. Tell all your friends I will not be pulling punches! _E_ LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_ Jeb is fighting to defend a catastrophic event. I am fighting to make sure it doesn't happen again.Jeb is too soft we need tougher & sharper _E_ Remember Trump ties & shirts @Macys for Fathers Day your father will love you even more! _E_ Filming for @CelebApprentice Season 13 is now into the 2nd week. The 'All Star' cast is already hard at work. _E_ .@YoungDems4Trump Thank you! _E_ While I believe I will clinch before Cleveland and get more than 1237 delegates it is unfair in that there have been so many in the race! _E_ I will be in Evansville Indiana with the great Bobby Knight (who last night endorsed me) at 12:00 this afternoon. See you there! _E_ .@jorgeramosnews Please send me your new number your old one's not working. Sincerely Donald J. Trump _E_ RT @JeffTutorials: @realDonaldTrump __HTTP__ _E_ The good news is that their ratings are terrible nobody cares! __HTTP__ _E_ Government needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. It's (cont) __HTTP__ _E_ RT @foxandfriends: Israeli PM Netanyahu praises U.S. policy changes during meeting with Defense. Sec Mattis __HTTP__ _E_ Thank you for sharing Amy. __HTTP__ _E_ The real story is that President Obama did NOTHING after being informed in August about Russian meddling. With 4 months looking at Russia... _E_ A great article about how ObamaCare has even further complicated the tax code and will hurt housing market __HTTP__ _E_ Via __HTTP__ __HTTP__ _E_ I will be interviewed on @foxandfriends at 8:40. A.M. Enjoy! _E_ Enjoy Celebrity Apprentice tonight at 9 a really great episode! _E_ Margaret Thatcher was the Iron Lady of the West. She promoted freedom & democracy a great leader & ally of America. _E_ Republicans have very strong hand in their fight against Obamacare lets see if they are willing and able to play it tuff ! _E_ The Sarasota Florida rally today was amazing. 12000 people chanting their love for our country. It's going to happen this is a MOVEMENT! _E_ My @TheBrodyFile int. from Iowa on how I would build a wall to secure our Southern Border & deduct costs from Mexico __HTTP__ _E_ Be sure to buy this month's @AmSpec magazine. Read "A Trump Card" my interview with Jeffrey Lord. _E_ Lying #Ted Cruz just (on election day) came out with a sneak and sleazy Robocall. He holds up the Bible but in fact is a true lowlife pol! _E_ Hillary Lies to Benghazi Families#CrookedHillary __HTTP__ _E_ Republicans and @MittRomney must get tough very soon. _E_ "The minute that you're not learning I believe you're dead." – Jack Nicholson _E_ I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Why did Pres Obama remove sanctions against Iran prior to negotiating rather than completing successful negotiation & then remove sanctions? _E_ Remember the worst thing you can do in a negotiation is seem desperate to make the deal. _E_ Claims for unemployment are at a 3 month high __HTTP__ Where's the @BarackObama recovery? _E_ However beautiful the strategy you should occasionally look at the results. Winston Churchill _E_ My @todayshow discussing the @CelebApprentice discussing the cast __HTTP__ _E_ "Do not view any failure as the end. Learn your lessons quickly then move on." – Think Big _E_ I will be interviewed tonight on @FoxNews by @SeanHannity at 9pmE. Enjoy! _E_ ...and they knew exactly what I said and meant. They just wanted a story. FAKE NEWS! _E_ ..(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong). Too bad the Dems have no one who can change tones! _E_ Superbowl Sunday is a great American tradition. The Colts and Saints are already champions but may the best team win! _E_ Hard to believe that Bernie Sanders has done such a complete fold. He got NOTHING for all of the time energy and money. The V.P. a joke! _E_ Jobs report is really bad beyond the worst projections.A bad day on Wall Street! _E_ ....that has served our country is put on a waiting list and gets no care. _E_ "Remember that some things are worth waiting for. Plans can change sometimes for good reason." – Trump Never Give Up _E_ "Never confuse a single defeat with a final defeat." F. Scott Fitzgerald _E_ "Courage is being scared to death but saddling up anyway." John Wayne _E_ The @WashingtonPost quickly put together a hit job book on me comprised of copies of some of their inaccurate stories. Don't buy boring! _E_ New South Carolina poll from PPP. Thank you! #VoteTrumpSC __HTTP__ _E_ .@CNN is so disgusting in their bias but they are having a hard time promoting Crooked Hillary in light of the new e mail scandals. _E_ Back by popular demand this year's All Star @ApprenticeNBC sees the return of @claudiajordan! Our fans love her. _E_ Thank you Omarosa for your service! I wish you continued success. _E_ Hillary and the Dems were never going to beat the PASSION of my voters. They saw what was happening in the last two weeks before the...... _E_ The just released Public Policy Polling (PPP national result) is the best yet. MAKE AMERICA GREAT AGAIN! _E_ When will the U.S. stop sending $'s to our enemies i.e. Mexico and others. _E_ RT @dcexaminer: Emails show Washington Post New York Times reporters unenthusiastic about covering Clinton Lynch meeting __HTTP__ _E_ Thank you @HauteLivingMag for naming @TrumpDoral the #1 golf course in Miami __HTTP__ _E_ We have to make America great again! _E_ Tomorrow night's episode of The Apprentice delivers excitement at QVC along with appearances by Isaac Mizrahi and Cathie Black. 10 pm on NBC _E_ ...Terrible for the economy and a job killer. China is laughing at us! _E_ Not the world only your tiny group of viewers the world doesn't care about you. @lawrence You're too stupid to (cont) __HTTP__ _E_ I will be doing @GMA @GStephanopoulos this morning at around 7:00. Likewise I will be doing @Morning_Joe at around 7:00. Figure it out! _E_ .@rushlimbaugh is right—the Republicans lost because they weren't conservative enough—or tough enough. _E_ It was a great honor to welcome the President of Turkey Recep Tayyip Erdoğan to the @WhiteHouse today! __HTTP__ _E_ The rigged system may have helped Hillary Clinton escape criminal charges but... __HTTP__ __HTTP__ _E_ Let's see what happens in the boardroom... #CelebApprentice _E_ Thank you! #Trump2016 __HTTP__ _E_ I have so much admiration and respect for the 2.4 million men and women of our Armed Forces. #TimeToGetTough _E_ ....for the Middle Class. The House and Senate should consider ASAP as the process of final approval moves along. Push Biggest Tax Cuts EVER _E_ My @foxandfriends interview from yesterday discussing how @BarackObama failed to show any leadership on th... (cont) __HTTP__ _E_ I believe Putin will continue to re build the Russian Empire. He has zero respect for Obama or the U.S.! _E_ A country that cannot protect its borders is a country destined to fail. Another broken promise by our leaders in Washington. _E_ If I run and if I win our country will be great again. last line of my @SRQRepublicans speech _E_ Thank you Nebraska!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Why didn't President Obama just go inside when it started raining yesterday common sense! The two Marines looked very uncomfortable & wet. _E_ I was in San Jose CA on Saturday for a sit down interview for the ACN national meeting which was attended by over 20000 people. Huge! _E_ For all of the haters and losers out there sorry I never went Bankrupt but I did build a world class company and employ many people! _E_ The ObamaCare website is unfixable & rumor has it that they will stop checks & balances—a free for all that will cost the country trillions _E_ Obama loves wasting our money. He just made another guarantee of $197M to a solar company __HTTP__ Cronyism! _E_ #CrookedHillary __HTTP__ _E_ Thomas Kinkade died. I happen to love the beauty of his paintings. He took a lot of heat from art critics who (cont) __HTTP__ _E_ Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud! _E_ This week will mark the 1 year anniversary of the attack in Benghazi that left 4 Americans dead. No answers! _E_ Wow what a day. So many foolish people that refuse to acknowledge the tremendous danger and uncertainty of certain people coming into U.S. _E_ Just won a big federal lawsuit similar in certain ways to the Trump U case but the press refuses to write about it. If I lost monster story! _E_ The military and first responders despite no electric roads phones etc. have done an amazing job. Puerto Rico was totally destroyed. _E_ Our vets are the pride of our nation. The VA scandal is a disgrace.If you can get food stamps so fast our vets should get immediate care _E_ Congrats to @TrumpWaikiki for being named @Orbitz Best In Stay Elite Award Winner for Oahu for 2014! _E_ I've been saying for three months that the bridge tolls to Staten Island are far too high and unfair just got lowered but not nearly enough _E_ The system is rigged. General Petraeus got in trouble for far less. Very very unfair! As usual bad judgment. _E_ This is not a media event or about Donald J. Trump this is about the United States of America. I will be... __HTTP__ _E_ I will be going to Texas as soon as that trip can be made without causing disruption. The focus must be life and safety. _E_ Based on the tremendous cost and cost overruns of the Lockheed Martin F 35 I have asked Boeing to price out a comparable F 18 Super Hornet! _E_ Go to work today be smart think positively and WIN! _E_ Pres. Obama is about to embark on a 17 day vacation in his 'native' Hawaii putting Secret Service away from families on Christmas. Aloha! _E_ With Barry Diller & Tina Brown in charge did anyone doubt that @Newsweek would be a massive failure? _E_ Verlander pitched great but @Yankees look truly defeated. _E_ We've just set a new goal: raise $4 million from our grassroots supporters by MIDNIGHT! __HTTP__ __HTTP__ _E_ Why should he? He's only the POTUS and @BarackObama has no opinion on whether the Senate should pass a budget. __HTTP__ _E_ Great deal we swap 5 killer terrorists for a U.S. military deserter. That's how the U.S. negotiates nowadays. _E_ Some of the women on Celebrity Apprentice are absolutely crazy maybe the wildest thing ever on reality television. Watch tonight! _E_ My @FoxNews interview with @gretawire where I discuss my potential GOP endorsement and the NH primary __HTTP__ _E_ "Action is the foundational key to all success." Pablo Picasso _E_ Thank you Portland Maine! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ I will Make Our Government Honest Again believe me. But first I'm going to have to #DrainTheSwamp in DC. __HTTP__ _E_ Champion @Joan_Rivers loves being on the other side of the table in the Boardroom. She leaves no punches out in @CelebApprentice! _E_ Dwyane Wade's cousin was just shot and killed walking her baby in Chicago. Just what I have been saying. African Americans will VOTE TRUMP! _E_ Just returned from Pennsylvania where we will be bringing back their jobs. Amazing crowd. Will be going back tomorrow to Gettysburg! _E_ Hillary Clinton lied last week when she said ISIS made a D.T. video. The video that ISIS made was about her husband being a degenerate. _E_ The Chinese laugh at how weak and pathetic our government is in combating intellectual property theft. (cont) __HTTP__ _E_ Time flies it's @TrumpTowerNY's 30th anniversary. To celebrate we made this video highlighting its amazing history __HTTP__ _E_ My @fox8news interview discussing the passing of my longtime friend Dick Clark. __HTTP__ A true TV legend who will be missed. _E_ Great bilateral meetings at Élysée Palace w/ President @EmmanuelMacron. The friendship between our two nations and ourselves is unbreakable. __HTTP__ _E_ Tim Kaine has been praising the Trans Pacific Partnership and has been pushing hard to get it approved. Job killer! _E_ I could fix tv talk shows that are doing poorly—there is tremendous talent out there waiting to be tapped—and nobody sees it! _E_ Congress is back.TIME TO CUT CAP AND BALANCE.There is no revenue problem.The Debt Limit cannot be raised until Obama spending is contained. _E_ RT @IngrahamAngle: The #CruzCrew prevailed! Smart for @MarcoRubio to keep his speech short & sweet. Ditto for @realDonaldTrump who was brie... _E_ Heading to Camp David for major meeting on National Security the Border and the Military (which we are rapidly building to strongest ever). _E_ Obamacare is a disaster. We must REPEAL & REPLACE. Tired of the lies and want to #DrainTheSwamp? Get out & VOTE... __HTTP__ _E_ RT @DRUDGE_REPORT: FORMER HOSTAGE SAYS PLANE WAITED UNTIL MONEY ARRIVED... __HTTP__ _E_ The reason Flake and Corker dropped out of the Senate race is very simple they had zero chance of being elected. Now act so hurt & wounded! _E_ The ratings of The Cycle on MSNBC a sad and pathetic show are way down. If they fired racist moron @Toure a truly stupid guy they live! _E_ Jeanne Shaheen wants amnesty for illegals placed the deciding vote for ObamaCare & opposes the 2nd Amendment. Vote her out in November! _E_ Are the Republicans going to blow their chance to take the Senate? Must focus on ObamaCare and amnesty. _E_ My @SquawkCNBC interview discussing the 57th St. crane damage from the storm and extending my $5M offer to Obama __HTTP__ _E_ Hillary's been failing for 30 years in not getting the job done it will never change. _E_ If you are steadfast in your efforts critics will be harmless. Achievers move forward and achievement is not a plateau it's a beginning. _E_ "Romney's $2 Billion Sacrifice for America" By Chris Ruddy @Newsmax_Media __HTTP__ _E_ RT @TeamTrump: 100% TRUE > @realDonaldTrump is right @HillaryClinton did call TPP 'the gold standard' #Debates2016 __HTTP__ _E_ ...dwindling subscribers and readers.They got me wrong right from the beginning and still have not changed course and never will. DISHONEST _E_ Government Funding Bill past last night in the House of Representatives. Now Democrats are needed if it is to pass in the Senate but they want illegal immigration and weak borders. Shutdown coming? We need more Republican victories in 2018! _E_ Joe Paterno's family should sue the idiots @PennState that made that ridiculous deal and commissioned the one sided report. _E_ Website Exposing Marco 'Amnesty' Rubio Goes Live: A 'Donor Class Puppet'? Breitbart __HTTP__ _E_ For an advance preview of the Miss USA 2013 contestants as well as other show details go to __HTTP__ _E_ #BARACKTAX QUOTE: If you have health insurance you're not getting hit with a tax. _E_ See the Ashley Judd ad by @karlrove and you will definitely vote for her and love Obama. _E_ NoKo has interpreted America's past restraint as weakness. This would be a fatal miscalculation. Do not underestimate us. AND DO NOT TRY US. __HTTP__ _E_ WIshing everyone a happy healthy and prosperous New Year! _E_ Wow I have just exceeded 2 million followers and in such a short time! _E_ Unless the Republican Senators are total quitters Repeal & Replace is not dead! Demand another vote before voting on any other bill! _E_ The Comedy Central Roast of Donald Trump last week was the #1 highest rated Comedy Central Roast ever...it brought in 3.5 milion viewers _E_ You have to love what you do or you are never going to be successful no matter what you do in life. Think Big _E_ Our airports are Third World horrible. Let's rebuild them by people who know how to do it inexpensively. _E_ When you are in a war or even a battle losing is not an option! _E_ Thank you Jeffrey Lord for the great article discrediting third rate @BuzzFeed site & slimebag reporter McKay Coppins.@PiersMorgan @AmSpec _E_ President Obama put himself in a very bad position when he talked about Syria crossing the RED LINE. Amazingly now he denies he said that! _E_ Thank you for the nice words this morning @KellyRiddell. Well delivered and totally logical! @CNN @FoxNews _E_ RT @Jenniffer2012: Thank you @realDonaldTrump for all the help you are providing for Puerto Rico. We're are grateful and happy to welcome y... _E_ "45 year low in illegal immigration this year." @foxandfriends _E_ What do you think about the push to put women into high intensity combat situations? _E_ The Washington Times Presidential Debate Poll:TRUMP 77% (18290)CLINTON 17% (4100)#DrainTheSwamp #Debate __HTTP__ _E_ The media tries so hard to make my move to the White House as it pertains to my business so complex when actually it isn't! _E_ I love reading about all of the geniuses who were so instrumental in my election success. Problem is most don't exist. #Fake News! MAGA _E_ I hope people are looking at the disgraceful behavior of Hillary Clinton as exposed by WikiLeaks. She is unfit to run. _E_ Terrible. Wind farms are provided permits by the US government which causes the programmatic killing of bald eagles. _E_ "Pride yourself on your ability to find creative solutions to tough problems. Think Big _E_ The Electoral College is actually genius in that it brings all states including the smaller ones into play. Campaigning is much different! _E_ Weekly Address #KatesLaw#NoSanctuaryForCriminalsActStatement: __HTTP__ __HTTP__ _E_ Is everyone enjoying ObamaCare's 21 new 2014 taxes? __HTTP__ It's Obama's special gift added on to your rising premium. _E_ "Do your duty and a little more and the future will take care of itself." Andrew Carnegie _E_ Head of Air Force's anti sexual assault unit arrested for sexual assault! It just seems that our Country is not what it used to be. _E_ The ultimate vacation destination @TrumpPanama's sleek design evokes a majestic sail fully deployed in the wind __HTTP__ _E_ At the end of the day Obama won the battleground states by less than 500000 votes. This was a winnable race. GOP needs to do better! _E_ By popular demand I will be tweeting on the very tainted Academy Awards tonight! _E_ Happy New Year to all of my Jewish friends and supporters. Shana Tova. Hopefully it will be a great year! _E_ Glad to see that the Egyptian Army is releasing Mubarek. As we see Obama never should have abandoned him. He was an ally. _E_ Check out the Trump Fabulous World of Golf site to meet the Fazio family master golf course designers.... __HTTP__ _E_ The Euro put in place to hurt the U.S. is done! will have less negative impact than most think. _E_ Great rally last night in Massachusetts. 2000 people at a house must be a record! Unbelievable spirit to MAKE AMERICA GREAT AGAIN. _E_ Some dope tweeted my message to my friend Bill Belichick incorrectly they called him Bob. Sorry Bill! @Patriots _E_ Newsmax article: 'Trump Declines Prime Time GOP Convention Speech' __HTTP__ _E_ Fact – every successful GOP Senate candidate just elected ran on repealing ObamaCare. In January it's time to move! _E_ We all know that chess is a game of strategy. So is business. Think Like a Champion _E_ I could fix existing Tappan Zee Bridge for peanuts. Unfortunately Gov Cuomo will end up spending more than $10B on this project. $25 tolls? _E_ I agree getting Tax Cuts approved is important (we will also get HealthCare) but perhaps no Administration has done more in its first..... _E_ Via @DMRegister by @AP: "Donald Trump talks economy with Republicans in Davenport" __HTTP__ _E_ Celebrate Martin Luther King Day and all of the many wonderful things that he stood for. Honor him for being the great man that he was! _E_ 'Economists say Trump delivered hope' __HTTP__ _E_ Will be doing @OutFrontCNN with @ErinBurnett tonight at 7 pm re: tax reductions and various other topics. _E_ "The thing about high corporate tax rates is that in the end companies aren't the ones who foot the bill consumers do." #TimeToGetTough _E_ Your tax dollars well spent. Over 1.295M ObamaCare enrollees will also be illegal immigrants __HTTP__ Are you surprised? _E_ .@KarlRove Had my best day ever in the polls one had me at 41% Morning Consult. Boston Globe Monmouth NBC and CNN all great. More! _E_ I had a great time in Iowa yesterday record crowds fantastic people! _E_ Weakness is very dangerous: @BarackObama is going to unilaterally disarm our nuclear arsenal. America keeps the world safe! _E_ I'm not a hunter and don't approve of killing animals. I strongly disagree with my sons who are hunters but (cont) __HTTP__ _E_ I have made my decision on who I will nominate for The United States Supreme Court. It will be announced live on Tuesday at 8:00 P.M. (W.H.) _E_ After allowing North Korea to research and build Nukes while Secretary of State (Bill C also) Crooked Hillary now criticizes. _E_ China is cooking up conspiracy theories that the Olympics are rigged. __HTTP__ They don't understand why they can't cheat. _E_ I am impressed with the scam @BarackObama pulled but the truth will come out. _E_ .@piersmorgan Russell has nothing going for himself except for energy & aggression. Without that he would be dead—a first class dummy! _E_ Crooked Hillary can't close the deal with Bernie Sanders. Will be another bad day for her! _E_ .@JohnKerry claims he has never stopped working" f/Pastor Abedini's release through "back channels. Where are the results? _E_ Vanity Fair party at Tribeca Film Festival was a bust. _E_ Adam Moss editor in chief of @NYMag is quickly losing his reputation in that @NYMag has become so boring and so irrelevant. _E_ Lying Ted Cruz and lightweight choker Marco Rubio teamed up last night in a last ditch effort to stop our great movement. They failed! _E_ Thank you @hardball_chris for your nice words. They are very much appreciated. I fully understand that you really get it. _E_ Rep. Lou Barletta a Great Republican from Pennsylvania who was one of my very earliest supporters will make a FANTASTIC Senator. He is strong & smart loves Pennsylvania & loves our Country! Voted for Tax Cuts unlike Bob Casey who listened to Tax Hikers Pelosi and Schumer! _E_ Do you think crooked @AGSchneiderman will ever challenge the NFL tax status? No—too many friends and contributors in @nfl? _E_ How can Crooked Hillary say she cares about women when she is silent on radical Islam which horribly oppresses women? _E_ ICYMI my @foxandfriends int. criticizing the GOP on ObamaCare the new Congress & 2016 __HTTP__ _E_ Make sure to verify the voting machine does not switch your vote. If you have any problems notify the poll workers. _E_ The fact that President Putin and I discussed a Cyber Security unit doesn't mean I think it can happen. It can't but a ceasefire can& did! _E_ Get rid of gun free zones. The four great marines who were just shot never had a chance. They were highly trained but helpless without guns. _E_ Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy and create many beautiful JOBS! _E_ RT @foxandfriends: .@Suffolk_Sheriff praises President Trump for making gang eradication a priority __HTTP__ _E_ The only deal the Republicans should accept is a complete repeal of ObamaCare. You have them on the run don't fold go for it! _E_ The Ryder Cup will be amazing this week. _E_ So many people are asking why isn't the A.G. or Special Council looking at the many Hillary Clinton or Comey crimes. 33000 e mails deleted? _E_ The failing @nytimes which never spoke to me keeps saying that I am saying to advisers that I will change. False I am who I am never said _E_ BREAKING NEWS: Obama has just made a trade with Russia. They get Florida California & our gold supply. We get borscht & a bottle of vodka. _E_ Today I was honored and proud to address the 45th Annual @March_for_Life! You are living witnesses of this year's March for Life theme: #LoveSavesLives. __HTTP__ _E_ Much bigger win than anticipated in Arizona. Thank you I will never forget! _E_ Entrepreneurs: Be passionate. You have to love what you're doing to be successful at it. _E_ We don't always think of our presidents as jobs and business negotiators but they are. Presidents are our (cont) __HTTP__ _E_ #CongressionalBaseballGame __HTTP__ _E_ We MUST have strong borders and stop illegal immigration. Without that we do not have a country. Also Mexico is killing U.S. on trade. WIN! _E_ I will be on @wolfblitzer for a @CNNSitRoom interview today. Please join us 5PM ET. _E_ A Warren Buffett corp. is currently ensnared in a bankruptcy. Likewise Icahn Kravis Apollo and many others have played the game.Thanks! _E_ Shameful. After trading 5 senior Taliban for a deserter the White House is now attacking Bergdahl's platoon __HTTP__ _E_ SUN newspaper/Scotland reports that Tourism jump is thanks to Trump. 8000 visitors in one month from 20 countries __HTTP__ _E_ A great crowd at Trump Tower for #TimeToGetTough book signing! _E_ On behalf of the entire family we would truly be honored to have your vote! Let's #MakeAmericaGreatAgain #EarlyVote __HTTP__ _E_ Very successful fund raising for @MittRomney yesterday. Good to see my friend Woody Johnson. _E_ RT @FoxNews: Jobs created in February. __HTTP__ _E_ From 2% to 27% in Texas quite a jump into first place! _E_ New orders for manufacturing down 9/10 months __HTTP__ Time for fair trade. Stop TPP! _E_ Great going Andy Roddick! Another victory for a fabulous player. Brooklyn Decker is good luck. _E_ .@McIlroyRory Great job Rory you have the heart and talent of a great champion. Work hard and win many more! See you at Turnberry. _E_ Just like I have warned from the beginning Crooked Hillary Clinton will betray you on the TPP. __HTTP__ _E_ Will be interviewed on @GMA at 7:00 A.M. Big wins last night! _E_ "A very good way to pave your own way to success is simply to work hard and to be diligent" – Think Like a Champion _E_ Alabama is sooo lucky to have a candidate like Big Luther Strange. Smart tough on crime borders & trade loves Vets & Military. Tuesday! _E_ I don't think Obama will do well in the second debate he is psyched out just like A Rod. _E_ Why are we continuing to train these Afghanis who then shoot our soldiers in the back? Afghanistan is a complete waste. Time to come home! _E_ "Always get even. When you are in business you need to get even with people who screw you." – Think Big _E_ The dishonest NY Daily News reporter advised my rep in writing story is dead and then put it out anyway. A total lie and she knew it! _E_ Prosperity is coming back to our shores because we are putting America WORKERS and FAMILIES first. #AmericaFirst __HTTP__ _E_ I fulfilled my campaign promise others didn't! __HTTP__ _E_ "WATCH: @MissUniverse contestants golf with Donald Trump @TrumpDoral" __HTTP__ via @KylePorterCBS by @CBSSports _E_ Memorial service today for beautiful and incredible Heather Heyer a truly special young woman. She will be long remembered by all! _E_ It came out that Huma Abedin knows all about Hillary's private illegal emails. Huma's PR husband Anthony Weiner will tell the world. _E_ You are witnessing the single greatest WITCH HUNT in American political history led by some very bad and conflicted people! #MAGA _E_ Entrepreneurs: Knowledge requires patience action requires courage. Put patience and courage together and you'll be a winner . _E_ Don't underestimate yourself or your possibilities. There are always opportunities. _E_ Featuring top spa in New York AAA Five Diamond Award @TrumpSoHo is Soho's most elite hotel & destination spot __HTTP__ _E_ MONDAY 11/7/2016Scranton Pennsylvania at 5:30pm. __HTTP__ Rapids Michigan at 11pm.... __HTTP__ _E_ I promised that my policies would allow companies like Apple to bring massive amounts of money back to the United States. Great to see Apple follow through as a result of TAX CUTS. Huge win for American workers and the USA! __HTTP__ _E_ According to the @nytimes a Russian sold phony secrets on "Trump" to the U.S. Asking price was $10 million brought down to $1 million to be paid over time. I hope people are now seeing & understanding what is going on here. It is all now starting to come out DRAIN THE SWAMP! _E_ Congratulations to Tom Scocca and Timothy Burke of @Deadspin for exposing the Manti Te'o fiasco. _E_ As I made very clear today our country needs the security of the Wall on the Southern Border which must be part of any DACA approval. _E_ .@GeraldoRivera Thanks my champion Geraldo and very true. _E_ In order to stay competitive in your industry it is imperative to keep up to date on all news. A great commodity is information. _E_ .@THEGaryBusey survives another week of All Star Celebrity @ApprenticeNBC. Gary is shifty and playing to win. _E_ Also tune in to the @TodayShow at 7:00am. I will be on to discuss the campaign my new ads and #CrippledAmerica. _E_ Somebody hacked the DNC but why did they not have hacking defense like the RNC has and why have they not responded to the terrible...... _E_ A new terror warning was issued for European cties. At what point do we say we have had enough and get really tough and smart. Weak leaders! _E_ Washington is simply incapable of any moderation because @BarackObama is such an extreme leftist. He must be defeated. #TImeToGetTough. _E_ RT @mike_pence: Congrats to my running mate @realDonaldTrump on a big debate win! Proud to stand with you as we #MAGA. _E_ WikiLeaks proves even the Clinton campaign knew Crooked mishandled classified info but no one gets charged? RIGGED! __HTTP__ _E_ I hope everyone read the brilliant article in American Spectator about leightweight A.G. Eric Schneiderman. He should be run out of office! _E_ ObamaCare contains marriage penalty taxes. Why should married couples be penalized for having healthcare? _E_ Will be on @ABC News tonight at 6:30. Interviewed by the legendary @BarbaraJWalters! Enjoy _E_ How did NBC get an exclusive look into the top secret report he (Obama) was presented? Who gave them this report and why? Politics! _E_ #TrumpAdvice __HTTP__ _E_ An investment in life luxury & leisure a Trump Nat'l Bedminster membership offers top amenities & services __HTTP__ _E_ I don't know why our allies are so surprised Obama is tapping their phones? Nothing changes! _E_ Today Barack Obama is standing in water in NJ. Remember on election day that he has put the US underwater. _E_ Via @LasVegasSun by Eugene Dunn: "2016 is the year of Donald Trump" __HTTP__ _E_ Our country is totally split right now but someday it will come together! _E_ My wife the beautiful @MELANIATRUMP will be appearing... #CelebApprentice _E_ Wow just 1 day after my offer to fund all WH tours Obama backtracks on decision to cancel all White House tours" ... _E_ Perhaps @BarackObama's biggest shortcoming as President is he failed to unite the country. _E_ $4 gasoline – wow—OPEC is very happy! _E_ .@JonahNRO watched on @seanhannity and appreciate your statements I have been waiting for them for a long time. Thank you. _E_ We win in our lives by having a champion's view of each moment. Donald J. Trump __HTTP__ _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ US News named the Top10 best hotels in the US and Trump Int'l Hotel & Tower NYC and Trump Int'l Hotel & Tower Chicago are on the list! _E_ Word is early voting in FL is very dishonest. Little Marco his State Chairman & their minions are working overtime trying to rig the vote. _E_ The White House is predicting 9% unemployment throughout 2012 – and when Obama Care takes effect in 2014 expect it to go even higher. _E_ As Governor of Texas Rick Perry could have done far more to secure the border but that's O.K. I like him anyway! @GovernorPerry _E_ It's a plain fact: free trade requires having fair rules that apply to everyone. #TimeToGetTough _E_ Thank you Virginia! #ImWithYou __HTTP__ _E_ Great to see how hard Republicans are fighting for our Military and Safety at the Border. The Dems just want illegal immigrants to pour into our nation unchecked. If stalemate continues Republicans should go to 51% (Nuclear Option) and vote on real long term budget no C.R.'s! _E_ 'President elect Donald J. Trump's CIA Director Garners Praise' __HTTP__ __HTTP__ _E_ Wow I just found out that in a major poll of its readers the @NewYorkObserver voted me #1 on the power 100 list in NY...... _E_ While China screws us with every turn of its currency is the biggest commercial espionage threat we face (cont) __HTTP__ _E_ Golf is a game of respect and sportsmanship we have to respect its traditions and its rules. Jack Nicklaus _E_ .@GOP must stay focused on defunding ObamaCare and the impending budget battle. Don't let Syria rule the agenda. _E_ Weekly Address 11:00 A.M. at the @WhiteHouse! #MAGA __HTTP__ __HTTP__ _E_ When I said that Hillary Clinton got schlonged by Obama it meant got beaten badly. The media knows this. Often used word in politics! _E_ Additionally @CelebApprentice ranked as the #1 program in the 9 11 pm time period with adults in the 25 54 age group. _E_ Via @Mediaite: Donald Trump Trashes 'Tacky' 'Boring' Oscars Blasts 'Racist' Django Unchained __HTTP__ _E_ Friday is the last day to enter the Counting Sheep for Hire contest. Click here www.youtube.com/user/mattressserta and you could win a trip _E_ Why do the Republicans always negotiate against themselves in public? Watching them operate these fiscal negotiations is painful. _E_ Lyin' Ted Cruz even voted against Superstorm Sandy aid and September 11th help. So many New Yorkers devastated. Cruz hates New York! _E_ With Joan Rivers and ivankatrump from last night's great boardroom! __HTTP__ _E_ Just signed contract to purchase the Ritz Carlton in Jupiter Florida great land great location great future! _E_ North Korea has shown great disrespect for their neighbor China by shooting off yet another ballistic missile...but China is trying hard! _E_ Entrepreneurs: If you cannot handle the tough times you will never be successful in business. Stay positive & stay strong! _E_ Great trip to Mexico today wonderful leadership and high quality people! Look forward to our next meeting. _E_ Find out where to #VoteTrump on caucus night in Iowa on 2/1/16!#IACaucus #FITN #Trump2016 __HTTP__ __HTTP__ _E_ Even though I refused to pay a ridiculous price for the Buffalo Bills I would have produced a winner. Now that won't happen. _E_ Thank you for your service! __HTTP__ _E_ It is a disgrace that my full Cabinet is still not in place the longest such delay in the history of our country. Obstruction by Democrats! _E_ Entrepreneurs: Watching you could be the motivation for your employees.Make it an example that will best serve the success of your business. _E_ Totally biased @NBCNews went out of its way to say that the big announcement from Ford G.M. Lockheed & others that jobs are coming back... _E_ Must read opinion piece by @Gallup CEO Jim Clifton: "The Big Lie: 5.6% Unemployment" __HTTP__ Just as I have long been saying... _E_ By failing to prepare you are preparing to fail. Benjamin Franklin _E_ .@NBC just announced that all 1 hour @CelebApprentice episodes are being expanded to 2 hours—it's amazing what good ratings will do! _E_ Entrepreneurs: Get a momentum going. Listen apply then move forward. Do not procrastinate. See opportunity as the perk that it is. _E_ I just got back from Russia learned lots & lots. Moscow is a very interesting and amazing place! U.S. MUST BE VERY SMART AND VERY STRATEGIC. _E_ I hope when the MSM runs its "interruption counters" they consider the # of times the moderators interrupted me com... __HTTP__ _E_ Markets are crashing all caused by poor planning and allowing China and Asia to dictate the agenda. This could get very messy! Vote Trump. _E_ When we're talking about math that doesn't add up how about $5 trillion of deficits over the last four years. @MittRomney _E_ No matter how diligent you are in evaluating a business deal there is invariably one factor you have no control over luck... _E_ ...time for Republicans & Democrats to get together and come up with a healthcare plan that really works much less expensive & FAR BETTER! _E_ Thanks @LilJon for coming to my defense in Rolling Stone Magazine. As I have often said you are a terrific guy! _E_ Moody's is out to make publicity. The bank downgrades from yesterday don't make up for @Moody's giving AAA (cont) __HTTP__ _E_ Our incompetent Secretary of State Hillary Clinton was the one who started talks to give 400 million dollars in cash to Iran. Scandal! _E_ Congrats to @Yankees on finishing 1st in the AL East. Derek Jeter is great good luck in the playoffs! _E_ Thank you! __HTTP__ _E_ Lightweight @DannyZuker is too stupid to see that China (and others) is destroying the U.S. economically and our leaders are helpless! SAD. _E_ Like it or not Edward Snowden is a SPY and should be tried as a SPY! He has stolen invaluable information and damaged us with other nations _E_ I am in Iowa watching all of these phony T.V. ads by the other candidates. All bull politicians are all talk and no action it won't happen! _E_ Rubio puts out ad that my pilot was a drug dealer not true not my pilot! Guy owned helicopter company don't think I ever even used. _E_ This is happening all over our country—great people being disenfranchised bypoliticians. Repub party is in trouble! __HTTP__ _E_ Today the U.S. flag flies at half staff at the @WhiteHouse in honor of National Pearl Harbor Remembrance Day. __HTTP__ __HTTP__ _E_ The final two @ArsenioOFFICIAL and @ClayAiken visited yesterday __HTTP__ _E_ .@LisaRinna looks better with her reduced lips. Good move Lisa. #CelebApprentice _E_ Due to the horrific events taking place in our country I have decided to postpone my speech on economic opportunity today in Miami. _E_ I'm glad that Mark Cuban won the ridiculous case with the S.E.C. It never should have been brought in the first place! _E_ If Obama doesn't accept my offer to be fully transparent what will he say? _E_ ...and an optimist is one who makes opportunities of his difficulties. Harry S. Truman _E_ Dumbass @BillMaher has still not given me the 5 million he committed to charity we just presented him with a demand notice. _E_ Coming together is a beginning. Keeping together is progress. Working together is success. Henry Ford _E_ If we reelect @BarackObama the America we leave our kids and grandkids won't look like the America we were (cont) __HTTP__ _E_ 13 Syrian refugees were caught trying to get into the U.S. through the Southern Border. How many made it? WE NEED THE WALL! _E_ I think it would be a good idea—and fair—to include @GovChristie & @MikeHuckabeeGOP in the debate. Both solid & good guys. @FoxBusiness _E_ Great numbers on the economy. All of our work including the passage of many bills & regulation killing Executive Orders now kicking in! _E_ .@DRUDGE_REPORT's First Presidential Debate Poll:Trump: 80%Clinton: 20%Join the MOVEMENT today & lets #MAGA!... __HTTP__ _E_ Via @Investopedia by @swan_investor: The Irreplaceable Brand Of Donald Trump __HTTP__ _E_ "Talent hits a target no one else can hit. Genius hits a target no one else can see." – Arthur Schopenhauer _E_ "Donald Trump Wishes Kristen Stewart A Happy Birthday" __HTTP__ via @HollywoodLife _E_ 'Trump rally disrupter was once on Clinton's payroll' __HTTP__ _E_ Interesting how President Obama is flying around in a Boeing 747 on so called Earth Day! _E_ The protesters in New Mexico were thugs who were flying the Mexican flag. The rally inside was big and beautiful but outside criminals! _E_ I will be interviewed on @seanhannity tonight at 10:00. You will find it very interesting (I hope). Enjoy! _E_ Maybe I'm old fashioned but I don't like seeing women in combat. _E_ Don't reward Mitt Romney who let us all down in the last presidential race by voting for Kasich (who voted for NAFTA open borders etc.). _E_ Donna Brazile just stated the DNC RIGGED the system to illegally steal the Primary from Bernie Sanders. Bought and paid for by Crooked H.... _E_ Well the New Year begins. We will together MAKE AMERICA GREAT AGAIN! _E_ If you want to succeed keep your edge. Staying on top of all new developments in your sector = major advantage that pays dividends. _E_ Will be going to Richmond Virginia today. Big crowd! See you there. _E_ Failing comedian Bill Maher who I got an accidental glimpse of the other night is really a dumb guy just look at his past! _E_ "Create your own visual style... let it be unique for yourself and yet identifiable for others." Orson Welles _E_ I like Michael Douglas! _E_ I hope you are watching the Apprentice...tonight's show is great and Brett Michaels is back! _E_ Understand that difficulties mistakes and setbacks are an inevitable part of business and life...But always look for the opportunities. _E_ ...a tool of anti Trump political actors. This is unacceptable in a democracy and ought to alarm anyone who wants the FBI to be a nonpartisan enforcer of the law....The FBI wasn't straight with Congress as it hid most of these facts from investigators." Wall Street Journal _E_ Re: Decisions: Cover your bases then ask yourself this question: What am I pretending not to see? This can save a lot of time & trouble. _E_ The Chinese are better off than they were 4 years ago. They have stolen even more from us in jobs & trade during @BarackObama's term. _E_ The fact that we are here today to debate raising America's debt limit is a sign of leadership failure. Sen. Obama 3/16/06 _E_ Looking forward to speaking at the @NHGOP #FITN Republican Leadership Summit on Saturday at 12PM! Let's Make America Great Again! _E_ The Fake News media is officially out of control. They will do or say anything in order to get attention never been a time like this! _E_ It's amazing how celebrities such as @Cher can say horrible untrue things about Republican politicians and it's (cont) __HTTP__ _E_ May God be with the people of Sutherland Springs Texas. The FBI and Law Enforcement has arrived. _E_ I don't know how Al Michaels could have been drunk and arrested on Friday night if he was totally sharp on Saturday morning. _E_ We must suspend immigration from regions linked with terrorism until a proven vetting method is in place. _E_ Join me live in Toledo Ohio!#MakeAmericaGreatAgain __HTTP__ _E_ The Democrats when they incorrectly thought they were going to win asked that the election night tabulation be accepted. Not so anymore! _E_ Making money is art and working is art and good business is the best art. Andy Warhol _E_ Hillary Clinton should have been prosecuted and should be in jail. Instead she is running for president in what looks like a rigged election _E_ Looking forward to hosting @NaghmehAbedini next week @TrumpTowerNY. The White House has abandoned her husband Christian Pastor Abedini. _E_ Donald Trump's birther event is the greatest trick he's ever pulled __HTTP__ _E_ Thank you America! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_ With one of the worst and most prolonged cold spells in history with Atlanta Texas and parts of Florida freezing Global Warming anyone? _E_ Thank you to Doug Parker and American Airlines for all of the help you have given to the U.S. with Hurricane flights. Fantastic job! _E_ If you like to work hard you will attract people with the same ethic. Think Like a Billionaire _E_ 70 stories over Panama Bay @TrumpPanama is the country's first five star development. A masterpiece __HTTP__ _E_ What a coincidence that Obama's good friends in Libya and Egypt picked 9/11 to attack our embassies. _E_ Trace and his team raised an amazing amount of $. Looks like a good season for charities. _E_ Thank you New Hampshire! #MakeAmericaGreatAgain __HTTP__ _E_ Sad case @USATODAY did article saying I don't pay bills false only don't pay when work is shoddy bad or not done! They should do same! _E_ I will be interviewed on @foxandfriends tomorrow morning at 7:00. Enjoy! _E_ On 1300 acres in Charlottesville @trumpwinery's wine has been awarded the coveted Virginia Double Gold Medal __HTTP__ _E_ Good.morning I'm going to work! _E_ RT @TeamTrump: It's US vs. them! @realDonaldTrump will fight for you! #BigLeagueTruth #Debates _E_ The @CadillacChamp returns to @TrumpDoral on March 6th __HTTP__ Watch top golfers of the world battle the Blue Monster! _E_ Rising gas prices are causing a steep rise in consumer prices and will slow any future economic growth. It is a tax on all Americans. _E_ Thank you @mcuban for your nice words. I am rapidly becoming a @dallasmavs fan! __HTTP__ _E_ Re Negotiation: Realize that persistence can go a long way. Being stubborn is often an attribute. _E_ Receiving thousands of thank you letters from @LibertyU students for my convocation speech. The honor was all mine! Great people. _E_ Irony! @BarackObama was in Florida yesterday fundraising. Gas also rose to $6/gallon for Florida drivers yesterday. __HTTP__ _E_ Other worthy people were taken off the @CNBC list as well. Stupid poll should be canceled—no credibility. _E_ You have to love what you do or you are never going to be successful no matter what you do in life." Think Big _E_ Stay tuned for my big Obama announcement probably on Wednesday. _E_ I will be on @MeetThePress with @ChuckTodd tomorrow morning at 10:30am ET on @NBC. Enjoy! _E_ Join me in Atlanta on Wednesday at noon! #Trump2016Tickets: __HTTP__ __HTTP__ _E_ Michael Forbes is a loser who failed to stop what was just named "the golf course of the year" and which has brought ... _E_ RT @DRUDGE_REPORT: CLINTON EMAIL LED TO EXECUTION IN IRAN? __HTTP__ _E_ .@heytana great job we are all proud of you! _E_ On this solemn day of remembrance we can all take joy in the fact that Bin Laden's last sight was a Navy SEAL pulling the trigger. _E_ Standing ovation after promising to bring the American Dream back and better than ever before! __HTTP__ _E_ Remember NBC increased Celebrity Apprentice to 2 hours starting this Sunday night at 9 P.M. through end of season great news for App lovers _E_ ObamaCare is already done. HHS Sec. Sebelius is trying to force private companies to finance implementation __HTTP__ _E_ .@TrumpPanama is Panama City's premiere hotel. 70 stories over Punta Pacifica excellence has arrived to So. America __HTTP__ _E_ The harder you work the luckier you get. Gary Player _E_ Dummy @Clare_OC from failing @Forbes magazine: NASCAR deal was 1 nite ballroom ESPN was small golf outing... _E_ I have hired renowned golf course architect Gil Hanse to rebuild The Blue Monster at Doral. He designed the 2016 (cont) __HTTP__ _E_ Deserter Bergdahl returns to active duty as parents of brave soldiers killed looking for him grieve. Obama trying to play this mistake down! _E_ I wonder what the answer is on @BarackObama's college application to the question: place of birth? Maybe the (cont) __HTTP__ _E_ Republicans and Democrats should get back to work immediately to work on resolving downgrade. This is not a go... (cont) __HTTP__ _E_ Hillary Clinton doesn't have the strength or stamina to be president. Jeb Bush is a low energy individual but Hillary is not much better! _E_ .@JoselynMartinez is a very brave woman who caught her father's killer __HTTP__ She visited Ivanka & me at Trump Tower today. _E_ Credibility is important to me hence must admit that both candidates did really well last night. #VPDebate _E_ Trump: If Republicans 'don't get tough they're not going to win this election' __HTTP__ via @thehill _E_ Thank you Great Faith Ministries International Bishop Wayne T. Jackson and Detroit! __HTTP__ _E_ As I've said many times before Jon Stewart @TheDailyShow is highly overrated. _E_ Will be interviewed by @seanhannity tonight for the full hour. Hope you enjoy it and more importantly hope you agree! _E_ Congress must end chain migration so that we can have a system that is SECURITY BASED! We need to make AMERICA SAFE! #USA __HTTP__ _E_ On my way to see the great people of Maine. Will be landing in Portland in 2 hours. Look forward to it! #Trump2016 _E_ A Rod's salary is more than the entire @astros. Half the players on @astros will have better seasons than him. A Rod is a joke! _E_ It's okay but why do the haters (& losers) want to follow me on twitter?? Get a life! _E_ #TrumpAdvice __HTTP__ _E_ The Hostess closing did not have to happen should have been an easy deal to make. _E_ Via @PatheosFamily by @BristolsBlog: Trump Weighs In on Saeed: Obama 'Didn't Even Ask' __HTTP__ Thanks Bristol! _E_ How many more times do we all have to watch and pay for that stupid and never ending #SmokeyBearHug commercial. How much is govt. spending? _E_ Trump's Tax Plan: A Proposal Reagan Would Approve? by Jeff Bell __HTTP__ _E_ The Donald J. Trump Signature Collection's new line is out @Macys ties shirts accessories great & going fast! __HTTP__ _E_ .@GolfMagazine is great thanks! _E_ Via Hardball with Chris Matthews __HTTP__ _E_ Trump Golf Links at Ferry Point will host many major championships over the years. Great thing for NYC—congratulations to all! _E_ Trump lays out big plans for Doonbeg resort: Billionaire says investment shows Ireland's economy recovering __HTTP__ _E_ RT @MittRomney: I am running for president to get us creating wealth again not to redistribute it. _E_ Thank you @FrankLuntz __HTTP__ _E_ The home of the boardroom @TrumpTowerNY __HTTP__ #CelebApprentice _E_ Just watched Full Metal Jacket can't believe R. Lee Ermey didn't win the Academy Award as the drill sergeant. Political nominations! _E_ British PM Cameron is making a fool of himself by wasting billions of pounds on unwanted & environment destroying Scottish windmills. _E_ I wonder what the next scandal will be in D.C.? Can we handle yet another? _E_ Will be in Phoenix Arizona on Wednesday. Changing venue to much larger one. Demand is unreal. Polls looking great! #ImWithYou _E_ China is our enemy. It's time we start acting like it...and if we do our job corectly China will gain a whole (cont) __HTTP__ _E_ WE ARE MAKING AMERICA GREAT AGAIN! __HTTP__ _E_ We should be focusing on beautiful clean air & not on wasteful & very expensive GLOBAL WARMING bullshit! China & others are hurting our air _E_ #TRUMP International Reality will be America's premiere real estate brokerage house __HTTP__ w/ the most distinctive services. _E_ Thank you Lexington South Carolina!#Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Great knockout on Saturday by Juan Manuel Marquez on Manny Pacquiao. A great fight! _E_ "The most terrifying words in the English language are: I'm from the government and I'm here to help." – Pres. Ronald Reagan _E_ Congratulations Eric & Lara. Very proud and happy for the two of you! __HTTP__ _E_ #MidasTouch is divided into five sections. The second is the index finger which represents Focus __HTTP__ _E_ Join me live in Hershey Pennsylvania! #MakeAmericaGreatAgain LIVE: __HTTP__ __HTTP__ _E_ .@HighSock_Sunday #asktrump __HTTP__ _E_ Congratulations to the Houston @Astros 2017 #WorldSeries Champions#HoustonStrong #EarnHistory __HTTP__ _E_ Foreign leaders are already requesting meetings with @MittRomney to warn that we are viewed as in decline __HTTP__ _E_ So many positive things going on for the U.S.A. and the Fake News Media just doesn't want to go there. Same negative stories over and over again! No wonder the People no longer trust the media whose approval ratings are correctly at their lowest levels in history! #MAGA _E_ Via @BreitbartNews by @rwildewrites: "TRUMP: 'I WOULD BUILD A BORDER FENCE LIKE YOU HAVE NEVER SEEN BEFORE'" __HTTP__ _E_ Anyone who doubts the strength or determination of the U.S. should look to our past....and you will doubt it no longer. __HTTP__ _E_ If I win I am going to instruct my AG to get a special prosecutor to look into your situation bc there's never been anything like your lies. _E_ Will be interviewed by @MariaBartiromo on @FoxBizAlert at 7:30 A.M. Enjoy! _E_ Thank you! #Trump2016 __HTTP__ __HTTP__ _E_ In this time of economic turmoil where millions of Americans are unemployed our tax dollars are paying @BillMoyers' big @PBS salary! _E_ You can only smile when the losers of the world try so hard to put down successful people. Just remember they all want to be YOU! _E_ Watch my interview with Greta Van Susteren on her show On the Record tonight on Fox News in the 10 p.m. hour. _E_ A disgraceful verdict in the Kate Steinle case! No wonder the people of our Country are so angry with Illegal Immigration. _E_ CNN/ORC Poll results just out for Nevada—WOW! Trump 38 Carson 22 Fiorina 8 Bush 6 Cruz 4 __HTTP__ _E_ ...Remember I told you so. _E_ Thank you Alabama! From now on it's going to be #AmericaFirst. Our goal is to bring back that wonderful phrase:... __HTTP__ _E_ Just got final renderings of Trump National Doral in Miami there will be nothing like it in the Country will be the best! _E_ China has hacked another US government body. __HTTP__ will we learn? _E_ The failing @nytimes wrote a story about my management style & that I don't have many people. I have 73 Hillary has 800 & I'm beating her. _E_ Ted Cruz complains about my views on eminent domain but without it we wouldn't have roads highways airports schools or even pipelines. _E_ ObamaCare is torturing the American People.The Democrats have fooled the people long enough. Repeal or Repeal & Replace! I have pen in hand. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Amazing both Transformers & Dark Knight Rises featured Trump properties and each grossed over $1B. Just coincidence. _E_ ....... I disagree but it's still cool. _E_ People are loving the new line of Trump ties and shirts at Macy's. Check them out! _E_ RT @IsraeliPM: PM Benjamin Netanyahu at weekly Cabinet meeting:In two weeks Israel will host @POTUS Trump on his first trip as President... _E_ It was an honor to welcome so many truckers and trucking industry leaders to the @WhiteHouse today! __HTTP__ _E_ I hear a failing New York newspaper is going to publish one of my old cell phone numbers. So original just one of many! _E_ Hopefully the violent and vicious killing by ISIS of a beloved French priest is causing people to start thinking rationally. Get tough! _E_ Looking forward to meeting the students of Urbandale High School tomorrow __HTTP__ _E_ Julian Assange said a 14 year old could have hacked Podesta why was DNC so careless? Also said Russians did not give him the info! _E_ With @TraceAdkins on top of the truck the crowd definitely buzzed. #CelebApprentice _E_ Join me on Saturday in Syracuse New York! #NYPrimary #Trump2016 __HTTP__ __HTTP__ _E_ My interview yesterday from Newsmax Obama Is 'Now Totally Lost' Boehner Must Not Fold __HTTP__ _E_ Congratulations to @GovMikeHuckabee on last night's tremendous speech. Mike united the party faithful and explained that we can do better. _E_ Why do so many people say I hate President Obama—I don't hate the President at all. I just disagree with his policies! _E_ If you treat people right they will treat you right...ninety percent of the time. Franklin D. Roosevelt _E_ Republicans are going for the big Budget approval today first step toward massive tax cuts. I think we have the votes but who knows? _E_ Pervert Anthony Wiener will never be able to get away from his perversion the cure rate is ZERO. _E_ Great! __HTTP__ _E_ "China presents three big threats to the United States in its outrageous currency manipulation its systematic (cont) __HTTP__ _E_ President Obama looks and sounds so ridiculous making his speech in Cuba especially in the shadows of Brussels. He is being treated badly! _E_ Negotiation tip: Know exactly what you want and focus on that. Trust your instincts even after you've honed your skills. _E_ The Jets just don't have it. Time for a quarterback change! _E_ How does Obama rationalize giving Iran $8B in sanction relief when a Christian pastor is being tortured in an Iranian prison? _E_ Via @MailOnline @dmartosko Donald Trump says it's morally unfair of Obama to send soldiers into Ebola hot zone __HTTP__ _E_ True thanks. __HTTP__ _E_ Ungrateful TRAITOR Chelsea Manning who should never have been released from prison is now calling President Obama a weak leader. Terrible! _E_ Why is @BarackObama continuing to lie? __HTTP__ has found that @MittRomney did not ship jobs overseas __HTTP__ _E_ CLINTON'S FLAILING SYRIA POLICY WAS JUDGED A FAILURE: __HTTP__ #VPDebate _E_ How many illegal foreign donations will Obama collect this final week? Another scandal ignored by the liberal media. __HTTP__ _E_ .@mike_pence was fantastic tonight. Will be a great V.P. _E_ I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_ Discussing #NewYorkValuesin Buffalo last night on the eve of the #NYPrimary.LETS GO NY! #VoteTrump __HTTP__ _E_ I am making a big speech the night of the @FoxNews debate but I wish everyone well. Yesterday was a big day for me with 5 wins! _E_ The media has not covered my long shot great finish in Iowa fairly. Brought in record voters and got second highest vote total in history! _E_ Just named General H.R. McMaster National Security Advisor. _E_ Think big set your vision high and go for it. You'll be shocked by what you can accomplish when you do. Midas Touch _E_ Thank you! WE WILL MAKE AMERICA GREAT AGAIN! #Trump2016 __HTTP__ _E_ Always pretend that you're working for yourself. You'll do a wonderful job. It's simple but it works. Think Like a Billionaire _E_ New national Bloomberg poll just released thank you! Join the MOVEMENT: __HTTP__ #TrumpTrain... __HTTP__ _E_ .@IvankaTrump and me at the @todayshow this morning. __HTTP__ _E_ Great news out of New Hampshire! DonaldTrump is pulling away from the pack w/ 2nd is 17% behind him! #Trump2016 __HTTP__ _E_ I have an open door policy for my employees. I'm accessible because I like to know what's going on. The Midas Touch _E_ Our country has the slowest growth since 1929. #BigLeagueTruth #debate _E_ North Carolina lost 300000 manufacturing jobs and Ohio lost 400000 since 2000. Going to Mexico etc. NO MORE IF I WIN WE WILL BRING BACK! _E_ It's Tuesday. How many more customers has Glenfiddich lost today? _E_ After thousands lost and spending two trillion dollars Iraq (I told you so) is imploding. Really dumb pols put us and kept us there so sad! _E_ I have never seen a thin person drinking Diet Coke. _E_ Rick Perry did an absolutely horrible job of securing the border. He should be ashamed of himself. Gov. Abbott has since been terrific. _E_ I have been watching and loving the United States for many years and have NEVER seen it look weaker or less effective! _E_ My speech from last Saturday's @Citizens_United @AFPhq #NHFreedomSummit __HTTP__ via @cspan _E_ No one will work harder. No one will move heaven and earth like Mitt Romney to make this country a better place to live! @AnnDRomney _E_ Dopey Prince @Alwaleed_Talal wants to control our U.S. politicians with daddy's money. Can't do it when I get elected. #Trump2016 _E_ Victoria's Secret reps were nasty to @KateUpton and now she is doing great. _E_ DOW S&P 500 and NASDAQ close at record highs! #MAGA __HTTP__ _E_ RT @realDonaldTrump: ...big unnecessary regulation cuts made it all possible" (among many other things). "President Trump reversed the poli... _E_ The sub station in Blackdog is very dangerous on unregulated landfill—fire hazard! @AlexSalmond @pressjournal _E_ Exclusive Video–Broaddrick Willey Jones to Bill's Defenders: 'These Are Crimes' 'Terrified' of 'Enabler' Hillary __HTTP__ _E_ The Celebrity Apprentice has a two hour premiere this Sunday March 14th at 9 p.m. on NBC. This will be the best season yet see you then! _E_ CNN: New GOP polls show Trump's favorability is up __HTTP__ _E_ Located in the beautiful countryside of Mooresville @Trump_Charlotte has a superb clubhouse & top amenities __HTTP__ _E_ THe Chinese military is already hacking our satellites __HTTP__ The Chinese government is not an American ally. _E_ The Theater must always be a safe and special place.The cast of Hamilton was very rude last night to a very good man Mike Pence. Apologize! _E_ .@MarieLeff #asktrump __HTTP__ _E_ Obama is without question the WORST EVER president. I predict he will now do something really bad and totally stupid to show manhood! _E_ Lightweight A.G. Eric Schneiderman is perhaps the most incompetent and least respected A.G. in the U.S. He is a total joke! _E_ Mitch get back to work and put Repeal & Replace Tax Reform & Cuts and a great Infrastructure Bill on my desk for signing. You can do it! _E_ Happy Friday the 13th __HTTP__ _E_ Tweet me back if u think we should start a petition to fire @hardball_chris for his comments on Sandy & the death & destruction it caused. _E_ China's economy is now projected to overtake the US as the world's largest economy by 2027 __HTTP__ #TimeToGetTough _E_ Trump International Hotel & Tower Vancouver will be a fantastic addition to a spectacular city. __HTTP__ _E_ Donald Trump Explains Why He Called Django Unchained 'Racist' In Tweet __HTTP__ via @accesshollywood _E_ Via @Newsmax_Media by @melaniebatley: "Trump Backed Candidate @leezeldin Wins NY GOP Primary" __HTTP__ _E_ Thank you to Jeffrey Lord @AmSpec for his incredible & insightful article this weekend on failing & irrelevant @BuzzFeed _E_ Packed venue of people who want to #MakeAmericaGreatAgain __HTTP__ _E_ My @LateNightJimmy interview with @jimmyfallon discussing the new season of All Star @CelebApprentice __HTTP__ _E_ I will be speaking the night before the RNC in Sarasota FL when I receive the Statesman of the Year award. _E_ .@MittRomney will make us energy independent by 2020 __HTTP__ @BarackObama will keep wasting money on Solyndra projects. _E_ I'll be on with Larry Kudlow of the Kudlow Report tonight on CNBC at 7 p.m. We'll be discussing current affairs and politics. Tune in. _E_ Congrats to @BarackObama he has now had over 40 months straight of over 8% unemployment while accruing over $6T (cont) __HTTP__ _E_ Via @TVbytheNumbers:"TV Ratings Sunday 'Family Guy' & 'The Simpsons' Down 'All Star Celebrity Apprentice' Up" __HTTP__ _E_ Join me in Greensboro North Carolina tomorrow at 2:00pm! #TrumpRally __HTTP__ __HTTP__ _E_ Kate is donating a #kidney to her husband __HTTP__ . You can help! I did @fundanything #donate _E_ MAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN!#Trump2016 #AmericaFirst __HTTP__ _E_ I have been consistent in my opposition to Common Core. Get rid of Common Core keep education local! _E_ .@ApprenticeNBC season premiere this Sunday at 9/8c on @NBC __HTTP__ _E_ "It's always great to be in business with Donald Trump" said @Telemundo president Emilio Romano. __HTTP__ _E_ The fact that Sneaky Dianne Feinstein who has on numerous occasions stated that collusion between Trump/Russia has not been found would release testimony in such an underhanded and possibly illegal way totally without authorization is a disgrace. Must have tough Primary! _E_ I will be on @foxandfriends Monday morning at 7.00. A lot to talk about! _E_ Thank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign. _E_ Remember to keep going if you stop your momentum will stop. _E_ I have an idea for A Rod buy a home at @TrumpGolfLA overlooking the Pacific will bring you better luck. _E_ A letter from an amazing woman __HTTP__ _E_ The dollar always talks in the end although our pols are killing the dollar! _E_ Remember that Bill Clinton was brought in to help Hillary against Obama in 2008. He was terrible failed badly and was called a racist! _E_ Thank you to General Motors and Walmart for starting the big jobs push back into the U.S.! _E_ No surprise. @DNC displayed Russian ships in tribute to vets __HTTP__ Did they mean to honor the Russians? _E_ "Trump's Championship #BlueMonster Course Opens To Rave Reviews" __HTTP__ via @sacbee_news _E_ .@AlexSalmond don't worry my ad will be shown across the world and it is highly accurate! _E_ CLINTON IS WEAK ON NORTH KOREA: __HTTP__ #VPDebate _E_ I have always been the same person remain true to self.The media wants me to change but it would be very dishonest to supporters to do so! _E_ I hear @NBCNews / @WSJ came out with another one of their phony polls. While I am leading they are totally discredited after last S.C. poll _E_ Opening in 2016 @TrumpVancouver's original twisting design will transform the skyline at 616 ft. & 63 stories __HTTP__ _E_ Do as I say not as I do.The politicians who passed ObamaCare are now exempting themselves from the monstrosity __HTTP__ _E_ The Bay Bridge in San Francisco is being built by the Chinese tremendous cost overruns. A total mess. We should build our own bridges etc _E_ Masa (SoftBank) of Japan has agreed to invest $50 billion in the U.S. toward businesses and 50000 new jobs.... _E_ mention crime infested) rather than falsely complaining about the election results. All talk talk talk no action or results. Sad! _E_ The Obstructionist Democrats have given us (or not fixed) some of the worst trade deals in World History. I am changing that fast! _E_ Cleveland just made a very wise decision congrats! _E_ Thank you General. #Trump2016 __HTTP__ _E_ Trump International Golf Club Turnberry Scotland home to four of the greatest Open Championships of all time.. __HTTP__ _E_ Depression be careful of China! __HTTP__ _E_ The language used by me at the DACA meeting was tough but this was not the language used. What was really tough was the outlandish proposal made a big setback for DACA! _E_ Thank you for your support! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_ Incompetent @RichLowry lost it tonight on @FoxNews. He should not be allowed on TV and the FCC should fine him! _E_ ...about then candidate Trump. Catherine Herridge @FoxNews. So why doesn't Fake News report this? Witch Hunt! Purposely phony reporting. _E_ You have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_ Syria has prepared for an attack based on all of our talk they have moved targeted ammunition and supplies to new locations.Amazing! _E_ Becoming a US citizen is not a right it's a privilege. _E_ Does anybody really think that President Obama didn't know about our spying on the leaders of allies around the world not possible! _E_ .@TrumpNationalHV features wide open pristine fairways tour caliber greens 64 strategically placed sand bunkers __HTTP__ _E_ Congratulations to all of the "DEPLORABLES" and the millions of people who gave us a MASSIVE (304 227) Electoral College landslide victory! __HTTP__ _E_ The best way out is always through. Robert Frost _E_ Obstacles are those frightening things that become visible when we take our eyes off our goals. Henry Ford _E_ Nevada: A quick reminder that today is your last day to register to vote! __HTTP__ __HTTP__ _E_ .@jacknicklaus has done a GREAT job as the architect of my new golf course at Ferry Point. NYC is very proud! _E_ Obama's deal raises taxes on 77% of national households. With Obama Care taxes kicking in now everyone will be paying for his 2nd term. _E_ .@BrentBozell one of the National Review lightweights came to my office begging for money like a dog. Why doesn't he say that? _E_ We don't have the leadership including the Generals (who just said the element of surprise does not matter) to attack anyone! Cool it. _E_ Now we will never know if @BarackObama would have been able to fill Bank of America Stadium. Pretty convenient. _E_ Watching Senator Richard Blumenthal speak of Comey is a joke. Richie devised one of the greatest military frauds in U.S. history. For.... _E_ Anne Hathaway is a good winner! _E_ We launched a new series of #Trump2016 videos via Facebook. A new topic everyday! Watch: __HTTP__ __HTTP__ _E_ Congratulations to the $1B ObamaCare website on enrolling FOUR in Delaware. Cost to us $4M __HTTP__ _E_ Thank you! #AmericaFirst __HTTP__ _E_ Russia is on the move in the Ukraine Iran is nuking up & Libya is run by Al Qaeda yet Obama is busy issuing 'climate change" warnings. _E_ Just letting China know in advance that the USA will win the medal count in the Olympics. Even with your cheating you can't beat us. _E_ Hard for Biden to justify Libya mess but doing best he can. #VPDebate _E_ Eliot Spitzer's illegal frivolous & over reaching harassment of Hank Greenberg at AIG played a major part in 2008 financial meltdown. _E_ Getting ready for some big news with my friends at @pgaofamerica _E_ Will be on @foxandfriends at 8:00 A.M. _E_ Happy Birthday to my legendary friend Aretha Franklin. _E_ Great minds have purposes others have wishes. Washington Irving _E_ Courage is being scared to death... and saddling up anyway. John Wayne _E_ The failing @nytimes writes total fiction concerning me. They have gotten it wrong for two years and now are making up stories & sources! _E_ ...Who says the death penalty is not a deterrent? _E_ I will be heading to Dubai where I am doing a GREAT project with Damac will be a massive success! _E_ RT @foxandfriends: Hannity: Russia allegations 'boomeranging back' on Democrats __HTTP__ _E_ Mexico is allowing many thousands to go thru their country & to our very stupid open door. The Mexicans are laughing at us as buses pass by. _E_ From 10 11 pm @ApprenticeNBC ranks #1 in 18 49 among ABCCBS and NBC. #CelebApprentice _E_ .@MittRomney should not give any other further information until @BarackObama releases the things that everyone wants to see _E_ I just finished a great meeting with the Republican Senators concerning HealthCare. They really want to get it right unlike OCare! _E_ The American people are sick and tired of not being able to lead normal lives and to constantly be on the lookout for terror and terrorists! _E_ Entrepreneurs are visionaries in some respects they look beyond the present. Keep that in mind when looking for opportunities. _E_ We want to make sure that we have the workforce development programs we need to ensure these jobs are.... __HTTP__ _E_ "It takes guts to win fortunately most people don't have guts! Donald J. Trump _E_ I will be on @meetthepress at 10:30. @nbc will be releasing their new poll numbers. Based on the debate results I should do well who knows? _E_ .@FoxNews Chris Wallace: "More evidence of Dem collusion with Russia than GOP" __HTTP__ _E_ Will be interviewed on @JudgeJeanine at 9:00 P.M. Enjoy! _E_ President Obama refuses to answer question about Iran terror funding. I won't dodge questions as your President. __HTTP__ _E_ New national poll released. Join the MOVEMENT & together we will #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_ Remember China is not a friend of the United States! _E_ Ringling Brothers is phasing out their elephants. Ifor one will never go again. They probably used the animal rights stuff to reduce costs _E_ Egypt is a total mess. We should have backed Mubarak instead of dropping him like a dog. _E_ Phony Club For Growth tried to shake me down for one million dollars & is now putting out nasty negative ads on me. They are total losers! _E_ No matter what you're managing don't assume you can glide by. You have to work to maintain your momentum. Trump: How to Get Rich _E_ Economic confidence is soaring as we unleash the power of private sector job creation and stand up for the American Workers. #AmericaFirst _E_ I am a cautious optimist. Call it positive thinking with a lot of reality checks. _E_ SEE YOU IN COURT THE SECURITY OF OUR NATION IS AT STAKE! _E_ .@donlemon on @CNN at 10:00 P.M. _E_ It's driving @ariannahuff & the money losing @HuffingtonPost post crazy that I am #1 in their poll and they only write bad stories about me! _E_ The Misery Index is at a 28 year high. _E_ My economic policy speech will be carried live at 12:15 P.M. Enjoy! _E_ Women defy media narrative love Trump at packed Michigan rally.VIDEO: __HTTP__ __HTTP__ _E_ Thank you for your support! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ For information on Trump University victory call Alan Garten Esquire at 212.836.3203 or Jeff Goldman Esquire at 212.867.4466. _E_ _xx_justme I still can't believe Donald Trump responded to my tweet. Respect #Trump2016 He would be the best Pres for this Country. Thx _E_ Wind turbines are a scourge to communities and wildlife. They are environmental disasters. _E_ Ivanka and Joan Rivers will be working hard tonight at the Live Finale everybody must watch the OPENING at 9. _E_ A new poll indicates that 68% of my supporters would vote for me if I departed the GOP & ran as an independent. __HTTP__ _E_ Via @DMRegister by @WilliamPetroski: Trump: I can make America great again __HTTP__ _E_ If Russia or some other entity was hacking why did the White House wait so long to act? Why did they only complain after Hillary lost? _E_ Sleazebag @BashirLive has just been forced to resign from @msnbc. His pathetic apology wasn't enough to save his job. @SarahPalinUSA _E_ Join me in Charleston WV tomorrow! __HTTP__ _E_ Wow CNN just said that Donald Trump won the DEBATE connected best with audience. Also Time Drudge Newsmax N.Y.Times and more! _E_ I should host the #Oscars just to shake things up this is not good! _E_ Ford said last week that it will expand in Michigan and U.S. instead of building a BILLION dollar plant in Mexico. Thank you Ford & Fiat C! _E_ Via @WashTimes By Eugene Dunn: "Trump could lead U.S. forward" __HTTP__ _E_ Today Americans everywhere remember the brave men and women of @NASA who lost their lives in our Nation's eternal quest to expand the boundaries of human potential. __HTTP__ __HTTP__ _E_ In '08 @PaulRyanVP predicted that US headed toward bankruptcy __HTTP__ @BarackObama has added over $6T in debt since. Scary. _E_ Make sure to vote today. Vote for real change. Change that will deliver jobs and a free & strong America. Vote for @MittRomney. _E_ I'll be on@SquawkCNBC tomorrow at 7:30 am #TrumpTuesday _E_ Under our President ISIS is gaining great strength __HTTP__ _E_ It was my great honor to deliver the #CGACommencement17 at the @USCGAcademy. CONGRATULATIONS to the Class of 2017!... __HTTP__ _E_ "Luck does not come around often. So when it does be sure to take full advantage of it even if it means working hard. Think Big _E_ Now another Obama speech from 2002 with him talking about taking the rich's 'stuff' __HTTP__ Who is this guy? Where's the media? _E_ Join @autismspeaks and light the world blue on 4/2. #LIUB will raise awareness for millions with autism! _E_ Big day in Alabama. Vote for Luther Strange he will be great! _E_ I will be doing Fox & Friends at 7 (15 minutes). Enjoy it and your day! _E_ Just announced that in the history of @CNN last night's debate was its highest rated ever. Will they send me flowers & a thank you note? _E_ Thank you Dallas Texas! __HTTP__ __HTTP__ _E_ The elites want Common Core so they can take education out of parental control. NO! Let's Make America Great Again! __HTTP__ _E_ I look forward to all meetings today with world leaders including my meeting with Vladimir Putin. Much to discuss.#G20Summit #USA _E_ According to Bill O'Reilly 80% of all the shootings in New York City are blacks if you add Hispanics that figure goes to 98%. 1% white. _E_ That was an amazing interview on @foxandfriends I hope the rest of the media picks it up to show how totally dishonest the @nytimes is! _E_ Now that it's almost over I can't believe that unions & management couldn't save Twinkies etc & management just got a $1.75M bonus. _E_ Record setting gas prices in the U.S. we're really looking dumb. Lots of $'s being made on us. _E_ He @MittRomney wrote a great piece on China __HTTP__ @JonHuntsman criticized him (cont) __HTTP__ _E_ At the foot of Whitestone Bridge in the Bronx @TrumpFerryPoint offers fantastic views of the Manhattan skyline __HTTP__ _E_ A note from the fabulous Mark Burnett: "Donald congratulations again we are #1 in the 10:00pm hour. I am tweeting about it." _E_ New Reuters poll thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ So don't forget to enter the Serta Counting Sheep for Hire contest! www.youtube.com/user/mattressserta _E_ ...if Congress gives us the massive tax cuts (and reform) I am asking for those numbers will grow by leaps and bounds. #MAGA _E_ Being in Detroit today was wonderful. Quick stop in Ohio to meet with some of our great supporters. Just got back home! _E_ Ted Cruz is in trouble for not reporting his bank borrowing in his very important Financial Disclosure Form. Very low interest loans scam! _E_ "@realDonaldTrump: I would like to extend my best wishes to all even the haters and losers on this special date September 11th." _E_ Via @bpolitics by @tdopp: "In Iowa Trump Promises to 'Surprise a Lot of People'" __HTTP__ _E_ To show you how dishonest some of the press is they took my funny & (cont) __HTTP__ _E_ 2016 GOP Nomination Polls have me as #1 as seen on @SpecialReport with @BretBaier. __HTTP__ _E_ It's inconvenient and inconsiderate: @BarackObama is doing a fundraiser tonight making it almost impossibl... (cont) __HTTP__ _E_ The election result in France is very disappointing. The Europeans have to embrace austerity in order for their economy to fully recover. _E_ I promise to rebuild our military and secure our border. Democrats want to shut down the government. Politics! _E_ State Department official accused of offering 'quid pro quo' in Clinton email scandal __HTTP__ _E_ I will be on FOX with the great @JudgeJeanine tonight at 9pm EST! Enjoy! #Trump2016 _E_ Hillary said at debate ISIS is going to people showing videos in order to recruit more radical jihadistst. She made up story want apology! _E_ Crooked Hillary Clinton is a fraud who has put the public and country at risk by her illegal and very stupid use of e mails. Many missing! _E_ I say we cannot continue to let Obama fly around on Air Force 1 at a cost of millions of dollars a day for the purpose of politics & play! _E_ Thank you to Chris Cox and Bikers for Trump Your support has been amazing. I will never forget. MAKE AMERICA GREAT AGAIN! _E_ I am very happy to have the civilian version of The Apprentice back on the air this fall. There will be excitement as well as opportunity. _E_ Wisconsin and Pennsylvania have just certified my wins in those states. I actually picked up additional votes! _E_ Great news! #MAGA __HTTP__ _E_ Michelle Obama's weekend ski trip toAspen makes it 16 times that Obamas have gone on vacation in 3 years. (cont) __HTTP__ _E_ Yesterday I was thrilled to be with so many WONDERFUL friends in Utah's MAGNIFICENT Capitol.It was my honor to sign two Presidential Proclamations that will modify the national monuments designations of both Bears Ears and Grand Staircase Escalante... __HTTP__ __HTTP__ _E_ Thank you @TrumpSoHo @TrumpNewYork for helping me celebrate #agreatcause @MarineCorpsLEF while accepting the Commandant's Leadership award! _E_ Never get good #'s from failing Des Moines Register/Bloomberg. I think something's going on w/them. Up 13 in IA according to respected CNN. _E_ ObamaCare is in serious trouble. The Dems need big money to keep it going otherwise it dies far sooner than anyone would have thought. _E_ The United States is prepared to work with each of the leaders in this room today to achieve mutually beneficial commerce that is in the interests of both your countries and mine. That is the message I am here to deliver today. #APEC2017 __HTTP__ _E_ Do you all remember how beautiful and safe a place Brussels was. Not anymore it is from a different world! U.S. must be vigilant and smart! _E_ Must read article on Obama's illegal fundraising from abroad __HTTP__ Foreign candidate getting foreign donations. _E_ ICYMI @nypost's @LoisWeiss described my Monday @ICSC speech @javitscenter as one of my "best and most riveting" __HTTP__ _E_ Hillary and her friends! __HTTP__ _E_ .@T Mobile has so many service complaints a total joke! _E_ I am in Las Vegas for the @MissUSA 2012 pageant. Watch live tonight on @NBC at 9PM ET. __HTTP__ _E_ NEBRASKA #VoteTrump TODAY!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Happy 70th Birthday @CIA! __HTTP__ _E_ We are fighting hard for Merit Based immigration no more Democrat Lottery Systems. We must get MUCH tougher (and smarter). @foxandfriends _E_ Big win in Montana for Republicans! We _E_ RT @EricTrump: #Arizona: We made it easy to find your polling location for today's primary! Simply visit __HTTP__ __HTTP__ _E_ Take a look at this amazing photo of the cast from the first ever All Star @CelebApprentice __HTTP__ _E_ Out of hundreds of deals & transactions I have used the bankruptcy laws a few times to make deals better. Nothing personal just business. _E_ Now an additional 600 700 jobs in America (2000) being eliminated for move to Mexico via Hartford Courant. __HTTP__ _E_ Outright disgusting the Obama administration has continually stonewalled and lied to US Amb. Sean Smith's mother __HTTP__ _E_ Hillary won't call out radical Islam! She will be soundly defeated. _E_ Why isn't the Arab League paying for everything and sending troops? They want us to do their dirty work with no involvement by themselves! _E_ The DC press corps is obsessed with my @CPACnews speech which is scheduled tomorrow 8:45AM in the Potomac Ballroom. Can't blame them. _E_ The haters and losers that assume I was a non athlete and know nothing about coaches should look into my past unlike our President open book _E_ .@MissUniverse final 3 on now. Great people great new owner @IMG. WATCH. _E_ RT @realDonaldTrump: Loser terrorists must be dealt with in a much tougher manner.The internet is their main recruitment tool which we must... _E_ Make sure to have fun and celebrate NYE with friends and family. Happy New Year everyone! _E_ Donald Trump Hands Bill O'Reilly Cable TV Viewership Win @deadline __HTTP__ _E_ Via @CBSNews by @ReenaJF: Donald Trump scolds Republicans: 'Toughen up' __HTTP__ _E_ The Fed continues to flood the market with US dollars. Wrong move. _E_ May jobless numbers have been readjusted to 8.2%. @BarackObama's economy is a disaster __HTTP__ New numbers tomorrow. _E_ There is only one person who should be crossing our southern border USMC Sgt. Tahmooressi. Boycott Mexico? #FreeOurMarine _E_ Thank you for the incredible support this morning Tampa Florida! #ICYMI watch here: __HTTP__ __HTTP__ _E_ The failing @nytimes should be focused on good reporting and the papers financial survival and not with constant hits on Donald Trump! _E_ Wake Up America China is eating our lunch. _E_ Great rally in New Mexico amazing crowd! Now in L.A. Big rally in Anaheim. _E_ I am in Indiana where we just had a great rally. Fantastic people! Staying at a Holiday Inn Express new and clean not bad! _E_ It is really a shame that Barack Obama may stop $5M from being generously donated to charity all because he refuses to be transparent. _E_ Just in big news I have been declared the winner of the CNMI Rep Caucus with 72.8% of the vote! Thank you! #SuperTuesday #VoteTrump _E_ I heard that the underachieving John King of @CNN on Inside Politics was one hour of lies. Happily few people are watching dead network! _E_ Here I am with @trishstratuscom #WWEHOF __HTTP__ _E_ What will be @RickSantorum's excuse tomorrow after @MittRomney wins Wisconsin and Maryland? Time for Rick to face reality and drop out. _E_ When it comes to China @BarackObama practices pretty please diplomacy. He begs and pleads and bows and it... (cont) __HTTP__ _E_ Guess which POTUS has held more fundraisers than the previous 5 combined? __HTTP__ @BarackObama is (cont) __HTTP__ _E_ Bernie Sanders endorsing Crooked Hillary Clinton is like Occupy Wall Street endorsing Goldman Sachs. _E_ Don King and so many other African Americans who know me well and endorsed me would not have done so if they thought I was a racist! _E_ Monday night at 8:00 will be must see television. Our wonderful Joan Rivers plays a major role as my advisor on the Apprentice. AMAZING! _E_ Why does @Greta have a fired Bushy like dummy John Sununu on spewing false info? I will beat Hillary by a lot she wants no part of Trump. _E_ There have been 17 shutdowns since 1976 14 under Reagan and Bush with Democrat Congresses who wanted more spending. _E_ Eight Syrians were just caught on the southern border trying to get into the U.S. ISIS maybe? I told you so. WE NEED A BIG & BEAUTIFUL WALL! _E_ There are 11 more Solyndras in the @BarackObama energy program __HTTP__  He loves to waste our (cont) __HTTP__ _E_ Uranium deal to Russia with Clinton help and Obama Administration knowledge is the biggest story that Fake Media doesn't want to follow! _E_ Military reps have attacked @BarackObama over Bin Laden leaks they believe he's just using this for his benefit. Not a big surprise... _E_ I will be making my announcement on the next Secretary of State tomorrow morning. _E_ Entrepreneurs are all unique. One way to build a business and turn it into a brand is to know who you are. Midas Touch _E_ .@dennisrodman looks like he really cleaned up his act. _E_ If you're going through hell keep going. Winston Churchill _E_ Hillary Clinton raked in money from regimes that horribly oppress women and gays & refuses to speak out against Radical Islam. _E_ To be successful never give up. My secrets to success will be shared at the National Achievers Congress in London. __HTTP__ _E_ Poll numbers way up making big progress! _E_ America's trade deficit with China is one of our greatest national security threats. Time for Fair Trade. We must produce our own products. _E_ My announcement is tomorrow! _E_ Sad to watch Bernie Sanders abandon his revolution. We welcome all voters who want to fix our rigged system and bring back our jobs. _E_ Trump rails on Romney as possible 2016 contender __HTTP__ via @nypost by @GeoffEarle _E_ The Mar a Lago Club the crown jewel of Palm Beach is a landmark in the National Register of Historic Places __HTTP__ _E_ Via @DailyCaller by @NeilMunroDC: "Trump Wants Ebola Travel Ban" __HTTP__ _E_ HAPPY BIRTHDAY to my son @EricTrump! Very proud of you! __HTTP__ __HTTP__ _E_ Listen to an interview with Donald Trump discussing his new book Think Like A Champion: __HTTP__ _E_ I as President want people coming into our Country who are going to help us become strong and great again people coming in through a system based on MERIT. No more Lotteries! #AMERICA FIRST _E_ If there is one more Ebola case in the U.S. a full travel ban will be instituted. This common sense move should have been done long ago! _E_ AMAZING @BarackObama has actually found a government program he can cut in half the Defense Department...bad (cont) __HTTP__ _E_ Iran's quest for nuclear weapons is a major threat to our nation's national security interests. We can't allow Iran to go nuclear. _E_ The Dallas event on September 14 at 6:00 P.M. at the American Airlines Center looks like it will be a giant success. Tickets are going FAST! _E_ Jeffrey Robinson's #TrumpTower has it all. The ultra rich powerful and beautiful. It's your summer must read. __HTTP__ _E_ The upcoming season of @CelebApprentice will be terrific a great cast. _E_ Have time to waste? Go to the ObamaCare website. _E_ With all of the jobs I am bringing back into the U.S. (even before taking office) with all of the new auto plants coming back into our..... _E_ Obama Spurns Trump Offer to Foot White House Tours __HTTP__ via @Newsmax_Media _E_ Thank you Florida! #SuperTuesday #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ I'll be on @foxandfriends on Monday at 7:30 AM... _E_ I will be interviewed on Face The Nation with @jdickerson this morning. Enjoy! _E_ Get out and vote Nebraska we will MAKE AMERICA GREAT AGAIN! _E_ The #CNBC 25 poll is a joke. I was in 9th place and taken off. (Politics?) No wonder @CNBC ratings are going down the tubes. _E_ Via @CBNNews by @TheBrodyFile: "Donald Trump To Brody File in 2011: People Send Me Bibles" __HTTP__ _E_ Our country is stagnant. We've lost jobs and business. We don't make things anymore b/c of the bill Hillary's husband signed and she blessed _E_ The two fake news polls released yesterday ABC & NBC while containing some very positive info were totally wrong in General E. Watch! _E_ Via @NRO by @LovelaceRyanD: Trump Slams Bush: 'I Don't See Him Winning I Don't See There's Any Way' __HTTP__ _E_ Obama's proposed budget has another middle class tax hike __HTTP__ Enjoy! _E_ Going to Ohio home of one of the worst presidential candidates in history Kasich. Can't debate loves #ObamaCare dummy! _E_ Just reported by CNN that the Trump halo effect caused a record shattering Democratic Debate rating of 15.3 million viewers. So true! _E_ United Nations Resolution is the single largest economic sanctions package ever on North Korea. Over one billion dollars in cost to N.K. _E_ Trump Will Make America GREAT!!!! #ChangeTheWorldIn5Words _E_ RT @atensnut: Hillary calls Trump's remarks horrific while she lives with and protects a Rapist . Her actions are horrific. _E_ Andy Roddick...a great tennis player is a fantastic guy with a wonderful wife. _E_ Thank you Michigan. We are going to bring back your jobs & together we will MAKE AMERICA GREAT AGAIN!Watch:... __HTTP__ _E_ The media is so after me on women Wow this is a tough business. Nobody has more respect for women than Donald Trump! _E_ .@garyplayer you were great on @MikeAndMike this morning—& the Gary Player Villa at @TrumpDoral is a hot ticket. _E_ Heading to Biloxi Mississippi. Massive crowds expected. Thank you for your support! #VoteTrump2016 __HTTP__ _E_ Obama's job approval is at 37% a record low. @GOP & @SpeakerBoehner have the leverage & momentum. Delay ObamaCare for all Americans! _E_ #MakeAmericaGreatAgain __HTTP__ _E_ .@foxandfriends interview re: North Korea firing @dennisrodman job report @MELANIATRUMP's debut & @WrestleMania __HTTP__ _E_ FORMAL ACCEPTANCE OF THE NOMINATION! #TrumpPence16 __HTTP__ _E_ .@Omarosa's emergency has put a new spin on Team Power's presentation—but it's not "show time" yet. #CelebApprentice _E_ Will be on @foxandfriends. Enjoy! _E_ .@CharlieRymerGC Charlie call me we'll set up a match with Gary and Damon. Doral now finished and doing great! _E_ I believe that in addition to the 5 terrorist leaders President Obama gave up for Bergdahl a great deal of CASH was also given. So stupid! _E_ I was relentless because more often than you would think sheer persistence is the difference between success and failure. NEVER GIVE UP! _E_ Great article on wind turbines by Robert Bryce in today's @NYPost __HTTP__ _E_ I loved the day Paul Goldberger got fired (or left) as N.Y.Times architecture critic and has since faded into irrelevance. Kamin next! _E_ I did interview with Chris Wallace of @FoxNews in order to be fair. He then puts on Rove Lane and Will three Trump bashers to discuss. _E_ Have you heard? China just told Obama to jump. Obama asked how high. _E_ Less than 1% of Obama's $4B immigration request will go towards immediate border security. A real scam. Enforce our laws now! _E_ Obama is addicted to spending America into insolvency. His record proves it. _E_ 14 African nations have totally banned West Africans from entering their nations. Likewise many other nations. But the U.S. = COME ON IN _E_ George Will said best debate he ever saw . If you ever heard George Will speak(boring) anything is exciting. _E_ Victory press conference was over. Why is she allowed to grab me and shout questions? Can I press charges? __HTTP__ _E_ I will be on Bill O'Reilly's show tonight at 8 PM talking about Iran and politics. @oreillyfactor _E_ I look very much forward to meeting w/Paul Ryan & the GOP Party Leadership on Thurs in DC. Together we will beat the Dems at all levels! _E_ I love Bluffton SC what a great place what great people. _E_ Congrats to @JimmieJohnson a great guy on winning Daytona! _E_ Thank you Pennsylvania. This is a MOVEMENT like we have never seen before! #VoteTrumpPence16 on 11/8/16 together... __HTTP__ _E_ People like @KatyTurNBC report on my campaign but have zero access. They say what they want without any knowledge.True of so much of media! _E_ How does frumpy & little read @nytimes editorial writer Gail Collins keep her job? She is totally irrelevant! @nytimescollins _E_ Happy birthday to the great @leegreenwood83. You and your beautiful song have made such a difference. MAKE AMERICA GREAT AGAIN! _E_ I watched Mark Cuban on Jay Leno last night what a jerk! _E_ Michael Barbaro the author of the now discredited @nytimes hit piece on me with women has in past tweeted badly about me. He should resign _E_ ISIS made a big mistake with the beheading of the reporter. Even people against intervention want them blown into oblivion. LEADERSHIP! _E_ Twitter will soon be irrelevant if lowlifes are so easily able to hack into accounts. _E_ Wow sexual assaults in the military have gone through the roof far worse than anybody could have predicted! _E_ Bret had a target on his back from the get go... _E_ Trump Tower is located at 725 Fifth Avenue between 56th and 57th Streets... _E_ ... Time for the Republicans to find someone new—and better. _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ I demand an apology from Hillary Clinton for the disgusting story she made up about me for purposes of the debate. There never was a video. _E_ Sacrificing our nation's bravest for ungrateful Iraqis = great for China. China is taking majority of the oil __HTTP__ _E_ My family and I just arrived in Scotland for the grand opening of Trump International Golf Links Scotland __HTTP__ _E_ To become a champion fight one more round. James J. Corbett long ago Heavyweight Champion _E_ Thanks to the historic TAX CUTS that I signed into law your paychecks are going way UP your taxes are going way DOWN and America is once again OPEN FOR BUSINESS! __HTTP__ _E_ The media refuses to talk about the three new national polls that have me in first place. Biggest crowds ever watch what happens! _E_ At 10:30 I will be interviewed on both @meetthepress by @chucktodd and @CBSNews Face The Nation by John Dickerson. This after long evening! _E_ .@tedcruz Conflicting Stances on Birthright Citizenship [14th Amendment] Gives #TeamTrump credit. __HTTP__ _E_ If U.C. Berkeley does not allow free speech and practices violence on innocent people with a different point of view NO FEDERAL FUNDS? _E_ More and more Americans seem fed up with both Parties I agree. _E_ Discussing #SyrianRefugees with @EricBolling on @FoxNews back on 10/3/2015. #ISIS __HTTP__ _E_ Alex Rodriguez should substantially reduce his salary from the Yankees in that he misrepresented his use of (cont) __HTTP__ _E_ RT @CLewandowski_: Gov Nikki Haley just became a liability for Rubio after this was published to social media! __HTTP__ _E_ Thank you Erie Pennsylvania! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_ .@CharlesMBlow Why don't you use new polls instead of the single ancient national poll that was a tiny bit negative. Dishonest reporting! _E_ My ties & shirts at Macy's are doing great. Stupid @GoAngelo is making people aware of how good they are! _E_ I'm not saying to not give vaccines I am just saying give them small doses over a long period of time not one massive dose for a child. _E_ "You can have the most wonderful product in the world but if people don't know about it it's not worth much." The Art of the Deal _E_ Jamie Dimon just gave away $13B to government in settlement. Terrible move & bad precedent. Could have done much better by fighting. _E_ I love you North Carolina thank you for your amazing support! Get out and __HTTP__ tomorrow!Watch:... __HTTP__ _E_ New Bloomberg Poll: Trump Leads Big __HTTP__ _E_ It was a great honor to have spoken before the countries of the world at the United Nations.#USAatUNGA#UNGA __HTTP__ __HTTP__ _E_ Obama is giving Social Security & ObamaCare to illegals yet wants to cut military benefits __HTTP__ Disgrace! _E_ Entrepreneurs: Everything starts with you. Leadership is not a group effort if you're in charge then be in charge. _E_ US should have told Libya Rebels give us 50% of your oil for our military support. _E_ If Republicans don't Repeal and Replace the disastrous ObamaCare the repercussions will be far greater than any of them understand! _E_ China is now deploying drones across ocean routes used for trade __HTTP__ They stole the technology from us. _E_ If the people of our great country could only see how viciously and inaccurately my administration is covered by certain media! _E_ Exceptional dining matched with exceptional views @Trumpchicago offers a unique array of 5 star dining options __HTTP__ _E_ .@bobvanderplaats begged me to do an event while asking organizers for $100000 for himself—a bad guy! _E_ RT @DRUDGE_REPORT: WSJ: The Cold Clinton Reality... __HTTP__ _E_ Entrepreneurs: See yourself as victorious. This will focus you in the right direction. Put everything you've got into what you're doing. _E_ The dishonest media didn't mention that Bernie Sanders was very angry looking during Crooked's speech. He wishes he didn't make that deal! _E_ Based on @MegynKelly's conflict of interest and bias she should not be allowed to be a moderator of the next debate. _E_ It's Friday how many advertisers dropped @HuffPost today? _E_ Very excited for @LaraLeaYunaska and @EricTrump's wedding this weekend. _E_ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Here are my thoughts on last night's episode of The Celebrity Apprentice... __HTTP__ _E_ Could be the hurricane helps @MittRomney people are rioting in the streets over gasoline _E_ Young entrepreneurs – never back down. Take the hits and get up. That's what makes a winner. _E_ My interview w/ @BloombergTV's Peter Cook re the Old Post Office Bldg becoming Trump Int'l Hotel Washington D.C. __HTTP__ _E_ Well that is it. Well done Megyn and they all lived happily ever after! Now let us all see how THE MOVEMENT does in Oregon tonight! _E_ .@BarackObama is practically begging @MittRomney to disavow the place of birth movement he is afraid of it and (cont) __HTTP__ _E_ A lot of people strongly advised me against doing @ApprenticeNBC. Next week we start filming the record 13th seasonHence go with your gut! _E_ Sorry to hear @msnbc was dead last in the gutter in their Boston bombing coverage __HTTP__ @hardball_chris @Lawrence _E_ Hillary Clinton has announced that she is letting her husband out to campaign but HE'S DEMONSTRATED A PENCHANT FOR SEXISM so inappropriate! _E_ Via @PRNewswire: "Central Park Horse Show To Make Inaugural Debut in NYC Sept 18 21" __HTTP__ I am proud to be a sponsor! _E_ The same brilliant negotiators that gave up five Taliban leaders for one traitor are now making trade deals with China & others.No chance _E_ The debate was very interesting last night. There were numerous winners and Governor Romney did very well. _E_ "Some events will wipe out one person but will make another even more tenacious." – Think Like a Champion _E_ How come discredited reporter @mckaycoppins refused to write that the events in New Hampshire Buffalo and N.Y. were all record breakers! _E_ RT @TeamTrump: It's hard to fight terrorism when you're making cash payments to the world's LARGEST state sponsor of TERROR. Under Trump: N... _E_ Weak newscasters are asking is there a racial component to knockout attacks? Of course there is and weakness will only make it worse! _E_ A Rod should donate his contract to charity. He doesn't make the @yankees any money and he doesn't perform. He is a $30M/yr rip off. _E_ power from Washington D.C. and giving it back to you the American People. #InaugurationDay _E_ Great news from Ireland—Clare County Council turned down massive windfarm near my hotel & golf course in Doonbeg. __HTTP__ _E_ The only job @BarackObama cares about is his own. Everything he does is for his own reelection. _E_ We are getting ready to protect Saudi Arabia against Iran & others sending ships. How much are they going to pay us toward this protection. _E_ FLORIDA Visit __HTTP__ to find shelters road closures & evacuation routes. Helpful Twitter list: __HTTP__ __HTTP__ _E_ El Chapo and the Mexican drug cartels use the border unimpeded like it was a vacuum cleaner sucking drugs and death right into the U.S. _E_ .@lolojones given a raw deal in @nytimes story not fair. _E_ (1/2) Time Magazine has me on the cover this week. David Von Drehle has written one of the best stories I have ever had. _E_ The Wikileaks e mail release today was so bad to Sanders that it will make it impossible for him to support her unless he is a fraud! _E_ Democrats slam GOP healthcare proposal as Obamacare premiums & deductibles increase by over 100%. Remember keep your doctor keep your plan? _E_ ... Supreme Court pick economic enthusiasm deregulation & so much more have driven the Trump base even closer together. Will never change! _E_ Join me tomorrow Nov. 3rd at 12pm in #TrumpTowerNY. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_ Congrats to Obama & Democrats. CBO has just announced that ObamaCare missed its uninsured target by half & program costs extra $700B+. _E_ "Let other people talk. Any business conversation should be two sided." – Think Like a Billionaire _E_ "Trump could be great friend if 'Second Amendment' enthusiasm is real" __HTTP__ via @SFLuxe _E_ Apple's iPhone sales fell way short they must go to a larger screen as alternative fast (as I said long ago)! Samsung's size much better. _E_ Would be nice if @jmartNYT learned how to read the polls before writing his next story. Probably done on purpose but not good reporting! _E_ Via @GolfweekMag: "Trump reveals routing for second course in Scotland" __HTTP__ _E_ A great photo of @MittRomney and me __HTTP__ _E_ Republicans should have been much tougher on Obama. Just wait until you see what Obama does to Romney at the DNC! _E_ Will be on @Morning_Joe at 6:30 A.M. _E_ Christians need support in our country (and around the world) their religious liberty is at stake! Obama has been horrible I will be great _E_ Congratulations to @NHGOP & @AFPFNH for winning control of the State House & Executive Council while holding State Senate. Strong results! _E_ She'll say anything and change NOTHING! #MAGA #BigLeagueTruth __HTTP__ _E_ Rated Toronto's #1 hotel @TrumpTO has 261 guest rooms & suites furnished in elegant cosmopolitan style. __HTTP__ _E_ I am watching Crooked Hillary speak. Same old stuff our country needs change! _E_ Iran is toying with our president buying time and laughing at the stupidity of our leadership. Syria and now this! What's next? _E_ RT @foxandfriends: Chicago approves new plan to hide illegal immigrants from the feds plus give them access to city services __HTTP__ _E_ Located in Central Park the iconic @TrumpRink is NYC's top skating rink. VIP sessions are available for booking __HTTP__ _E_ A phony story that I am trying to buy a soccer team in Argentina is untrue. Never even heard of the team—no interest! __HTTP__ _E_ Hillary said with respect to ISIS we are finally where we need to be. Do we want 4 more years of incompetent leadership? MAGA! _E_ I call Jeb Bush the reluctant warrior he just doesn't want to be doing this he is not having fun! _E_ Socialists think profits are a vice I consider losses the real vice. Winston Churchill _E_ ... Will be there front & center along with the 70 greatest players in the world. WGC @Cadillac Championship _E_ Statesman of the Year in Sarasota FL on Sunday night will be terrific a total sellout. _E_ .@foxandfriends we are in record territory in all things having to do with our economy! __HTTP__ _E_ No Question' Violent Crime Will Rise If Program (Stop & Frisk) Is Stopped" @NY_POLICE Commissioner Ray Kelly _E_ Heading to Youngstown Ohio now some great polls. #AmericaFirst __HTTP__ _E_ Why didn't Obama as part of the negotiation free the Christian Pastor Saeed Abedini? __HTTP__ _E_ Congratulations to the Republic of Korea on what will be a MAGNIFICENT Winter Olympics! What the South Korean people have built is truly an inspiration! __HTTP__ _E_ The 2nd Amendment is under siege. We need SCOTUS judges who will uphold the US Constitution. #Debate #BigLeagueTruth _E_ Watch me on the @hannityshow tonight at 9pm. More thoughts on Anthony Weiner in today's #trumpvlog... __HTTP__ _E_ In the last 24 hrs. we have raised over $13M from online donations and National Call Day and we're still going! Thank you America! #MAGA _E_ Dems don't want to talk ISIS b/c Hillary's foreign interventions unleashed ISIS & her refugee plans make it easier for them to come here. _E_ China loved Obama's climate change speech yesterday. They laughed! It hastens their takeover of us as the leading world economy. _E_ RECKLESS! @BarackObama has now increased the debt more than any other POTUS and the first 42 combined. __HTTP__ _E_ OPEC is ripping us off on oil. We are ripping ourselves off by investing in unproven green energy. #Solyndra _E_ For all of those who want to #MakeAmericaGreatAgain boycott @Macys. They are weak on border security & stopping illegal immigration. _E_ .@SkyscraperLive: Nick all of the folks at Trump International next door are wishing you well. We will block the strong winds! _E_ #VoteTrumpMS! #Trump2016 __HTTP__ _E_ RT @LouDobbs: Trump outlines new child care policy proposals via the @FoxNews App @realDonaldTrump seems a candidate of destiny __HTTP__ _E_ Ask yourself: What am I pretending not to see? There may be some great opportunities right around you. _E_ Negotiation tip #1: The worst thing you can possibly do in a deal is seem desperate to make it. @realDonaldTrump _E_ Photo of @Gretawire and me from yesterday's interview... __HTTP__ _E_ Happy Father's Day to all! I had a wonderful and loving father. __HTTP__ _E_ Thank you to the amazing law enforcement officers in Colorado!#MakeAmericaGreatAgain #LESM __HTTP__ _E_ Why doesn't the media want to report that on the two Big Thursdays when Crooked Hillary and I made our speeches Republican's won ratings _E_ Military has announced that China has successfully hacked our advanced weapon designs. China is our enemy.Should we offset this on our debt? _E_ Live from New York November 7th! @nbcsnl __HTTP__ _E_ Going now to make a major speech before some of the world's biggest investors in Dubai! _E_ Nothing on emails. Nothing on the corrupt Clinton Foundation. And nothing on #Benghazi. #Debates2016 #debatenight _E_ Hard to believe that the Democrats who have gone so far LEFT that they are no longer recognizable are fighting so hard for Sanctuary crime _E_ .@HillaryClinton : Bill "clarified" what he meant when calling Obamacare a "disaster." Actually "disaster" is pretty clear. #Debate _E_ I have no doubt that Mitt will do really well tonight. We'll all be watching @MittRomney. _E_ In a new poll a majority of people felt the president knowingly lied about health care pledge. Who are the fools who don't think he lied? _E_ Our amazing golf course @TrumpScotland __HTTP__ _E_ Trump: US Must Get Tougher Because China Is 'Eating Our Lunch' __HTTP__ via Moneynews @Newsmax_Media _E_ RT @BretEastonEllis: Just back from a dinner in West Hollywood: shocked the majority of the table was voting for Trump but they would never... _E_ With $250M of renovations Trump Int'l DC's 250 expansive guest rooms will be DC's top offering of amenities & views __HTTP__ _E_ Dishonest @politico just called to say that none of the polls including Fox NBC CNN Zogby & Morning Consult matter. Serious haters. _E_ .....you keep forgetting to mention the fragrance Success ! _E_ Identify your goals. Know precisely what you want to achieve study the best people in your fieldand then plan the best route for success. _E_ Actually I was very nice to Jimmy Carter during my standing room only (& standing ovation) speech for CPAC stated better Pres. than Obama! _E_ #ThankYouTour2016 Tonight Orlando Florida Tickets: __HTTP__ Mobile AlabamaTickets:... __HTTP__ _E_ THANK YOU ARIZONA! Get out and #VoteTrump on Tuesday! #AZPrimary #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ .@washingtonpost is going out of its way to tell failing candidates how to beat Donald Trump.The Post doesn't get that I'm good at winning! _E_ In Crooked Hillary's telepromter speech yesterday she made up things that I said or believe but have no basis in fact. Not honest! _E_ He @BarackObama wants record high gas prices drilling permits on federal land are declining under his regime __HTTP__ _E_ Tainted (no very dishonest?) FBI "agent's role in Clinton probe under review." Led Clinton Email probe. @foxandfriends Clinton money going to wife of another FBI agent in charge. _E_ With all of the Crooked Hillary Clinton's foreign policy experience she has made so many mistakes and I mean real monsters! No more HRC. _E_ A lot of people are concerned about which charity my $5M will be donated to. The onus is on Obama to first release his records. _E_ It's go time! See you at Trump Tower. I'm giving money away! #FundAnything _E_ Join me Thursday in Florida & Ohio!West Palm Beach FL at noon: __HTTP__ OH this 7:30pm: __HTTP__ _E_ Mexico has lost a brilliant finance minister and wonderful man who I know is highly respected by President Peña Nieto. _E_ What about the undocumented immigrant with a record who killed the beautiful young women (in front of her father) in San Fran. Get smart! _E_ I will be in Milwaukee Wisconsin tomorrow at 7pmE with @MELANIATRUMP. Join us! #WIPrimary #Trump2016 __HTTP__ _E_ America's competitors love @BarackObama. @MedvedevRussiaE says @BarackObama has been the best 3 years for Russia __HTTP__ _E_ Just as I predicted while Obama lifted sanctions 18 months ago Iran cheated & increased its nuclear fuel by 20%. We must DOUBLE sanctions! _E_ So many in the African American community are doing so badly poverty and crime way up employment and jobs way down: I will fix it promise _E_ Some exciting news the newest acquisition of Trump Golf Trump National Golf Club Charlotte NC formerly The (cont) __HTTP__ _E_ Via @ABC by @jonkarl & @JordynPhelps: Donald Trump Says Jeb Bush is the 'Last Thing We Need' __HTTP__ _E_ Our economy is at a standstill. Some are even predicting a possible double dip. We need to elect @MittRomney in November. _E_ Via @ArutzSheva_En by Moshe Cohen: "Donald Trump: French Gun Control Allowed Terrorists to Succeed" __HTTP__ _E_ Do not look for approval except for the consciousness of doing your best. Andrew Carnegie _E_ I never said that China was in the bad TPP trade deal but that China would come in the back door at a later date. @CNN @FoxBusiness _E_ Going to Columbus Ohio today for a tremendous rally of thousands. The silent majority is no longer silent! _E_ .@TrumpNewYork is the only Forbes 5 Star & 5 Diamond hotel with a 5 Star & Five Diamond restaurant in NYC __HTTP__ _E_ What took so long to catch only 1 of the Benghazi terrorists? Especially after the killer has been taunting the US in the press f/2 yrs. _E_ Thank you Wisconsin! My Administration will be focused on three very important words: jobs jobs jobs! Watch:... __HTTP__ _E_ As everybody knows but the haters & losers refuse to acknowledge I do not wear a "wig." My hair may not be perfect but it's mine. _E_ While @BetteMidler is an extremely unattractive woman I refuse to say that because I always insist on being politically correct. _E_ Re Negotiation: Think about what the other side wants. Know where they're coming from. View any conflict as an opportunity. Be flexible. _E_ Great speech on China by @PaulRyanVP yesterday where he explains why China is treating @BarackObama like a Doormat __HTTP__ _E_ The ObamaCare disaster will increase the amount of uninsured __HTTP__ What is the point of this Trillion $ monstrosity? _E_ Excited to see that @AnnDRomney has joined twitter. Melania and I are looking forward to hosting her next week (cont) __HTTP__ _E_ Via @digitalspyus: Donald Trump to Lord Sugar: 'Drop to your knees and thank me' __HTTP__ _E_ A person who never made a mistake never tried anything new. Albert Einstein _E_ Thank you @USNavy! #USA __HTTP__ _E_ Via @NRO: Donald Trump Eyes 2016 by @woodruffbets __HTTP__ _E_ .@AlexSalmond I hope you played well at Royal Aberdeen but u must admit the windmill hovering over hole 14 is disgusting & inappropriate. _E_ It's freezing outside where the hell is global warming ?? _E_ My Administration will follow two simple rules: __HTTP__ _E_ Great new poll thank you America!#Trump2016 #ImWithYou __HTTP__ _E_ Secy John Kerry has a tough job but he looks so totally lost negotiating w/ those characters who are cleaning his clock. Sad to watch... _E_ Via @qctimes by @EdTibbetts: "Trump: U.S. getting beat up" __HTTP__ _E_ It was great to have Governor @RicardoRossello of #PuertoRico with us at the @WhiteHouse today. We are with you! #PRStrong __HTTP__ _E_ Got to do something about these missing chidlren grabbed by the perverts. Too many incidents fast trial death penalty. _E_ .@IvankaTrump looks like a movie star from the days of glamour and beauty. #CelebApprentice _E_ RT @foxnation: Grateful Syrians React To @realDonaldTrump Strike: 'I'll Name My Son Donald' __HTTP__ #SyrianStrikes _E_ My @CPACnews speech is scheduled Friday at 8:45AM in the Potomac Ballroom. Will also be telecast live on CSPAN & cable news networks. _E_ Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_ Via @BreitbartFeed why doesn't @BarackObama release his original book proposal which says he was born in Kenya? __HTTP__ _E_ Eric's Sept. 14th event will be held at Trump National Golf Club Westchester. __HTTP__ _E_ My @gretawire interview discussing my @MittRomney fundraiser in Trump Int'l Hotel Las Vegas and the state of the (cont) __HTTP__ _E_ .@Mediaite: Donald Trump Trashes @michellemalkin On Twitter:You're A 'Dummy' & 'Were Born Stupid' __HTTP__ @AndrewKirell _E_ Such amazing reporting on unmasking and the crooked scheme against us by @foxandfriends. Spied on before nomination. The real story. _E_ Someone just asked me who is my favorite Donald Trump impersonator? __HTTP__ _E_ .@foxandfriends We are not looking to fill all of those positions. Don't need many of them reduce size of government. @IngrahamAngle _E_ Then how come gasoline is hitting record high prices? _E_ The @Yankees must re negotiate @AROD's contract. He is not the same player without drugs. _E_ I am at Trump National Doral in Miami as the best golfers in the World start arriving for the World Golf Championship (Cadillac). A big week _E_ Just hit a million on Facebook __HTTP__ _E_ Obama Care stole more then $500M from Medicare. _E_ Thank you @tweetbypremier for selecting the Ocean View Suite @Trump_Ireland as one of your top 10 suites __HTTP__ _E_ Congratulations Treasury Secretary Steven Mnuchin! #ICYMI watch here: __HTTP__ __HTTP__ _E_ The @AmSpec article Shakedown Schneiderman about NY State lightweight @AGSchneiderman is amazing. __HTTP__ _E_ I hope everyone enjoyed Palm Sunday! _E_ With the exception of cheating Bernie out of the nom the Dems have always proven to be far more loyal to each other than the Republicans! _E_ The Democrats will make a deal with me on healthcare as soon as ObamaCare folds not long. Do not worry we are in very good shape! _E_ On stunning Aberdeenshire coastline @TrumpScotland features a classic Scottish link threaded through the dunes __HTTP__ _E_ Opportunity is missed by most people because it is dressed in overalls and looks like work. Thomas Edison _E_ We should not bail out any of the European countries or banks. _E_ RT @KellyannePolls: After a decent first debate @HillaryClinton is back to form: pedantic lawyerly technocratic (woefully untruthful) r... _E_ .@MittRomney's @RNC convention came in over $3M under budget. Barack's @DNC convention is over $10M in debt. What a surprise! _E_ Give yourself a chance make every day a discovery. _E_ My interview which recently aired on CNBC's Squawk Box __HTTP__ _E_ _E_ One of the country's dumbest newspapers—The Palm Beach Post should be put to sleep. It's dying. @pbpost _E_ Such amazing people in India. This trip is very enlightening! _E_ Entrepreneurs: Take responsibility for yourself. It's a very empowering attitude. _E_ Karl Rove lost GOP both Houses of Congress and the White House gave us Obama. _E_ Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau _E_ So many self righteous hypocrites. Watch their poll numbers and elections go down! _E_ Entrepreneurs: There's nothing wrong with bringing your talents to the surface. Having an ego and acknowledging it is a healthy choice. _E_ As I have long been saying South Africa is a total and very dangerous mess. Just watch the evening news (when not talking weather). _E_ Crooked Hillary called it totally wrong on BREXIT she went with Obama and now she is saying we need her to lead. She would be a disaster _E_ Signing my tax return.... __HTTP__ _E_ People buy deals & immediately put them into bankruptcy in order to make better deals... _E_ Join me today Nov 3rd in #TrumpTowerNYC at noon. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_ Be a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_ I won every debate so far according to all debate polls including @DRUDGE_REPORT @TIME @Slate and more. Too bad dopey @megynkelly lies! _E_ Don't blindly pursue a career that others suggest or insist is right for you. It may be worth taking a pay cut for a job you love. _E_ "Donald Trump: I've made up my mind on 2016" __HTTP__ via @msnbc by @janestreet _E_ Congratulations to America's new Secretary of @HHSGov Alex Azar! __HTTP__ _E_ Just as I predicted @BarackObama is preparing a possible attack on Iran right before November. __HTTP__ _E_ Con Ed has won its suit against the Ground Zero Mosque developers __HTTP__ The mosque is never going up. _E_ Obama's policies have led to food stamp rolls growing 75X faster than job production __HTTP__ We can't afford 4 more years. _E_ No surprise that all the foreign countries are celebrating Obama's win. They love a weak America that they can rip off. _E_ There are no short cuts to any place worth going. Beverly Sills _E_ All of my Cabinet nominee are looking good and doing a great job. I want them to be themselves and express their own thoughts not mine! _E_ My wife @MELANIATRUMP and my children will be featured on @FoxNews with @Greta 7pmE. Enjoy!#MeetTheTrumps #Trump2016 _E_ Obamacare premiums increasing 33% in Pennsylvania a complete disaster. It must be repealed and replaced!... __HTTP__ _E_ .@brandonhardest Love what you do and work hard. _E_ Just as I said last October census workers cooked the job numbers for Obama right before the election __HTTP__ _E_ And happy to welcome @ArsenioHall back as an advisor— he will have his own show and is doing great. #CelebApprentice _E_ Just did @OReillyFactor. Will be back on at 11pm on @FoxNews. _E_ Surprised @Eagles signed Michael Vick yesterday to be their 2013 QB. Vick is talented but brittle & probably won't last long. _E_ Just Introduced at #NCGOPcon as the country's highest paid speaker. Told the record crowd of 650 I am to be speaking here for free! _E_ On my way to Dayton Ohio. Will be there soon! _E_ Hillary's debate answer on delay: That is horrifying. That is not the way our democracy works. Been around for 240 years. We've had free _E_ HAPPY 241st BIRTHDAY to the @USArmy! THANK YOU! __HTTP__ _E_ My @gretawire interview on @FoxNewsInsider "Trump: 'Last Person I'd Want Negotiating for Me Is Obama'" __HTTP__ _E_ China continues to be on the move both technologically and militarily. Obama is sitting by and watching. _E_ Too bad I'll Have Another out of Belmont Stakes interest now way down. _E_ Is it true the DNC would not allow the FBI access to check server or other equipment after learning it was hacked? Can that be possible? _E_ Join me this Wednesday in Phoenix Arizona at 6pm! #ImWithYouTickets: __HTTP__ __HTTP__ _E_ 'Trump Celebrates American Manufacturing Survey Showing Highest Level of Optimism in 20 Years' ... __HTTP__ _E_ We need a 21st century MERIT BASED immigration system. Chain migration and the visa lottery are outdated programs that hurt our economic and national security. __HTTP__ _E_ #TBT With Barbara Walters on my helicopter going somewhere. __HTTP__ _E_ It takes guts to win! _E_ I will be going to Puerto Rico on Tuesday with Melania. Will hopefully be able to stop at the U.S. Virgin Islands (people working hard). _E_ The new season of the Celebrity Apprentice is off to a great start last night it swept the 10 p.m. hour in every key demographic. _E_ You have to know when to call it quits and when to keep moving forward. Donald J. Trump __HTTP__ _E_ Lynne Ryan just read your great story in the NY Times I am proud of you. Thanks! __HTTP__ _E_ When it comes to violent crime and if we are going to solve the problem we must stop being so politically correct must tell it like it is! _E_ Via @PRNewswire: TRUMP HOTEL COLLECTION™ Announces Trump® International Hotel & Tower Baku __HTTP__ _E_ __HTTP__ Lights... Camera....You're Fired! All new @apprenticenbc tonight at 8PM ET on NBC! _E_ Frumpy and very dumb Gail Collins an editorial writer at The New York Times is so lucky to even have a job. Check her out incompetent! _E_ Vattenfall the promoter of the money losing wind farm plan in Aberdeen Scotland just took a loss of $4.6 billion after dumb European move _E_ You can't know it all yourself anyone who thinks that they do is destined for mediocrity." The Way To The Top _E_ 740 Park Avenue is being robbed all over the place we come down hard on thieves at Trump buildings. _E_ We mourn the horrifying terrorist attack in NYC. All of America is praying and grieving for the families who lost their precious loved ones. __HTTP__ _E_ A great honor to easily finish FIRST in the @FoxNews poll tabulation even though some of my best polls were not used in determining winner! _E_ We will defend our people our nations and our civilization from all who dare to threaten our way of life...cont: __HTTP__ __HTTP__ _E_ I hate to say it but the Republican Convention was far more interesting (with a much more beautiful set) than the Democratic Convention! _E_ I am honored that @BarackObama has featured my plane in one of his attack ads. It was made in America! _E_ Now China 'calls in' US diplomats to lecture them on their illegal escapades. __HTTP__ The new reality. @BarackObama is weak. _E_ This month we celebrate the contributions of Asian Americans & Pacific Islanders that enrich our Nation. __HTTP__ _E_ Great advice from my father: Know everything you can about what you're doing. Fred C. Trump _E_ Happy New Year to all my Jewish friends celebrating the holiday. _E_ The Oil Companies collude with OPEC to keep oil artificially overvalued. They need to be reigned in. _E_ THANK YOU Youngstown Ohio! I love you! Get out & #VoteTrump tomorrow. #Trump2016 __HTTP__ _E_ The Phoenix V.A. it has just been reported is in worse shape than ever before. The wait is horrendous and people are dying. I will fix it _E_ Raffaele Sollecito was unfairly convicted. He didn't kill anyone. The Italian government should be ashamed. @Raffasolaries _E_ The Democrats want MASSIVE tax increases & soft crime producing borders.The Republicans want the biggest tax cut in history & the WALL! _E_ Leading in the Bloomberg Iowa poll. Also my favorability numbers went up at a record almost unheard of clip. Thank you Iowa! _E_ If your enemies end up liking you it's because they beat you. You want their respect not their friendship. _E_ I want to thank Steve Bannon for his service. He came to the campaign during my run against Crooked Hillary Clinton it was great! Thanks S _E_ The young intern who accidentally did a Retweet apologizes. _E_ Big protest march in Colorado on Friday afternoon! Don't let the bosses take your vote! _E_ Congress' greatest card against Obama is the power of the purse. Use it! _E_ It's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_ Why gas prices will cost @BarackObama re election: pain at the pump not good for obama __HTTP__ _E_ I gave away money. Go to __HTTP__ to see how I'm helping people. #FundAnything #Entrepreneurs #GiveBack _E_ Mitt Romney is a mixed up man who doesn't have a clue. No wonder he lost! _E_ Watch me play both golf and baseball tonight on Donald J. Trump's Fabulous World of Golf 9PM ET on Golf Channel.. __HTTP__ _E_ Because of #FakeNews my people are not getting the credit they deserve for doing a great job. As seen here they are ALL doing a GREAT JOB! __HTTP__ _E_ It is time to take back our country and MAKE AMERICA GREAT AGAIN!#CaucusForTrump Video: __HTTP__ __HTTP__ _E_ Donald Trump explains celebrity feuds: 'I speak the truth' __HTTP__ via @DigitalSpyUS _E_ Via @WDesMoinesPatch by @DerekJ3031: "@ShawnJohnson on @ApprenticeNBC" __HTTP__ _E_ Watched Sean Hannity last night a great guy. _E_ He thinks that the wealth you create belongs to the government @BarackObama doesn't respect the fact that the (cont) __HTTP__ _E_ Looking for Father's Day gift? @Miamimagazine named the spa @TrumpDoral one of the best places for men to relax __HTTP__ _E_ RT @marcorubio: Good #AfghanStrategy & excellent speech by @POTUS laying it out to the nation. _E_ LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_ .@BenSasse looks more like a gym rat than a U.S. Senator. How the hell did he ever get elected? @greta _E_ With respect to Iran we have all the cards they are scared stiff! I can't believe we aren't able to negotiate (cont) __HTTP__ _E_ Big meeting today with Republican leadership concerning Tax Cuts and Healthcare. We are all pushing hard must get it right! _E_ Obama administration had 4 years to prepare for the ObamaCare rollout. And of course they failed miserably. _E_ Convention speaker schedule to be released tomorrow. Let today be devoted to Crooked Hillary and the rigged system under which we live. _E_ The great Barbara Walters interviews Melania Trump and me on a Special Friday night at 10:00 on ABC.... __HTTP__ _E_ The President's speech was very combative toward Republicans—they have obviously not earned his respect! _E_ Only two weeks until we start shooting @CelebApprentice. We really have something amazing for the fans this year. _E_ The dealmaker is cunning secretive focused and never settles for less than he wants. The America We Deserve _E_ Entrepreneurs: Identify your goals know precisely what you want to achieve. Then study the best people in your field and learn from them. _E_ Thank you Christian Broadcasting Network @TheBrodyFile @CBNNews __HTTP__ _E_ Think BIG! You are going to be thinking anyway so you might as well think BIG! _E_ The Job on CBS the 15th copy of The Apprentice was just cancelled I love it! _E_ Lindsey Graham is all over T.V. much like failed 47% candidate Mitt Romney. These nasty angry jealous failures have ZERO credibility! _E_ Peyton Manning should have passed on 3rd down! _E_ Still a buyer's market. Residential home sales fall 7.1% in March. __HTTP__ Now is the time to buy property. _E_ Every on line poll Time Magazine Drudge etc. has me winning the debate. Thank you to Fox & Friends for so reporting! _E_ Is Hillary really protecting women? __HTTP__ _E_ My interview from yesterday with #Apprentice Andy on @AmericaNowRadio __HTTP__ _E_ .@antbaxter Dummythanks for increasing awareness of my big golf project in Aberdeen—sales are thru the roof & Aberdeen seeing big benefits. _E_ .@MittRomney looks much stronger and much more Presidential! _E_ I had a wonderful meeting with Likud Deputy Speaker of The Knesset @DannyDanon this past Friday in Trump Tower __HTTP__ #Israel _E_ I am watching the Democrats trying to defend the you can keep you doctor you can keep your plan & premiums will go down ObamaCare lie. _E_ It was just determined that the woman who passed out at Obama's press conference had just seen what her new premiums would be! _E_ Snowden is handing over to Russia a treasure trove of intel. Our politicians are incapable of dealing! _E_ You've got something unique to offer. Find out what it is. Ask yourself: What can I provide that does not yet exist? _E_ 'Kept me out of jail': Top DOJ official involved in Clinton probe represented her campaign chairman: __HTTP__ _E_ Today is my birthday. My wish is for our country to be great and prosperous again. _E_ .@thehill Your story about me & the carbon tax is absolutely incorrect—it is just the opposite. I will not support or endorse a carbon tax! _E_ Before you vote think: Obama wants to raise taxes @MittRomney wants to lower taxes need I say more! _E_ THANK YOU to all of the incredible HEROES in Texas. America is with you! #TexasStrong __HTTP__ _E_ Congratulations to Connecticut's Erin Brady on being crowned the 2013 @MissUSA! America will be well represented in @MissUniverse! _E_ With millions of dollars of negative and phony ads against me by the establishment my numbers continue to go up. Can anyone explain this? _E_ Departing for Texas and Louisiana with @FLOTUS Melania right now @JBA_NAFW. We will see you soon. America is with you! __HTTP__ _E_ Only the Fake News Media and Trump enemies want me to stop using Social Media (110 million people). Only way for me to get the truth out! _E_ "Don't find fault find a remedy." Henry Ford _E_ Join me in Columbus Ohio tomorrow!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ President Obama said ISIL continues to shrink in an interview just hours before the horrible attack in Paris. He is just so bad! CHANGE. _E_ Paul Teutul Sr. is a fantastic guy. Although I fired him on #CelebApprentice we will remain great friends. I love the bike he made for me. _E_ Fox News PollThank you New Hampshire! #FITN#Trump2016 __HTTP__ _E_ .@foxandfriends Russia sent millions to Clinton Foundation _E_ In today's #trumpvlog I speak about Clint Eastwood the #DNC and Drew Peterson __HTTP__ _E_ ...There is also something appropriate about keeping him in the home of the horrible crime he committed. Should move fast. DEATH PENALTY! _E_ Just as we won the Cold War in part by exposing the evils of communism and the virtues of free markets....Cont: __HTTP__ _E_ ...the Ninth Circuit which has a terrible record of being overturned (close to 80%). They used to call this judge shopping! Messy system. _E_ So much for Washington shutting down Strasburg they deserved to lose. _E_ Surprising a future Nobel prize winner on today's @KatieShow: __HTTP__ _E_ "RUBIO'S GANG OF 8 BILL WOULD HAVE REWARDED SANCTUARY CITIES HARBORING ILLEGALS" __HTTP__ Marco is a politician he flip flops! _E_ Bloggers like McKay Coppins & @BuzzFeed are true garbage with no credibility. Record setting crowds & speech not reported. @PiersMorgan _E_ To get momentum you must first focus on a specific goal with passion and intensity. _E_ New Gravis national poll just out 36%! Very nice! #MakeAmericaGreatAgain _E_ Hillary Clinton said that it is O.K. to ban Muslims from Israel by building a WALL but not O.K. to do so in the U.S. We must be vigilant! _E_ Here's the solution on China: get tough. Slap a 25 percent tax on China's products if they don't set a real (cont) __HTTP__ _E_ Lightweight @AGSchneiderman is driving business out of NY so that he can get publicity for his failing political career. _E_ Less than a week after we leave Iraq the country is already unraveling. We got nothing from the Iraqis and now (cont) __HTTP__ _E_ My campaign for president is $35000000 under budget I have spent very little (and am in 1st place).Now I will spend big in Iowa/N.H./S.C. _E_ Our border is being breached daily by criminals. We must build a wall & deduct costs from Mexican foreign aid! __HTTP__ _E_ Mr. President you're entitled as the president to your own airplane and to your own house but not to your own facts. @MittRomney _E_ My @FoxNews interview with @TeamCavuto where I explain that we need to start using our own domestic energy resources. __HTTP__ _E_ .@CelebApprentice was #1 on network TV last night in its time slot and easily won the 10 o'clock hour in all major demographics. _E_ Now that Bush has wasted $120 million of special interest money on his failed campaign he says he would end super PACs. Sad! _E_ The great boxing promoter Don King just endorsed me. Nice! _E_ Via @BreitbartNews: "DONALD TRUMP: 'RICH PEOPLE DON'T LIKE ME'–POOR MIDDLE INCOME PEOPLE 'LIKE ME BEST'" __HTTP__ _E_ Wow great news! I hear @EWErickson of Red State was fired like a dog. If you read his tweets you'll understand why. Just doesn't have IT! _E_ I will be on @foxandfriends at 7:00 A.M. Will be talking about many things including The Apprentice! _E_ Despite the constant negative press covfefe _E_ George also appeared on Saturday Night Live when I was guest host in 2004. A great time! #CelebApprentice _E_ Melania and I offer our deepest condolences to the family of Otto Warmbier. Full statement: __HTTP__ __HTTP__ _E_ It is time to rebuild OUR country to bring back OUR jobs to restore OUR dreams & yes to put #AmericaFirst! TY O... __HTTP__ _E_ Keep focused on your goals. Practice positive thinking. View any conflict as an opportunity look at the solution not the problem. _E_ Watch my interview with Greta Van Susteren @Gretawire tonight at 10 p.m. on Fox News. _E_ Looking forward to keynoting @bobvanderplaats' @theFAMiLYLEADER Leadership Summit. Tickets selling out __HTTP__ _E_ .@OMAROSA as a cashier a big mistake by @BrandenRoderick. #CelebApprentice _E_ #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ My @OraTV #Politicking interview w/@kingsthings on the govt. shutdown ObamaCare Putin 2016 and @TrumpDoral __HTTP__ _E_ Another great accolade for @TrumpGolf. Highly respected Golf Odyssey awarded @TrumpDoral Blue Monster with best redesign. Thank you! _E_ RT @foxandfriends: FOX NEWS ALERT: Jihadis using religious visa to enter US experts warn (via @FoxFriendsFirst) __HTTP__ _E_ My @SquawkCNBC interview discussing #TRUMPTUESDAYS high ratings @ToddAkin's statement & @MittRomney's policies __HTTP__ _E_ Set the bar high do the best you possibly can. Be focused disciplined and alert every single day. _E_ Just a few days until I keynote at @bobvanderplaats' @theFAMiLYLEADER Leadership Summit in Iowa __HTTP__ Very exciting _E_ At the request of many and even though I expect it to be a very boring two hours I will be covering the Democrat Debate live on twitter! _E_ 'Clinton Campaign And Harry Reid Worked With New York Times To Smear State Dept Watchdog'Time to #DrainTheSwamp! __HTTP__ _E_ .@realDonaldTrump is going to cut taxes BIG LEAGUE Crooked is going to raise taxes BIG LEAGUE! #DrainTheSwamp... __HTTP__ _E_ My @FoxNews interview with @gretawire discussing the GOP primary my 2012 options and why @BarackObama must lose __HTTP__ _E_ The Democrats have a corrupt political machine pushing crooked Hillary Clinton. We have Paul Ryan always fighting the Republican nominee! _E_ I am happy to hear that Pres.Obama is considering giving Anna Wintour @voguemagazine an ambassadorship. She is a winner & really smart! _E_ Crooked @club4growth has given up advertising in Iowa on me—remember they wanted my million dollars—I said no—total frauds! _E_ Mark my words a gallon of gas will be $5 during the summer. OPEC is ripping us off. There's nobody in our (cont) __HTTP__ _E_ Crooked Hillary Clinton now blames everybody but herself refuses to say she was a terrible candidate. Hits Facebook & even Dems & DNC. _E_ Thank you Adam Levine The Federalist in interview on @foxandfriends "Donald Trump is the greatest President our Country has ever seen." _E_ RT @FoxNews: More than 1 million jobs added since @POTUS took office. __HTTP__ __HTTP__ _E_ Word is I am doing very well in Michigan and Mississippi! Wow and with all that money spent against me! Will be going to Trump Jupiter now! _E_ To have a government we can afford we need to eliminate the tremendous waste clogging the system #TimeToGetTough _E_ No taxes the only good thing about DC Debt Deal. _E_ "Success is getting what you want. Happiness is wanting what you get." Dale Carnegie _E_ RT @GovMikeHuckabee: Trump says the chaos in Chicago was a planned attack. But Hillary insists it was a spontaneous reaction to an internet... _E_ I wonder why @BarackObama is now spending $8B to postpone Obamacare's Medicare Cuts until after the election? __HTTP__ _E_ RT @DanScavino: On behalf of our next #POTUS & @TeamTrump #HappyNewYear AMERICA __HTTP__ __HTTP__ __HTTP__ _E_ Most politicians would have gone to a meeting like the one Don jr attended in order to get info on an opponent. That's politics! _E_ The election is trending towards @MittRomney. Americans know we can't afford another 4 years of the Obama economic decline. _E_ It's that time of the year. @TrumpRink in Central Park is now open best rink in the world. __HTTP__ A landmark. _E_ Our country is looking very bad right now! _E_ Our deficits are caused by runaway spending not inadequate taxing. Washington does not have a revenue problem. _E_ Wednesday's debate is day one of the election. Over 70 million voters will be watching. _E_ When will the US government finally classify China as a currency manipulator? China is robbing us blind and @BarackObama defends them. _E_ This is your land this is your home and it's your voice that matters the most. So speak up be heard and fight fight fight for the change you've been waiting for your entire life!MERRY CHRISTMAS and THANK YOU Pensacola Florida! __HTTP__ _E_ My son Donald did a good job last night. He was open transparent and innocent. This is the greatest Witch Hunt in political history. Sad! _E_ The $10 billion (net worth) is AFTER all debt and liabilities. So simple to understand but @CNN & @CNNPolitics is just plain dumb! _E_ To all struggling young entrepreneurs stay positive in this tough climate and keep looking for good deals. They are out there. _E_ ...Brande was also smart in not bringing Omarosa to the boardroom. _E_ Happy to have passed 800000 followers. Looking forward to passing 1M sooner than later. _E_ Thank you Louisiana! Get out & vote for John Kennedy tomorrow. Electing Kennedy will help enact our agenda on behal... __HTTP__ _E_ "To be successful you must become very good at finding creative solutions to what appear to be impossible problems." – Think BIG _E_ On Greta 87% of the people said they would not watch the debate if I'm not in it. Wow what an honor! _E_ Many people think that WM23 @WrestleMania "the battle of the billionaires" was the greatest of all time—set all records _E_ The crowd in Ohio was amazing last night broke all records. We all had a great time in a great State. Will be back soon! _E_ On behalf of all Americans I want to wish Jewish families many blessings in the New Year. __HTTP__ __HTTP__ _E_ "Hard work is my personal method for financial success. You can do it too." Think Big _E_ I am the only one who can fix this. Very sad. Will not happen under my watch! #MakeAmericaGreatAgain __HTTP__ _E_ ‎In anticipation of ObamaCare part time jobs are surging & full time jobs are falling and becoming scarce __HTTP__ _E_ No better place to celebrate New Year's Eve than @TrumpSoHo the most elite hotel in downtown NYC __HTTP__ _E_ RT @robertjeffress: Honored to pray for my friend @realDonaldTrump at tonight's Dallas rally. #TrumpDallas c: @DanScavino __HTTP__ _E_ Upstate New York is suffering with record unemployment. Fracking is the answer. Frack now and Frack fast! _E_ Wow FBI confirms report that James Comey drafted letter exonerating Crooked Hillary Clinton long before investigation was complete. Many.. _E_ Response to Hillary Clinton __HTTP__ _E_ Leaving for New York City and meetings on military purchases and trade. _E_ Is @karlrove incompetent? 400 million dollars down the drain and not 1 victory! _E_ #TrumpVlog Obama should be ashamed! __HTTP__ _E_ China and Saudi Arabia recently struck a deal which is the largest expansion by any oil company in the world (cont) __HTTP__ _E_ .@GoAngelo—the next time you have a rally @Macy's try getting 12 people instead of 11—it would be much more effective! _E_ Still a buyer's market. Home prices are dropping mortgages are low. Now is the time to take advantage for your gain. __HTTP__ _E_ Once again @BarackObama's speech at @AIPAC yesterday proved that he is more concerned about containing @Israel (cont) __HTTP__ _E_ Thank you to everybody for your wonderful comments on my debate performance it was a lot of fun! Today I will be speaking in Reno Nevada. _E_ If you love your work the difficulties will be balanced out by the enjoyment. Think Big _E_ The difference between @MittRomney and @BarackObama's campaign promises to @Israel is that Mitt will actually keep all of his. _E_ Thanks. __HTTP__ _E_ .@RuthMarcus of the @washingtonpost was terrible today on Face The Nation.No focus poor level of concentration but correct on Hillary lying _E_ I hope all workers demand that their @Teamsters reps endorse Donald J. Trump. Nobody knows jobs like I do! Don't let them sell you out! _E_ Congratulations to @DianeSawyer on her big ratings win for the evening news. Diane is a spectacular person. _E_ Mexico's totally corrupt gov't looks horrible with El Chapo's escape—totally corrupt. U.S. paid them $3 billion. _E_ Great bilateral meeting with Prime Minister Theresa May of the United Kingdom affirming the special relationship and our commitment to work together on key national security challenges and economic opportunities. #WEF18 __HTTP__ _E_ Sheldon Adelson is looking to give big dollars to Rubio because he feels he can mold him into his perfect little puppet. I agree! _E_ Just landing in Knoxville Tennessee! Massive crowd expected! Will all have a great time despite serious subject matter. _E_ Healthy young child goes to doctor gets pumped with massive shot of many vaccines doesn't feel good and changes AUTISM. Many such cases! _E_ If Ted Cruz is so opposed to gay marriage why did he accept money from people who espouse gay marriage? _E_ Cowards die many times before their actual deaths. Caesar _E_ Donald Trump reads Top Ten Financial Tips on Late Show with David Letterman: __HTTP__ Very funny! _E_ ...While I fully agree it is not politically correct! __HTTP__ _E_ PM @David_Cameron should be run out of office for spending so much of England's money to subsidize windfarms in Scotland. _E_ Ebola is much easier to transmit than the CDC and government representatives are admitting. Spreading all over Africa and fast. Stop flights _E_ Please keep your thoughts & prayers with Melissa Young Miss Wisconsin 2005. __HTTP__ _E_ Get out and vote! I am your voice and I will fight for you! We will make America great again! __HTTP__ _E_ Notice that illegal immigrants will be given ObamaCare and free college tuition but nothing has been mentioned about our VETERANS #DemDebate _E_ I will be on @foxandfriends at 7:00 A.M. Enjoy! _E_ FACT on "red line" in Syria: HRC I wasn't there. Fact: line drawn in Aug '12. HRC Secy of State til Feb '13. __HTTP__ _E_ RT @foxandfriends: FOX NEWS ALERT: ISIS claims responsibility for hostage siege in Melbourne Australia that killed 1 person and injured 3... _E_ Obama's administration is now openly admitting it expects US credit downgraded again __HTTP__ Thanks for letting us know now _E_ I was invited by Caroline Wozniacki to sit with her family in her special box during her match at the U.S. Open yesterday. She's fantastic! _E_ Who do you like of the final two? #CelebApprentice __HTTP__ _E_ My @IngrahamAngle interview on the border crisis USMC Tahmooressi & my fight for the American flag __HTTP__ (15:00 mark) _E_ My friend @GovChristie called it @MittRomney recast the race. _E_ Try to develop a tempo when you're working momentum is something you have to work at to maintain & is an important element of success. _E_ .@TheBrodyFile: Trump's appeal to evangelicals is real #Trump2016 __HTTP__ __HTTP__ _E_ My bestselling book from last April Think Like a Champion is now available in paperback. It's inspiring entertaining and a great read. _E_ ....Some of those they are harshly treating have been "milking" their country for years! _E_ Thank you for the incredible support Maryland! This is a movement!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Remember I was the one who said attack the oil (ISIS source of wealth) a long time ago. Everyone scoffed now they're attacking the oil. _E_ Be in Turnberry on Thurs AM for start of Women's British Open one of world's great golf tournaments. Back soon to #MakeAmericaGreatAgain! _E_ Tremendous day in Massachusetts and Maine. Thank you to everyone for making it so special! _E_ Thank you to @DailyTelegraph reviewer @NeilMidgley who stated 'You've Been Trumped' was so biased in favour of the protesters... _E_ Puerto Rico being hit hard by new monster Hurricane. Be careful our hearts are with you will be there to help! _E_ John Foust is a liberal who supports ObamaCare and opposes Ebola travel ban. Send Conservative @BarbaraComstock to Congress! _E_ As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_ Little Marco Rubio is just another Washington D.C. politician that is all talk and no action. #RobotRubio __HTTP__ _E_ Just received huge applause when I said Berghdal should be sent back to Afghanistan! @SRQRepublicans speech is sold out with record crowd. _E_ Report out that Obama Campaign paid $972000 to Fusion GPS. The firm also got $12400000 (really?) from DNC. Nobody knows who OK'd! _E_ Leadership is the capacity to translate vision into reality. Warren G. Bennis _E_ .@ABFalecbaldwin Alec it's not science it's a con read the e mails. _E_ Don't sell yourself short on something that is important. Today is just the beginning. Think Like a Champion _E_ I will represent our country well and fight for its interests! Fake News Media will never cover me accurately but who cares! We will #MAGA! _E_ I wonder if when Secy. Kerry goes to Iraq and Afghanistan he pushes hard for them to look at GLOBAL WARMING and study the carbon footprint? _E_ So many problems in the U.S. and leadership that is hopeless...and now on top of everything else we just hit $18 trillion in debt! _E_ In politics and sometimes in life FRIENDS COME AND GO BUT ENEMIES ACCUMULATE! _E_ Sunday night at 9 PM EST will be re run of last week's episode of Celebrity @ApprenticeNBC followed by new episode at 10 PM. _E_ .@serenawilliams we look forward to being with you a truly great champion tomorrow at Trump National D.C. for the Tennis Center dedication _E_ Our infrastructure plan has been put forward and has received great reviews by everyone except of course the Democrats. After many years we have taken care of our Military now we have to fix our roads bridges tunnels airports and more. Bipartisan make deal Dems? _E_ Texas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow! _E_ .@danielhalper Great job on @CNN today. Very wise indeed! _E_ Leaving for Albany New York now massive crowd expected. Very exciting! _E_ Don't give up Republican Senators the World is watching: Repeal & Replace...and go to 51 votes (nuke option) get Cross State Lines & more. _E_ the American people. I have no doubt that we will together MAKE AMERICA GREAT AGAIN! _E_ Join @ericbolling to get @vanessariddle to 100k followers. Beautiful girl with stage 4 cancer. __HTTP__ _E_ New polls join the MOVEMENT today. __HTTP__ #ImWithYou __HTTP__ _E_ Thank you Florida a MOVEMENT that has never been seen before and will never be seen again. Lets get out &... __HTTP__ _E_ ...9 months than this Administration. Over 50 Legislation approvals massive regulation cuts energy freedom pipelines border security.... _E_ Numerous patriots will be coming to Bedminster today as I continue to fill out the various positions necessary to MAKE AMERICA GREAT AGAIN! _E_ RESPONSE TO THE LIES OF SENATOR CRUZ: __HTTP__ #VoteTrumpSC _E_ Should have settled ... Ft Lauderdale plaintiffs must pay me close to $400k in legal fees after Trump trial victory. _E_ The border is wide open for cartels & terrorists. Secure our border now. Build a massive wall & deduct the costs from Mexican foreign aid! _E_ Going to a Cabinet Meeting (tele conference) at 11:00 A.M. on #Harvey. Even experts have said they've never seen one like this! _E_ If we are going to continue to be stupid and go into Syria (watch Russia) as they say in the movies SHOOT FIRST AND TALK LATER! _E_ My official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_ America is going to build again. Under budget and ahead of schedule. Time to put #AmericaFirst! #InfrastructureWeek... __HTTP__ _E_ Heed the advice of @FLGovScott! If you're in an evacuation zone you need to get to a shelter...there's not many hours left. Gov. Scott __HTTP__ _E_ Via World Tribune The elites' problem with Donald Trump: He's not for sale by Jeffrey T. Kuhner __HTTP__ _E_ WOW! Thank you Massachusetts! See you soon. #VoteTrumpMA __HTTP__ _E_ Thank you Mark. #GOPDebate __HTTP__ _E_ The U.S. must immediately stop all flights from EBOLA infected countries or the plague will start and spread inside our borders. Act fast! _E_ Congrats to Pres.Obama on having 3 of @washingtonpost's "biggest Pinocchios of the year" __HTTP__ Great accomplishment! _E_ Credible Source on 9 11 Muslim Celebrations: FBI __HTTP__ via @WKRG _E_ When Strasburg leaves in a couple of years under free agency Washington will say what were we doing . _E_ .@chucktodd is a nice guy but just hopeless. He knows so little about politics and in particular winning! I fixed his rating problem. _E_ Sad only 36% think America's best days are ahead while 49% believe they are in the past __HTTP__ We can & must do better. _E_ Watch my interview with Greta Van Susteren @gretawire tonight on Fox News at 10 p.m. _E_ MERRY CHRISTMAS!!! __HTTP__ _E_ Donald Trump: Anna Wintour Ambassadorship Would Be 'A Favor To The Country' __HTTP__ via @mediaite _E_ This is how it starts. Obama is now threatening to use an Executive Order for gun control __HTTP__ Welcome to his 2nd term. _E_ Will be leaving for Missouri soon for a speech on tax cuts and tax reform so badly needed! _E_ I'm giving away money go to __HTTP__ . Take it from me! Proud of the #FundAnything team. _E_ "Obstacles are those frightful things you see when you take your eyes off your goal." Henry Ford _E_ What They Are Saying About @realDonaldTrump's GREAT Debate and @HillaryClinton's Bad Performance... __HTTP__ _E_ Congrats to fantastic All Star @ApprenticeNBC celebrity & illusionist @pennjillette on being honored at 2013 Hollywood Walk of Fame! _E_ Someone should inform @CNN that despite spending millions of $'s on graphics it is not the Democratic Debate rather the Democrat (s) D! _E_ I will be on @cbs @60minutes this Sunday. A great honor hope you enjoy it. _E_ The Republican Party must get tougher and smarter and fast or it will go down to a very big defeat just like the last two times! _E_ .@Neilyoung's song "Rockin' In The Free World" was just one of 10 songs used as background music. Didn't love it anyway. _E_ Must read quote by @EricTrump in @CNNMoney article "Builders race to develop sky high condo buildings" __HTTP__ _E_ Swisher should have caught ball in right field last night. _E_ Thank you Iowa! Great night see you soon! #Trump2016 __HTTP__ _E_ It's snowing & freezing in NYC. What the hell ever happened to global warming? _E_ Press Conference at Glasgow Prestwick Airport this Friday Nov. 14 at 11 AM with Donald J. Trump & Mr. Iain Cochrane __HTTP__ _E_ What a foolish statement by @davidaxelrod he said that a @marcorubio VP pick would 'insult' Hispanics __HTTP__ _E_ Obama through his cronies said the Keysyone pipeline was not political how much can one man lie about even the most obvious things? _E_ Thank you for your support Greensboro North Carolina. Next stop Charlotte! #MAGA __HTTP__ __HTTP__ _E_ Another must read from Jeffrey Lord @amspec: "Rove Email Leaks: Ideological War Opens in GOP" __HTTP__ _E_ .@DennisRodman is always hard to miss especially when dressed in silver finery. But not sure about the silver lipstick. #CelebApprentice _E_ Tonight is the Apprentice finale and it's a fantastic episode in every way with the great Liza Minnelli performing and a new Apprentice! _E_ Crooked Hillary just took a major ad of me playing golf at Turnberry. Shows me hitting shot but I never did = lie! Was there to support son _E_ The economy won't fully recover until @ObamaCare is fully repealed. It is a job killer! _E_ There is. __HTTP__ _E_ Fact – all the countries complaining about us spying on them spy on us. They just don't get caught stupid! _E_ I will take care of the Veterans who have served this country so bravely.#ThankAVet Video: __HTTP__ __HTTP__ _E_ China's military buildup is a major threat to the Free World. We must remain resolute and maintain our national defense at all costs. _E_ They say that if I participated in last night's Fox debate they would have had 12 million more & would have broken the all time record. _E_ I would not sign Graham Cassidy if it did not include coverage of pre existing conditions. It does! A great Bill. Repeal & Replace. _E_ .@NicolleDWallace is really hurting @TheView. She is boring predictable and has zero television it show no longer has ratings dying! _E_ I want to applaud the many protestors in Boston who are speaking out against bigotry and hate. Our country will soon come together as one! _E_ I am allowing Japan & South Korea to buy a substantially increased amount of highly sophisticated military equipment from the United States. _E_ You have to scratch your head when the president spends the last week talking about saving Big Bird. @MittRomney _E_ It was just announced that I will be hosting Saturday Night Live on Nov. 7th look forward to it! __HTTP__ _E_ In addition to winning the Electoral College in a landslide I won the popular vote if you deduct the millions of people who voted illegally _E_ .@JebBush has spent $63000000 and is at the bottom of the polls. I have spent almost nothing and am at the top. WIN! @hughhewitt _E_ Looking forward to Sunday's speech in the ExCel Centre. __HTTP__ _E_ Reckless @BarackObama is projecting $1.2T deficit from 2012 budget & a projected $25.4T debt in a decade __HTTP__ _E_ It is terrible that @BarackObama did not appoint an independent counsel to investigate the national security leaks. No accountability. _E_ "Pay attention to the small numbers in your finances such as percentages and cents... _E_ Thank you to @LOUDOBBS for giving the first six months of the Trump Administration an A+. S.C.reg cuttingStock M jobsborder etc. = TRUE! _E_ Sometimes when you innovate you make mistakes. It is best to admit them quickly and get on with other innovations. Steve Jobs _E_ Smart move by @BarackObama having Pres. Bill Clinton deliver the @DNC convention keynote. _E_ LAST thing the Make America Great Again Agenda needs is a Liberal Democrat in Senate where we have so little margin for victory already. The Pelosi/Schumer Puppet Jones would vote against us 100% of the time. He's bad on Crime Life Border Vets Guns & Military. VOTE ROY MOORE! _E_ My appearance on @foxandfriends from today.... __HTTP__ _E_ Reuters polling just out thank you!#MakeAmericaGreatAgain __HTTP__ _E_ .@cher I don't wear a "rug"—it's mine. And I promise not to talk about your massive plastic surgeries that didn't work. _E_ #TBT It is great being part of Home Alone 2 a holiday staple. __HTTP__ _E_ American professors were in Tehran for an Occupy Wall Street Conference __HTTP__ @BarackObama's diplomatic initiative?!?! _E_ The water damage to NYC is amazing. The winds were bad but the water was worse. _E_ .@mercedesschlapp thank you so much for your kind words on television fantastic job and greatly appreciated! _E_ Entrepreneurs: Set the bar high and resolve to be bigger than your problems. Who's the boss? _E_ He is out of real solutions @BarackObama's job bill is nothing more than a tax increase. _E_ Do you notice we are not having a gun debate right now? That's because they used knives and a truck! _E_ "Golf is deceptively simple and endlessly complicated. It satisfies the soul and frustrates the intellect. (cont) __HTTP__ _E_ The U.S. will invite El Chapo the Mexican drug lord who just escaped prison to become a U.S. citizen because our leaders can't say no! _E_ If I'd started in business thinking I knew everything I'd have been sunk before I got started. Think Like a Champion _E_ New Government data by the Center for Immigration Studies shows more than 3M new legal & illegal immigrants settled.. __HTTP__ _E_ Sean's interview with Bob Woodward on @hannityshow was very interesting Woodward was great. __HTTP__ _E_ In the upcoming New Year we will focus like never before if we do that we will have complete and total VICTORY in all we do! _E_ I loved beating John Kasich in the debates but it was easy—he came in dead last! _E_ Check it out 2nd video on Lying Crooked Hillary is now online! Watch it here: __HTTP__ #CrookedHillary #Trump2016 _E_ Entrepreneurs: Keep your focus and keep your momentum. Listen apply and move forward. Set the standard! _E_ Was with @jacknicklaus yesterday great golfer great architect great guy! _E_ Thank you Pittsburgh Pennsylvania!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ The Republicans better be careful. Obama is out to destroy them! _E_ Television ratings for @nbcsnl Saturday Night Live just came out and they were great the best since 2011. Very few protesters! _E_ Failure is simply the opportunity to begin again this time more intelligently. Henry Ford _E_ Selective memory @BarackObama says that he forgets the recession __HTTP__ Maybe that's why he is forgetting to create jobs. _E_ Do we really need another Bush in the White House we have had enough of them. __HTTP__ _E_ It's Tuesday. I wonder how much money @HuffPost lost today great purchase AOL _E_ One of Obama's greatest failures will be his legacy of making millions completely dependent on government handouts not work. _E_ Am leaving now for Florida to see our GREAT first responders and to thank the U.S. Coast Guard FEMA etc. A real disaster much work to do! _E_ Stop the EBOLA patients from entering the U.S. Treat them at the highest level over there. THE UNITED STATES HAS ENOUGH PROBLEMS! _E_ The new Pope is a humble man very much like me which probably explains why I like him so much! _E_ China is our enemy. It's time we start acting like it and if we do our job correctly China will gain a whole (cont) __HTTP__ _E_ After two days of very productive talks Prime Minister Abe is heading back to Japan. L _E_ Mitt Romney must start congratulating the Navy Seals and military on Bin Laden's killing not the President. _E_ We should not allow @Chrysler to move @Jeep jobs to China after they said they wouldn't stay tuned! _E_ "Score one for the Donald in his battle with @AGSchneiderman." __HTTP__ _E_ I still say Te'o did this in order to get sympathy for the Heisman vote—thankfully he did not win. _E_ Just returned from Pensacola Florida where the crowd was incredible. _E_ They should have allowed applause during the TRIBUTE to the departed Really bad production. Bette Midler sucked! #Oscars _E_ America will THINK BIG once again. We will inspire millions of children to carry on the proud tradition of American... __HTTP__ _E_ On Monday ObamaCare kicks in with all goodies of 300% increased premiums higher taxes and part time replacement employees. _E_ Hillary Clinton should ask why the Democrat pols in Atlantic City made all the wrong moves Convention Center Airport and destroyed City _E_ .@VenueMagazine_ highlights the opening of @TrumpDoral's brand new #RedTiger course: __HTTP__ _E_ Great new ad from @MittRomney titled Nothing's Free __HTTP__ detailing both the high costs and taxes of ObamaCare. _E_ Good news out of the House with the passing of 'No Sanctuary for Criminals Act.' Hopefully Senate will follow. _E_ Sadly I'm probably helping @billmaher's lowly rated show—but charity will benefit by $5 million so it's worth it. _E_ #sweepstweet @DonaldJTrumpJr and @EricTrump have the eyes and ears for total surveillance I wonder where they got that from? _E_ RT @DRUDGE_REPORT: LIMBAUGH: By not showing he's owning entire event... __HTTP__ _E_ The #MarchForLife is so important. To all of you marching you have my full support! _E_ Just watched @Patriots Bill Belichick's news conference. He did a great job—smart concise truthful! _E_ ...I trounced him in ratings & Letterman beat @jayleno last Thursday. Brian—are you irrelevant? _E_ Sugar: @Lord_Sugar—unlike you I own The Apprentice. You were never successful enough... _E_ Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_ Aberdeenshire coast is spectacular. Its historic value & wildlife will be tarnished if these wind turbines are built but they won't be! _E_ I will be interviewed on @foxandfriends tomorrow at 7am. Enjoy! _E_ "If you have a crisis whether on a ship or wherever there are heroes who rise above it." Jerry Bruckheimer _E_ Thank you Montana! #Trump2016 __HTTP__ __HTTP__ _E_ .@TrumpChicago's The Spa offers 5 star services w/ 12 treatment rooms & 53 spa guestrooms overlooking the skyline __HTTP__ _E_ Marco Rubio will not win. Weak on illegal immigration strong on amnesty and has the appearance to killers of the world as a lightweight . _E_ "Compete with yourself to be the best you can be." – Think Like a Champion _E_ Another new Iowa poll just released. Thank you! #IACaucus #FITN __HTTP__ _E_ Jeb Bush has zero communication skills so he spent a fortune of special interest money on a Super Bowl ad. He is a weak candidate! _E_ .@CGasparino Good seeing you. Keep up the great work never stop! _E_ Rodolfo Rosas Moya and his pals in Mexico owe me a lot of money. Disgusting & slow Mexico court system. Mexico is not a U.S. friend. _E_ The results are in. I killed Wolf Blitzer in our debate. I like Wolf but he went for an ambush! #wolfblitzercnn _E_ More than $500 million designated for Iraqi Army disappeared. Where is it? Our sad sad country what have we come to? _E_ A gallon of gas is $3.523 today and has never before risen so high early in the year __HTTP__ The @BarackObama policy realized! _E_ As China is built on corporate espionage currency manipulation & cheap labor its economy is a ticking time bomb __HTTP__ _E_ Dopey Sugar @Lord_Sugar I never go silent. I was buying a major property in Florida a property worth more than you are! _E_ Great article in the @NewYorkPost by Ben Garrett Don't Blame Sandy on Global Warming __HTTP__ _E_ With @IvankaTrump and @EricTrump at the opening of the @GaryPlayer Villa at @TrumpDoral __HTTP__ _E_ With the two wacko perverts Spitzer and Weiner NYC politics has become a joke all over the world. _E_ James Holmes the Aurora Colorado guy who killed 12 people & injured 58 others is fighting hard to avoid the death penalty... _E_ America is mired in the longest job recession since the Great Depression. @MittRomney can get us out of it. (cont) __HTTP__ _E_ Looking forward to a great weekend in Iowa! #IACaucus #CaucusForTrump Tickets: __HTTP__ __HTTP__ _E_ Via @HuffPostPol by @_under_current: "Donald Trump Will End Outsourcing If President" __HTTP__ _E_ Just gave a speech to the great men and women at Yokota Air Base in Tokyo Japan. Leaving to see Prime Minister Abe. __HTTP__ _E_ Thank you to @GaryVanSickle & Sports Illustrated @SInow for the really nice piece about me. March 17 2014 issue __HTTP__ _E_ Will be at venue in wonderful South Carolina very soon. Big traffic back up tremendous crowd! Will be wild. _E_ "A true business only exists to solve a problem and to make life better." – Midas Touch _E_ Amazing Obama speaks market goes DOWN Trump tells CNBC he's buying stock market goes UP should not be that way! _E_ Join me in Florida tomorrow! #MakeAmericaGreatAgain Daytona | 3pm __HTTP__ | 7pm __HTTP__ _E_ My book with @theRealKiyosaki Midas Touch is divided into five sections. The first is the thumb __HTTP__ _E_ The @TODAYshow refused to use their just in poll numbers where I have a massive lead but instead used @CNN numbers where my lead is smaller. _E_ We need a President who understands the economy @gallupnews has US unemployment at 8.2% in July up from 8% in June __HTTP__ _E_ People are going crazy with my comments on Diet Coke (soda). Let's face it this stuff just doesn't work. It makes you hungry. _E_ #ICYMI: I agree To all Americans I see you & I hear you. I am your voice. Vote to #DrainTheSwamp with me on 11/8.... __HTTP__ _E_ We need a balanced budget Amendment because Congress has no fiscal discipline. _E_ Our national debt has grown by 30% and a gallon of gas has doubled so far under @BarackObama. He is a disaster. _E_ .@EricTrumpFdn continues to do important work for @StJude Children's Research Hospital. I am very proud of @EricTrump's philanthropy. _E_ I will be on CNN's State of the Union tomorrow morning at 9amE. __HTTP__ __HTTP__ _E_ I have been asking Director Comey & others from the beginning of my administration to find the LEAKERS in the intelligence community..... _E_ The Trump Administration has terminated more UNNECESSARY Regulation in just twelve months than any other Administration has terminated during their full term in office no matter what the length. The good news is THERE IS MUCH MORE TO COME! _E_ Windmills are a bigger safety hazard than either coal or oil __HTTP__ A 34% higher mortality rate than coal alone. Outrageous! _E_ Is Jon Stewart a racist? See video __HTTP__ @thedailyshow _E_ While @BarackObama criticizes the GOP budget his own party graded him with an F by voting down his budget in the House 414 0. _E_ I will be doing Fox & Friends in 10 minutes at 7.00. Many things to talk about! ENJOY _E_ Press conference after CPAC speech this morning was excellent lots of very professional reporters. _E_ Almost no news organizations are showing the satirical pictures. Gee I wonder why? The media is usually so brave! _E_ I don't know what it is but I'm getting totally bored watching NFL football. Too many penalties and far too soft! T.V. off and back to work _E_ The habitual vacationer: @BarackObama has campaigned on our dime more than any previous president in history... (cont) __HTTP__ _E_ President Obama looks absolutely exhausted in the Netherlands. He is not a natural leader was never ment to lead it is tough work for him _E_ Remember Univision apologized! _E_ I hope everyone is having a great Christmas then tomorrow it's back to work in order to Make America Great Again (which is happening faster than anyone anticipated)! _E_ Hitting the first ball at Trump International Dubai 272 right down the middle. __HTTP__ _E_ Today our entire nation pauses to REMEMBER PEARL HARBOR—and the brave warriors who on that day stood tall and fought for America. God Bless our HEROES who wear the uniform and God Bless the United States of America. #PearlHarborRemembranceDay __HTTP__ _E_ .@TrumpScotland provides luxury accommodations & a championship Par 72 7400 yd. course. Book your tee time now __HTTP__ _E_ As of September 30th we have a record trade deficit with China of over $217Billion. They are ripping us off. #TimeToGetTough _E_ "One of the keys to thinking big is total focus." – The Art of The Deal _E_ I gave millions of dollars to DJT Foundation raised or recieved millions more ALL of which is given to charity and media won't report! _E_ I'm on the David Letterman @LateShow tonight looking forward to it. 11:35 PM on CBS. _E_ Trump Miss Universe simulcast on @nbc and @Telemundo on December 19th will once again deliver an entertaining and 'beautiful' show! _E_ We were led to believe that Jeep would manufacture in U.S. and sell to China—like China does to us. _E_ Why is the GOP establishment so threatened by the Newsmax @iontv debate? More debate is always better. _E_ Scary Americans private wealth fell 40% from 2007 2010 __HTTP__ But @BarackObama thinks the private economy is doing fine. _E_ Hurricane Irma is of epic proportion perhaps bigger than we have ever seen. Be safe and get out of its wayif possible. Federal G is ready! _E_ A wonderfully written article concerning Israel by @JasonDovEsq __HTTP__ _E_ Sen. Jeff Flake(y) who is unelectable in the Great State of Arizona (quit race anemic polls) was caught (purposely) on "mike" saying bad things about your favorite President. He'll be a NO on tax cuts because his political career anyway is "toast." _E_ .@club4growth asked me for $1 million. I said no. Now falsely advertising that I will raise taxes. I'll lower big league for middle class. _E_ China must be worried that @MittRomney will win this November. They have never had such a pushover like @BarackObama. _E_ I played football and baseball sorry but said to be the best bball player in N.Y. State ask coach Ted Dobias said best he ever coached. _E_ Teachers in Chicago should go back to work immediately.Rahm Emanuel has offered them a fair deal. Now they're just acting for the cameras. _E_ The Manufacturing Index rose to 59% the highest level since early 2011 and we can do much better! _E_ Our great country has been divided for decades. Sometimes you need protest in order to heel & we will heel & be stronger than ever before! _E_ Donald Trump to Chris Christie: Don't hire @stuartpstevens __HTTP__ via @politico by @Hadas_Gold _E_ A certain whack job Go Angelo who doesn't have a life spends his time hopelessly attacking me re: Macy's.... _E_ Best thing my supporters can do if you don't like the way @megynkelly and her puppets unfairly treat us is don't watch her show! _E_ .@transition2017 update and policy plans for the first 100 days. __HTTP__ _E_ On Mike and Mike @espn in two minutes! _E_ It's late in July and it is really cold outside in New York. Where the hell is GLOBAL WARMING??? We need some fast! It's now CLIMATE CHANGE _E_ Thank you California! Will see you soon! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ We must remember this truth: No matter our color creed religion or political party we are ALL AMERICANS FIRST. __HTTP__ _E_ #TrumpVlog #TheInterviewMovie A sad day for freedom of speech __HTTP__ _E_ The Chinese are the biggest beneficiary of this post Saddam oil boom in Iraq __HTTP__ _E_ Everyone is asking me to speak more on Robert & Kristen.I don't have time except to say Robert drop her she cheated on you & will again! _E_ Why are we fighting for the rebels that hate us only to save face for Obama! _E_ Real unemployment is 20%. We must simplify the tax code and start making our own products again to bring our jobs back from overseas. _E_ MAKE AMERICA GREAT AGAIN! __HTTP__ _E_ ??? @BarackObama held a raffle with donors for a lunch in the White House. The winners were conveniently all (cont) __HTTP__ _E_ Congratulations to @JasonDufner on winning the PGA championship. Great job! _E_ THe people at shouldtrumprun.com have got it right! How are our factories supposed to compete with China and other countries... _E_ Achievers go for the challenge so the next deal is what they're thinking about. They have an obligation to best themselves. _E_ Just stated by a total pro: You are the only one who has the guts to say what we are all thinking. _E_ .@David_Cameron As Prime Minister why are you spending vast amounts of money to subsidize ugly wind turbines in Scotland that nobody wants? _E_ Gallup finds Des Moines Iowa has the highest community pride (76.5) of any large city. Congrats and I agree I love the place! @DesMoines _E_ Iraq buying $200000000 worth of weapons from Iran. Despite so many killed and trillions spent Iraq dumps U.S. I TOLD YOU SO LONG AGO! _E_ Watch @IvankaTrump's Ready To Wear Fashion Show at @LordandTaylor featuring @TrumpModels and @MissUSA..... __HTTP__ _E_ Never ever quit never give up Donald J. Trump The Art of the Deal. _E_ Just heard Fake News CNN is doing polls again despite the fact that their election polls were a WAY OFF disaster. Much higher ratings at Fox _E_ RT @DonaldJTrumpJr: FINAL PUSH! Eric and I doing dozens of radio interviews. We can win this thing! GET OUT AND VOTE! #MAGA #ElectionDay ht... _E_ It does matter! __HTTP__ _E_ Last night's All Star @ApprenticeNBC once again showed why the ultimate onus lies with the project manager. The buck stops there. _E_ He thinks that the wealth you create belongs to the gov't. @BarackObama doesn't respect the fact that the money he wastes belongs to us. _E_ Via @DailyCaller by @samsondunn: "Pastor To Hispanic Congregation Speaks Out On Trump Immigrant Crime Statement" __HTTP__ _E_ The OWS protesters are doing nothing to advance the interests of the 99%. Time for them to go home! _E_ It's extremely cold in NY & NJ—not good for flood victims. Where is global warming? _E_ Friends in NY 9 let @BarackObama know that you don't approve of his mistreatment of @Israel. Vote for @Bobturner9th tomorrow! _E_ Nobody wants wind turbines they are failing all over the world and need massive subsidy a disaster for taxpayers. _E_ No one has worse judgement than Hillary Clinton corruption and devastation follows her wherever she goes. _E_ Poll numbers are starting to look very good. Leading in Florida @CNN Arizona and big jump in Utah. All numbers rising national way up. Wow! _E_ At the request of the Governor of Texas I have signed the Disaster Proclamation which unleashes the full force of government help! _E_ .@TrumpGolfLA public golf course features spectacular panoramic Pacific Ocean views an elite attraction __HTTP__ _E_ Opportunities only present themselves if you are out there looking for them. Be aggressive and seize them when they come. _E_ Free enterprise is essentially a formula not just for wealth creation but for life satisfaction. Arthur C. Brooks _E_ It was an honor to welcome Republican and Democratic members of the Senate Finance Committee to the @WhiteHouse today. #TaxReform __HTTP__ _E_ ...for safety. Thank you to the Governor of P.R. and to all of those who are working so closely with our First Responders. Fantastic job! _E_ Jeb Bush signed memo saying not to use the term anchor babies offensive. Now he wants to use it because I use it. Stay true to yourself! _E_ I suspect @JoeBiden could do well tonight. Don't be fooled by his gaffes. He is a seasoned and feisty debater. _E_ Thank you Kansas! The line going into the Orlando event is over a mile long. Massive crowd expected. Leaving Kansas now be there soon! _E_ Trump to host #Oscars? __HTTP__ _E_ Entrepreneurs: Listen and learn from others but make your own decisions. Take responsibility for yourself. It's a very empowering attitude! _E_ Happy birthday to U.S. ARMY and our soldiers. Thank you for your bravery sacrifices & dedication. Proud to be your Commander in Chief! _E_ 7.8% unemployment number is a complete fraud as evidenced by the jobless claims number released yesterday.Real unemployment is at least 15% _E_ Just left hospital. Rep. Steve Scalise one of the truly great people is in very tough shape but he is a real fighter. Pray for Steve! _E_ RT @SecShulkin: Our Mobile Vet Center set up and ready to help #Veterans impacted by #HurricaneHarvey in Corpus Christi. __HTTP__ _E_ Many reports that I will be attending the Alvarez/Khan fight this weekend in Vegas. Totally untrue! Unfortunately I have other plans. _E_ "Today's put off objectives reduce tomorrow's achievements." Henry Banks _E_ I'm a former chief of police in a border town. I'm Hispanic I'm proud to be Hispanic and I'm 100% behind Trump. __HTTP__ _E_ Mar a Lago in Palm Beach is one of the most exclusive & elite clubs in the world w/award winning amenities __HTTP__ _E_ Good advice from my mother Mary MacLeod Trump: "Trust in God and be true to yourself." _E_ Iran the Number One State of Sponsored Terror with numerous violations of Human Rights occurring on an hourly basis has now closed down the Internet so that peaceful demonstrators cannot communicate. Not good! _E_ Thank you for your endorsement @GovernorSununu. #MAGA __HTTP__ _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ Tomorrow's the day! Knock on doors and make calls with us on National Day of Action! #TrumpTrain #MAGA... __HTTP__ _E_ Must watch – owner of a single restaurant anticipates that ObamaCare will cost over $1M for compliance __HTTP__ _E_ Q/A @thecelidebiasio The secret behind my success is that I love what I'm doing. That gives me energy focus (cont) __HTTP__ _E_ Things work out best for those who make the best of how things work out. John Wooden _E_ If the Republicans ever want to win a presidential election in the next 30 years they must get rid of @KarlRove. He is useless. _E_ More lies and deceptions @BarackObama is having his ex staffers write 'independent' studies for his reelection __HTTP__ _E_ True thanks. __HTTP__ _E_ "We build too many walls and not enough bridges." Isaac Newton _E_ I went to Wharton made over $8 billion employ thousands of people & get insulted by morons who can't get enough of me on twitter...! _E_ You have until 8pm to #VoteTrump Delaware! __HTTP__ _E_ Located in Tribeca each @TrumpSoHo hotel room features floor to window ceilings for a view of lower Manhattan __HTTP__ _E_ Crooked Hillary Clinton wants to essentially abolish the 2nd Amendment. No gun owner can ever vote for Clinton! _E_ Welcome to the 'Islamist Winter' the Muslim Brotherhood is now taking over the Egyptian military and possibly (cont) __HTTP__ _E_ I take great pride watching skaters enjoy the #TRUMP Rink in Central Park from my office world's best skating rink __HTTP__ _E_ South Korea is finding as I have told them that their talk of appeasement with North Korea will not work they only understand one thing! _E_ I'm not against vaccinations for your children I'm against them in 1 massive dose.Spread them out over a period of time & autism will drop! _E_ The way President Obama runs down the stairs of Air Force 1 hopping & bobbing all the way is so inelegant and unpresidential. Do not fall! _E_ Despite what the haters and losers like to say I never filed for bankruptcy but WOW the preeminent gaming company Caesars just did. _E_ .@thehill John Oliver had his people call to ask me to be on his very boring and low rated show. I said NO THANKS Waste of time & energy! _E_ I thought I was being nice to somebody re their parents. I guess this teaches you not to be nice or trusting. Sad! _E_ Our country has tremendous potential. Together we can fix Washington. Let's Make America Great Again! __HTTP__ _E_ Chrysler is moving a massive plant from Mexico to Michigan reversing a years long opposite trend. Thank you Chrysler a very wise decision. The voters in Michigan are very happy they voted for Trump/Pence. Plenty of more to follow! _E_ Crooked Hillary is being badly criticized (for a Wall Street paid for ad) by PolitiFact for a false ad on me on women. She is a total fraud! _E_ On behalf of a GRATEFUL NATION THANK YOU to all of the First Responders (HEROES) who saved countless lives in Las Vegas on Sunday night. __HTTP__ _E_ The fake news media is going crazy with their conspiracy theories and blind hatred. @MSNBC & @CNN are unwatchable. @foxandfriends is great! _E_ Rather than causing a big disruption in N.Y.C. I will be working out of my home in Bedminster N.J. this weekend. Also saves country money! _E_ THANK YOU to all of the incredible volunteers behind the scenes in Iowa! #CaucusForTrump __HTTP__ __HTTP__ _E_ Thank you South Carolina! Together WE WILL MAKE AMERICA GREAT AGAIN! #VoteTrumpSC __HTTP__ _E_ We will repeal and replace the horrible disaster known as #Obamacare! __HTTP__ _E_ .@mcuban When Apprentice became the #1 show on tv you tried copying me with The Benefactor a complete and total ratings disaster for @ABC. _E_ Sharks are last on my list other than perhaps the losers and haters of the World! _E_ "My office is at Yankee stadium. Yes dreams do come true." @Yankees Captain Derek Jeter _E_ Reverend Wright was dumped like a dog by @BarackObama he can't be feeling too good. _E_ The Yankees are sure lucky George Steinbrenner is not around. A lot of people would be losing their jobs. _E_ Gov Kasich voted for NAFTA which devastated Ohio and is now pushing TPP hard bad for American workers! _E_ Hard to believe that with 24/7 #Fake News on CNN ABC NBC CBS NYTIMES & WAPO the Trump base is getting stronger! _E_ Thank you South Carolina! Everyone get out and vote tomorrow! We will #MakeAmericaGreatAgain! __HTTP__ _E_ Zimmerman is no angel but the lack of evidence and the concept of self defense especially in Florida law gave the jury little other choice _E_ Barack Obama is not who you think he is. Most overrated politician in US history. _E_ People are happy that I left the Trump Tower atrium open as opposed to taking the easy way out. __HTTP__ _E_ I really like the Koch Brothers (members of my P.B. Club) but I don't want their money or anything else from them. Cannot influence Trump! _E_ "Be tough be smart be personable but don't take things personally. That's good business." – Think Like a Champion _E_ Power Lunching next to the #BlueMonster: __HTTP__ via @UrbanDaddy cc @TrumpDoral _E_ Still a buyer's market. Buy directly from a bank. They want to offload properties that have defaulted will give good prices & financing. _E_ Obama administration said that Saudi Arabia was on Syria's border __HTTP__ Wrong. These are the civilians planning the war. _E_ The Trump Organization is honored to be expanding our interests into Dubai. The golf course will be the top course in the Middle East. _E_ Ted Cruz purposely and illegally did not list on his personal disclosure form personally guaranteed loans from banks. They own him! _E_ Pervert Alert! Serial sexter @anthonyweiner has promised to use twitter as a "tool." Parentsmake sure your children have him blocked. _E_ The third mass attack (slaughter) in days by ISIS. 200 dead in Baghdad worst in many years. We do not have leadership that can stop this! _E_ Congratulations to Rex Tillerson on being sworn in as our new Secretary of State. He will be a star! _E_ A great interview of @DonaldJTrumpJr in the @ globeandmail on Trump Tower Toronto __HTTP__ _E_ As China and the rest of the World continue to rip off the U.S. economically they laugh at us and our president over the riots in Ferguson! _E_ Thank you! #TrumpPence16 __HTTP__ _E_ Just out: The Obama Administration knew far in advance of November 8th about election meddling by Russia. Did nothing about it. WHY? _E_ Did China ask us if it was OK to devalue their currency (making it hard for our companies to compete) heavily tax our products going into.. _E_ Getting ready to engage G7 leaders on many issues including economic growth terrorism and security. _E_ Congrats to @leezeldin on a great victory. I hope my robocalls helped! #NY1 _E_ Can you believe that Mitch McConnell who has screamed Repeal & Replace for 7 years couldn't get it done. Must Repeal & Replace ObamaCare! _E_ Crazy @megynkelly says I don't (won't) go on her show and she still gets good ratings. But almost all of her shows are negative hits on me! _E_ Tennessee GOP Poll __HTTP__ 32.7%Cruz 16.5%Carson 6.6%Rubio 5.3%Christie 2.4%Jeb 1.6% _E_ As a very active President with lots of things happening it is not possible for my surrogates to stand at podium with perfect accuracy!.... _E_ My @SquawkCNBC interview discussing 2012 election polls @MittRomney's current trip & the US housing & land market __HTTP__ _E_ One point I made last night and will continue to push is that the @GOP can't be pollitically correct. We must fight fire with fire. _E_ .@billmaher was so nervous talking about me on the @jayleno show—I've never seen him like that! _E_ China has just intervened to lower the yuan in other words they will continue to screw the U.S.! _E_ I am pleased to inform you that I have just granted a full Pardon to 85 year old American patriot Sheriff Joe Arpaio. He kept Arizona safe! _E_ He may be the worst reporter in all of sports: @RickReilly of @ESPN. He gets away with murder and most people (cont) __HTTP__ _E_ The Kate Steinle killer came back and back over the weakly protected Obama border always committing crimes and being violent and yet this info was not used in court. His exoneration is a complete travesty of justice. BUILD THE WALL! _E_ We will soon be at a point with our incompetent politicians where we will be treating illegal immigrants better than our veterans. _E_ Via __HTTP__ Interview with Donald Trump about Presidential Aspirations: It's all a deal __HTTP__ _E_ Wow @UnionLeader circulation in NH has dropped from 75000 to around 10—bad management. No wonder they begged me for ads. _E_ Isn't it a shame that the person who will have by far the most delegates and many millions more votes than anyone else me still must fight _E_ If @VattenfallGroup dropped out of the economically unfeasible wind farm development in Aberdeen who is (cont) __HTTP__ _E_ We cannot let this evil continue! #Debates2016 __HTTP__ _E_ Thank you South Carolina! We will MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ __HTTP__ _E_ Trump: If Republicans 'don't get tough they're not going to win this election' __HTTP__ Via @thehill _E_ Will be on Fox & Friends in 3 minutes 7.00 A.M. _E_ Entrepreneurs: Don't ever think you've done it all already or that you've done your best. You haven't so don't limit yourself! _E_ I just passed a 10 block long gas line going to LGA airport a terrible situation! _E_ Our country is totally fractured and with our weak leadership in Washington you can expect Ferguson type riots and looting in other places _E_ Advice from my father Fred C. Trump: Know everything you can about what you're doing. _E_ Have passion for what you do and be efficient at the same time. Think Like a Champion _E_ Ted Cruz is falling in the polls. He is nervous. People are worried about his place of birth and his failure to report his loans from banks! _E_ Video in honor of the 100th Anniversary of the Anti Defamation League (ADL): "Imagine a World Without Hate" __HTTP__ _E_ If authorities need direct view from top of Trump Tower call office. _E_ He @BarackObama should not be trying to intimidate the USC justices on ObamaCare. He is worried because SG (cont) __HTTP__ _E_ "A people that values its privileges above its principles soon loses both." Dwight D. Eisenhower _E_ Via @WTCommunities: Donald Trump to CPAC: Romney 'Didn't Talk Enough About Success' __HTTP__ by @HuizingaDanny _E_ I can't believe Apple isn't moving faster to create a larger iPhone screen. Bring back Steve Jobs! _E_ Congress should get back to Washington but @BarackObama doesn't want to interrupt his vacation in Martha's Vineyard. _E_ Entrepreneurs: When negotiating don't be an open book. Know that the only person on your side might be yourself. _E_ The more time you spend feeling sorry for yourself the more time you waste after a setback. Move on and quickly embrace the next challenge! _E_ See dummy Danny Zuker who I never heard until this started something that he couldn't finish gutless and unwilling to take my bet! _E_ Thank you Tennessee! #Trump2016 __HTTP__ _E_ "Becoming an entrepreneur is a personal development program. If you grow personally your business will grow." – Midas Touch _E_ Thank you @DonaldJTrumpJr! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Leaving for the GREAT STATE OF SOUTH CAROLINA now to make a speech about how to MAKE OUR COUNTRY GREAT AGAIN! _E_ Great rally in Fresno California great crowd! Thank you! #Trump2016 __HTTP__ _E_ With @greta in Washington D.C. Old Post Office under construction. Tune in tonight at 7PM EST! __HTTP__ _E_ See Lyin' Ted even the @DailyBeast (no fan of mine) says this story came from Rubio not Trump! __HTTP__ _E_ Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before _E_ If Obama has to re fight this fight next year he loses Watch the fine details in every deal The Art of the Deal _E_ Via @DMRegister BY @JenniferJJacobs: "@SteveKingIA ramps up with first TV ad Trump event" __HTTP__ _E_ New CNN Iowa poll Trump 33 Cruz 20. Everyone else way down! Don't trust Des Moines Register poll biased towards Trump! _E_ Today we honored our true American heroes on the first ever National Vietnam War Veterans Day.#ThankAVeteran... __HTTP__ _E_ Wow the @nytimes is losing thousands of subscribers because of their very poor and highly inaccurate coverage of the Trump phenomena _E_ Did my weekly phoner on Fox & Friends this morning...sounding off on issues of the day ... __HTTP__ _E_ Great job tonight @ericbolling _E_ Tired of being bullied by the economy? I'm going to help people. Wednesday 11 AM at Trump Tower _E_ I will do more in the first 30 days in office than Hillary has done in the last 30 years! #Debate  #BigLeagueTruth __HTTP__ _E_ The Democrats seem intent on having people and drugs pour into our country from the Southern Border risking thousands of lives in the process. It is my duty to protect the lives and safety of all Americans. We must build a Great Wall think Merit and end Lottery & Chain. USA! _E_ Trump International Hotel Washington D.C. will be one of the world's top luxury hotels __HTTP__ _E_ Thank you. __HTTP__ _E_ Many great business campaigns at @fundanything __HTTP__ Great way to support small upstarts. _E_ Lance Armstrong did himself great harm last night. Lawsuits & failure will follow him! _E_ On the red carpet at the NYC premiere of Dark Knight Rises with @melaniatrump via @NewYorkObserver's @velvet_roper __HTTP__ _E_ .@SarahPalinUSA was 100% correct when she stated that @oreillyfactor used us in day long tease to get people to watch but we were not on! _E_ Featuring five championship golf courses including the Blue Monster @TrumpDoral is South Miami's top destination __HTTP__ _E_ "All Star Celebrity Apprentice" is #1 in the time period among ABC CBS and NBC in 18 49 and all other key demos—Nielsen Ratings _E_ The real estate market is slowly improving. Still a great time to buy. You will thank me in 5 years. _E_ "Money may not grow on trees but it does grow from talent hard work and brains." – Think Like a Billionaire _E_ A former Miss New York is the designer behind the swimsuits featured in Sunday's Miss USA pageant—beautiful! __HTTP__ _E_ It was a great honor to be with King Abdullah II of Jordan and his delegation this morning. We had a GREAT bilateral meeting! __HTTP__ _E_ EXCLUSIVE — DONALD TRUMP ON THE GOP PRIMARY: 'IF I WIN I WILL BEAT HILLARY' __HTTP__ via @BreitbartNews by Katie McHugh _E_ Wind turbine syndrome is affecting tremendous numbers of people in their wake—stop ugly turbines. _E_ Floyd Mayweather is being beaten up badly through 10 rounds by Marcos Maidana but announcers say it is even. TWO ROUNDS LEFT. _E_ Wacky & totally unhinged Tom Steyer who has been fighting me and my Make America Great Again agenda from beginning never wins elections! _E_ RT @markets: U.S. job openings surge to record __HTTP__ via @ShoChandra __HTTP__ _E_ Trump Collection's summer line exclusively available @Macys is the pinnacle of style & prestige. Dress your best! __HTTP__ _E_ On December 19th the @MissUniverse pageant will be broadcast live in over 190 countries to one billion viewers. @nbc _E_ The situation with Russia is much more dangerous than most people may think and could lead to World War III. WE NEED GREAT LEADERSHIP FAST _E_ Club For Growth tried to extort $1000000 from me. When I said NO they went hostile with negative ads. Disgraceful! _E_ Dishonest media is trying their absolute best to depict a star in a tweet as the Star of David rather than a Sheriff's Star or plain star! _E_ RT @GregAbbott_TX: Thanks to the Texas National Guard for their help to rescue flooded Texans. #HurricaneHarvey __HTTP__ _E_ "When you have confidence you can have a lot of fun. And when you have fun you can do amazing things." @RealJoeNamath _E_ Great poll thank you Nevada!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ "Generals don't panic then the troops never panic." @SHAQ _E_ You can't tax business. Business doesn't pay taxes. It collects taxes. ― Ronald Reagan _E_ Thank you New York! I love you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ I know about the "rustic" look on golf courses—but see photo of highly rated Trump National Philadelphia—a real gem. __HTTP__ _E_ Thank you to @NYPost's Robert Rorke for the really nice review of #SNL. So many enjoyed it very gratifying! __HTTP__ _E_ The TODAY Show should call me about who to put on the show— I know more about people who get ratings than anyone. _E_ Obama wants Americans to keep buying crude from OPEC who is ripping us off instead of our ally Canada through (cont) __HTTP__ _E_ Just out new PPP NATIONAL POLL has me in first place by a wide margin at 29%. I wonder why only @FoxNews has not reported this? Too bad! _E_ As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_ Trump National Golf Club Los Angeles fronts the Pacific Ocean and has an 18 hole Pete Dye course. Beautiful! __HTTP__ _E_ Trump Int'l Washington D.C. is a historic building which our entire nation can take pride in & enjoy Opening 2016 __HTTP__ _E_ .@rupertmurdoch is absolutely right it will be a nightmare for @Israel if Obama is re elected. _E_ I will be on Fox & Friends @foxandfriends at 7.00 a.m. (30 minutes). Enjoy! _E_ The journey to #MAGA began @CPAC 2011 and the opportunity to reconnect with friends and supporters is something I look forward to every year. See you at #CPAC2018! _E_ He is delusional: @BarackObama believes that he is the 4th best POTUS ever. _E_ Minorities line up behind.....Donald Trump #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_ Weak & ineffective @JebBush is doing ads where he shows his statement in the debate but not my response. False advertising! _E_ The UN is about to use its Assembly to attack @Israel. We should defund the UN entirely if they can't act resp... (cont) __HTTP__ _E_ I must say that some of these college football games are great tonight very exciting I wish I had more time to watch! _E_ .@seanhannity at 10:00. _E_ I'm convinced that about half of what separates successful entrepreneurs from the non successful ones is pure perseverance. Steve Jobs _E_ My @FoxNews interview with @TeamCavuto discussing the Newsmax @iontv debate #TimeToGetTough and the 2012 race __HTTP__ _E_ Another electric car firm that @BarackObama gave $118M just went bankrupt. __HTTP__ He loves to waste our tax dollars. _E_ "TRUMP DECLARES VICTORY ON IMMIGRATION AS OBAMA ADMITS SOME ILLEGALS ARE 'GANG BANGERS'" __HTTP__ via @BreitbartNews @ASwoyer _E_ #ICYMI: @foxandfriends this morning. __HTTP__ _E_ He admits his presidency has been flawed...but @BarackObama claims economy is stronger. __HTTP__ _E_ Ailsa Course changes: #TrumpTurnberry What a beautiful place! __HTTP__ _E_ RT @foxandfriends: White House calls out Senate Democrats for obstructing nominees __HTTP__ _E_ The judge opens up our country to potential terrorists and others that do not have our best interests at heart. Bad people are very happy! _E_ We will follow two simple rules: BUY AMERICAN & HIRE AMERICAN!#InaugurationDay #MAGA _E_ Crooked Hillary Clinton and her team were extremely careless in their handling of very sensitive highly classified information. Not fit! _E_ Don't forget the open call at Trump Tower tomorrow for The Apprentice. I look forward to seeing you there. _E_ I'm a Republican but not a fan of the last George Bush he also was a lousy President (Iraq etc.). In fact he was so bad he gave us Obama! _E_ ...I told Republicans to approve healthcare fast or this would happen. But don't worry I will veto because I love our country & its people. _E_ Commodity prices are beginning to drop as a result of the Euro crisis __HTTP__ _E_ Russia should hand over Snowden to the U.S. but they are having too much fun taunting our leaders. _E_ Why is it that Eric Schneiderman is considered a lightweight by so many and has failed to go after Jon Corzine and big abusers for billions? _E_ I requested that Mitch M & Paul R tie the Debt Ceiling legislation into the popular V.A. Bill (which just passed) for easy approval. They... _E_ Why does @megynkelly devote so much time on her shows to me almost always negative? Without me her ratings would tank. Get a life Megyn! _E_ Do you think that Hillary Clinton will apologize to me for the lie she told about the video of me being used by ISIS. There is no video. _E_ Congratulations to @AlabamaFTBL on winning the BCS championship last night! _E_ Just arrived in Taormina with @FLOTUS Melania. #G7Summit #USA __HTTP__ _E_ The Amateur. On his trip to Afghanistan our commander in chief disclosed the CIA Chief's name. Unsafe disaster! __HTTP__ _E_ Wishing everyone a Happy Memorial Day and a thank you to all the soldiers who protected our great country. _E_ Hillary and the Dems loved and praised FBI Director Comey just a few days ago. Original evidence was overwhelming should not have delayed! _E_ As promised on the campaign trail we will provide opportunity for Americans to gain skills needed to succeed & thrive as the economy grows! __HTTP__ _E_ Trump offers $5 million for Obama college passport records __HTTP__ By @AlexPappasDC @DailyCaller _E_ Watch me tonight on The O'Reilly Factor at 8 pm and 11 pm EST FOX News _E_ Interesting article about Atlantic City __HTTP__ _E_ The American people agree. No free pass for #CrookedHillary! __HTTP__ _E_ Via @bostonherald by @ChrisCassidy_BH: "Trump: `The last thing we need is another Bush'" __HTTP__ _E_ First of all you don't necessarily need the best location. What you need is the best deal. The Art of the Deal _E_ Congratulations to @TrumpChicago @TrumpSoHo and @TrumpLasVegas all listed #1 on @TravelandLeisure World's Best Business Hotels _E_ The Blue Monster at Trump National Doral in Miami is doing record business everybody wants a piece of it. Great reviews. Thank you! _E_ A Rod was a great player when he lived at Trump Park Avenue even though he was on the juice! _E_ A friend of mine went to @CakeBossBuddy and sent me this beautiful cake which we put in the atrium of @TrumpTowerNY. __HTTP__ _E_ .@Lord_Sugar nice call on predicting that the iPOD would be dead finished gone kaput __HTTP__ Great business foresight. _E_ So many great endorsements yesterday except for Paul Ryan! We must put America first and MAKE AMERICA GREAT AGAIN! _E_ I hope the @RNC is ready for a Third Party if they blow this election because that is what they will face. They must fight hard. _E_ Word is that @Greta Van Susteren was let go by her out of control bosses at @NBC & @Comcast because she refused to go along w/ 'Trump hate!' _E_ Huffington Post is just upset that I said its purchase by AOL has been a disaster and that Arianna Huffington is ugly both inside and out! _E_ .@MatthewJDowd thank you for the nice comments recently especially on @BarbaraJWalters. My family & I greatly appreciate your kind words. _E_ Great being in Cincinnati Ohio last night thank you! Off to Washington D.C. now. #Trump2016 #AmericaFirst __HTTP__ _E_ The Failing @nytimes set Liddle' Bob Corker up by recording his conversation. Was made to sound a fool and that's what I am dealing with! _E_ Iraq's government is treating us like fools. We should demand their oil. _E_ All this from a guy who lectured Americans about tightening their belts: @BarackObama bashes rich people an... (cont) __HTTP__ _E_ We are the greatest country the world has ever known. I make no apologies for this country my pride in it or (cont) __HTTP__ _E_ Don't forget the Celebrity Apprentice Sunday night at 9 pm on NBC for another surprising and exciting episode __HTTP__ _E_ Trump is going to be our President. We owe him an open mind and the chance to lead. So much time and money will be spent same result! Sad _E_ The so called Commission on Presidential Debates admitted to us that the DJT audio & sound level was very bad. So why didn't they fix it? _E_ Scary. Obama and the Democrat Senate have accrued over $5T worth of debt without passing a budget in the last 3 years. 4 more years? _E_ Be on time. Wasting other people's time due to poor planning and thoughtlessness will only leave a bad impression. Think Like a Champion _E_ .@NYDailyNews the dying tabloid owned by dopey clown Mort Zuckerman puts me on the cover daily because I sell. My honor but it is dead! _E_ It's amazing how different all of the polling results are not an exact science. _E_ We must not let #CrookedHillary take her CRIMINAL SCHEME into the Oval Office. #DrainTheSwamp __HTTP__ _E_ .@andersoncooper did an excellent job of hosting the #DemDebate last night. Tough firm but fair. _E_ It is time to create jobs for Americans not D.C. We need a bold new direction. Let's Make America Great Again! __HTTP__ _E_ I told all of the haters and losers long ago that Iraq would fall take the OIL or get out fast! Massive waste of lives and trillions of $'s _E_ China just called. They want to lend Obama another $1B for the ObamaCare web site. _E_ RT @RSBNetwork: LIVE Stream: Donald Trump about to speak in Boca Raton FL. Protesters already before Trump speaks. #TrumpTrain __HTTP__ _E_ He has no respect for American exceptionalism. @BarackObama has outsourced our space program to the Russians __HTTP__ _E_ $30M a year and A Rod is now relegated to the bench. @yankees would have lost if Girardi hadn't benched him in the 9th (see my prediction) _E_ It's Thursday. How much has OPEC ripped us off today? _E_ .@reince is doing a fantastic job for the Republican Party hope he gets the credit he deserves. _E_ The VA scandal will only get worse over the time. Our vets deserve the best care possible. We must be open to private solutions. _E_ The super Liberal Democrat in the Georgia Congressioal race tomorrow wants to protect criminals allow illegal immigration and raise taxes! _E_ Watch @foxandfriends now on Podesta and Russia! _E_ This assignment has been a challenge to both teams. #CelebApprentice _E_ I will soon be releasing my response to the fact that President Obama refused to show his applications and records to the public. _E_ Things turn out best for the people who make the best of the way things turn out. John Wooden _E_ Just finished a very good meeting with the President of South Korea. Many subjects discussed including North Korea and new trade deal! _E_ The fight against ISIS starts at our border. 'At least' 10 ISIS have been caught crossing the Mexico border. Build a wall! _E_ Palm Springs CA has been destroyed absolutely destroyed by the world's ugliest wind farm at the Gateway on Interstate 10. Very very sad! _E_ I went to @MikeTyson's play. I will be doing a review in the next #trumpvlog. _E_ It's this simple. "Make America Great Again." #debate #BigLeagueTruth _E_ As to the U.N. things will be different after Jan. 20th. _E_ "You can't con people at least not for long. If you don't deliver the goods people will eventually catch on." The Art of The Deal _E_ Wow I was just informed that I'm being inducted into the @WWE Hall of Fame a great honor 4/6/13 at @MSGnyc __HTTP__ _E_ .@MittRomney & @PaulRyanVP get what needs to be done to reign in China. @BarackObama gets kicked around by the Chinese. _E_ Our national security starts at the border. Do you think ISIS & al Qaeda are just in the Middle East? _E_ Tell 'Top Scot' Michael Forbes to clean up his property—it is an embarrassment to Scotland. _E_ Via @washingtonpost by @jdelreal: About that Donald Trump speech at CPAC ... __HTTP__ _E_ Thank you @scottienhughes for the great job you did on @CNN. Great energy and smarts! I will not let you down. _E_ Distressed real estate opportunities can make great investments. You need the foresight and instincts to know the property's true potential. _E_ Rapper @MacMiller's song Donald Trump now has 57 million hits I created another star where's my cut? _E_ RT: @thedailybeast: Polling shows the @AmericansElect movement could still nominate a viable independent with a chance of victory... _E_ One 57 is one of the worst looking buildings I've seen in a long time in particular its very ugly skin. _E_ Funny to hear the Democrats talking about the National Debt when President Obama doubled it in only 8 years! _E_ have been allowed to run guilty as hell. They were VERY nice to her. She lost because she campaigned in the wrong states no enthusiasm! _E_ Just watched @marcorubio on television. Just another all talk no action politician. Truly doesn't have a clue! Worst voting record in Sen. _E_ Last nights results in poll taken by NBC. #AmericaFirst #ImWithYou __HTTP__ _E_ This is a MOVEMENT! #RNCinCLE __HTTP__ _E_ Don't forget the Miss USA Pageant live on Sunday night at 9 pm ET on NBC. And you can vote for your favorite beauty! __HTTP__ _E_ Donald Trump is confident that Ireland is ready for a big comeback __HTTP__ via @independent_ie by @AnitaActually _E_ Will be interviewed by @oreillyfactor tonight at 8 PM. _E_ Help save the lives of our troops.Our #vets suffering from TBI/PTS need treatment @makeitvisible Donate to __HTTP__ _E_ #CelebApprentice who do you think won? _E_ Fake News story of secret dinner with Putin is sick. All G 20 leaders and spouses were invited by the Chancellor of Germany. Press knew! _E_ Will be going to Detroit Michigan (love) today for a big meeting on bringing back car production to State & U.S. Already happening! _E_ Young entrepreneurs – be resolute in your drive for success. Gain momentum. Once you succeed promote yourself! _E_ So great to be in New York. Catching up on many things (remember I am still running a major business while I campaign) and loving it! _E_ Great that Pres. O is seeing @MittRomney today—lots of good things can happen. _E_ How does @HBO employ @BillMaher with a pathetic show that he does what kind of a special is that? Complete garbage! _E_ In Las Vegas for the Miss Universe Pageant—airing tonight on @nbc at 8 o'clock. _E_ Sleepy eyes @chucktodd is an absolute joke of a reporter. He is in the bag for Obama. He can't carry @jack_welch's jock. _E_ .@marcorubio what do you say to the family of Kathryn Steinle in CA who was viciously killed b/c we can't secure our border? Stand up for US _E_ The death tax should be abolished the Government is simply taxing you twice. It is also a job killer. _E_ Heading to Iowa to a packed house. Just released polls all first place are amazing. Thank you! _E_ Via @ProgressIndex: "Donald Trump to deliver keynote address at annual Chesterfield Republican Gala" __HTTP__ _E_ I will be interviewed on @FaceTheNation this morning. Enjoy! @jdickerson _E_ Another @BarackObama green car loan recipient is laying off staff. __HTTP__ How many billions of our money has he wasted? _E_ We are going to have a big event at the Verizon Wireless Arena in Manchester New Hampshire! 5K+! Join us tomorrow: __HTTP__ _E_ Wow Mitt Romney didn't know that Rand Paul was in the race for president. Very strange! @FoxNews _E_ Lawrence O'Donnell will soon have another cancelled show to go along with his three cancelled TV series Mister (cont) __HTTP__ _E_ RT @FoxNews: .@AlanDersh: Trump Has 'More Credibility' Than Obama With North Korea __HTTP__ __HTTP__ _E_ Crooked Hillary Clinton is being protected by the media. She is not a talented person or politician. The dishonest media refuses to expose! _E_ Hillary says take back Mosul? We would have NEVER lost Mosul if it wasn't for #CrookedHillary. #DrainTheSwamp __HTTP__ _E_ Visiting New York City? Make sure to skate in the world famous Trump Rink in Central Park __HTTP__ Great for the whole family! _E_ Another clip from my @greta interview discussing why Sony should not have capitulated to the hackers __HTTP__ No Courage! _E_ Obama has admitted that he spends his mornings watching @ESPN. Then he plays golf fundraises & grants amnesty to illegals. _E_ Wrong @BarackObama's '08 campaign manager & current Senior WH Advisor collected $100G fee from Iranian affiliate __HTTP__ _E_ We're worried about waterboarding as our enemy ISIS is beheading people and burning people alive. Time for us to wake up. _E_ Getting ready to leave for Poland after which I will travel to Germany for the G 20. Will be back on Saturday. _E_ Iran is flying supply planes to Syria through Iraqi airspace. Thank you United States for making this possible! _E_ Thank you Oklahoma & Virginia! #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ Honored to have passed 1 million twitter followers. We are making America #1 again. #TimeToGetTough _E_ The Trump Signature Collection available @Macys offers top new designs for your fall wardrobe. Dress your best! __HTTP__ _E_ When employees are working at home they can never have the same cohesivness as working together as a group... _E_ #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ I will be nominating Christopher A. Wray a man of impeccable credentials to be the new Director of the FBI. Details to follow. _E_ Today is armed forces day. Thank you  to our military service members! I love you all! _E_ Dopey Sugar—@Lord_Sugar Isn't it sad that my golf course in Scotland just got "best new course in the world"—it's worth more than you are! _E_ .@JuddApatow I agree! _E_ Obama's convention bounce is gone. @MittRomney has retaken the lead in the latest @RasmussenPoll __HTTP__ _E_ Thank you Nashville Tennessee! __HTTP__ _E_ Wow it is unbelievable how distorted one sided and biased the media is against us. The failing @nytimes is a joke. @CNN is laughable! _E_ My interview w/ @WendyWilliams on @WendyShow discussing @MichelleObama's bangs & All Star @CelebApprentice __HTTP__ _E_ Great win by the @nyjets yesterday. If they run the table they will make the playoffs. _E_ Remember when the two failed presidential candidates Lindsey Graham and Jeb Bush signed a binding PLEDGE? They broke the deal no honor! _E_ Many of the thugs that attacked the peaceful Trump supporters in San Jose were illegals. They burned the American flag and laughed at police _E_ Will be interviewed by @JudgeJeanine on @FoxNews at 9:00 P.M. (Saturday night). Enjoy! _E_ Congressman John Lewis should finally focus on the burning and crime infested inner cities of the U.S. I can use all the help I can get! _E_ .@andydean2014 Thank you you were great. You can defend me anytime. Amazing job. _E_ Cruz says I supported TARP which gave $25 million to Goldman Sachs the bank which loaned him the money he didn't disclose. Puppet! _E_ In the "old days" when good news was reported the Stock Market would go up. Today when good news is reported the Stock Market goes down. Big mistake and we have so much good (great) news about the economy! _E_ Biden's statements on Medicare are very effective. Ryan must now come back and combat. #VPDebate _E_ A clip from my @foxandfriends interview discussing how Newsmax @iontv debate is determining the GOP primary polls __HTTP__ _E_ Will be interviewed on @FoxNews at 10:00 P.M. Enjoy! _E_ I am in Iowa. Will be interviewed on This Week With @GStephanopoulos this morning. ENJOY! _E_ Be sure to listen to my interview on tonight's @SteveDeaceShow. Steve is a terrific guy! _E_ The great Mike Wallace covered me in a much more professional manner than his son Chris Wallace of @FoxNews. Mike was a total pro! _E_ Hagel committee vote has been postponed as Hagel refuses to disclose all his finances __HTTP__ _E_ NY should frack now. What's the hold up? Is Albany opposed to creating jobs and making gas cheaper for middle class? _E_ When will lightweight hack Attorney General be investigated for his repeated prosecutorial misconduct? __HTTP__ _E_ Figure out what really moves you. You've got to have the 'FIRE' in order to have the Midas Touch. Midas Touch _E_ Don't negate your own power. Whatever you've been dealt know you can deal with it. Fear is the opposite of faith. _E_ True. Thanks. __HTTP__ _E_ I love seeing that Graydon Carter and @VanityFair are failing so badly. He's only focused on his bad food restaurants. _E_ Very grateful for the 9 O decision from the U. S. Supreme Court. We must keep America SAFE! _E_ Need all on the UN Security Council to vote to renew the Joint Investigative Mechanism for Syria to ensure that Assad Regime does not commit mass murder with chemical weapons ever again. _E_ "What counts is not necessarily the size of the dog in the fight it's the size of the fight in the dog." Dwight D. Eisenhower _E_ Will be interviewed on @foxandfriends now! _E_ New Hampshire has a major decision to make today. Hopefully we won't have to hear any more Mandarin spoken in future debates. _E_ Amazing evening at Saturday Night Live! _E_ The harder I work the luckier I get. Samuel Goldwyn _E_ NYC is under constant threat from Jihadists & violent criminals. Stop & Frisk keeps streets & subways safe.Stand strong Ray Kelly _E_ Wow reviews are in THANK YOU! _E_ The Gang of Six yet another unmitigated disaster. ANY DEAL NEEDS TO REPEAL OBAMACARE. T E A. _E_ Lightweight Senator @RandPaul should focus on trying to get elected in Kentucky a great state which is embarrassed by him. _E_ Busy doing phoners this week with Neil Cavuto Wolf Blitzer Fox & Friends and Larry Kudlow....check out __HTTP__ _E_ Why didn't Gates resign if he was so unhappy about what he was being told by Obama? The fact is Iraq etc. have always been disasters! _E_ Millions losing healthcare plans despite President Obama's promise that this WOULD NOT HAPPEN! What about a massive protest march on D.C. _E_ My family has the honor of being interviewed for a full hour by the legendary @BarbaraJWalters tonight @ABC 10pmE. __HTTP__ _E_ You can benefit from others' wisdom. Not just their mistakes but the good decisions and insight they have to offer." The Way To The Top _E_ Looks like the U.S. will be having the coldest March since 1996 global warming anyone????????? _E_ Thank you Bangor Maine! Get out & #VoteTrumpPence16 on 11/8/16 and together we will MAKE AMERICA SAFE AND GREAT A... __HTTP__ _E_ I will be the featured guest on the season opener of @60Minutes this Sunday. There certainly is plenty to talk about! _E_ CHAIN MIGRATION cannot be allowed to be part of any legislation on Immigration! _E_ Frank was a great guy married to an absolutely wonderful woman @KathieLGifford. What a couple! __HTTP__ _E_ My @gretawire int. on Obama's falling poll numbers Americans losing incentive to work and Weiner's sexting __HTTP__ _E_ Heading to Birmingham Alabama and a massive crowd of incredible people! 12 noon will be wild. _E_ Leaving Nevada now for Iowa. Things are looking good great new polls! _E_ So nice great Americans outside Trump Tower right now. Thank you! __HTTP__ _E_ I am truly enjoying myself while running for president. The people of our country are amazing great numbers on November 8th! _E_ Isn't it great that Obama had time yesterday to fundraise with Jay Z and do @Late_Show while there is a record 21% real unemployment! _E_ Thank you Dayton Ohio! 20000 supporters largest in airport history! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Thank you South Carolina! __HTTP__ _E_ President Obama could totally solve the problem with Putin by demanding that Russia sign on to ObamaCare thereby destroying their economy! _E_ .@HillaryClinton you have failed failed and failed. #BigLeagueTruthTime to #DrainTheSwamp! __HTTP__ _E_ Wages in are country are too low good jobs are too few and people have lost faith in our leaders.We need smart and strong leadership now! _E_ Reports are out there that many CEOs of charities are getting overpaid while their causes are seeing very little... _E_ He (or she) who hesitates is lost: MAKE AMERICA GREAT AGAIN! _E_ .@CNN has to do better reporting if it wants to keep up with the crowd.So totally one sided and biased against me that it is becoming boring _E_ Was photo bombed yesterday by a wise guy when I left the set of @LateNightJimmy... _E_ The Senate should immediately vote on the Iranian sanctions bill. What is the delay? Iran is already breaking its agreement with Obama _E_ Donald Trump ready to end @ApprenticeNBC for White House run __HTTP__ via via @dcexaminer by @eScarry _E_ We are what we repeatedly do. Excellence then is not an act but a habit. Aristotle _E_ Via NYTimes What's Your Ideal Gadget? __HTTP__ _E_ Jeb failed as Jeb! He gave up and enlisted Mommy and his brother (who got us into the quicksand of Iraq). Spent $120 million.Weak no chance! _E_ RT @IamVicky4Trump: TUNE IN: Maria Bartiromo Has an Exclusive Interview With President Trump __HTTP__ _E_ .... we push for the removal of all trade distorting practices....to foster a truly level playing field. _E_ It is truly an honor that His Eminence Archbishop of New York @CardinalDolan will be delivering the benediction at the @RNC convention. _E_ "@joerepublic1 @mckaycoppins how nice of this punk with a pen to call a truce after he tries to show u up w/his bs! True thx _E_ Marco Rubio should pick a location that has working air conditioning next time especially when in Miami proper plan. Sweating profusely! _E_ Preview of Obama's SOTU: More taxes bigger government shrink the private sector end the Republicans & bankrupt the country. Enjoy! _E_ Think of it the Arab League doesn't want to get involved with Syria but they want us to do their dirty work. How stupid! _E_ I encourage everyone in the path of #HurricaneHarvey to heed the advice & orders of their local and state officials. __HTTP__ _E_ ...to win. The Democrats are overplaying their hand. They lost the election and now they have lost their grip on reality. The real story... _E_ First Minister Salmond should stop his fruitless drive for obsolete wind turbines in Scotland he would become popular again! @alexsalmond _E_ I watched Sen. Graham @FaceTheNation. Why don't they say that I ran him out of the race like a little boy and in the end he had no support? _E_ By @kwrcrow: Hey Washington Post 'Only You Hate Donald Trump' or 'Is it FEAR?' __HTTP__ _E_ I will be on @meetthepress in an interview with @chucktodd on Sunday morning. So much to talk about! _E_ Wow just watching the news.ObamaCare and the website are TOTALLY OUT OF CONTROL. Costs are through the roof. This could be ruinous to U.S.! _E_ Offering river lake & skyline views @TrumpChicago's 339 5 Star rooms range from Deluxe Suites to Spa Guestrooms __HTTP__ _E_ There is no substitute for hard work. Thomas Edison _E_ Join me in Roanoke Virginia on Saturday evening at 6pm! #MAGA __HTTP__ _E_ Fox and.Friends now! _E_ Obama will eventually approve the Keystone XL pipeline has to happen but it is very late! _E_ The boardroom has never been as intense as in the upcoming13th season of All Star @CelebApprentice. Premieres March 3rd on @NBC! _E_ All Star Celebrity @ApprenticeNBC continues to dominate the Sunday 10PM slot in every key demographic. Still hot after 13 seasons! _E_ I will not be attending the White House Correspondents' Association Dinner this year. Please wish everyone well and have a great evening! _E_ So much Fake News being put in dying magazines and newspapers. Only place worse may be @NBCNews @CBSNews @ABC and @CNN. Fiction writers! _E_ Congratulations to our great military men and women for representing the United States and the world so well in the Syria attack. _E_ TODAY WE MAKE AMERICA GREAT AGAIN! _E_ Nielson Media Research final numbers on ACCEPTANCE SPEECH: TRUMP 32.2 MILLION. CLINTON 27.8 MILLION. Thank you! _E_ I didn't suggest a database a reporter did. We must defeat Islamic terrorism & have surveillance including a watch list to protect America _E_ Leaked e mails of DNC show plans to destroy Bernie Sanders. Mock his heritage and much more. On line from Wikileakes really vicious. RIGGED _E_ I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ __HTTP__ _E_ Work Underway on First New Trump Course in Dubai Second Course in Planning __HTTP__ via @CybergolfNews _E_ Small businesses will have an ally in the White House with @MittRomney. Mitt gave a great interview yesterday __HTTP__ _E_ Family group shot. #WWEHOF __HTTP__ _E_ Just cannot believe a judge would put our country in such peril. If something happens blame him and court system. People pouring in. Bad! _E_ Via @DailyCaller @NeilMunroDC: "Obama's Border Policy Fueled Epidemic Evidence Shows" __HTTP__ _E_ RT @mike_pence: Good morning! Join me in Lima Ohio tomorrow evening at 7pm. #MAGATickets: __HTTP__ _E_ FLASHBACK via @Reuters from 2004: "Donald Trump Would 'Fire' Bush Over Iraq Invasion" It's called great vision. __HTTP__ _E_ Just arrived in Texas have been informed two @fortworthpd officers have been shot. My thoughts and prayers are with them. _E_ Trump Hotels are delivering lots of food to storm victims...we love doing it! _E_ RT @TeamTrump: Hillary's policies have made America less safe that's why 200+ general and military leaders have endorsed @realDonaldTrump!... _E_ The worst show in Las Vegas in my opinion is @pennjillette. Hokey garbage. New York show even worse! _E_ Awarded both @ForbesInspector Five Star & @AAAFiveDiamond ratings @TrumpNewYork's @Jean_GeorgesNYC is fantastic. __HTTP__ _E_ Thank you Charlotte North Carolina!#MakeAmericaGreatAgain __HTTP__ _E_ "Donald Trump dedicates second Scottish golf course to beloved mother Mary" __HTTP__ via @MailOnline _E_ My fellow Tea Party friends in Ohio make sure you take advantage of early voting so you can GOTV election day. Know you can! Must win Ohio. _E_ Everyone should watch the documentary 'Windfall' on @netflix. See an upstate NY town ruined by environmentalists & windfarms. _E_ HAPPY PRESIDENTS DAY MAKE AMERICA GREAT AGAIN! _E_ .@ArceePalabrica @realDonaldTrump Midas Touch is the manual for entrepreneurs who want to succeed. Thanks for sharing your knowledge _E_ One season ends and another starts. Already casting for the next @ApprenticeNBC. Great news for charity $13 million so far. _E_ Looking forward to my meeting with Benjamin Netanyahu in Trump Tower at 10:00 A.M. _E_ Wow I love stimulating debate and driving certain people crazy the Generals were forced to do something they didn't want to do (not me). _E_ Thank you @NYPost! #Trump2016 __HTTP__ _E_ The lightweight hack Schneiderman told Ivanka that the "case is weak and more. Meets with Obama & then files one day later. _E_ Sadly they and others are Fake News and the public is just beginning to figure it out! __HTTP__ _E_ Jeb Bush really blew his interview with @megynkelly should cost him big time. Said he would do the disastrous Iraq war all over again _E_ So Obama wants to bomb ISIS in Iraq & arm them in Syria? What is he doing! _E_ In my administration EVERY American will be treated equally protected equally and honored equally #Debate #BigLeagueTruth _E_ Interesting.@BarackObama's 1981 transfer class to Columbia declined in quality according to the Columbia Spectator __HTTP__ _E_ What's funny about the name "F**kface Von Clownstick" it was not coined by Jon Leibowitz he stole it from some moron on twitter. _E_ So many incredible friends said thanks for TT help I say thanks to you! __HTTP__ _E_ We threw our ally Mubarak overboard and Egypt is now our enemy. Great going Obama Israel is in trouble. _E_ Will be meeting on Monday at Trump Tower with a large group of African American Pastors. Many I know wonderful people! Not a press event. _E_ It doesn't matter that Crooked Hillary has experience look at all of the bad decisions she has made. Bernie said she has bad judgement! _E_ So many people don't understand I am a big proponent of vaccines for children—just not in one massive dose—spread them out over time. _E_ Sorry but @piersmorgan is a good & smart man who is doing really well. That's why he won @ApprenticeNBC. _E_ Wow President Obama's brother Malik just announced that he is voting for me. Was probably treated badly by president like everybody else! _E_ Supporters waiting to hear me speak in Oskaloosa Iowa. #MakeAmericaGreatAgain __HTTP__ _E_ With the record high February gas prices hurting the economy even more reason to start fracking. Will create jobs & lower prices. _E_ MAKE AMERICA GREAT AGAIN! #Trump2016 #VoteTrump __HTTP__ _E_ Congratulations to @marklevinshow on 'The Liberty Amendments' debuting at #1 on the NY Times' bestseller list. Must read! _E_ Leaving now for Texas! _E_ Wow! Such a wonderful article from fantastic people my great honor! __HTTP__ _E_ If you love what you do you are going to work harder you are going to try harder and you will be better at it. Think Big _E_ Trump Int'l Puerto Rico spreads luxury residences a world class golf resort & beach club across 1000 acres __HTTP__ _E_ A great job by @RickieFowlerPGA in winning The Players yesterday. Finally your jealous critics can go to hell! Good luck at The U.S. Open. _E_ The unemployment numbers released later this week will show no job growth. We must start making our own products again. #TimeToGetTough. _E_ Dummy writer @Clare_OC from failing @Forbes magazine works so hard to make such trivial license deals look important... _E_ Jennifer Aniston is engaged she's a great person and I wish her well. _E_ We have all got to come together and win this election. We can't have four more years of Obama (or worse!). _E_ #TrumpAdvice __HTTP__ _E_ Happy to announce I am nominating Alex Azar to be the next HHS Secretary. He will be a star for better healthcare and lower drug prices! _E_ Via @Newsmax_Media: Maher Being Sued by Trump Over Birth Certificate Bet on 'Tonight Show' __HTTP__ _E_ In the plane heading to Iowa State Fair. Will be great fun. Hopefully giving helicopter rides to some of the kids. _E_ .@MannyPacquiao was robbed in his title fight on Saturday night. No wonder boxing is dying. Bring back the 15 round fights. _E_ Good @FLGovScott is suing the Federal Government so he can protect the voter rolls __HTTP__ Florida must be a legal election. _E_ #FlashbackFriday Trump family final week of @Oprah's show @Oprah is terrific! __HTTP__ _E_ I would have had millions of votes more in the primaries (than Crooked Hillary) if I only had one opponent instead of sixteen. Broke record _E_ Whether you love like or hate Donald Trump I will be on Bill O'Reilly (Fox) tonight at 8.00. Bill knows Trump is great for ratings! _E_ Melania and I extend our deepest condolences to the family of Shimon Peres... __HTTP__ _E_ Now every time Islamic militants attack they will use that movie as an excuse __HTTP__ What was the excuse before the movie? _E_ With imposing dunes on the rugged Aberdeenshire coastline @TrumpScotland's Championship Course is a masterpiece __HTTP__ _E_ Al Sharpton said they are even making it more harder to register people to vote . Which is worse his grammar or his thoughts? _E_ Really bad news just announced concerning jobs. Far fewer jobs created in August than anticipated. Interest rates therefore to remain low. _E_ "Donald Trump: 'Karl Rove Is A Total Loser' So Why Are People Still Giving Him Money?" __HTTP__ via @Mediaite _E_ "Attitude is a little thing that makes a big difference." Winston Churchill _E_ Proud to see my friend Governor Chris Christie standing up for Israel on his visit. Standing tall! _E_ Wow because of the pressure put on by me ICE TO LAUNCH LARGE SCALE DEPORTATION RAIDS. It's about time! _E_ Voter fraud! Crooked Hillary Clinton even got the questions to a debate and nobody says a word. Can you imagine if I got the questions? _E_ The Tax Cuts are so large and so meaningful and yet the Fake News is working overtime to follow the lead of their friends the defeated Dems and only demean. This is truly a case where the results will speak for themselves starting very soon. Jobs Jobs Jobs! _E_ ... while a 300ft turbine in Ardrossan North Ayrshire erupted in flames the previous month during gales of 165 mph __HTTP__ _E_ A big day for New York and for our COUNTRY! MAKE AMERICA GREAT AGAIN! _E_ Thank you Governor @ScottWalker & @GOP Chairman @Reince Priebus. #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_ More people attend a @JonHuntsman rally than watch @Lawrence on @MSNBCtv all week. @Lawrence is very lonely. (cont) __HTTP__ _E_ #CelebApprentice what do you think of the choices for project manager? _E_ Weak and totally conflicted people like @TheRickWilson shouldn't be allowed on television unless given an I.Q. test. Dumb as a rock! @CNN _E_ As usual the storm of the century was not nearly as bad as forecast. What a waste of time energy and money! _E_ Via @worldnetdaily by @jerome_corsi: "Donald Trump: Obama's Jobless Figures 'Phony.' Economists agree." __HTTP__ _E_ I can't believe the Yankees continue to pay A Rod they have a perfect right to stop paying (and should have stopped a long time ago). _E_ "The achievements of an organization are the results of the combined effort of each individual." – Vince Lombardi _E_ TRUMP & CLINTON ON IMMIGRATION#Debate #BigLeagueTruth __HTTP__ _E_ "Donald Trump on Jeb Bush: 'The last thing we need is another Bush'" __HTTP__ via @fox5newsdc by @EmilyMiller _E_ ObamaCare has brought skyrocketing premium increases & unaffordable deductibles which will lead to less care & job losses. _E_ "Face reality as it is not as it was or as you wish it to be." @jack_welch _E_ SHOCK @BarackObama's people are sending paid political organizers to heckle at @MittRomney events __HTTP__ _E_ National Review Online: Kristin Davis's Libertarian 'Tough Love' __HTTP__ _E_ Congratulations to @bobmcdonnell on leading Virginia to be in the black for a 3rd straight year. He is a fantastic governor. _E_ When will our nation's sacrifices be respectfully appreciated? Iraq and Libya should reimburse us in oil. _E_ .@Lord_Sugar How did you enjoy Mar a Lago? It was nice having you there my people thought you were terrific! _E_ I remember when the Apprentice became the number one show on T.V. @tombrokow came up to me and thanked me on behalf of NBC (Yankee Stadium) _E_ Jusr watched #HarveyPitt on @TeamCavuto he was great! _E_ Via @Hometownlife: Donald Trump to speak at Lincoln Day Dinner at The Showplace in Novi __HTTP__ _E_ Selfishness ultimately begets only unhappiness. Unselfishness begets happiness. B.C. Forbes _E_ Enjoying the Olympics. Great coverage by @NBC as well. GO TEAM USA! _E_ .@MarthaRaddatz was so unprofessional and biased when discussing me on This Week. @GStephanopoulos should not allow this conduct! _E_ REPEAL AND REPLACE OBAMACARE! _E_ Clinton camp fumed when surrogate told supporters Clinton planned to betray labor on TPP post election: __HTTP__ _E_ After decades of lies and scandal Crooked Hillary's corruption is closing in. #DrainTheSwamp! __HTTP__ _E_ I cannot believe how bad Jeb Bush looks with his insane answer on Iraq and then his numerous corrections which made him look even worse. _E_ Home of @PGATOUR's @CadillacChamp @TrumpDoral represents all that is Miami: energy glamour innovation & luxury __HTTP__ _E_ Drew Peterson a real sleaze just convicted of killing wife. Change the law so he gets death penalty. _E_ Saudi Arabia should fight their own wars which they won't or pay us an absolute fortune to protect them and their great wealth $ trillion! _E_ Loved doing the debate...won Drudge and all on line polls! Amazing evening moderators did an outstanding job. _E_ RT @mike_pence: We are heading to Virginia. Looking forward to supporting my friend @EdWGillespie. He will make a great Governor for the Co... _E_ .@GovernorSununu who couldn't get elected dog catcher in NH forgot to mention my phenomenal biz success rate: 99.2% __HTTP__ _E_ Alison Grimes supports harsh restrictions to kill coal industry & supports Obama's anti gun legislation. Vote @Team_Mitch! _E_ A country that Crooked Hillary says has funded ISIS also gave Wild Bill $1 million for his birthday? SO CORRUPT! __HTTP__ _E_ I'll be appearing on Larry King Live for his final show Thursday night at 9 p.m. CNN. Larry's been on TV for 25 years... _E_ Hillary Clinton's Presidency would be catastrophic forthe future of our country. She is ill fit with bad judgment. _E_ The Generals and top military brass never wanted a mixer but were forced to do it by very dumb politicians who wanted to be politically C! _E_ 'Clinton Campaign Tried to Limit Damage From Classified Info on Email Server' #DrainTheSwamp __HTTP__ _E_ Beautiful evening with Religious Leaders here at the WH last night. Join us now for a #NationalDayofPrayer LIVE:... __HTTP__ _E_ I have brought millions of people into the Republican Party while the Dems are going down. Establishment wants to kill this movement! _E_ #TBT With Darrell Hammond when I hosted SNL. __HTTP__ _E_ I watched lightweight Senator Marco Rubio who is all talk and no action defend his WEAK position on illegal immigration. Pathetic! _E_ Remember get out on November 8th & VOTE #TrumpPence16. It is time to #DrainTheSwamp this is our last chance! __HTTP__ _E_ The polling numbers for 2012 are very interesting will Americans ultimately want their leaders to be 'likeable' or 'competent'? _E_ A real president should take pride in saving and spending your money wisely not funneling it to his cronies (cont) __HTTP__ _E_ .@HillaryClinton and Obama policies increased debt by $9trillion over the last 8 years _E_ RT @Scavino45: U.S. MARKETS FROM ELECTION DAY {Since 11/8/2016} 📈 __HTTP__ _E_ DELUSIONAL Obama actually thought that he won the debate __HTTP__ What is he thinking? _E_ Congratulations to my friend @RoccoMediate on winning the big golf tournament today! _E_ In any business venture remember that branding is one of the most crucial aspects of your enterprise. Fight hard for that brand of yours. _E_ I endorsed @MittRomney not because I agree with him on every issue but because he will get tough with China. _E_ RT @EricTrump: I look forward to being on @CNN with @ErinBurnett at 7:40pmET. @realDonaldTrump _E_ .@SenTedCruz had a very good debate far better than Rand Paul. _E_ Congratulations to Karen Handel on her big win in Georgia 6th. Fantastic job we are all very proud of you! _E_ The @WTA released a new #StrongisBeautiful celebrity campaign today. Amazing athletes. Proud to be a part of this. __HTTP__ _E_ Beyond simple justice and beyond reducing our national debt another advanage of taking the oil is that it (cont) __HTTP__ _E_ Premiering Jan. 4th the record 14th season's @ApprenticeNBC cast is the nastiest yet __HTTP__ Major Boardroom fireworks! _E_ For those asking my son @EricTrump makes zero $$ running his charity & raises a great deal of $$ all of it for @StJude @EricTrumpFdn _E_ Excited to be speaking at @frankgaffney's @securefreedom Iowa National Security Action Summit tomorrow at 1:30PM! __HTTP__ _E_ I am leaving for Norfolk Virginia the great battleship U.S.S. Wisconsin for a big rally and really big crowd. See you soon! _E_ .@Playboy Playmate of the Year @BrandenRoderick returns to the 13th season of All Star @CelebApprentice she is smart & beautiful. _E_ .@Disney's acquisition of Lucas Film is a smart deal for both sides. Disney just bought a great brand which will keep producing revenue. _E_ Now Obama has set red line 2 with demand that Assad hands over Syria's chemical weapons or it will face an attack. _E_ Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_ RT @EPAScottPruitt: Thoughts and prayers for those in Texas & Louisiana. I am closely monitoring #Harvey developments along with @fema & @E... _E_ Thank you Senator @ChuckGrassley! #TrumpPence16 __HTTP__ _E_ Wind Power Company Fined $1 Million for Killing Birds. Golden eagles among victims... __HTTP__ @RSPBScotland @Natures_Voice _E_ I am very proud to have brought the subject of illegal immigration back into the discussion. Such a big problem for our country I will solve _E_ Voters understand that Crooked Hillary's negative ads are not true just like her email lies and her other fraudulent activity. _E_ "Statement by President Trump on the Apprehension of Mustafa al Imam for His Alleged Role in Benghazi Attacks" __HTTP__ _E_ Entrepreneurs Always remember that every day counts. Stay focused. Stay positive and develop momentum. _E_ Ratings for #MissUniverse pageant were highest in 4 years. @NBC likes me (and I like them!) _E_ Stop calling my office to do your show I have more important things to do with my time nobody's watching you! @lawrence _E_ My thoughts on last night's Celebrity Apprentice __HTTP__ as well as my latest video blog at __HTTP__ _E_ RT @TeamTrump: .@timkaine has a pay to play problem just like Crooked @HillaryClinton #VPDebates #BigLeagueTruth __HTTP__ _E_ Karl Rove's stupid ad made Ashley Judd hot—now everybody is talking about her. _E_ China has done great under Obama. Increased private US holdings by 500%. Hacks our military & R&D. Robs us blind daily.#timetogettough _E_ When little Morty Zuckerman closes his failing @NYDailyNews will I at least be given some credit? Will happen soon. _E_ Why does Obama believe he shouldn't comply with record releases that his predecessors did of their own volition? Hiding something? _E_ Via The Washington Times Mr. Trump buzzes the presidential radar __HTTP__ _E_ .@AlexSalmond Wind turbines are ripping your country apart and killing tourism.Electric bills in Scotland are skyrocketing stop the madness _E_ . @BarbaraJWalters made a great decision in firing @JoyVBehar from @theviewtv. The show will be better without her! _E_ RT @foxandfriends: .@DonaldJTrumpJr: Trump has had a lot more responsibility to deal with than any of the other GOP candidates __HTTP__ _E_ Thank you Nicole! __HTTP__ _E_ Thank you Reno Nevada. NOTHING will stop us in our quest to MAKE AMERICA SAFE AND GREAT AGAIN! #AmericaFirst... __HTTP__ _E_ #TrumpAdvice __HTTP__ _E_ Wow they are really killing Jay Leno let him go out with dignity! _E_ Will be doing @foxandfriends this morning at 7:00. ENJOY! _E_ The newly built Blue Monster at Trump National Doral is being considered a masterpiece by almost all who see it and play it THANK YOU! _E_ Crooked Hillary is flooding the airwaves with false and misleading ads all paid for by her bosses on Wall Street. Media is protecting her! _E_ The organized group of people many of them thugs who shut down our First Amendment rights in Chicago have totally energized America! _E_ Five Star @TrumpCondosLV are the most luxurious & elite residences in the Vegas market __HTTP__ "If you love it own it" _E_ Just returned from New Hampshire where the crowd was great and got a beautiful standing ovation! Wonderful people who truly love the U.S.A. _E_ Flags to be flown at Half Staff at all Trump Properties in Honor of the Five Fallen Soldiers __HTTP__ _E_ ... and in my opinion should not be doing The Apprentice. _E_ RT @DarrenJJordan: CONSTRUCTIVE WINS! 💪 @realDonaldTrump @CLewandowski_ @DanScavino @MichaelCohen212 @KatrinaPierson @DefendingtheUSA __HTTP__ _E_ Jeb Bush who did poorly last night in the debate and whose chances of winning are zero just got Graham endorsement. Graham quit at O. _E_ Monitoring the terrible situation in Florida. Just spoke to Governor Scott. Thoughts and prayers for all. Stay safe! _E_ A good friend: @SarahPalinUSA. More importantly she is a tremendous voice for policies that would put America on (cont) __HTTP__ _E_ People Magazine: Donald Trump Was Right: He Gave SNL Its Best Ratings in Nearly 4 Years Plus What You Didn't See __HTTP__ _E_ Living in denial only 15% of Democrats think that recent economic news is poor __HTTP__ _E_ From Fox and Friends interview: Trump: We should not go back to Iraq __HTTP__ _E_ .@RepChrisCollins Chris thank you so much for your wonderful endorsement. I will not let you down! @CNN _E_ Republican Senators are working very hard to get Tax Cuts and Tax Reform approved. Hopefully it will not be long and they do not want to disappoint the American public! _E_ The Debate @BarackObama's mic and my Endorsement in today's #trumpvlog __HTTP__ _E_ No more massive injections. Tiny children are not horses—one vaccine at a time over time. _E_ The most elite private club in the world Mar a Lago is Palm Beach's legendary landmark. __HTTP__ _E_ "The Conservative does not despise government. He despises tyranny. @marklevinshow _E_ .@pennjillette doesn't like @StephenBaldwin7's cliché line and Stephen says Penn creeps him out. Do we sense conflict yet? #CelebApprentice _E_ Just sit back and watch ObamaCare is such a disaster it will fall like a house of broken cards. The website is the best part of this mess! _E_ "No person who is enthusiastic about his work has anything to fear from life." – Samuel Goldwyn _E_ America's top Army general has warned of a crisis unless sexual abuse in the military is quickly brought undet control.Forces greatly hurt! _E_ I will be on @FoxNewsSunday with Chris Wallace this morning. Enjoy! _E_ Beautiful thank you. __HTTP__ _E_ I'm looking forward to seeing you all this afternoon at Macy's Herald Square. 5:30 pm at the Crystal department on 8. _E_ Order signed copy of CRIPPLED AMERICA & submit a question for my live streaming book signing on 12/3 at 7:30 pm. __HTTP__ _E_ Failed presidential candidate Lindsey Graham should respect me. I destroyed his run brought him from 7% to 0% when he got out. Now nasty! _E_ The ones who are crazy enough to think that they can change the world are the ones who do. Steve Jobs _E_ He's hired! Listen to my #Apprentice Andy launchhis radio show @AmericaNowRadio with me tomorrow 6PM ET __HTTP__ _E_ Via @NRO:"Trump @KarlRove 'Most Overrated Man in Politics'Responsible for Ashley Judd's Rise" __HTTP__ @elianayjohnson _E_ Just leaving Knoxville TN what a crowd what amazing people! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_ Reckless! Why is @BarackObama wasting over $70 Billion on 'climate change activities?' Will he ever learn? __HTTP__ _E_ Celebrity Apprentice on tonight CNBC at 9 _E_ Whenever one of the morons say I wear a wig stop reading because they have no credibility & just hate. _E_ Four brave Americans died in Benghazi. Administration is still covering up the truth. We deserve to know the full truth. _E_ The NFL should have its non profit status immediately revoked while at the same time ending the giant tax scam which makes teams so valuable _E_ The people who support Hillary sit behind CNN anchor chairs or headline fundraisers those disconnected from real life. _E_ "Winners see problems as just another way to prove themselves." – Think Like a Champion _E_ Things are going really well for our economy a subject the Fake News spends as little time as possible discussing! Stock Market hit another RECORD HIGH unemployment is now at a 17 year low and companies are coming back into the USA. Really good news and much more to come! _E_ Guess who is talking to @MissUniverse at @TrumpTowerNY? Not terrible hair! __HTTP__ _E_ Much of the money I have raised for our veterans has already been distributed with the rest to go shortly to various other veteran groups. _E_ caught he cried like a baby and begged for forgiveness...and now he is judge & jury. He should be the one who is investigated for his acts. _E_ Here's my message to @BarackObama: America is a capitalistic country. Get over it and get on with it! #TimeToGetTough _E_ Mr. President take your campaign of division and anger and hate back to Chicago. @MittRomney _E_ The 2013 MISS UNIVERSE® Pageantwill take place in Russia for the very first time in the 62 year history of the contest. _E_ A day after @BarackObama released a trillion dollar budget deficit he is hosting China's future leader VP XiJinping. America's new reality. _E_ I developed the Wollman Rink under budget and in record time __HTTP__ If I hadn't gotten involved it would still be unused. _E_ My @foxandfriends interview discussing @BarackObama's reckless spending the Buffet Tax gimmick and #CelebApprentice __HTTP__ _E_ Great to meet everyone while having breakfast @ChezVachon this morning! #FITN #VoteTrumpNH __HTTP__ __HTTP__ _E_ .@RobertGBeckel Please thank your brother for his nice words on television. Seems like a great guy and character! @CNN _E_ RT @DonaldJTrumpJr: Thank you Elko County Nevada. So much amazing feedback from my forum today I really appreciate it #trump2016 #ICYMI ht... _E_ People have been asking to hear my Howard Stern interview—you can access it on @HowardTV. _E_ I am extremely pleased to see that @CNN has finally been exposed as #FakeNews and garbage journalism. It's about time! _E_ RT @EricTrump: Please stay safe #Florida! You are in our thoughts and we are praying for you! __HTTP__ _E_ Tune in to see me on @ThisWeekABC with @GStephanopoulos at 10am ET. Enjoy! _E_ Going to Charleston South Carolina in order to spend time with Boeing and talk jobs! Look forward to it. _E_ Obama's war on women. "Number of Unemployed Women Increased in July by 227000" __HTTP__ _E_ The ObamaCare website will cost over $1.5B when all is said and done. Crazy! _E_ Massive combined inoculations to small children is the cause for big increase in autism.... _E_ Don't forget to tune in tonight to see another unpredictable and exciting episode of The Apprentice 10 pm on NBC _E_ Getting ready to go on @KellyandMichael two great people! _E_ .@chelseahandler—stop trying to get your hotelier boyfriend back—a lost cause—he can do much better! _E_ Order signed copy of CRIPPLED AMERICA & have opportunity to submit question for my live streaming book signing 12/3 __HTTP__ _E_ Thank you @AnnCoulter for your nice words. The U.S. is becoming a dumping ground for the world. Pols don't get it. Make America Great Again! _E_ Tremendous pressure on President Obama to institute a travel ban on Ebola stricken West Africa. At some point this stubborn dope will fold! _E_ Low energy Jeb Bush just endorsed a man he truly hates Lyin' Ted Cruz. Honestly I can't blame Jeb in that I drove him into oblivion! _E_ Big storm in New Hampshire. Moved my event to Monday. Will be there next four days. _E_ Team Trump with the recipients of our donations in the Rockaways. #Sandy __HTTP__ _E_ Both Barack and @MittRomney were excellent at the Al Smith dinner last night! _E_ Congrats to Barack Obama on April's job report. Over 800000 left the work force w/average hourly wages & weekly hours staying flat. Bad! _E_ I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_ To every action there is always opposed an equal reaction. Isaac Newton _E_ Via @nypost by Editorial Board: "New York's mute @AGSchneiderman" __HTTP__ Schneiderman is feckless and corrupt. _E_ Just leaving Miami for Houston Oklahoma and Colorado. Miami crowd was fantastic! _E_ Obama's China 'climate' deal binds America with language of 'will' curb emissions now while China only 'intends' to curb in 2030. Bad deal! _E_ Aside from having no ratings sleazy Ed Schultz lied about what I said. Thank you Scott Whitlock @ScottJW __HTTP__ _E_ Will be in Louisiana for the Miss USA Pageant which will be on NBC on Sunday night. Watch Miss Pennsylvaniaan interesting and amazing story _E_ We're spending a fortune looking for the lost plane with mostly Chinese passengers and that's OK but how much are Russia & China spending? _E_ Remain open to new ideas. That's where innovation comes from. _E_ Goofy Elizabeth Warren lied when she says I want to abolish the Federal Minimum Wage. See media—asking for increase! _E_ The documentary of me that @CNN just aired is a total waste of time. I don't even know many of the people who spoke about me. A joke! _E_ Ask yourself: What can I learn today that I didn't know before? Always be a student always be open to new ideas. _E_ People buy deals & immediately put them into bankruptcy in order to make better deals.. _E_ To every PATRIOT who will serve on the #USSGeraldRFord:Keep the watchProtect herDefend herLOVE HERGood Luck & Godspeed! __HTTP__ _E_ Via @StarsEntLive by Nick Ricko: "@kevinjonas @IanZiering In Celebrity @ApprenticeNBC First Look" __HTTP__ _E_ Newly released NH poll has @MittRomney with a 1 point lead. Mitt will pull away next week. _E_ Re Florida Power & Light—Most important is safety but they have to also cater to aesthetics & not ruin the beauty of Florida. _E_ I will be re tweeting some of your better most imaginative and hopefully insightful tweets. Make them good (great)! Important stuff. _E_ Tomorrow we'll be going to Panama for the opening of our new hotel. It's a fantastic building in a fantastic location. __HTTP__ _E_ Iran hides behind its assertion of technical compliance w/the nuclear deal while it brazenly violates the other limits.. Amb. @NikkiHaley __HTTP__ _E_ "Failed show @DannyZuker" I have never heard of you and was told you are a loser after reading your credits I have no questions about it! _E_ I know a great deal about websites etc. but I am unable to understand how our government spent $635 million on the ObamaCare site & disaster _E_ Another nasty season premieres Sunday March 3rd at 9/8c on NBC! __HTTP__ _E_ New Virginia poll thank you! We are going to show the whole world that America is back – BIGGER and BETTER and S... __HTTP__ _E_ A study says @Autism is out of control a 78% increase in 10 years. Stop giving monstrous combined vaccinations (cont) __HTTP__ _E_ Great job @EricTrump! Proud of you! #AmericaFirst #RNCinCLE __HTTP__ __HTTP__ _E_ "@TurnberryBuzz the jewel in Donald Trump golfing crown" __HTTP__ via @TheScotsman by @DempsterMartin _E_ Wow tremendous victory in the Trump University case against lightweight @AGSchneiderman just got the news! _E_ Casting sometimes is fate and destiny more than skill and talent from a director's point of view. Steven Spielberg _E_ So China is ordering us to raise the Debt Limit...How low have we as a nation sunk? _E_ We must bring the truth directly to hard working Americans who want to take our country back. #BigLeagueTruth... __HTTP__ _E_ As I have been saying Crooked Hillary will approve the job killing TPP after the election despite her statements to the contrary: top adv. _E_ Omarosa is very confident that the execs loved her concept & presentation. _E_ Looking forward to returning to the Hawkeye state this Saturday to support my friend and strong Conservative @SteveKingIA! _E_ Heading to U.S. Bank Arena in Cincinnati Ohio for a 7pm rally. Join me! Tickets: __HTTP__ _E_ Via @WPOffshore: "Donald Trump's Blackdog victory" __HTTP__ _E_ Today it was an honor to have @UNSecretary General @AntonioGuterres at the @WhiteHouse. Speaking for the U.S.A. we appreciate all you do! __HTTP__ _E_ Entrepreneurs: Review your work habits and make sure they are taking you in the right direction. Don't become complacent! _E_ The United States will be immediately implementing much tougher Extreme Vetting Procedures. The safety of our citizens comes first! _E_ Thank you! #MakeAmericaGreatAgain __HTTP__ _E_ It was my great honor to defend @dennisrodman on @ApprenticeNBC last night—he has come a long way and for the good! _E_ Via @RedState by @EWErickson: "Always Play On Offense" __HTTP__ _E_ My speech to @PressClubDC on Tuesday at the #NPCLunch on the topic of building a business brand via @cspan __HTTP__ _E_ In 2008 @BarackObama warned that electricity rates will necessarily skyrocket during his term. Mission Accomplished! __HTTP__ _E_ No surprise. Woman being cited by Kerry & McCain on Syrian rebels is a paid consultant of the rebels __HTTP__ _E_ .@MittRomney needs to make @BarackObama regret that he ever asked for his tax records. _E_ Little @MacMiller—I don't need your praise __HTTP__ just pay me the money you owe. _E_ Great meeting @GarySinise at @AmSpec dinner. Besides his great acting Gary does tremendous work for vets through his foundation. _E_ Join me Tuesday Nov. 3rd at 12 PM in #TrumpTower in NYC. I'll be signing copies of my book CRIPPLED AMERICA. Don't miss it! _E_ I'll be in Iowa tonight making a speech to a record setting crowd. The word is getting out MAKE AMERICA GREAT AGAIN! _E_ Tune in tonight to Greta van Susteren's show On the Record which airs on Fox News at 9 p.m. _E_ The networks are all driving me crazy to do television shows—"a ratings machine"—but because of Apprentice have been loyal to NBC. _E_ RT @realDonaldTrump: I as President want people coming into our Country who are going to help us become strong and great again people co... _E_ .@PamelaGeller is a total whack job who doesn't have a clue. Don't provoke the enemy go get them and make them pay. No signals just do it! _E_ Ivanka Trump defends her dad __HTTP__ via @politico _E_ He ruins the brand: @bobbeckel doesn't belong on @FoxNews. As CM for Mondale in '84 you lost 49 states. Sad! _E_ FM @AlexSalmond of Scotland spent more than $750000 of taxpayers $ to visit Ryder Cup in Chicago peanuts compared to his windmill folly. _E_ Phyllis Schlafly: Trump is 'last hope for America' __HTTP__ __HTTP__ _E_ The #MissUniverse women totally blow away the Victoria's Secret women! _E_ No @DannyZuker it's making you crazy because you don't have the guts to play the game. Come on Danny you can do it! _E_ Even Bill is tired of the lies SAD! __HTTP__ _E_ An honor to host President Mahmoud Abbas at the WH today. Hopefully something terrific could come out it between th... __HTTP__ _E_ Based on very popular demand I will be live tweeting tomorrow night during the Presidential debate. _E_ Miss Florida was great in her denial of Miss Pennsylvania's phoney statements. She blows Miss Pennsylvania away a different league . _E_ Congratulations to @tedcruz on his Texas primary victory last night. He will be an outstanding Senator. _E_ Just remember the birther movement was started by Hillary Clinton in 2008. She was all in! _E_ Shooting deaths of police officers up 78% this year. We must restore law and order and protect our great law enforcement officers! _E_ We pray for our fallen heroes who died while serving our country in the @USNavy aboard the #USSJohnSMcCain and their families. __HTTP__ _E_ It's not whether you get knocked down it's whether you get up. Vince Lombardi _E_ The premiere of Donald J. Trump's Fabulous World of Golf is tomorrow night at 9 p.m.ET on Golf Channel. Tune in for a great adventure! _E_ ObamaCare website fiasco was a SINGLE bid to a Canadian company terrible! _E_ Look what is happening to our country under the WEAK leadership of Obama and people like Crooked Hillary Clinton. We are a divided nation! _E_ All he does is go on television is talk talk talk but incapable of doing anything. _E_ If Justice Roberts had made the correct decision on ObamaCare our country would not be in turmoil right now! _E_ I will be in South Carolina all week. Saturday is BIG BIG BIG! Get out and vote MAKE AMERICA GREAT AGAIN _E_ Either Miss Pennsylvania will pay her father will pay or her lawyers will pay. She hurt many people! _E_ Via @BreitbartNews by @IanHanchett: "Trump: Obama 'Treats Our Known Enemies Much Better' Than Israel" __HTTP__ _E_ Via @Mediaite by forza_desiderio: "Donald Trump Blasts Obama on Ebola: Why Are You Sending Troops?" __HTTP__ _E_ Fun fact for my 2M+ followers the 'Architect' Karl Rove blew $400M in the 2012 election with a success rate of 1.6%. _E_ Thank you for a great day yesterday Rhode Island! #VoteTrump __HTTP__ _E_ Will be interviewed on @seanhannity tonight at 10pmE. Enjoy! #INPrimary _E_ The residential real estate market continues to provide opportunities for first time home owners. Buy now if you can! _E_ #CrookedHillary is not fit to be our next president! #TrumpPence16 __HTTP__ _E_ Just bought the Kluge Estate in Charlottesville Virginia (don't worry only business). See Washington Post article __HTTP__ _E_ Congratulations to Michael Jordan on his marriage over the weekend. _E_ RT @EricTrump: Debate ready!!! @realDonaldTrump #MakeAmericaGreatAgain #TrumpTrain __HTTP__ _E_ A big contingent of very enthusiastic Roy Moore fans at the rally last night. We can't have a Pelosi/Schumer Liberal Democrat Jones in that important Alabama Senate seat. Need your vote to Make America Great Again! Jones will always vote against what we must do for our Country. _E_ Watch Obama refuse to call Benghazi a terrorist attack on 9.12 __HTTP__ What took @CBS so long to release this footage? _E_ Life brings you many surprises. As a child I used to vacation with my family at the Doral in Miami. Now I own it. __HTTP__ _E_ Trump: Weiner a 'Sick Puppy' That NYC Doesn't Need __HTTP__ via @Newsmax_Media _E_ Thank you! __HTTP__ _E_ The Cruz Kasich pact is under great strain. This joke of a deal is falling apart not being honored and almost dead. Very dumb! _E_ The CBO has confirmed that @BarackObama's stimulus crowds out private investment while not creating any jobs. __HTTP__ _E_ My @FoxNews int. with @seanhannity on Obama being all talk & no action & making America Great Again! __HTTP__ _E_ We need a dealmaker in the White House who knows how to think innovatively and make smart (cont) __HTTP__ _E_ Crooked Hillary Clinton is the worst (and biggest) loser of all time. She just can't stop which is so good for the Republican Party. Hillary get on with your life and give it another try in three years! _E_ DAMAC Properties @DamacOfficial @realDonaldTrump Looking forward to welcoming you to Dubai! Have a great trip! Thank you! _E_ RT @FoxNews: Poll: @realDonaldTrump vs. @HillaryClinton among white Evangelicals. __HTTP__ _E_ In today's #trumpvlog I speak about the chopper recently made for me by @occhoppers.... __HTTP__ #CelebApprentice _E_ Watch @Seanhannity tonight on his show Hannity Fox News at 9 pm. I'll be on and we'll cover the Wall Stree... (cont) __HTTP__ _E_ Join us in Iowa tomorrow! #IACaucus #Trump2016 #MakeAmericaGreatAgain 3:00pm: __HTTP__ 7:30pm: __HTTP__ _E_ I will be live tweeting during the debate tonight. _E_ I have an idea for @JebBush whose campaign is a disaster. Try using your last name & don't be ashamed of it! _E_ How long did it take for Obama to call Hugo Chavez and congratulate him on his 'reelection?' Who do you think Chavez supports in ours? _E_ If you have a speech one that would put Winston Churchill to shame liberals would find a way to make it sound terrible! _E_ Congratulations @Trump_Ireland for being named #12 resort in Europe by the @CNTraveler #ReadersChoice2014 awards! _E_ I will be on @FoxNews live with members of my family at 11:50 P.M. We will ring in the New Year together! MAKE AMERICA GREAT AGAIN! _E_ Happy to hear that @ralphreed's Faith and Freedom chapters are at the @RNC convention supporting @MittRomney. We must be united to win! _E_ I am reading that the great border WALL will cost more than the government originally thought but I have not gotten involved in the..... _E_ The top Leadership and Investigators of the FBI and the Justice Department have politicized the sacred investigative process in favor of Democrats and against Republicans something which would have been unthinkable just a short time ago. Rank & File are great people! _E_ .@FLOTUS Melania and I were honored to stop by the Women's Empowerment Panel this afternoon at the @WhiteHouse.... __HTTP__ _E_ I'm looking forward to the Super Bowl but looking even more forward to Monday night at 8:00 best episode EVER of Celebrity Apprentice! _E_ A top Clinton Foundation official said he could name "500 different examples" of conflicts of interest. __HTTP__ _E_ After @TrumpTurnberry I will be visiting Aberdeen the oil capital of Europe to see my great club @TrumpScotland. _E_ All eyes are on @TigerWoods @The_Masters. He's in good position! _E_ Ellen is sadly having a hard time with her lines. #Oscars _E_ Thanks to ObamaCare's device tax Boston Scientific plans to cut 1500 jobs __HTTP__ ObamaCare will kill ingenuity. _E_ Jeb Bush has a photoshopped photo for an ad which gives him a black left hand and much different looking body. Jeb just can't get it right! _E_ New CBS poll. #Trump2016 __HTTP__ _E_ This has been a very difficult decision regarding the Presidential run and I want to thank all my twitter fans for your fantastic support. _E_ Tom Brokaw keeps calling Mitt Romney George (Mitt's father). Sadly time is up for Tom. _E_ .@AnnCoulter U were great last nite @ericbolling on FOX. Our country has become a dumping ground for the world I'll get it to stop & fast! _E_ The mind that opens to a new idea never comes back to its original size. Einstein _E_ A great honor to sign the Veterans Appeals Improvement & Modernization Act into law w/ @AmericanLegion @SecShulkin. __HTTP__ __HTTP__ _E_ The real estate market in Vietnam is booming. Growth is everywhere in the world except for the US. _E_ RT @FoxNews: .@davidwebbshow: Let's look at the calendar. It's January 20th. DACA expires on March 5th. That means this was a construct of... _E_ Thank you Ohio. Together we will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_ Trump's National Lead Increases to 35.6% Going into the Third GOP Debate it's Trump Carson and Rubio __HTTP__ ... _E_ I predicted the 9/11 attack on America in my book The America We Deserve and the collapse of Iraq in @TimeToGetTough. _E_ Thanks @piersmorgan! Trump is the most unpredictable extraordinary entertaining&massively popular candidate this country has ever seen. _E_ #MayThe4thBeWithYou here is when Darth Vader and I did some firing __HTTP__ _E_ I would invite Edward Snowden to be a judge at the Miss Universe Pageant in Moscow but would be concerned that he would sell results early! _E_ The coolest story is that John Beale the man who headed up CLIMATE CHANGE for the government is a proven con man and total phoney.ARRESTED _E_ Great move on delay (by V. Putin) I always knew he was very smart! _E_ Via The Hindu @businessline: Realty brand Donald Trump's India venture to sport desi tag __HTTP__ _E_ Jeffrey Lord former Reagan adviser has endorsed the Newsmax @iontv debate with a great article __HTTP__ _E_ We will immediately repeal and replace ObamaCare and nobody can do that like me. We will save $'s and have much better healthcare! _E_ Camp David is a very special place. An honor to have spent the weekend there. Military runs it so well and are so proud of what they do! _E_ So proud of NASCAR and its supporters and fans. They won't put up with disrespecting our Country or our Flag they said it loud and clear! _E_ Dopey Sugar @Lord_Sugar You should thank me for having created the platform on which you became known The Apprentice. Say Thank you Donald _E_ China is heavily investing in building its own jet engine __HTTP__ They will end up stealing the design from us as usual. _E_ 'How Trump won over a bar full of undecideds and Democrats' __HTTP__ _E_ If @Barack Obama is really concerned about carbon emissions and air pollution then maybe he should have (cont) __HTTP__ _E_ I am starting to think that there is something seriously wrong with President Obama's mental health. Why won't he stop the flights. Psycho! _E_ Just out: Boston Herald/Franklin Pierce Poll N.H. TRUMP 28 (up 10) CARSON 16 BUSH 9 RUBIO 6 CRUZ 5 Press will say they are surging! _E_ Crooked Hillary has once again been proven to be a person who is dishonest incompetent and of very bad judgement. _E_ ...and did not want to rock the boat. He didn't choke he colluded or obstructed and it did the Dems and Crooked Hillary no good. _E_ Americans nationwide have their premiums double and work hours decreased. @GOP must do the right thing stand strong & defund! _E_ It's Thursday. Which brand of eyeliner is the nation's worst AG @AGSchneiderman wearing today? _E_ .@Deadspin's disgusting response will teach me & others not to be nice anymore—a sad lesson. _E_ Feeling sorry for yourself is not only a waste of energy but the worst habit you could possibly have. Dale Carnegie _E_ Via @scotsmandotcom: "Donald Trump hires top lawyer for wind farm battle" __HTTP__ _E_ Sweat equity is the most valuable equity there is. Know your business and industry better than anyone else in the world. @mcuban _E_ If Chicago doesn't fix the horrible carnage going on 228 shootings in 2017 with 42 killings (up 24% from 2016) I will send in the Feds! _E_ "Being true to yourself and your work is an asset. Remember that assets are worth protecting." – Think Like a Champion _E_ Anyone who wants strong borders and good trade deals for the US should boycott @Univision. _E_ Entrepreneurs: Set the example and you'll be a magnet for the right people. That's the best way to work with people you like. _E_ Thank you NH! We will end illegal immigration stop the drugs deport all criminal aliens&save American lives! Watc... __HTTP__ _E_ Do you think Iran would have acted so tough if they were Russian sailors? Our country was humiliated. _E_ It is so pathetic that the Dems have still not approved my full Cabinet. _E_ New Sugar deal negotiated with Mexico is a very good one for both Mexico and the U.S. Had no deal for many years which hurt U.S. badly. _E_ Goofy Elizabeth Warren is weak and ineffective. Does nothing. All talk no action maybe her Native American name? _E_ .@ABFAlecBaldwin They were rising in the 1950's then went back down they will go up and down through eternity. _E_ .@JebBush is a low energy stiff who should focus his special interest money on the many people ahead of him in the polls. Has no chance! _E_ A record 1.2 million Americans have left the job force during @BarackObama's recovery __HTTP__ Don't trust the job numbers. _E_ Pretty even debate no knockouts. However Ryan's closing statement somewhat stronger. What do you think? #VPDebate _E_ "Real estate is at the core of almost every business and it's certainly at the core of most people's wealth." – Think Like a Billionaire _E_ Back by popular demand the fabulous @LilJon returns to the record setting 13th season of All Star @CelebApprentice. The fans love him! _E_ Re @TWC TimeWarner I am going to be switching many of my buildings to another service—this is ridiculous! _E_ "Success breeds success. The best way to impress people is through results." – Think Like a Billionaire _E_ #trumpvlog The Republicans must defeat @BarackObama not themselves..... __HTTP__ _E_ If you want more you have to require more from yourself. Dr. Phil McGraw _E_ 20000📈21000📈22000📈23000📈this year...FOUR one thousand milestones this year... #Dow23K #MAGA __HTTP__ _E_ So much for Hope and Change. @BarackObama has already spent over $100M on attack ads across the swing states __HTTP__ _E_ Glad to hear that @JimTalent has put some strong anti China referendums in the @GOP convention platform. _E_ Sexual pervert & deviant Anthony Weiner is polling to see if he can run for NYC Mayor... _E_ "To state the obvious if any business operated the way the government does it would go under." #TimeToGetTough _E_ #MakeAmericaGreatAgain #Trump2016LIFE CHANGING EXPERIENCEVideo: __HTTP__ __HTTP__ _E_ On November 9th @MissUniverse comes to Moscow! Hosted by the wonderful duo of @OfficialMelB & @ThomasARoberts in Crocus City Hall! _E_ I guess Obama's Cairo Speech really worked out. The Muslim Brotherhood stormed our embassy on 9.11. Imagine if Obama speaks in Beijing? _E_ .@daveweigel of the Washington Post just admitted that his picture was a FAKE (fraud?) showing an almost empty arena last night for my speech in Pensacola when in fact he knew the arena was packed (as shown also on T.V.). FAKE NEWS he should be fired. _E_ REMEMBER the terrible 5 for 1 trade whereby the Taliban got back leaders (killers) and we got back a NOTHING WILL COME BACK TO HAUNT U.S.! _E_ I have always liked Ellen done her show numerous times but she was not good last night fumbling and stumbling! _E_ In new Quinnipiac Poll 66% of people feel the economy is "Excellent or Good." That is the highest number ever recorded by this poll. _E_ China has copied our military's F 22 Raptor design __HTTP__ We should offset their theft from our debt. _E_ Officials behind the now discredited Dossier plead the Fifth. Justice Department and/or FBI should immediately release who paid for it. _E_ Great everyone is saying I did much better on @60Minutes last week than President Obama did tonight. I agree! _E_ Median household income is down for the middle class since Obama took office. It will only go further down under Clinton. _E_ Sometimes your best investments are the ones you don't make. The Art of The Deal _E_ Congratulations to Bob Kraft and Coach Bill Belichick for having built an amazing team. @Patriots _E_ "You cannot escape the responsibility of tomorrow by evading it today." – Pres. Abraham Lincoln _E_ Boring & failing @NYMag's 3rd rate political reporter @jheil had flunky @DanAmira write a totally false report about me today...... _E_ My message MAKE AMERICA GREAT AGAIN is beginning to take hold. Bring back our jobs strengthen our military and borders help our VETS! _E_ What do you think Obama will do when Putin seizes Alaska? _E_ Big day in Washington D.C. even though White House & Oval Office are being renovated. Great trade deals coming for American workers! _E_ Via @WTOC11: Donald Trump headlines Tea Party Convention in Myrtle Beach __HTTP__ Looking forward to visiting SC on Monday! _E_ Study your area of business. All business involves risk but risk can be reduced when you learn everything you can about what you're doing. _E_ My @foxandfriends interview re: firing @bretmichaels on the premiere of All Star @ApprenticeNBC & politics __HTTP__ _E_ RT @IvankaTrump: .@realDonaldTrump stock market rally is close to becoming the greatest in 85 years __HTTP__ _E_ Trump at CPAC: 'We Have to Get the Momentum Back' __HTTP__ via @WSJ's @WSJVideo _E_ I bet the dumbest political commentator on television @Lawrence will soon be thrown off the air for poor (cont) __HTTP__ _E_ Tonight's episode of The Apprentice is one of the best ever we're down to the final 3 and it's high excitement all the way. 10 pm on NBC. _E_ "Hook your career to a big trend. There are huge opportunities for profits if you can create big solutions." – Think Big _E_ I call my own shots largely based on an accumulation of data and everyone knows it. Some FAKE NEWS media in order to marginalize lies! _E_ I have recieved and taken calls from many foreign leaders despite what the failing @nytimes said. Russia U.K. China Saudi Arabia Japan _E_ China's stock market rose yesterday after 4 consecutive days of losses __HTTP__ Their market gains the day we are hit by storm _E_ It's Tuesday how much inflation has @BarackObama's spending caused today on the price of food and gas? _E_ Al Qaeda terrorist Al Libi was immediately read his rights & is now being treated for 'pre existing' medical (cont) __HTTP__ _E_ 19000 RESPECTING our National Anthem! #StandForOurAnthem __HTTP__ _E_ My twitter followers will soon be over 2 million & all the biggies. It's like having your own newspaper. _E_ Trump International in Dubai will be one of the great projects anywhere in the world. Congratulations to @damacofficial for their genius! _E_ China just landed a jet on an aircraft carrier stolen from a U.S. design. __HTTP__ We should offset the thievery from our debt.. _E_ Looking at Air Force One @ MIA. Why is he campaigning instead of creating jobs & fixing Obamacare? Get back to work for the American people! _E_ Some people dream of great accomplishments while others stay awake and do them! _E_ Entrepreneurs: Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_ It is outrageous and disgusting that families of U.S. MILITARY personnel killed in action will not be given money for burials. SAD! _E_ Little respected Club For Growth asked me for $1000000 I said NO . Now they are spending lobbyist and special interest money on ads! _E_ Don't go around saying the world owes you a living. The world owes you nothing. It was here first. Mark Twain _E_ Getting ready to celebrate the 4th of July with a big crowd at the White House. Happy 4th to everyone. Our country will grow and prosper! _E_ Just watching NBC News where our potential attack is being detailed the exact ships the stealth bombers the destinations so ridiculous! _E_ Strive for wholeness and keep your sense of wonder intact. Donald J. Trump __HTTP__ _E_ Trump Int'l Golf Links & Hotel Ireland is on 400 beautiful acres & fronts the Atlantic Ocean for 2.5 miles. Spectac! __HTTP__ _E_ Today I was pleased to announce the official approval of the presidential permit for the #KeystonePipeline. A grea... __HTTP__ _E_ I am watching two clown announcers on @FoxNews as they try to build up failed presidential candidate #LittleMarco. Fox News is in the bag! _E_ Yesterday I signed the #INTERDICTAct (H.R. 2142) with bipartisan members of Congress to help end the flow of drugs into our country. Together we are committed to doing everything we can to combat the deadly scourge of drug addiction and overdose in the United States! __HTTP__ _E_ I support K 9's for Warriors a wonderful organization that trains service dogs for veterans. Please contact __HTTP__ _E_ For Entrepreneurs: A good question to ask yourself –"What can I provide that does not yet exist?" _E_ .@jessebwatters is terrific at hosting on @FoxNews he really gets it! _E_ Mr. Khan who does not know me viciously attacked me from the stage of the DNC and is now all over T.V. doing the same Nice! _E_ Ms. Goldberg & her blowhard lawyer should be ashamed for having brought this frivolous case. They should pay me damages! _E_ English taxpayers should stop subsidizing the destruction of Scotland by paying massive subsidies for ugly wind turbines. _E_ Spoke to a capacity crowd at Horry County Republican event earlier today. __HTTP__ _E_ Donald Trump Reviews Oscars: Django 'Racist' Ceremony 'Boring' Set 'Tacky'... __HTTP__ via @eonline _E_ Democrats are not interested in Border Safety & Security or in the funding and rebuilding of our Military. They are only interested in Obstruction! _E_ Lord grant that I may always desire more than I can accomplish. Michelangelo _E_ Via @AmericanThinker by Malcolm Unwell: "Taking Trump Seriously" __HTTP__ _E_ How far has the United States gone down when we are reduced to accept the imbecilic deal just agreed to with Iran. Read THE ART OF THE DEAL! _E_ Huff Post His early morning speech drew a large crowd far larger than remarks at the same time on Thursday and packed by end! The facts. _E_ I can't believe my friend Derek Jeter is out for whole season injured day he left Trump World Tower. Lucky bldg. Move back fast! _E_ Make sure to watch Celebrity Apprentice tonight at 9 on NBC. A GREAT SHOW JUST LIKE THE MASTERS. 9 _E_ So how did I do on Face The Nation? _E_ So sad that Obama rejected Keystone Pipeline. Thousands of jobs good for the environment no downside! _E_ Thank you! #Trump2016 __HTTP__ _E_ Not one American flag on the massive stage at the Democratic National Convention until people started complaining then a small one. Pathetic _E_ Our great country has been divided for decades. Sometimes you need protest in order to heal & we will heal & be stronger than ever before! _E_ I'll always like @OMAROSA because she constantly defends me. #CelebApprentice _E_ ...and now Alex Salmond pushes ugly turbines! _E_ I will be doing the @TodayShow live from New Hampshire at 7am on Monday morning. #TrumpToday _E_ .@garyplayer As a true champion you must have enjoyed how difficult but fair The Blue Monster played last weekend. Gary Player Villa loved! _E_ "Inside Donald Trump's Scottish golf course" __HTTP__ via @TelegraphSport _E_ My @greta int. discussing $25000 gift to USMC Tahmooressi Obama's trip to China & the 2014 election results __HTTP__ _E_ .@newtgingrich just said a historic victory for Trump. NICE! _E_ Keep stimulating your mind with big ideas. Be a collector of big ideas. Constantly fill your mind with new information. Think Big _E_ Congratulations to @RobinRoberts on celebrating 100 days in her bone marrow transplant recovery. Robin is a special person. _E_ .@stuartpstevens horrible advise to Mitt Romney made victory an impossibility. Don't blame Mitt! Now Stevens can't get a job! _E_ Never in U.S.history has anyone lied or defrauded voters like Senator Richard Blumenthal. He told stories about his Vietnam battles and.... _E_ The Democratic Convention has paid ZERO respect to the great police and law enforcement professionals of our country. No recognition SAD! _E_ .@BarackObama should release all his records (like other Presidents).... _E_ The first General killed in a combat zone since Vietnam it is a travesty that Obama did not attend Major General Harold Greene's funeral _E_ Thank you Terre Haute Indiana!#MakeAmericaGreatAgain __HTTP__ _E_ Carl Icahn said this about me: I think at this moment in time he's the only candidate that speaks out about the country's problems. _E_ Individual commitment to a group effort is what makes a team work a company work a society work a civilization work. Vince Lombardi _E_ Must read via @FoxNews by @JaySekulow: "Mr. President: Will you bring home American pastor imprisoned in Iran?" __HTTP__ _E_ Via @Newsmax_Media by @ChrisRuddyNMX: Donald Trump and the End of Free Speech __HTTP__ _E_ .@melaniatrump will be on @theviewtv today at 11am ET discussing @apprenticenbc #celebapprentice & her skin care collection. Tune in! _E_ 4.2 million hard working Americans have already received a large Bonus and/or Pay Increase because of our recently Passed Tax Cut & Jobs Bill....and it will only get better! We are far ahead of schedule. _E_ Putting Pelosi/Schumer Liberal Puppet Jones into office in Alabama would hurt our great Republican Agenda of low on taxes tough on crime strong on military and borders...& so much more. Look at your 401 k's since Election. Highest Stock Market EVER! Jobs are roaring back! _E_ In any event we are EXTREME VETTING people coming into the U.S. in order to help keep our country safe. The courts are slow and political! _E_ Why do people give @KarlRove contributions when they know he is a loser who has no idea how to win? __HTTP__ _E_ Join me tomorrow! #Trump2016 #MakeAmericaGreatAgain Omaha Nebraska: __HTTP__ Oregon: __HTTP__ _E_ Entrepreneurs: Follow your instincts and keep your focus intact. You alone know where you really want to go. _E_ "@marklevinshow: 'PLUNDER AND DECEIT'" __HTTP__ via @AmSpec by @JeffJlpa1 _E_ Lightweight A.G. Eric Schneiderman asked us for political contributions DURING his investigation of usthen sued for $40 million.Dopey guy! _E_ The @Yankees acquisition of Ichiro was a smart move. I look forward to watching him play. _E_ Everyone is asking if and when I will endorse a candidate in the NYC mayoral race. Doing my due diligence... _E_ The brass in #TRUMP Tower's atrium is polished twice a month like clockwork. I keep the atrium impeccable. Key to its success! _E_ Vision remains vision until you focus do the work and bring it down to earth where it will do some good. _E_ North Korea disrespected the wishes of China & its highly respected President when it launched though unsuccessfully a missile today. Bad! _E_ This election is a total sham and a travesty. We are not a democracy! _E_ Thoughts and prayers with the sailors of USS Fitzgerald and their families. Thank you to our Japanese allies for th... __HTTP__ _E_ I'm always amazed when I travel to my foreign properties.Seeing the Trump brand across 4 continents proves that excellence can be universal. _E_ NO WAY JUDGES SAY MAYWEATHER WON. INVESTIGATION SHOULD TAKE PLACE. FIX? _E_ Today it was an honor to celebrate the Collegiate National Champions of 2016/2017 at the @WhiteHouse! #NCAAChampions Photos: __HTTP__ __HTTP__ _E_ Thanks to Giovanni's Coal Fire Pizza of Florida for donating enough pizza to feed 750 Police Athletic League youngsters in NY this Friday. _E_ My interview yesterday with @IngrahamAngle __HTTP__ _E_ "Trump: 'Seriously Considering' a Presidential Bid" __HTTP__ via @NBCNews _E_ If the press can report stories from @MittRomney's dorm years then why can't it find @BarackObama's college and law school transcripts? _E_ With terrific Steve Wynn at dinner last night. __HTTP__ _E_ Via @Newsmax_Media: Trump: Americans 'Desperate for Leadership' __HTTP__ _E_ If everybody sued the Journal News for revealing their info (guns) paper would go out of business. _E_ If the Saudis are so concerned about Syria then they should go in themselves. Stop telling us to do their dirty work. _E_ "Every big thinker has had to start as a nobody. Just think big & that immediately distinguishes you from the majority." – Think Big _E_ Est. in 1906 @TrumpTurnberry is home to the iconic Ailsa @The_Open Championship course four times over __HTTP__ _E_ Why isn't AG Schneiderman going after Democrat Jon Corzine and the $1.4 billion that is "missing?" _E_ These Islamists chop Americans' heads off and want to destroy us. We should be applauding the CIA not persecuting them. _E_ Thank you Texas! If you haven't registered to VOTE today is your last day. Go to: __HTTP__ & get ou... __HTTP__ _E_ The girlfriend of Lubitz the wacko co pilot who took down the plane knew he was insane and should have reported him. Put her through hell _E_ Trump Golf Links at Ferry Point is a Jack Nicklaus Signature Design 18 hole course just minutes from Manhattan __HTTP__ _E_ If United Steelworkers 1999 was any good they would have kept those jobs in Indiana. Spend more time working less time talking. Reduce dues _E_ .@RNC report was written by the ruling class of consultants who blew the election. Short on ideas. Just giving excuses to donors. _E_ The people of Scotland are really starting to fight the ugly industrial wind turbines. See Press and Journal __HTTP__ _E_ GIVE AMERICA BACK ITS DREAM! Donald J. Trump _E_ "MSNBC'S TOURÉ HAS EPIC RACE BAITING MELTDOWN ON CNN" __HTTP__ It's Toure's modus operandi. He is so angry. _E_ See yourself as having a lot already and keep your integrity intact. It's the best path to comprehensive success. Think Like a Champion _E_ Love making correct predictions. National Review is over. __HTTP__ _E_ My nomination would increase voter turnout. #VoteTrump #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_ Congress was elected last November to reign Obama in not to give him 'fast track' authority for bad trade deals for the American worker! _E_ ================================================ FILE: assignments/word_transform/common.en.vocab ================================================ , . the of - in and ' ) ( to a is was on s for as by that it with from at he this be i an utc his not – are or talk which also has were but have # one rd new first page no you they had article t who ? all their there been made its people may after % other should two score her can would more if she about when time team american such th do discussion links only some up see united years into / school so world university during out state states national wikipedia year most city over used then d than county external m where will de what delete any these january march august july being film him many south september like between october three june well use war under them april we born december link while c later part november further players list please following my february known second u name group history series just e north work before since season both high st through district now ! comments because football music however diff century league edits debate title articles john same including could english album number against family user based area became york b life me british international game " above club your until early best west house company general left very here don living day several place party college result keep appropriate four subsequent even class government how called did each found center per style com long country back way does www modify end make public played p won another released added f support games former those films church east line major members good much image show still think below town last system right song non notable section single included align home women television — seed member goals sources book station order old information set own text band point local around river top main language french https named off us note career original age service established located re said website population air german law military } great ii within clubs published president park official $ r case > london times although small third different due get village closed g art player final l community held n again began army award without death built men large site + using deletion white along five central road children free took england include association down j given source x california man version written created media black though php report building la take division comment having king edit stadium died ship research record archive places undo cup records often few received side power education know category water political species field near & co australia video need go island form find served play project o according radio am works proposed every development example live union india next special court region h little short v william province western son france council others royal current street full red too department w san help among ve preserved james open force position head director father track http canada never australian id george jpg level late summer society moved office period championship round story songs various file days land business tv reason america million european term al six uk post why produced making subject young total david science related rock archived railway become led students started news described role election albums present indian kingdom books important northern love run canadian press rather k type act editor came schools program once issue social germany production male might awards points similar professional say background enough lead either common overlap data color better • person services museum battle went sports already currently hall buildings historic date deleted considered change location seems must yes our southern least lost something review together robert fact less japanese groups content involved isbn board japan control policy modern human half design event events available done washington real start personal action space areas doesn notability star really china possible paul working taken far going minister lake reported popular married founded europe author away independent process teams character low michael pages light big seen release want episode wrote republic thomas companies via russian thanks put race worked route recorded someone civil police charles listed users template instead eastern body question italian featured week editors texas chief close himself upon match q roman come opened tour sea actually cross playing health institute caps forces green rights evidence originally aircraft arts range probably consensus bar problem look issues alumni average network win shows wife returned night magazine centre joined usually middle completed elected significant african able google stage addition ireland today academy saint self itself continued stations mother appeared africa culture spanish grand committee things fire changed gold female course directed months whether chinese previous developed size mentioned add festival peter basketball across move performance standard means give training artist word blue primary announced value christian private catholic artists includes view thus almost baseball seven appears ever provide technology olympics future formed census nd images los results return quality construction zealand front cover model despite read material strong coach henry footballers mark rev organization studies federal richard html virginia car attack conference outside study brother names throughout writer characters musical nothing border medical countries past writing makes interest provided killed medal signed dr largest label fair search bay reference especially refer removed library eventually management references features navy guitar hill sure historical lower daughter appointed reading yet systems debut movement fc specific always actor natural clear coast let got chicago championships ll pennsylvania ten performed individual designed rule etc lists paris thought brown hand needs reliable smith generally base sometimes florida capital valley bank gave ground reached italy energy believe leader active online block bridge families changes y followed industry collection request soon leading olympic sold writers professor studio mexico competition campaign org theatre anything particular empire length islands singer create redirect additional soviet market words producer notes hockey novel code referee fourth sport van mary airport sound status irish placed child perhaps idea foreign municipality isn register eight problems native coverage channel parliament username edition minor says whose foundation units movie runs ice simply limited unit student previously stated governor complete test nominated bill parts vocals theory regional km account vote computer none carolina tournament poland behind wales winning lot hospital mid taking mountain higher cases angeles editing replaced food multiple likely terms sir thing square try topic woman officer categories greek recent sent copyright speed templates money saw senior selected introduced politician true required regular awarded commercial cities contains trade mr degree anti birth sun finished longer rugby earth access prior seasons journal beginning software famous religious appear martin el god bit hours running brought missing economic structure rural remained decision certain quite hit minutes spain plays whole joseph lord web decided operations function louis assembly queen security uses ohio owned jan operation call successful legal russia prince mean jewish staff establishments goal towards agree bad attendance populated nature allowed captain mount township calculated structures hard saying manager earlier elections meet box lines democratic success · associated singles traditional rest highway matter particularly wide month care admin cultural commission didn plan therefore practice command nomination jersey parties michigan entire en anyone seem overlaps approximately master noted usa stop cannot feature engine response needed illinois afd experience highest engineering silver separate takes secretary dutch lee recording prime le themselves rules uploaded trying youth scotland iii houses heart room stone shown deal drama scores dead key shot turn occupation scottish executive plant promoted whom villages languages internet leave feel covered merge mostly numerous ancient attempt property programs picture finally ships fiction looking secondary nations majority edward annual digital mission wp lived claim seat bbc profile dance prize doing georgia port pacific castle pass transport organizations ratio recently fall global era wing opinion commander fort effect opening fine purpose winter genus congress overall activities met income massachusetts comes older peak lack bass super complex academic stars accounts appearance asian asked friends kind et till financial entry asia sense meaning actress map → intended bishop boston rate literature forest voice jack pre justice britain champion double polish numbers columbia temple defeated administration ended claims z jones mm parish israel actors sister nine scored table attended pop cd else newspaper friend unknown winner chart initially loss sites starting architecture relations upper supported tracks contract face directly spent girl clearly junior francisco politics greater presented mar ed cause volume caused pagename tom flight candidate passed matches claimed except oil assistant surface victory regiment stories represented gets speedy weeks listing allow jr branch retired communities train paper adding provides remains victoria metal wrong larger direct frank miles blocked launched mass chairman comedy relationship knowledge format creek meeting failed officers draft goes fight figure faculty camp ran variety owner statistics raised heavy alexander alone understand episodes gives educational daily williams latin completely products dark attention religion referred von mind oppose corps administrative cut scott becoming footballer jean mayor pro beach descent nearly latter leaving highly cast territory write towns forms joe inside wanted solid individuals authority mention projects del continue cost vice drive notice johnson forced basis looks reasons job photo hope log parents entered mike basic scientific amount spring oxford kong opera tried critical simple founder hong told husband useful technical necessary believed operated mountains importance musicians hotel girls crew feb boy ontario nation defense wiki champions golden districts faith racing mainly auto unless lives swedish hot entertainment turned net soccer creation product tower increased votes squadron px contemporary subsequently regarding focus marriage questions naval details forward memorial peace ip kept iran korea analysis winners poor grade cricket judge electric bc exist corporation hold featuring campus brazil chris beyond fifth increase summary remaining statement broadcast getting piano des novels serving hour moving resolution concept alternative brothers attacks encyclopedia republican representatives politicians difficult ability studied host wall immediately urban pakistan becomes marine physical dec troops interview coming semi suggest emperor letter couple fellow duke tell gallery follow windows tree hits jazz protection relevant count situation reviews containing classical offered lady netherlands reports influence address linear consider machine domain elements minnesota types nov serve sydney ministry blood distance bottom giving boys potential toronto edited infantry jun formerly oct conflict workers steve philadelphia helped der nationality dispute scene method titles berlin conditions arms races maybe discovered iron extended churches otherwise positive santa nom imperial composed ball width quickly correct responsible possibly indiana soldiers examples korean timezone genre fish senate effects gun check appearances plans renamed sign reporting sweden consists heritage tag primarily doctor leaders lies inc rivers du crime liberal stand bob existing publishing industrial answer split apr sex mixed acting personnel rail die premier approach wisconsin sentence root ago standards comics earned miss specifically horse actual contributions carried lieutenant wood plants initial origin environment pretty rank bus gas direction guide resources affairs accepted animals nor activity levels laws jim creating cambridge ones composer remove agency reserve atlantic supreme weight pp ask fighting jackson widely rose operating treatment linked andrew trial expanded daniel certainly info vs sciences fame everything avenue travel scale break oregon housing produce capacity smaller fictional exchange actions cited typically settlement agreement translation males kansas managed bring charge fails dedicated estate nearby residents piece growth trust applied drums issued murder normal twenty commonly avoid tony norwegian criteria context suggested revolution fully wars prominent aug leaves advanced distribution medicine garden reach turkey females publications impact households survey height morning honor deep argument publication arthur elizabeth disambiguation worth colorado median maryland falls zone solo learning pay resolves choice vol flag engineer cars farm wilson principal acquired constructed secret poet build remain orchestra versions follows fixed fm efforts documentary equipment ray yellow guard pressure grant prison freedom norway twice sportspeople store taylor quarter designated independence platform rome ad teacher copy effort nuclear pictures models sep everyone easily thank di description agreed £ institutions covers facilities target stack rationale stat combined bronze sort hosted programming sri railroad unique defined ocean cell missouri concert improve biography loan shortly contact holy tennessee sub safety competed stephen policies painting price entirely mexican leadership flying message municipal serious headquarters officially cemetery memory × fields generation join copies finals fox continues representative destroyed feet guy philippines revealed organized serves conservative share maria disease sections philosophy ways arrived divided floor labour logo meets yard largely cancer offer tax expected traffic concerns graduated guest jews formation meant economy storm tells mile protected bowl letters providing begins classic damage harry offers davis challenge views marked allows density literary fa htm ben transportation kentucky sales fleet supporting captured extra recognized arizona compared theme francis moscow interested heard behavior transferred environmental blank musician starring assigned seats tennis percent logs display convention ring joint brian deputy planned universities yards communist agent difference animal czech positions exactly stay titled combat palace card ordered opposition attempts understanding stub wrestling critics growing establish hands participated revert poetry materials ga turkish paid promotion apparently battalion mobile additions row merged metropolitan figures existence eye longs louisiana lewis melbourne austria brigade screen risk conducted lats ban da labor legislative definition indeed draw application un steel presence expansion earl max wild planning comic adopted easy plus happy acts classes iowa grew save wins theater exists roles chance prevent candidates object felt powers birds spread defeat cape identified gained regions mine sides jul showing teaching guidelines simon depth lyrics christmas declined greece express federation journalist intelligence occurred connection displayed portuguese declared constitution presidential standing sons plot dates firm proper ends pilot relatively receive educated opposed manchester queensland americans introduction directors vehicle stock vehicles israeli frequently hills performing northwest drug visit portion residence walter pov interesting moon limit minute bell athletics reduced wind oklahoma architect ideas electronic crown younger anderson step weapons unable neutral connected switzerland expatriate armed weekly rating programme squad medalists multi dynasty cold granted socorro alliance methods sr sam alabama albert tropical vietnam dvd refers heat fans surrounding purposes credit commons boat iv boxes ethnic speaking fell arena roads core dog kill athletic oldest negative confirmed sixth edge jesus tools colonel weak chosen brand resulting nfl rise supply tradition elementary household spirit task slightly howard incident develop southeast sunday discuss stats climate topics purchased communications chapter broken singapore alongside situated ca license haven deaths passing citizens guns trees gone greatest improved visual pope officials sat glass miller resulted posted estimated contain brazilian sexual defence respectively concerning rich myself fast properties taught extensive exhibition speech les proposal straight ff internal effective solution fashion foot orange argentina brief performances adult allowing newly identity nominator singers inspired discussed require ex facility transfer egypt cells patrick quebec connecticut scoring anthony permanent phase audience motion blues hungarian arab trains sets wasn ranked unlike begin setting eyes database studios criminal commonwealth finish communication scope accused divisions accept warning alan objects diego contest fighter finds coaches beat extremely ford swiss sorry houston worldwide showed holds cathedral losing advance reality broadcasting adam vandalism enemy entitled youtube assessed billion buried belgium respect rare detroit graduate colleges explain authorities killing maximum neither fan notify painter hamilton returning attempted universe passes obvious suffered pieces apply actresses competitions aid driver folk dan khan baby denmark tokyo billboard calling anne happened danish wants formula interior kevin weather powerful muslim registered publisher preceding sounds eric approved achieved douglas provincial fund portugal athletes bird bands audio cat bureau centuries valid chemical items lane holding counties update ncaa speak finding domestic ali false equivalent caught christ ending toward puerto perform partner romania aviation wouldn failure ward strength onto knight nominations hungary concern keeping recordings ep juan functions mississippi ok calls criticism involving magic gordon treaty antonio selection rear colonial motor obtained circuit wish compilation harvard islamic determined geography arkansas fuel artillery medieval locations inclusion recognition northeast chamber moment somewhat grounds anyway succeeded historian condition physics newspapers instance represent allen watch kitt protect grey launch dave philip dc iraq changing ukraine municipalities mix tamil shift shared austrian door investigation institution princess trail ultimately parks applications hundred aired requirements talking kim ltd metres gray sector dean agricultural unincorporated incorporated escape orders corner commissioned founding mill mrs subjects temperature settled remember miami promote values spot progress learn planet oh occupied usage southwest refused borough truth clark sufficient equal administrator persons factory fought derived outstanding magazines flow peer attacked generate shape creator requires option lincoln starts stands carry establishment selling causes mp budget battles sky legend sourced arrested forum metro broke strike injury ryan zero converted violence significantly statements controlled welsh dropped roger pdf distinguished samuel translated papers detail chapel frederick thousands banks herself offensive kings factor rename replace museums resistance junction tries tim engines contributed medium device profit dream enter twelve universal typical seeing skills bought passenger cleveland funding agriculture parent decades receiving signal reform organisation prix column defunct utah managers qualified indicate ukrainian gay amateur obviously flora gene soul op alt discussions montreal turns walker entrance path nice string influenced occur developing abandoned humans pair flat sample contained banned moore strongly visited pm increasing attorney arm mathematics canal charts thinking dublin suggests whatever surname brain pittsburgh blog economics seventh alex employed heavily authors paintings concerned recipients navigational scholars controversial controversy reverted expressed josé bodies conservation maps ahead marie arguments chain focused readers carl cm violation offices wave circle apart invasion jimmy opportunity determine orthodox voted formal describes seconds cycle doubt golf walls productions constituency closely occurs huge andy representing indonesia sell aren mon drawn diocese tank advice senator manner generated malaysia asking finland causing leads lawyer seattle gain index saints runner crisis cinema matt hollywood reaction medals documents reader lawrence pattern archives atlanta voting reviewed looked bear perfect restored bruce baltimore baron pan commune fantasy duty chair scenes broad opposite stuff aged streets nick anna billy extension kent parliamentary kelly shooting ready pick ma songwriter aware jordan dictionary composition salt stating bangladesh bot successfully benefit lands interests scheduled teachers closing advertising contribution maine retirement scientists dam ny blocks las print techniques participate anniversary requested discovery explained expedition citation assist und meanwhile hampshire creative maintain pierre detailed facts frame finance socialist script camera returns engaged assistance experienced underground sale beautiful jane abc supposed successor classification tool mining producing cabinet fr bytes ross russell citations maintained evening singing fifa gender venues lakes mail jeff electoral emergency mode christopher heads proved priest funds investment romanian session capture aspects reduce trophy abuse prefecture walk faced normally regarded snow shop dakota bush coal inhabitants headed gary employees error invited cable protein accident decade measure watched patients downtown animated satellite johnny combination courts sequence hook clean wed owners twin distributed describe ~ defensive islam photos ottoman trained affected routes ministers wine elsewhere biggest li lanka carlos landing collected revival rio communes saturday mps guess drop sarah laid swimming membership edinburgh fit harris dallas degrees bachelor personally briefly files conduct extreme courses hence homes reaching na sought vision demand vertical updated marketing jason consisted appeal plane quick victor dyk solar ages neighborhood fairly wings acid scheme ° matters rfc constant additionally hip admins nova ceremony chile composers nazi scholar liverpool hero designer learned instruments welcome hair consecutive movies adjacent pool tue norman collections belgian corporate austin ensure driving phone fly ian window document adams collaboration margaret kennedy leg videos assume attached dry expand bible matthew depending serbian instrument covering random represents participants thorough mentions portrait drivers airlines franklin viewers finnish differences venue vocal cricketers element regularly rejected relative illegal stewart roof leagues argued colour morgan prisoners facebook attend nelson survived insurance expert steam cards manufacturing testing coastal yorkshire rescue territories thu thailand struck choose vienna journey storage costs singh distinct notably soldier il colony evolution taiwan hurricane judges gardens poems consisting removing driven responsibility sentences birmingham engineers visible ft substantial gulf installed revolutionary inner trip restaurant graham \ stores rice happen prove reasonable skin committed volleyball _ chose factors hundreds injured devices phrase stanley lemmon thompson suicide advantage automatically disc minimum goods charges alfred operator merely finishing fred identify producers ss ann campbell portland helps latest releases victims explanation operate threat crossing slow poets stopped strategy wayne ranking disney wright residential associate hi significance ruled excellent shouldn observed threatened friendly redirects temporary masters peninsula networks passengers assumed artistic safe earliest festivals compete png hunter moths alaska mi partnership maintenance monitoring evil relief charlie poverty hop cc fri encyclopedic suspected filled nba decide breaking argentine resigned oblast handed drew hawaii brooklyn whilst historians pa speaker moth permission wounded racial marshall kg gate springs roy photography helping knights roll progressive contrast continuing processes terminal executed shall svg spouse infrastructure principle painters painted properly frequency shaped joining robinson waters ridge bridges ceo monument mental carter karl mac orleans portal parallel regardless thirty giant qualifying murray afghanistan assessment counter bears purchase expression backing uefa improvement madrid closure wheel ambassador desert bringing iranian reign uncle severe rain admiral fishing existed raise broadway principles grow tests roughly tech trouble rico paragraph bat prepared measures robin hired fear merit participation massive designs agencies whereas technique alberta egyptian clerk knew narrow adapted commissioner rapid credited dating businesses bomb capable poem stages honorary dragon charged propose modified fired mlb send proof practices arabic attractions carrying mouth fix licensed symbol organ damaged warren exception costa unfortunately jerusalem replacement indians besides soundtrack virgin thousand vancouver legislation beauty credits buy organisations serbia christianity opinions cavalry tribe richmond chess channels claiming exact baker allied involvement anime referring donald sisters willing requests unusual yourself impossible colors cook drawing wikimedia jonathan removal moves indicates admitted ownership shore monitored nebraska regulations crash guitarist enforcement supports abbey deleting nevada barry tone operates indigenous personality reception transit buffalo flowers bond jay adventure definitely guinea horror rangers pointed apple popularity occasionally coalition franchise starred critic journals rolling percentage silent laboratory microsoft movements charter suitable alternate offering missions sc experimental rooms concluded reputation accurate versus websites interpretation tagged endemic chemistry achieve knows manga journalists forests cbs comprehensive symphony promotional electrical tags meters jerry tigers commerce remix addressed phil automatic gang afterwards printed oak warner tend ms quote separated bishops glasgow essentially wait input battery favor benjamin apparent shopping patrol eagle mainstream pc angel martial restoration delhi hans indicated morris railways centers mills helpful delivered components victorian legislature tourism treated extent kids barbara essay circumstances repeated plain superior strategic similarly duties effectively blp considering arranged ken grammar amendment alleged relation habitat spoken eu shell mounted entries conflicts philippine montana appearing triple boundary caribbean hosts signs seriously bristol warring mitchell industries colombia comparison basin eleven ill putting pradesh charity output dna carbon boats desc architectural representation commentary rising visitors markets plate giants processing landscape dick hunt em summit rr psychology ride greatly guardian closer terminus losses balance democracy submarine nicholas unsourced usual peru eighth instrumental hindu amongst defender riding arrival evans turning imply prose cargo hidden volunteer bio holder sugar daughters wildlife fun integrated partners rates grace feed childhood accompanied milan photographs honour soil server manual concrete possibility ghost confused tunnel larry styles elevation muhammad considerable stood inter lose phoenix sweet waste operational tall ongoing qualify constitutional sporting peoples acceptable fruit decisions depression perspective longest midfielder crystal monastery resident seek cincinnati tied surgery steps carrier stream alice dj kick furthermore strange predecessor bernard nigeria pain ph influential punk wooden suggestion interaction retained achievement mechanical drugs missed expect trinity classified minority businessman grown coat powered alive nbc nhl keith bobby harbor behaviour croatian maritime terry virtual indoor periods spiritual easier croatia lions archbishop luis merchant azerbaijan lots contested editorial initiative charlotte pure borders persian marks armenian romantic replacing talent unlikely panel jump animation agents employment trading parker statue ac dated wonder filed provinces friday jobs cuba couldn aside são scientist schedule waiting familiar suspect disagree suggestions turner forming formally locomotives barcelona se consistent recommended desire happens patient bulgaria vi vincent outer hear texts belief visitor vessels basically continental hole fail passage sees wedding archaeological layer designation clan revenue seeking couples entering suit soft weekend approval democrats crimes collins expatriates horses wear visiting overs supporters cash somewhere dennis resource sculpture practical harrison pink oliver limits cooper illustrated sur hell statistical referenced wolf warriors incidents fresh editions roots signature clinical premiered volumes worst adults contribute necessarily immediate feeling theories essential completion conclusion technologies strip bound praised stayed hull − diamond origins empty eliminated valuable cite doubles branches @ honors brick experiences beijing tie lgbt sa wickets liberty repeatedly siege baptist ron hebrew affect portrayed decline widespread coaching alpha equipped identical submitted enterprise touch transmission rs platforms cave filmed inch cool bulgarian debuted liga manhattan destruction activist weapon clay keyboards dangerous viewed lp email biology increasingly bold bowling os compare treaties affiliated sock assault regards monthly foster cousin urls hispanic logic craig trivial pioneer muslims lay rated absence amsterdam publishers tribes percussion runners themes benefits guards flows attributed stubs athens herbert celebrated sponsored raf regard asks delaware neil pole ref historically tail tours stable decides vessel identification delta telling dealing writes mediterranean volunteers reply attempting stuart marvel luke grave odd hearing uss mall penalty solutions secure hugh steven sole architects characteristics falling spin clinton villa select metric criticized surviving roberts standings biological lloyd munich belongs adelaide belong harold norfolk butler coi rival acoustic posts adaptation greg reporter url absolutely nobody scholarship vast exit facing inquiry dual belt noticed patent mathematical relating rarely submission demographics crowd rick governments bonus tourist mystery click settlements walking nevertheless voters rifle component civilian partial encouraged birthday eddie christians denver petersburg researchers partly photographer runtime jon obama picked seemed clock violin highways holiday distinction artwork makeup catherine font farmers occasions au guideline photograph struggle timestamp produces yale options pen procedure jacob convicted touring transition anglo legacy denied relationships ottawa derby surrounded libraries competing speakers grades hudson administrators sacred signing rob citizen dogs argue believes annually cardinal nepal intersection discussing reveals defeating disputes beam overseas perry nickname ruling syria wells contributing ultimate ranks danny retail favorite vermont begun download trusted appointment ballet jefferson anywhere sand angle sessions recreation wearing kenya accessible ralph thread disruptive spend ninth arrest choir trials mines injuries rapidly rounds competitive opportunities meetings commented wang woods exercise jacques objective demolished preferred resort pedro robot venezuela segment studying edwards aim dancing eagles demonstrated tribute continuous encourage spider acted convinced heroes describing rocks bed gap reflect mars participating cooperation obtain gothic protest hunting rfa frequent conversion stress manufacturers voiced innings traditionally jose adventures tiger totally voyage concentration sing rocket electricity shadow boxing senators doc stanford machines vegas clearer saved jury calendar noble tommy guilty leo affair handle extinct responded shares scotia manufacturer tales implementation truck spelling item load customers adds spaces cap orphaned ferry prefer push lie berkeley lebanon madison throne attracted ie lion retrieved manor promoting saudi serial abroad rogers lights grandfather gauge concerts elder reads renaissance uniform chase aka computers brisbane susan raymond flower col thai disaster survive involves clothing murphy sharp behalf explains yugoslavia buddhist publicly meat literally spam telephone moral sung partially lawyers citing interviews brunswick radar spending grove tea ap elite bright improving sierra heaven athlete aspect answers ted consumer funded exclusive ibn manuel allies reviewer missile mechanism helen withdrawn intention mini casualties establishing diseases rhythm pat catch poll deck newcastle antarctic leeds lasted ranges listings ordinary insects suffering flash worship boundaries blind pakistani assuming interstate patterns arrangement globe honours gross gilbert applies gradually youngest managing experiment radical gov legs opponent diameter supplies pitch utility cleanup opponents regime revised plenty genera diplomatic germans seal gregory corresponding concepts sword purple pending virus populations bull drummer presents holland congressional bias merger remote sean messages rebellion premiere physician ha victim con cloud angels noise heading duo beer palestinian copper jurisdiction implemented improvements ski peaked hms loop renaming drum dramatic saskatchewan talks earthquake rhode hat requirement den tanks presidents societies min defending alcohol dominated sang eat graphics constituencies asp coffee batting chancellor destroy tons cruz warsaw exclusively connections rush heights playstation outcome apartment cardinals fill recipient correctly traditions fundamental copyrighted thin chan resolved mario departments dame thereafter shield lowest fighters ivan writings bosnia sentenced violent caption harbour margin auckland postal pirates collective diesel liberation confederate devil activists sultan rider amazon florence marc wider arnold shah si blogspot reduction contents genetic somerset locally milk romance lacking intellectual latino failing mason pete advisory es arbitration interface hitler default accessed lifetime sheffield departure hindi anglican suggesting mistake residing remainder raising embassy murdered sox sleep suspended sum mythology bengal confusion bakhsh oscar programmes therapy occasion exposed assisted possession defend devoted graphic warfare milwaukee informed anonymous reverse soap territorial lisa paulo northwestern playoffs boss nasa sockpuppets quoted byzantine idaho poster geographic rebounds ho congo venture cricketer worse hoax restricted advocate doors naming situations instructions sullivan tables leaf shoot substitute restaurants contributor spoke errors enjoyed framework rocky kerala shakespeare quantum immigration mirror certified assets potentially presentation cotton sitting tournaments syndrome checked forty aimed sourcing journalism upload unsuccessful towers conductor hospitals bone essex rebuilt wellington ideal raw sharing labels leonard watson governors posting harvey bases hello rabbi hardware ensemble monster pitcher emphasis irst recovery respond aaron lesser qualification organic exposure palestine thoughts needing drafted maurice immigrants variant lap legitimate autonomous wallace † succession throw monday reserves donated increases kid guidance delivery joan fifty slave feedback columbus stones manage cgi initiated favour printing variable theology todd parameters traveled md canton han reed celtic characteristic commanded searching inappropriate switch ties tube otto debt outdoor navigation eligible experts expensive tier gospel newton essays shanghai conventional campaigns formations feelings bath venice cats variations emerged socks ch connecting flood documented custom touchdown profession layout academics settlers merging sony competitors phillips hasn grass reservoir artificial novelist tip prague abu faces guitars laura fellows internationally attacking johann dreams hughes suburb understood specialized warned pearl chorus dependent restrictions killer oakland romanized trio influences blocking mtv wikipedians à cattle gear gabriel traded skating fifteen palm wikis tale demonstrate vary liquid cycling princeton respective voices faster friedrich jet horn erected burning worker atmosphere characterized syrian welfare java monitor ye graduating columns reportedly repair bin stick dollars organised parameter truly resolve buenos parade backed awareness depends define spencer republicans conspiracy dies clarke rough engage pine equation feels democrat permitted cutting button attending brands queens abraham neck forever est drink sheriff miguel aires irrelevant poorly montgomery vanity gift riders functional crossed diverse ru numbered quotes slowly xi attitude mouse justin protests gods amounts variation smart prices prayer terrorism beta durham householder counts iraqi detective josh placing somebody linking compositions oval extensively filming perfectly mw pr indianapolis fn funeral recovered southeastern farmer protestant lt cameron focuses ranging unclear indonesian mixing mumbai nashville danger rally narrative camps surprise manufactured deployed kate solely molecular unnecessary isle theorem homepage colonies cyprus wake brings winds magnetic conversation sussex ab fl fastest gates ram plastic electronics restore stockholm inn buses connect forth guests radiation receives lancashire playoff cork generals intermediate ba verifiable cheers filipino reaches oriented hamburg creates orbit massacre dialogue illness wc dress codes dawn isolated nancy violations perth tenure ladies autumn ratings incorrect scout difficulty pupils wealth hart toured allegations regulation watching lodge eggs disputed citizenship specialist tasks intent instruction ceased pride banner friendship panama corruption sunk harm ernest pilots pursue tape emigrants cancelled revenge revision dominant fee computing examination chen matrix das biographical kiss nationalist luck crosses heavyweight bid appreciate ce enemies mercury interactive math debuts preserve nobel grande keeps structural marry airports veterans airline axis execution cult reducing sp colin chester ticket belonging im entity judicial explicitly bombing recognised originated applicable founders fitted wilhelm suddenly parking absolute françois locomotive preparation nintendo declaration presumably burial governing jamaica knowing vladimir beating avg methodist utf challenges kenneth evolved celebration discipline bearing belonged fauna manuscript experiments chiefs compound tampa arabia associations equally dealt shut targets alien withdrew depicted sergeant diffs subsidiary thirteen thick extend dismissed neo wire phd measured fat visits linux teach flights verse bennett warm dynamic shaw breaks grandson monuments lying lords michel treat raid congregation shorter temperatures testament drinking companion manila km² punjab imagine consideration veteran doctors eldest carries ruler wise shipping afc worthy registration directory wyoming manitoba vietnamese ronald cuban burns justify divine suppose sequel fate rovers cole oral trans deemed boards span bryan santiago episcopal terrorist okay waves invented landed sandy acres paint actively indication stops excellence integration thinks bibliography farming nonsense marathon beliefs redundant freestyle aerial € preservation altitude freely landforms simultaneously psychological fernando cultures taxes marcus stakes dominican franz coins oxygen ^ incumbent civic hardly isaac spell craft inspiration pairs vector arc professionals vii contrary accusations approaches baronet slaves mad spectrum client dozen travels symbols plaza banking inherited legion symptoms mosque guys lab sailing orientation virtually generic reasoning stroke unions efficient opens impression discover relocated novelists roosevelt dancer phenomenon preliminary recognize anchor arguing abilities procedures emotional timber fisher prod cartoon disorder fled demands lithuania continent fellowship lock relegated warrant pictured recurring overview wealthy acquisition eve filter addresses independently slovenia observation roster disestablished challenged threats fallen protocol judgment grammy colours distinctive namely opposing landmark package controls completing sabha prisoner signals owen capita inaugural intervention arriving cylinder tenth liu tested renowned shops dome philosopher epic stem specified kinds davies collapse allan sight albanian canyon samples perceived celebrity priests louise workshop herzegovina claude fortune bars cornwall palmer presidency tiny fk appeals istanbul hp rookie expanding calgary shock pulled stevens employee yang housed tomb earning innovation streams unity lucas grows armenia interchange sized proteins proposals swimmers mainland seminary hamlet timeline realize coup newport negotiations exhibitions malta hate westminster installation enters goalkeeper julian morocco efficiency chapters aboard helicopter fewer fortress ani burned displays compiled ips contributors torpedo giovanni chat catholics herald chuck pit supplied optional desk garrison sprint exile surprised achievements biblical rebels te denis geographical sit alpine bills glacier aa binding indicating estonia eating saving chi developer indie difficulties doctrine worn fork simpson moreover maintaining theological upcoming vocalist temporarily hotels edmonton developments literacy currency missionary arrives hammer dollar ambassadors reverts twitter centres solomon recommend descendants ruth handling customs collect grid secured certificate destination albania indies euro consumption feat pushing constantly survivors mansion cardiff temples blake sheet lift confidence cuisine frankfurt galaxy ecuador breeding outbreak legendary handball georgian copenhagen trek ignored arch keys proceedings enjoy quartet aims propaganda wu disk realized ne neat funny punishment accuracy businesspeople meter theoretical suspension graduation flew seeds lighting jennifer smooth ah customer armstrong southwestern involve philosophical escaped powell kills taste allmusic requiring bros assertion boulevard northeastern brooks sending atomic antarctica strikes reconstruction chronicle traveling leslie ellis devon ghana gen rebel duncan pianist canon nc reformed pack iceland solve cyclists payment suburbs militia pronounced exhibit mph glen eugene compromise tactical discovers switched uganda jail yeah townships somehow withdraw holmes promise deals convert dos afternoon noting recall arrive warrior mammals dimensions surrey gaming lutheran ports amy survival responses badly collegiate scandal widow swing nights polo linda adr consist probability farms conferences zhang crazy witness nephew sensitive mutual hd diet clients fringe passion rings stronger millions dialect orlando undergraduate relay wet cruise henri publish joy julia kitchen abstract snake comedian motorcycle nadu reverting arsenal millennium assists thereby bow andré serie dimensional travelled eurovision firing suite doug gravity stored departed optical frontier evaluation graph hybrid oslo earn metre keyboard inducted nearest jamie decorated complicated nathan slavery circular operators armor mechanics bradford leon rachel footage strings header hood inspector warnings relatives plains defended wheels criterion ace arrangements penn approached joke sailed religions authored grants andrews moderate stolen tributary commanding pin carol owns prototype copied canterbury midnight quarterback duchy bailey arbitrators performers handled exploration diversity sixteen findings repeat brussels imdb planets theatrical reconnaissance shots complaint batman exhibited espn investigate verify discontinued absent girlfriend resignation fossil explaining tang inches proven yu franco dying tribal tyler surrender glenn substance focusing luxembourg colored scholarly administered explosion pushed generations duck porter permanently memphis salvador emma mit zoo gibson wording emerging mere notre portions macedonia ethics depot curtis rescued gaelic slovakia elevated jeremy listen impressive bradley surely egg conquest rod cdp algorithm burn thesis lover capitol comprises remembered ferdinand marshal judaism balls nacional wrestlers ahmed sin holocaust edgar saxophone retain curriculum wishes prepare ruins ibm rochester nigerian pitched jesse malaysian atlas telegraph performer cannon encounter emily dissolved catalogue discrimination er refs myspace reveal wizard teen spots bomber foods quest connor screenplay motors minimal muscle prestigious sustainable chelsea strict kingston sheep andrea complaints xs née connects nursing defenders richardson triangle nato teeth occasional strictly harper fluid bigger fed newfoundland disbanded comparable documentation brien compounds pointing edmund instances naturally forcing ussr laser lat sculptor guild observer worlds imprisoned wrestler praise parishes bones css cox contracts consequences provisions circulation butterfly hugo abolished algeria edu sufficiently armies separation spy cliff technically reactions lithuanian trick curve accidents horizontal uploader legends enzyme freight lacks hydrogen broadcasts viii caroline pull plymouth twentieth cuts mediation airfield catalog dale synthesis rape seoul engagement coin lucy consequently platinum twins memories robertson verified anthology milton geological defining dinner hosting thriller retreat albany abdul ignore migration carefully magnitude sudan closest manages duration henderson explorer marco fusion aids gathered privately reflected afraid presbyterian automobile estates fault pound allegedly delay developers semifinals belfast arctic ps kurt mayors windsor assumption plates fourteen nominee disruption monroe hearts belgrade victories extending pale pursuit glory destroyer deeply lectures affiliate preston deceased speaks gathering angry incomplete enrolled configuration brad skill intense tasmania commitment loved reforms rulers uruguay sustained napoleon confirm breed auxiliary enabled discography licence refugees adrian pipe karen altered budapest designers fe heir advisor illustrate authorized hide announcement compact tissue particles refuses receiver civilians marsh vinyl delayed unrelated encountered wednesday checking chilean hey chambers demo nationwide agrees ahmad santos paying interpreted submit desired followers observatory problematic springfield kit remarks burton mo inns coached monarch observations footnotes beetle promised palomar cream presenter potter su favourite transformation mcdonald bavaria kumar nineteenth severely gaining mixture browser endangered mate everybody lyon illustration kyle afl brook geometry ping extends aggregate variants baroque iso collapsed neighboring integral jake hopes cornell modes servant gt td kenny hurt mk maker inline carlo lynn stability hoping beneath imposed confusing mt summaries beetles joel cf jets logos vital malcolm winnipeg kilometers songwriters buddhism nose respected pace thunder centered physicians bolivia forget implies crops halifax toll monk extraordinary lessons pub paralympics monte maría segments deer wireless whenever commenced mysterious consultant fraser formats jam chicken enable idol reid births amazing pet upset loves stretch nominate striking striker accidentally louisville hopkins eds goddess burials resumed satisfy notion voltage betty marion geology consistently cyclone export lightning impressed maintains logical aggressive jin julie fbi yankees ludwig fi pond suburban enlisted moments conjunction interim argues lucky targeted lon speedway regiments picks prevented toy bicycle purely pd interactions fraud lang arcade lecture sanctuary dragons copa careful nurse rivals module supplement lens patron commands trend superintendent gerald rap geneva ash blade disappeared array patrolling predominantly committees loose boom sailors beaten smoke assassination lancaster reynolds divorce dust saxon healthcare separately grain executives translations zimbabwe thrown cohen puts diving neighbouring carroll accounting mesa prussia intelligent cherry underlying tobacco cleaned varieties bench directions ellen padding measurement paradise alexandria complement witch attraction diana personalities colleagues busy cia screenwriter rankings aboriginal commanders salem wagner firms sanctions americas endings instructor nobility divorced varies tomorrow manuscripts unified clarify scouts investigations silva derek agenda provision humanity admit terror contestants trinidad distant burke circles assignment releasing recalled shrine sail willie karnataka celebrate ranch jo collaborated vampire unfree playwright sick associates heinrich ethiopia flags tel drove learns shorts drives accomplished autobiography recruited uprising edwin velocity terminology raiders coordinates brighton viola para morrison propulsion boxer finale sh shoulder disabled joins div tactics ernst innocent rapper settle privacy boeing cites bunch emmy indo distinguish rosa accordance thermal flute marines feminist trustees sculptures nationally bacteria introduce landmarks disorders rivalry prevention honored healthy circus speculation burma sec ka ar quiet knee deliver threw hypothesis referendum travelling estonian pastor sofia tribune lasting permit priority pounds cent consequence rica conducting furniture macdonald honest innovative estimate atp rotation syracuse lecturer automated obscure kosovo classics julius appreciated naples sebastian activated varied offense advised barnes acknowledged exceptions martha quarters drawings barely hitting refuge maharashtra conventions elliott diplomat unused searches brigadier particle malayalam thursday icon ulster genes infinite considerably vale portraits paste randy ec saxony convoy annie excessive believing rhine mineral implement surgeon badge charleston clause infection electron walt cnn likewise tonight confederation accommodate casino doctorate aux guatemala settings mask shelter dorothy ethnicity hopefully elimination heath pregnant richards theodore delegates blair fac phrases crashed preference janeiro concerto headquartered bits construct tune unofficial bulk lighthouse stan highland mascot squadrons acceptance tight considers ai hub mess wilderness routine reviewing dubbed dozens spotted harmony entrepreneur wwe apollo runway ji naked anton moses legally wa nominating fake biased división revolt equity varying providence investors reliability tenor fights pocket sad troy treasure ion rendered transformed roberto nn adoption decrease reserved forgotten lok crop licensing advocacy collecting treasury trumpet johnston uncertain norton collector cluster dear georges roller pt clothes sovereign enhanced compensation consent outline holdings jorge darkness penalties sk bombers hometown holes blow cooking vfd aftermath trainer rican measuring lawsuit retiring chip consciousness archaeology latvia telugu blogs protecting hardy nicknamed scorer stamp nat fur redirected estimates lit ritual locality trace marble foundations politically nottingham derivative boxers dimension touchdowns crawford bats yugoslav tanzania succeed motto streak concentrated dirty hayes xbox identifying likes genres galleries forbes councils adequate brass bach alias inland counsel wore comprising tough advertisement protagonist trails demanded claire mistakes bruno dylan bag throwing churchill tan spelled climb witnesses storyline thames anybody kazakhstan bots constitute presenting highlights jumping prof slovak skull missionaries ordained eventual hoped myth mandatory stern fees bet monks dancers quantity endorse inventor cairo graves proximity seemingly sue armament barrier creatures logan je erik leicester ds silence jessica plateau finite precedent stationed walsh zones intensity exterior murders paragraphs costume bike neighborhoods imprisonment suffolk forwards remarkable undelete differ tin garcia madagascar cameras ammunition fires viewing explore builder minneapolis occurring bullet kerry subway arrow economist bread lou strategies rubber precise rifles cognitive governorate nest slam ancestry portsmouth miscellany convince audiences boarding bonds joshua inhabited casey attract nonetheless eb kilometres pump feeding prey ain mathematician diary vulnerable inscription dubai michelle lebanese productive guided happening accordingly gp researcher della baden upgraded demonstration equality philosophers spacecraft trap gb clara invitation marking expertise admission sacramento certification precisely casting reassessed submarines prohibited supposedly governance sometime frog vague tackles mhz secular tracking spa publicity armoured cleared watts gibraltar renewed reflects fever melody supporter elaborate jeffrey discusses surnames useless swift tuesday silly empress capabilities newman scales onwards beatles ko clergy jacksonville sara lifestyle bee holders baltic czechoslovakia brandon loaded maya evangelical enterprises imo mature physically sequences breast beast raja für anymore educator bang griffin rhodes preparing proportion enrollment itv ana ceiling rainbow demon prussian equations answered ← ist perception distributor entities jackie dynamics fiji insufficient algebra homer larvae limestone johns bce chaos chang layers crater ki chad db seized webster depicting excess bombardment hurling ashley dot gif translator cowboys counted hanging soprano interviewed governmental workshops terrain belarus liked console nascar teenage applying vandal graduates jungle ballot placement fairy tourists reasonably performs quarterly shifted romans rpm diploma circa environments collaborative swan carpenter petition boris berry invention southampton prairie bend app finalist questioned explicit bo draws governed slight drag maxwell quarterfinals planes everyday oriental manufacture airing acclaimed coordinator bombs mohammad bassist superman colombian philippe felix bengali greene voluntary floating montenegro sketch lo mann flooding escort dressed astronomy sudden variables arbitrary skiing timothy cello rainfall rafael sphere ought rewrite georg cinematography canvas chest krishna provider va frances crowned wanting carved poles cabin civilization broader avoided pl lisbon tongue endorsed newer eliminate gng panels darwin cheese easter rat papua insert descriptions debates informal castles cry loyal surfaces nicolas institutes humor madonna worcester cooperative substantially trips winston introducing llc drake lunar fountain consulting tends patricia ers garcía quit rid maple aberdeen verb illustrations mechanisms reds posters underwent leipzig santo engaging roland expelled evident staged telecommunications mc strait nationals unanimous misleading sherman stefan sleeping alto crucial thomson directing yo prehistoric communicate alphabet zagreb fu chef ix tubes carnegie hostile prizes eighteen shirt weird namespace pursued loading edges plantation vernon acre practiced wonderful missiles br guides finger garage savage technological rely rises shoes climbing barrel biographies spite assembled ham hon geoffrey valve histories commenting osaka beck analog monkey tackle listening integrity sits definitions critically operas baldwin troop objections marina detection sixty differential linguistic venus salmon monarchy stade depicts francesco abortion monica elephant polar prompted trademark ton provisional counting tu corn trucks sank airborne lengthy deutsche absorbed par tension pablo aviv cleaning judgement physicist catalina preventing disasters ni lesbian rays withdrawal walks realm trailer janet tornado aunt distribute genesis shallow makers mentioning requesting floors violate scattered boyfriend consolidated determining catholicism luther pleasure seeks constructive webb battalions alberto magical deliberately sacrifice faction olive corridor compatible receptor molecules kiev lb vista castro preceded wei pleasant miscellaneous lineup ultra nowhere bulletin clinic itunes batted patriots pollution treasurer … herman questionable sally rex mvp stuck proceeded openly afghan inferior vatican folklore affiliation cruiser princes silk agreements trilogy armour eng stamps harder specimens consumers differently tallest defines cage vicinity ferguson correspondence illustrates tehran cheshire pageant amanda pos bye employer browns static tibetan nixon sociology recreational capability addressing panthers objection systematic violated nl chocolate helsinki bedford cambodia grandmother rounded fitness halls midlands jung brilliant libya transported angela creature gather volcanic android filmography deposits mentor freeman apartments csd uci puppet carey hiding sends meyer strongest gameplay topped lima freeway kirk batteries chains ucla recover wong ear servants combine decent cubs deputies hawk contestant cumberland rational altar ufc organize exhibits subjective determination chennai dong threshold myanmar proud custody surveillance ng threatening allegiance discharge penny ra kannada astronomical rector petroleum bug proceed chapman mud decree regulatory honey diagram infringement broadcaster complexity riley displacement cr sponsor ira condemned careers haiti wisdom pierce stupid alma alexandra continuously shuttle hampton eva mickey atlético gastropod relegation yuan sv lynch johannes ecclesiastical destroying trunk proclaimed genuine adolf sharon acquire guitarists hung corrected theatres gardner artifacts handful mandate locked mohammed dish terrible revived reagan rebecca toledo indefinitely divide recommendations gymnastics alter ideology carriers constantinople fatal modeling monsters notified bulls harassment censorship favorable bore commercially inserted paramount luxury yahoo ecology reviewers emirates neutrality chronicles deployment lepidoptera blast jenkins omaha detected attribution harsh cn bundesliga overcome monarchs facilitate hermann morton ya sake conceived roma accredited echo bangkok achieving bergen hawks availability organizing lung andreas cedar stake bankruptcy flagship lankan unassessed basement spirits torture toyota wheat examined membrane knockout coleman homeland maiden dhaka dental christine curious tucker evolutionary responsibilities viscount communists pavilion fantastic ceremonies augustus loans altogether joey corporations tagging moss parody yield beside sizes armored neighbourhood cheap stuttgart dialects backup forbidden sanskrit neill councillor knife wade humanities premiership gonzález yesterday percy expressway transform vegetation tickets sf dams adopt braves throws plural filmmaker pomeranian habitats sicily justified speeds zip updates replied restriction lin siblings measurements uttar courthouse tide laps dynamo predicted uncredited charitable feast poker promising po kane trevor reprinted attractive sidney factual holidays wondering initiatives comfortable sw outlets fletcher airways floyd snail genocide saga undertaken scots kingdoms violating weren archdiocese lu fossils buying commissioners smoking enormous subdivision ta leather def patch unreliable porto valencia subfamily tracy downloaded breakfast subspecies andhra mercy botanical peaks ore lone bottle tunisia kw angola humanitarian decorations slide invalid generating trapped rao guru freshwater telescope dense satisfied criticised coral xavier obituary helena rowing albeit anatomy binary merchants controlling wound conviction publishes dixon kuwait ridiculous suffer nsw blacklisted nord caves regent riverside shane balanced fitzgerald highlight payments tendency lópez joyce belle lifted davidson decorative rouge clue rockets cure bolton slope manning boyd organisms paraguay preview dodgers qatar import buddy iucn dock clare penguin rapids exceptional papal shed cologne providers beings norwich challenging anthem supervision investigator ming lease yours cecil toys conrad buck hang bulldogs italia portable dell sd irving bells tibet composite councillors salary blacks enhance staying realizes sysop dropping smallest fragments objectives caesar tragedy macedonian unesco investigated cameroon carson slot rams icelandic sophie granite jenny deciding comparative apostolic venezuelan pirate paralympic slovenian accompanying seasonal delegation combining portfolio machinery miners embedded keen charted gandhi fiscal testimony duchess cu reflection middlesex carr freshman booth rudolf león courage midland riots develops pier isles formatting woodland toss auction cbc specifications acclaim secrets crest sutton decreased ups repairs que bypass prominence slopes phantom hartford alps liner ease cet investments auburn flame exeter din kay rent highlands induced verses crane simulation schemes latitude lance networking keeper wanderers fixing asset prospect knocked merits prejudice highlighted stance gymnasium gloria trent choosing horizon yemen readily riot arose renovation quinn excluded elena successive anger hierarchy ml nm decoration nerve disability giuseppe boot exceed scan rca pioneers canberra mb coupled newark flowing fungi sensors battlefield shells ashore arguably cottage honduras resides shepherd reservation elderly pbs mapping mongolia commentator bi nash sharks ricky shirley stopping exercises stepped ribbon infant destroyers mild vikings peers correspondent renovated proceeds reproduction sings mercedes compliance sided bombay jointly barack premises hawaiian poly handbook rené screenshot imported transactions statute congressman jerome shark choices doom bare raids diagnosis referencing firearms stoke bride prolific selective criminals lindsay advocates automotive wartime worry vacuum heating rushing jp lengths radius johan inactive advances valued celebrities meta estadio rehabilitation synonym asserted corp ambiguous leone sailor racism negro tears inspection ancestors honda barn controller assert abbreviated singular vacant depend ag burden encounters fruits ruby lunch nationalism photographers mick classroom btw delegate clyde forums cups challenger recommendation shi instant falcon loses planted nordic hannah coordination hammond dances barracks skilled qualifier saves hardcore precipitation bucharest ranger lily surveys peterson albion wicket destinations calm photographic schmidt subtropical contacts asylum deserves hyderabad visa twelfth contacted matching pregnancy blatant noteworthy commit stale expectations carmen erie harvest flies kashmir theft relates occupy flames wesley pga umpires beats arlington sterling chances donna petty lets sonic tr transparent dedication clarence katherine nervous facto excluding everywhere sunset anthropology minds reward narrator wingspan antoine ally steep primitive gateway leigh slavic derbyshire lopez chronic donations imaging suited europa duet ignoring intel winchester broncos buddha pig whale beaches approaching archipelago attributes minerals justification frozen inherently fischer gains dee comparing accurately participant peaceful typhoon hank manually verlag shortened tram aerospace alignment arabian ivory hastings taxa oz holly traces constellation allocated healing frames rey lorenzo specification risks professors viking atmospheric pottery bloody eden dd mormon nielsen consecrated peruvian coloured relevance interference candy bitter flown longtime struggled monetary trends individually funk basque bangalore synagogue latvian elect pot interact crews yi paths var guarantee intact lacrosse raj immigrant aus patriarch themed blessed tailed oversight servers ecological rabbit phi sectors corners powder mega vfl conservatives trivia clips coventry wives plaque institutional inform vic obtaining summers detect precision conservatory sunshine motivated diaspora linguistics gm lc accepting flowering nazis savings underwater deeper mccarthy avoiding toxic georgetown screening unaware phenomena inaugurated rolls liver aided benedict commodore hollow mob dawson emi brett alert seventeen bounded insisted loyalty reunion patterson collision lighter linnaeus zl filling enacted matthews willis secretly vowel whatsoever encouraging finest drainage disciplines shooters lambert seth kai emigrated bermuda wow noun reversed factories valentine plasma virtue heated phillies bryant carnival notices bates partisan amended ballad dairy insight sorts wolves hay kolkata trafficking momentum alternatively peerage labeled ricardo devils solved endurance blame pearson gaza hal teachings convent dealer ct neal frost bout bugs consistency demanding wrecked steady explored fuller shannon watershed specially hansen implied mg extant reverend lanes whitney mistaken pupil intend adequately limitations lateral varsity lovers intentions comply securities prohibition acute container emissions upgrade notation omar taipei torres dining survivor corpus advancing shaft nicole turks marker locals observe dishes cove seating dont pour loud hu suspicious expense prosecution noah hyde eternal terrestrial ethical countess tender reflecting yearly wholly pushpin masses modest dudley barrett explosive fathers pas revealing juvenile similarities conquered relate colts violates surrendered denomination feminine contentious softball compression micro sovereignty reef brady walked scenic eleventh shadows breach behavioral monasteries rope owing spare cache wimbledon practitioners duplicate wards notorious parkway ronnie odds andre ½ mos maternal transmitter steal identifies tertiary pension hc scenario constantine specimen mare dolphins derives gloucestershire metals teaches incorporate oceania galway mighty lightweight processor prints somalia unblock og switching possessed hr phillip bilateral descended gloucester pyramid eps emeritus nina val wolfgang gill cooling hiv electorate sexuality dash doctoral rita outright constituted nam timing peters tribunal stalin stadion organizational memoirs pitchers presumed mayer beaver ja sm gonna gustav motorsport facade spike titans possess apps cal carlton klein hassan dundee translate flesh medalist clayton pistol mikhail expired unnamed clube watchlist valleys benson profits acids travis distances enclosed asteroid cow accommodation catalan warwick textile brewery sanders clarification deadly qualities marching assess gotten witnessed marx chassis intensive usd phones marcel weber diane detachment operative gazette thoroughly specify fritz marketed cloth doyle statues nigel qing sheikh cafe sandbox breeds orphan monaco folks meaningful invaded norm sen berkshire illustrator sl celebrations beverly macmillan strain motorway contracted constitutes williamson pronunciation disco presently talented deletions forestry rodríguez exam coined struggling advantages mood wheeler karachi blackburn unsuccessfully educators beth swimmer tunnels wagon masculine offshore darren gerard consul excuse unreleased investigating mohamed salon checks trigger reinforced enabling shoots pipes atoms invisible ibrahim memoir intro inception implications coronation tobago werner gifts livestock companions differs confirmation volcano jupiter prophet taxi disagreement dresden originating refuse cyclist deserve detention persistent scripts diagnosed iris dad abbot elvis dorset mel laboratories localities responding lorraine fabric seas shin starter clip villain chemicals replaceable multiplayer nr astronomer immune comprised bologna flexible modifications samoa tai gamma reformation convincing cherokee posthumously homeless triumph prefix carriage obsolete communism unreferenced consortium stark locate greeks inclined effectiveness advocated surgical souls retire cs renewable safely cameo dissolution unemployment racist torn baghdad malay dana clarinet sodium moldova backs crow interrupted columnist quantities arise pseudonym eclipse organs philharmonic christie incorporating château christina countryside conception angelo dir lincolnshire wished regina rectangular transfers yacht picking secondly algorithms nassau constituent sánchez falcons termed bart genius defeats informational punch specialty clash continuation charities abdullah remedy judged josef loosely uninvolved comfort branded broadly cadet helicopters nave shooter lotus enlarged dwarf logged affecting packers antwerp glad fold knox billed terrace flanders tri socialism sandstone renault fleming judy infected informative happiness warehouse miracle leisure kindergarten leopold barker synthetic waterloo sage adviser accepts trim basilica sitcom cartoons irregular atom normandy financing admits embarked passive modification shiva tudor breakdown alfonso badminton fiber fiba mice carlisle unveiled subdivisions tired mu wage fingers cest loving worried gambling basel randolph allison mothers capitals auditorium vault haute persuaded descendant poison gujarat royals yan damages eleanor parsons invested calvin dover mafia resembles offs anon deny shapes madras wears professionally essence emotions myers saturn transmitted paperback freed domains hanover mets pike sketches cellular augusta sinclair examine unexpected sponsorship ae dab bmw uncommon aria realistic barton miranda dodge stint oc sunderland plug orbital spreading puzzle symbolic diamonds jesuit viable flee lined roses competitor munster midway germanic urdu rides scratch lahore pipeline protective levy rodriguez seymour affects pac jules archer kicked promises remake forts accusation reopened phases rendering progression complained liberals indefinite lacked clement batsman mackenzie judith chartered salisbury discoveries pursuing primera polls reactor cope beds howe calcutta stems dubious desperate lafayette uniforms outcomes tips partition ministries conclusions destiny sigma reporters daytime aurora goodbye relisted webpage coordinate steering pepper drunk folded seals scouting swim cunningham lexington persecution expenses theaters controversies agnes higgins projected functioning query seine byron nowadays accomplishments pérez averaged fool downs brave mozart gil dirt mathematicians aquatic noel inventory ethiopian evaluate meanings reluctant rgb imagination barber slogan reich hurricanes desktop demonstrates amusement hunters drops letting sept demolition pi hague bronx dukes natives ancestor preparatory fence retaining rna strikeouts turin crush algerian accusing exploring reunited wii narrowly cone nz württemberg urged taiwanese capturing continuity enforce clouds verification owl struggles namibia pratt tate romeo reduces regulated lonely deaf shire structured newsletter ink dwight swept menu steelers pointless declare paolo supernatural privileges pleased commercials faithful knock launching jubilee finishes lamb resemble hesse storms simpsons storey flemish logistics consort satellites kb dub elisabeth apache antenna ashes fin advise trump kidnapped enables wounds alternatives turtle greenwich cake tributaries heather neologism hobart pulse camden meyrick eaten irrigation kyoto labrador vera honestly ninja midfielders transcription validity employers winger redesignated pays curse breakthrough cycles node europeans expressing recreated thirds privy habit viewer russians shields analyst exposition interval decommissioned igor oath filing generator welcomed pg bahrain pressed stays twist griffith aston bordered nicaragua bacon bury frankly smithsonian premium ned mayo walton stevenson colleague curves fraternity xii encoded jacobs hire isolation raven dig introduces verbal appropriately noticeboard patents usc holt tx utilized searched clifford sonata cb hemisphere solving slang thrust trout aesthetic predators panic tulsa aaa teresa warming advancement hunger ears tract laurent vandals resign burnt splitting courtesy rodney encyclopaedia treating subscription perspectives augustine eighteenth midwest pasha amber dayton universidad revelation designing overtime rows functionality ave pts mastering deity shocked autonomy kuala conway frequencies magnus protesters inflation academia rc hancock unacceptable passport interpretations creators appealed baba viral clearing woody ventures tab marginal meantime inscriptions shame genetics motivation yoga rotterdam marvin teenager northumberland vocational afterward laurel overhead balloon sponsors firstly wha greenland demographic rahman attained ivy eg screw talked wigan accreditation vicar surprising sandra overwhelming rats mali baku stranger installations payne mod projection ferrari collectors sands subtle strips shoe basket moroccan fisheries gfdl tally hadn voter resolutions silicon rage brent aliens cao imagery nucleus ceremonial practically simmons circuits strand definitive promotes tooth uranium complications sega clusters zambia employ ky isabella abandon marriages tunes probable amino accent incorporates garnered playwrights robots slate weston parachute escapes lauren skiers belarusian clearance snails martínez mercer crimson incorrectly meal islander pharmaceutical dominion chemist medley proves porn mirrors remnants expressions vegetables teammate leningrad psychiatric laos traced wiltshire tense ant hectares fears curling fits af inning rotten fernández fury underneath bihar kraków confident pioneering suzuki mobility fascist marquis registry numerical catches olivier halt chronology polytechnic randall aligned dunn scrapped alike leased chandler sophisticated spur acknowledge laurence textbook arabs bother xx rewritten sealed verde landscapes curved reilly spatial commissions databases approximate inheritance counterpart cease ugly fraction traits rivera transaction decay lineage substances sued guam ensuring syndicated fencing zhou finalists mir electrons § avant turbine install majesty zürich outlet eduardo vince mint brotherhood iconic katie bizarre ign gr swansea announces wheelchair refusing slip shri gore herb rep stripped crescent possibilities syntax genome hale railroads sonny favored jonas hawkins similarity sheets suspicion molecule winters launches tonnes nautical preferences cds casa tricks burlington partnerships confined terminated salvation adjusted bb guaranteed swamp uc wwii overly hanna dove jennings beneficial annexed sutherland gym patriotic mollusk suspects squares ads collectively staffordshire eet revenues müller bremen licenses seventy cultivation credible ingredients benny stafford stephens shipped xiii dramatically racer translators beef password modules comeback precious dei computational apologize moist aging garrett dinosaur emblem plague broadcasters wan christchurch pie spiral marcos tensions admiralty orchestral screened ev cement refugee continually chaired madame viewpoint miniseries luigi ieee donation numbering café hare correction jensen crusade talbot developmental brooke extensions cube migrated goa vacation maggie boots submerged ambulance disabilities sanctioned ligue tyne byrne importantly luna flint skeleton burmese kurdish lawn école weaver mai einstein forgot gould crosby livingston titular wreck niagara demons treatments sinking capt vintage perkins gdp pius stereo troubles tina resume unlimited footnote molly robbie attitudes thoroughbred decisive mans boost norse neighbors restoring portrayal ti afford rim procedural assam abundant wendy gran feud jockey breathing alison niger prone whereby emmanuel birthplace doll oaks umbrella elliot bahá statewide edith campuses showcase rotating angles tolerance survives insignia berg combines resistant paula affiliates memorable traders sink concentrate prop fp um hiking mansfield refusal rama paved afb airplane demonstrating violet northampton blamed wordpress globally constable abbreviation ku gentleman cosmic shawn piper banker villagers definite bullets holden lionel demonstrations daniels drinks prosecutor photographed inaccurate blown bollywood irene chamberlain dive sworn anita apartheid ada simplified reject abbott stephanie longitude mack lucia smile classrooms feared injection calcium redskins jill nurses sophia swords pokémon prominently multimedia buchanan declaring proving judging sanction employs sant horace darker emerson mandarin garde ipswich borrowed bred vaughan criticisms ambitious philanthropist discourse rajasthan collar honolulu inherent jamaican fcc advent priory inability brigades sunny talents jakarta natalie chin lester macau barbados activism consumed spectators wages undertook silesian inmates homosexuality teamed recruit sensor cp floods thereof weakened twilight lowe nitrogen assassinated harmful denotes sergei kirby displaced occurrence firefox strengthen américa leonardo settling categorized irc brittany lds rite proportional speeches una contributes digit lobby supervisor minorities spectacular skaters bays bean bg blades nyc pitt printer genoa brandenburg residences thumb wikipedian lublin isabel civility sol luftwaffe seated overnight marty segunda kerr butterflies televised hostage lennon fowler labs instructed casual rude cubic commentators soup loch battleship dissertation denominations python extract peasants throat celebrating finn mound mozambique fortified inadequate uzbekistan stained punjabi caste prisons stirling limiting senegal beirut touched repeating sexually wildcats deposit olivia colombo observers hubert basil ou batch nh greenwood breath baton spokesman caution meadows turnout aluminum revisions stein jokes italics bark azerbaijani diabetes flickr framed lesson tenants captains factions conscious alley schneider bert subset bowie geo meditation rue undergo promotions diplomats critique browne succeeding purchasing pastoral liz horns emergence canceled marched damaging trustee resorts taxation miniature conclude dartmouth clarity incredible resided forewings chevrolet qualifications poetic void bosnian cab hereditary passages mater principality increment contests omitted gentlemen nj underway tumor torah catalonia cricinfo memorials warn geoff jimbo locks bees vanderbilt commemorate equilibrium dorsal realism adaptations johannesburg bent nonprofit intentionally becker bs eponymous assumes advertisements humorous shores hoffman enforced chung juniors oman renewal flank okinawa outlined helmet straw grouped downstream tens anticipated incivility steele goodman nutrition havana messenger wi robbery cultivated unusually ducks castile screens peaking bowler huang proposing packaging int nu apology trauma tt reyes subjected capped draught whites vols laying extinction troll divisional pulling attribute proto proxy coding canoe gasoline podcast commuter bloc encompasses cpu borderline displaying geometric chaplain tended chord barrow angus leinster costumes domingo cowboy jew ge preparations qualifies sculptors koch sts bedroom prevents minded declining choral interpret grape expeditions airs limerick khorasan haunted offspring xp marxist aluminium permits tomatoes savannah tapes quarry australians disappointed cola crafts irvine curator rbi davenport receivers accidental marseille opted gibbs rand insect sergio combinations appreciation enzymes awkward spanning pressing dioxide theologian progressed receptors encourages corrupt donovan ht hogan sg nominees logging ole evacuated doubled conversations questioning gavin targeting implementing ceylon depiction converts symmetry excavations reuters hercules martín devi sheridan unfair pile muscles reptiles repertoire sorted samurai cassette stripes ak juice proprietary fierce necessity stealing hilton blu giles compelling stomach rushed abusive buttons gram hinduism fortifications reggae homosexual apprentice partnered hussein sells disposal pharmacy synthesizer liability cam dinamo sci contexts cannes willow hbo starr statutory winged concurrency recruiting periodic slower pagan matched defendant specializing crashes pizza evan rwanda mls avengers thornton ghosts exploitation xml ul colt ensuing pulitzer supervised flooded repaired integer instantly mistress securing ethnicities mortar plausible cop activation lyric cypriot assumptions methodology ios gale dominic nominal bud melissa skip hapoel invite rt reside martinez affiliations medina debris successes frustrated engineered arranger » infinity corresponds cretaceous bite nodes bavarian marilyn rumors neighbor suffix fare ordering lifelong campeonato fuselage wanna capitalism harlem warwickshire delivering pigs centennial rituals justices joão bolt wakefield infamous automobiles damascus listeners simpler orioles xv dried ke hai assignments porch hymn ⋅ indirect verifiability uruguayan vhs sioux convenience statesman detached practicing chu abs hiatus banning sul paired relied watt kite offence sylvia frigate convey dramas shade trades marian coastline clive lamp camping chips earthquakes assisting indicator bloom emil prayers attorneys napoleonic mortality bruins processed pathway treatise malik res cliffs caliber bordeaux concludes lowell atop relies marino scorecard recognizes heavier prevalent ruined kidney emphasized uncertainty documentaries teddy omega cheng denial sunni granting outreach ordnance neville brush rolled alain mama evaluated ein styled turbo pietro demise hydraulic eminent apologies baxter addiction anders gubernatorial emerge zu solidarity posed maid confluence physiology algebraic brake loire jacket spinning brendan scarlet explorers impacts dante gerry regained granada cemeteries royalty ins confronted boulder announcer inconsistent amphibious spears mariners mutant oracle slim erosion borneo positioned coordinated explanations transformers evelyn profiles beethoven digits usgs raises parole constraints notification vein frankie economists terra bahamas spark kappa potato interred impose acc kathleen unconscious filmmakers cl endless absurd apex evacuation lil mortgage considerations zur suits lagos hm realizing quotation representations pilgrimage truman ashton yields taliban concise remarked olga prostitution flyers prestige contractor xiv expeditionary bubble mauritius deliberate incoming panzer antiquity convenient wines directorate approve gifted wilmington tempo dexter skipped upstream islanders bp perennial kerman rai fulfill api hatch unionist unfortunate immunity snakes nile rm duel assessments bern daisy authentic attachment xvi lawson laureate rupert mccartney relativity nikolai cord exclude fencers shipyard entirety emotion cobra sensitivity intimate soils plots holstein investor huntington neighbours retains barons enrique roth psychologist kaiser darling judiciary vicente streaming jeanne replies converting klaus anarchist oxide removes overlooking occupies stella dose zombie canadians dwelling backwards liberia blackpool modeled offline norris serbs harp rental relieved botanist guilt fried alexandre concentrations markings verdict cyril mnm hertfordshire discovering aces shropshire spiders compulsory woo bonnie crack incarnation oldham liam positively mgm everett quincy firmly rode utrecht volcanoes britannica amendments proposition threads triggered humphrey bust fitting discrete peterborough zurich deliveries lumpur wolfe unfinished drill exotic teenagers sauce probation loaned archie javier exports cartridge clifton nagar deluxe commando deportivo feeds optimal rhythmic kernel springer austro dramatists bans tragic credibility resurrection conditional ind gaa usb belize researched prosperity standardized comet pools troubled negotiated tasked markers kenyan archibald lava victorious promo creativity protocols beloved asserts esther objected peel heidelberg chooses groove lantern savoy saunders facial confrontation ladder bohemia thief guyana vanessa halloween sentiment appointments levi rodgers homestead realised plc sainte pune sarajevo announcing dinosaurs dominance precursor laugh financed lars viktor scrutiny sandwich apparatus utterly pornography toulouse tap incredibly alarm cruisers preserving rover granddaughter exams halfway acceleration raced shelley adverse competes intervals nichols induction privilege trombone aforementioned airplay edison eyed magnet martyrs suffers captive codex wool forensic exciting majors surprisingly vowels seller platoon dia fog honorable cornish bt triangular yuri hiroshima selections wash freddie nationale undrafted signatures treason babies persia wilkinson whoever sacked glaciers sustainability leaning recognise lumber receptions ballads pillars turret residency reginald doubts zhu une owens lately veterinary guggenheim reputable hector lounge undergoing presenters sacks tara jumped curry goat nightmare burst yokohama behaviors secretariat spans illusion wta icons electro teens romanesque shake elias resist planetary pseudo alba biodiversity shifting bluff anxiety zhao rogue dolls pitching barney sikh suffrage feathers richie marathi habits blend extraction courtyard turf desirable expo bhutan guerrilla esp deadline ea bon exchanges whip farewell cardiac sensible deities replica smiles sympathetic reproductive cousins terrorists acquiring heinz songwriting financially lizard zen remixes exiled negotiate axe flour jade insane pose fry ida goose scientology chiang dom pact garbage regency fulton reorganized sixteenth riga mom xu accompany nw hasan mosaic icc ark subcategories joachim bahn rutgers comma crude taxonomy alcoholic rom caucasus charlton villains eliminating eighty occupying superhero mao baptiste paz sid loads lime raleigh crossover karate strengthened brig dickinson drought delays socially maccabi ajax chargers rejection schooling caucus counterparts reconstructed investigative catcher prev thor amnesty tracked surfaced thirteenth bohemian failures soviets wichita barriers lottery grateful outskirts turtles meadow electromagnetic conan likelihood endowment wiley formatted patriot deacon infrared dioceses specialists julio paso kang bourbon tf promoter nineteen surgeons greco militant gable quoting thatcher weigh parma librarian chairs vc comedians stressed watches undefeated communal tablet palatinate copying flip descriptive shelf upright nursery synod packages unsure disclosure emission visually correlation ideals av disappearance gong derivatives sammy tong possesses adobe deficit premise identities avon vega rosario tko leap transparency everton ri lakers sic debated ruin sophomore retailers malawi payload subdivided gf substantive devotion punches audit méxico steadily noon hist inequality lowland strauss informs confession mba stunt monopoly niece surf spells nationalists nathaniel gentle patronage transferring kc playboy penguins transgender eisenhower opus linebacker chatham communion jewelry probe sharma gases malls ro organist pertaining intersections russo candidacy concurrently laguna elevator discs ironically lenin breton patience pedestrian peggy tucson cares issuing nickel verbs catching orchid calculate comprise screenwriters maj manned alexis instituted namesake libyan mentally ir liang quotations oxfordshire townsend cw tear unpublished subordinate heroic shine regain determines onset sounding branding pf recreate adjoining peasant rebuilding infections violinist mongol botswana exclusion magistrate unstable murderer deborah uncivil leicestershire promptly arte alejandro killings orient gymnasts balkan pascal readings venetian vocabulary cum destructive judo signpost translates differing eli cane innocence npr arrows calculation innovations croix abundance cyber pauline ego litigation ua hooks nwa bangladeshi motives coats mia possessions meridian examinations bayern cv satirical reissued wrapped harriet audition enjoys sampling brackets insists newest amor hubbard merry backgrounds fragment nottinghamshire beginnings busch ancestral dalton magnificent lethal banana guerrero usaf friction comparisons madness routledge mutations assassin histoire canals harding conceptual daddy occupational guinness inscribed adler acronym forthcoming carpet felipe hamlets buzz unfamiliar recorder intake unhappy seventeenth fundraising laden byrd homage chiefly fuck cooke hulk grandchildren suppression mae establishes antony natal contention unix conform ernie chandra beard coca usable teammates concord macarthur maltese discretion sorting tottenham bel worcestershire danube garland builders wetlands côte mapped cooled bas temporal misunderstanding boyle moody detained beacon coaster baja blah nobles oilers arches examining hazard titan cables psychedelic qaeda forrest realise obligations osborne somali mma reminiscent recruitment flats obligation libertarian weiss corrections wembley debts answering rigid flores enlightenment sect focal fielding abolition gps citadel gravel secretaries oswald noir martyr institut myths anterior sticks nb suppressed analogy golfers labelled zinc beans mclean shrewsbury turbines ourselves textbooks ang tractor typing borne sting pic cents excited speedily scandinavian atari unblocked inlet fairfield pounder minimize substituted chronological satisfaction remedies polynomial butter fourteenth posterior float persuade cyrus wherever om laptop spartak derry rhetoric sunrise equestrian render nhs plantations enthusiasm repository propeller morse stadiums nk maturity outfit inflammatory habsburg bombings schwartz drain nate strasbourg lemon norte brennan ions workforce honourable predict bullying graveyard afro mortal ± nuts visions poisoning combustion commandant enduring mn exceeded por clans tuberculosis warships eddy caldwell eco foul bentley physicists ankara geelong organism beaumont gorge mcgill retrospective nolan procession rb weir panther foremost aragon palermo jaw explores baritone kilkenny annals ceramic pony cornelius detainees neural moor ssr abbas collaborations tidal hui announce calculations congregations unification cartoonist improper panorama dividing nt grouping mural torque hatred productivity dans exempt soundtracks futsal monumental anaheim spends economically wolverhampton spire centro brakes predecessors jays expresses basal versa packed landings categorization accomplish warden wholesale dial asphalt clarified blockade laurie middleton flynn toby mole nicholson cheaper piedmont refrain otters patrons corporal sparks berger jain trolling coliseum aero turnpike historia offerings smell moreno oversaw bamboo lockheed meals charging dal unchanged foo observing setup metallic respiratory militants correspond rowers lean proposes sweep meredith purdue nissan calculus steals samantha constructing babylon huddersfield rabbis donor smash putnam drowned hut salzburg devised dillon pressures mountainous rented musée veronica brock galicia pal gus abused famed tiles drift brewing canary humour olympia radial bk córdoba nude currents reservoirs feminism resembling québec transitional straightforward waterford divers shia insee averaging famine willy greens ramsey honoured guangzhou teatro unsuitable metacritic ling summoned indirectly reflections jurisdictions wyatt cfd manifesto shan cadets depictions clicking utilities figured explosives paradox minsk conferred chrome scroll tramway solicitor niche crap lifting expecting doncaster regulate defenses experiencing agf shirts ts marquess undue wax motive hutchinson overturned tango lara strokes infectious reinstated mont pigeon lyons stole daylight fertile stairs patrols updating slender ut botany dignity madhya ideological grip shortage analyses skater clone ravens gu foreigners takeover westward recognizing retrieve traction brewers humboldt alternating lenses opposes pulp salle visibility plata picnic decks czechoslovak concur worms boone lam lagoon soo cruel threatens allocation buffy recovering río affordable presley randomly timor mackay tire chestnut pillar accumulated dt diagnostic dem imho unanimously popularly choreographer simone bernie bags champagne norms düsseldorf musicals complexes endorsement neighbourhoods concurrent hydroelectric carrie mughal monmouth forested mccoy args ignorance squash conductors invasive closes burgess tavern wmf petit reno az racehorse jong kitty reinforcements seahawks workplace offset benz cha racecourse reissue rebuild motorcycles sevens covenant robust dislike minus weimar hoover dolphin conditioning ella investigators glasses bowen hindus handicap ware jurassic amphibians implying postgraduate siding trench spi weighed redevelopment sanchez inactivated wishing imaginary revue incidentally hs imam radioactive consultation tipperary tonga adapt lovely erotic hg manipulation belmont farrell thickness discharged torch lois ramos filters damon mongolian employing premature preacher ballots rubin pornographic katrina etymology attracting ambient subdistrict feudal antagonist dare insult diplomacy claudia neglected literal middleweight complaining crushed seniors brunei dots postponed lowered vegetable siberia collects birch syndicate crowds wwf scholarships polite confirms stall shifts wired directive aide theresa biographer gma tissues benton nos marijuana commemorative gnu tuition resemblance lsu gao sundays lac watkins passionate waterfall genealogy discouraged centenary empirical charting bd hq react snyder psychiatry prescribed educate fairfax devastated confronts testified rails westphalia routing exhaust twisted vitamin alvin routinely chromosome mecklenburg weakness weekends puppets nippon jealous brutal absorption shaun kung canonical worm akin viruses odi butcher farther lim disagreed andersen separating excavated eligibility césar weasel graphical itn mock agreeing kara atlantis inductees freak amtrak wien accounted inclusive eliot unrest specials speculative semitic mla dismissal harmonica outlook elegant mast crystals resting climbed dug heirs profound mitch uae depict abel colonists temperate alexa dar enthusiastic cromwell annoying iihf frustration kathy kensington guiding surroundings kidnapping playground pomerania det incorporation cfm españa exported texture fancy tor doris digging cocaine rites bauer erich mainz dwellings spinal ramp socialists semester che unwilling prediction rollback upheld samsung albuquerque ó reconciliation freelance stretching topology neurons assertions retention woodlands standalone cobb halo graphs grange mendoza aquatics lip speculated raphael unprecedented baseman sadly amherst builds researching seeded lyrical colchester gallagher generates wherein amos pitchfork adopting scarborough quasi northamptonshire cooked optimization vacancy aggression dressing contingent sympathy lea juliet emperors staging df paternal principally schleswig fresno clever suzanne ee uncovered prolonged disappointing liaison polling abd sunlight tyrone syed compressed humid assyrian touching gravitational accession tutor darlington tar complain geologic singaporean integrate ing pioneered sar execute sedan antique rf morales choi disappear stocks surplus furious buccaneers mutation ghetto satire vp velvet astronaut gaps concacaf punt ljubljana cfl archaeologists irwin pad autobiographical yukon interceptions instrumentation rockefeller interception captained shining spokesperson toilet pol orchard rutherford rfd kramer romney nas advocating pueblo nuremberg flavor hypothetical ‎ finch grammatical knots remotely utilize divinity fixtures invest straits jumps retreated bacterial théâtre shy buckinghamshire sai sino amid aiming surveyed misuse continents refined solitary spectral desmond odyssey hiring mysteries phosphate bombed wesleyan imprint caledonia exploded portals darts animalia cancellation autism knoxville peacock syllable pianists depths michele shipbuilding sleeve cumbria quo theologians reigning pamela montevideo andrei assemblies stanton tones saddle disturbed blessing inevitable reprise selecting iaaf portray jasper eaton fb pits hanson mcmahon robbins vine sparta espionage fifteenth poznań clown paddy troupe relying yankee vaccine welch captivity packet replay qi boiler belly iphone excerpt competent nightclub symposium jewel generous statutes entertaining odessa cockpit nets bucks detailing headline tremendous mailing hicks fiat alessandro alec centred stretches clashes leiden damn surveyor paterson yong aristotle dáil tent nouns atkinson persona mig distributions playable nuns rotary angular foley slaughter switches rejoined distress ariel corpse peripheral accelerated prasad fixture voluntarily accord conscience ass daytona accountability novi burnett coconut giorgio drilling khz anniversaries travelers dominate lazy ordinance semifinal piston cody gómez bravo crete bravery theorists novgorod analytical inventions extracted metabolism provence stud stratford bella recruits countless posthumous originates amir morality fife tombs credentials proclamation sahara presided papa rus boring oceans ismail intercontinental cain choke ahl compass freeze profitable haifa southbound reeves gmt elaine compton blonde sultanate curtain deposited ot royce dispatched ud submissions crossings operatic buckley golfer vita mirza fra termination hitter burkina reliance superseded propelled liquor blackwell ciudad flexibility arbor monastic pe adjective boer wicked hewitt bilingual constance bleeding perez vilnius loser fond lasts stranded bottles monkeys sheila exchanged participates reel kicks invites bureaucrat relics washed mx ew jessie blunt olsen sims hk skinner canoeists elm resonance faso declares franchises kurdistan coffin sights italians bothered recipe alright elephants greenhouse automation hampson cascade forge aquarium romero tsar disciples donnell specialised cutter sustain scream pavel approx ratified generalized activate processors garner satisfies northbound andes shareholders evergreen kicking killers postseason meteorological digest handles mclaren subscribers sparrow marin dynasties shankar mat wally primetime snowman grapes crusaders boroughs underworld headmaster ravi substrate cheltenham melodies mankind prompting spies tuning insulting creed senses specializes mona reorganization confederacy stockton accessories supportive programmer swami torpedoes motif itf cortex epidemic ambrose za unsigned lyricist courtney edo mustafa shrub germain whales française encoding concluding crossroads consolidation calcio willem telecom goldman briggs alonso sumatra anchored kapoor boycott museo forks consulate firearm banjo frogs pork contemporaries emphasize arises kazan surpassed inverse reddy colonization assured obliged eruption analogous friedman ideally exits keller remark nad fiddle horrible economies entrants pasadena fungus escaping scanned libretto benin ecosystem navies portrays joanna usda graffiti mystic obstacles fda bing blanked ants reddish barlow lent deeds doe plugin futures horton brasil cannabis serv entertainer glance soloist repetition sparked verona attracts armenians coupe wit chrysler hobby jaime merlin sindh wight cropped lama connector mccain tm magician guangdong wizards advertised mediator burger bernardino catalyst glacial rhodesia tbilisi protestants hindwings stretched gossip metropolis beatrice undercover authoritative díaz cannons naturalist glider illegitimate juventus xxx disciplinary occupations ty internally sheer arithmetic spokane newcomer sami earnings programmed aba ns overthrow allah rancho dump merchandise patches humble shelby avery denote worded javascript panchayat padres unverifiable rewarded presentations hurdles versailles generators happily dungeons seville systemic daly haitian patented gig renovations stellar med mates sans convinces strengthening porsche undertake skyscrapers buckingham diaries arrests wilde mandated adjust immense rot kv hungry fremantle tna midst sgt waterfront celestial levine combatant nicola minh engined exceptionally sheldon halted playback giro hee poe proponents inauguration bind needle courier excavation spurs exodus quad climax potassium ascent volkswagen lydia reprint connell lattice unicode godfrey rossi gonzalez prospects decreasing rains hymns qf admired orion pledge modernist blacklist monitors academies derive shit undergone garfield vishnu evidently fest cbn overhaul flawed cynthia degradation bracket pray tex utilizing abe pam artery appalachian plagiarism leopard piers sensory evidenced bunker spherical regret ulrich socio vickers supermarket customary malone weights elders tornadoes corey vacated charm petrol blanc ik signaling buffer melting sensation subcommittee finances caracas vernacular regimental judoka psychic gundam denny fatalities zach ju burgundy exemption otago simultaneous unite eager composing rothschild booker weighing calais hint punished crying bunny spl libel antioch gangs castillo arrondissement appoint urge palestinians favoured hernández backward ambiguity approximation grocery restrict cyrillic shoulders harley dealers diminished unopposed ret surge reservations bald seminars rudolph vijay wagons devastating remind bn tallinn praising campaigned nasty pants fleeing analyzed apocalypse archaeologist grief dispersed allegheny consulted hydro legislators staircase bernstein bundle commencement textual prospective moose chancel consuming minas consonant nun variously melodic knot null tsunami adventist defendants protested valves brewer barred ruiz weekday ordination rpg hillary spun racehorses erin balkans prep ville yiddish entrepreneurs crimean sq intersects welterweight bratislava mushroom mosques humidity alicia emilio dixie © len gradual trash litre chasing ponds greenville midi convex rejects seminar cart russ insist stationary toni larsen orchestras bandwidth seize plato auf tc fibers interfere punctuation clair collingwood tn portraying imports gradient respects gregg philips megan quiz alterations howell guardians highlighting tasmanian mf surround ol loops symphonic hospitality rae intellectuals junk cod bf winding sb estuary discount axle reliably chun willingness neoclassical painful amelia hussain exhausted responds provost luca umpire indiscriminate ngo behave podium quentin jakob pneumonia lao ethan committing eliza deficiency coherent rudy mantle woodward sac julien copyediting liberties therapeutic arising spill estádio semantic chloride confront vanguard vendors baptism rv famously planting valle í bonn claus mono intends penal lips opt eurasian ramon maxim zion uploading alfredo ia gs policeman treasures ceramics africans embrace culminating bliss wonders bowls universally characterised playhouse goldberg caretaker guadalajara archaic risen ranged terminals campaigning pedal wen managerial immortal marrying suppress cambridgeshire rappers deposed mistakenly recycling intentional bei fishermen alloy malmö assurance lan stevie marne contractors spine maximilian gala glucose parallels awesome migrants quaker zionist detract palazzo doping ramsay js ministerial blanche moran crab butt canadiens hanged electrified burt ambush flotilla equatorial moderately minors subsidiaries conflicting herd luc serb shea collier nickelodeon apostles sherwood conducts archery cyclones oceanic potomac conversely captures shootout eton loc adb blanket paraphrasing expose schooner departing lbs drv awaiting disguise brenda nora beams connie como hayden ld wehrmacht warbler rhineland advert toe astros ditch polymer yorker subsection thanksgiving transverse houghton neutron morphology mythological cho locke modelling bois façade leafs piracy evolve compliant fulham superstar mechanic perimeter exceeding hmm franciscan detector arrange wires vertex bethlehem wharf gi carmel medication infants auguste broadband bali rift henrik delivers mitsubishi leak nme sharply formulation bisexual sichuan sincerely bricks kendall countdown supplier plea affinity het fw replaces fined walters analysts kyiv marketplace kits lamps lviv boyer preserves expulsion favourable biologist debbie stephenson tanker domination margins skate herring disrupt worthwhile ffff steward proceeding jacqueline cindy watford theodor restructuring mysore baronets diver philipp disguised gmbh accuse convergence prophecy nuevo wills outfielder sanitation tortured luton govt ankle backlog coil collaborate cinematographer undisclosed demos predator tops livery coefficient sentinel recalls alphabetical inserting ponce defences volunteered wilkes overlooked vogue leaked middlesbrough torpedoed soyuz janata milestone imposing shades deed freud campo rodrigo redesigned gwen masonic summarize jstor monterey denise spear zoe graf dev fertility carla vertices successors pleaded ventura sins mastered culminated expectation asteroids wat prima serpent stepping farmland fixes viaduct christoph initiate remixed dunedin grenade ao analyze satan français folding earls christi flux invaders nail modular squirrel offences sloan boilers liturgical ballroom vida scenarios tablets martins neon trader tails saxe lamar thessaloniki dictatorship sperm differentiate conjecture taft mckay melville kris mating ridges tabloid northward decreases battleships descending polk announcements hara supplying état hears otis milano nikki pearce proton weaker rainy diffusion clarkson bordering hostilities awakening sherlock tyson vengeance doi pont slalom comune bowman sack leroy elk applicants mister nobleman hamas vectors disagreements fs thats freezing mounting quintet baronetage counseling khmer beaux fascism reproduce andrés walled costly jurist malaya gerhard executions flagged foil jammu urgent cerebral tajikistan typo colorful whig deception mariana hooker akron crimea neolithic narrated viva fia krai feasible immigrated canvassing qualifiers awful hübner engraved coke conquer introductory raaf hazardous certificates directorial hume dl practitioner disused periodically ferries pathways abuses scrap meaningless anand docks illustrating falkland shale annex whistle glamorgan isa aft creations sms jaguar hazel shu sellers vaudeville tenant willard boca uni nagoya ransom stokes redirecting curiosity disqualified emerald fars shear nokia interfaces pereira genuinely chalk pest steamer illegally guillaume mixtape compelled decimal ascension technician wasted denying melanie mutiny hind impaired unidentified openings michaels donetsk ching confirming presiding motifs defects urging capsule buyers trailing gomez astronomers clues disciple jared apostle grossing compiler jackets obe gan acquitted magna nan preface ensign uh dracula mandolin patton turkic naomi unmarried snooker lena annexation wasting yen tying dull concession valerie amiga donors purge algae jesuits sinatra disastrous pathology containers airlift jpeg blanco rory handsome dvds kabul archbishops rip brigham ginger bangor registers pembroke diagrams disappointment champ indy nicely unexpectedly uncomfortable kahn caring cinemas summarized postage nut peculiar loyola equals ww authorship obsessed veracruz tunisian escorted wavelength spawned relocation headlines squads colon fist frigates insults guillermo avalanche unpopular dickens deported asa fulfilled retaliation miner lol grounded yin settler cbe segregation exercised substitution kelley incidence kinetic bernhard fearing blurb skeptical hereford zheng benedictine hz nairobi sinai gypsy falsely optics touches tanner hitchcock manifold nests moe dependence pixels yves prefers xiao denounced gymnast mop helm eduard bis vie pilgrims merrill bail rigorous sha gem circulated saul duffy totals fashioned landfall ramps hyun offended rockies waltz medicinal epa ambition disturbing pardon moot linguist strangers camille tb uninhabited beverages vila lend tandem semiconductor palaces od nomenclature browning muse silesia antigua isis tires simplicity fuels interdisciplinary fluent barony swindon padma bounds hostility gabon theoretically bankrupt masked poole maud mohan ritchie insurgency moisture correctional mckenzie burnley sermon venom tha elton capitalist dominique compilations ramón priesthood awb geologist revive whitish veins sarawak organizer tehsil galloway bengals referees sud patel tripoli protects cantonese wr sulfur eccentric qc pba vivian desires pak fatty scorers feng expiry protectorate ottomans tobias adaptive federico nike emanuel manners tuscany documenting tao scripture rusty mediated shout stronghold spray eastward rhythms rooted pixel tile ornamental intercepted suns webber cis josephine preaching distortion roofs hail evaluating bayer culturally paradigm dewey indicators distinguishing fg markup paranormal compatibility grasp kyrgyzstan hardcover atoll amalgamated ensured mythical rufus atheist warns masterpiece mis booklet montréal postwar île heats notoriety ortiz lever symmetric doo blowing lobbying exploit ib choreography saloon thieves sabbath uhf zeppelin bernardo mv mystical société fundamentally qb attested metaphor chesapeake lokomotiv faa salvage oasis beverage sufi stefano galaxies shelters anchorage upwards reminded sexy threaten quantitative guessing parentheses instituto hutton lai treats rink arid trams hailed washing stony skies barrister flourished vampires gum bathroom bartlett benfica crowded harmonic psychiatrist guido fas qin ethel glossary cavity aziz forgive sardinia transylvania stadio dai reggie repetitive uncopyrighted dismantled currie miracles roc fam moines reassigned pumps kindly sniper pod intercollegiate kin obsession ezra ninety thy gertrude guthrie lola anthropologist goodwin blanking hellenic hairs mutually harrington parkinson sums hormone audrey gut archers drummond aperture goalie digitally misconduct mammal knowles spotlight seldom spice galerie assistants fitzroy ic outlaw cougars harald genetically rotor mas splits peabody encouragement instability drafts periodical multinational anhalt rayon sylvester archival mil claudio witches onward tomas destroys apples arenas medallists sabah motorsports napier lucius oxidation lighthouses realms vargas headings pulls grazing commentaries resisted emails dictator croydon enthusiasts montenegrin periodicals commitments laughing efficiently tk negatively ames unavailable reluctantly usl predictions preferably precedence clergyman potatoes debating costello libre opener screenplays frederic offenders ars announcers lede reminds sweeping fore psi sooner transports nil antrim kilda purchases stalking protagonists cigarette académie stamford racers clinics upgrades tl snap dunes griffiths mca chick recipes ghanaian initiation ballistic appealing eh barrels roche inspire satisfying attic attain consult tuned ala matthias chesterfield viceroy disturbance besieged tau lauderdale dumb sawyer tacoma holloway maldives vuelta langley barnett lightly slater liège cassidy jaguars ks existent dart boiling ferreira cullen browsers insertion dortmund macintosh undated lille packs université chittagong resolving reproduced glover millionaire synonymous dion organizers urine sicilian influx pets noticeable mer beckett fukuoka nanjing pledged wes buyer mal stripe envelope rosenberg overlapping trenton bestowed faber consonants richest neptune barr khuzestan characterization tolkien forged nero cecilia edible dice asserting breeders separates skier mausoleum monty reelection yearbook shafts masks faculties encompassing dismiss santana swallow clint prevailing transcript enjoying massacres ensembles malaria oro staple telangana fender trait lange outdated contamination cska differentiation advisors gilles downloads grains psychologists tow axel wt tattoo siena depressed cass rowland lund hearings rosemary parrot adhere lindsey kemp ryder peninsular blaze limbs furnace sergey fools phelps dickson slovene pretend erect rainforest enclosure analogue legitimacy tirana recession affection ska ef shipwrecks aesthetics hayward aol waited np tito djs mag perpetual swap adjustment bertrand navigate fairs mourning mounts steiner fanny postcode draper fortunes cancel hides spartans sears fullback lal lex stimulus tactic presume cabaret thou transforming confiscated undertaking canopy inverted graeme drained withdrawing titanic airfields gaston engraving wonderland spontaneous warranted spirituality dharma ying propagation textiles olds gesture alumnus kamen scandinavia bonaparte repeats undoubtedly knowledgeable reconsider magnum richter clemson parry nfc grandparents miriam pontifical diocesan harmless dictionaries mart fumble gettysburg bey shortest cylindrical tiffany physiological safari screaming centimeters faults owed proliferation limb alliances malicious farmhouse admissions commodity intending ndp inputs abdomen discarded félix impulse stricken crowley jiang penned vineyard businessmen yielded rationales saxophonist kobe arbitrator louisa admirals texans orthodoxy dirk chattanooga creole drafting garry bloomberg fuji cummings gothenburg pamphlet patty stiff marries honneur scheduling cheek bucket flaws vapor overturn byu protector carleton woodstock lastly geographically freiburg ufo prelude cory lynx liechtenstein examiner sharpe atkins ffa blogger priorities akbar meg airbus isil translating homicide avid sanford heels discographies levin lau shotgun emigration slated homework fascinating casualty guernsey populous concealed jumper diaz waived techno lending theorist compose lively relieve masonry armistice camel revolves edt waterfalls til madeleine titus catering delicate quietly glorious redemption injunction isfahan nana liturgy cosmos togo baroness exploited improves fig chant quran derrick chairperson trance elmer respectable trophies bari dangers haryana taekwondo microwave morrow allegation ras assessing insights gangster viewpoints yunnan danielle marshes io msg dino bishopric deserted internationale pricing cz heron unmanned avatar yates aleksandr walnut marguerite seneca nrl confidential interpreter navarre remembrance gemini torino pfc chords fireworks acquisitions scaled scanning compromised pointer pitches dye oversee betrayed serena readable unreasonable petersen gdańsk gardiner convictions collaborators craters entrusted satisfactory emilia coincidence susceptible industrialist lawsuits feather competence nasal roe metadata elevations denoted dyer transporting coupling norwood kiel elbow hats understands forecast ample dispatch traps franks thistle pb partisans duff billie heavenly huskies katz instructors accessibility robotics lausanne perpendicular brains plaster rumours knesset buster trusts token guantanamo brest coma preferable zeller opp simplest centralized gee fernandez goalkeepers barnard submitting cathy believers prototypes pops raped collaborating cheyenne heroine citrus timely empires salford zoom encyclopaedic bilbao dissent aground inclination intervene gail cairns murdoch commemorated vows slayer interacting siberian vinci rowan oliveira baylor wilder boise keynes ridden dragged cerro excel jeremiah addison inventors huron celebrates gators frontal murals denies sharif harrisburg evolving installment wai energetic bafta craven prepares palais provoked popularized monsoon mara attendees berth assure safer bismarck whitman matilda weed sails mellon surfing biotechnology cary queue scattering haul subgroup arturo consume yeshiva erwin coordinating carolyn hartley bournemouth mata genealogical torre formulated authorised soda rendition suriname prefect insurgents agg sinister rec sim phyllis parental reminder rp fishes seaside marlborough cy rhys dodd nails cylinders browse immaculate sounded intensified accordion embraced joker hendrix elector voyager ita juno virgil grab pilgrim strawberry bounty vicious towed collaborator sabre celtics remarkably disclosed decca uniquely synchronized microphone fang dhabi fracture colliery brethren maze comparatively iberian asleep succeeds ‘ sprinter conceded hidalgo hack johor hum unbeaten owls congregational pentagon categorize rests continuum computation schumacher sas freddy liable somme kangaroo mlas qui taxonomic yerevan barnsley augmented westwood ari galactic superiority ioc heck fiona chiba emotionally illuminated kidd interventions slade ale absorb vain robotic staten prevalence braun selo shandong thorpe wolverines hints tug lied cambodian commemorating knives zeus edmond bluegrass enrico zaragoza averages clerks sax concordia appendix familia baird otter spw gillespie seminal nf rana wrap mead casablanca qur babe coincide penis mckinley verdi severity complementary superliga veto accountant theo charming colbert ineffective rushes sui criticizing donkey roadway encryption puzzles misunderstood hokkaido worries hairy serra unemployed jsp retailer hotspur anal haynes bartholomew untitled wooded judah deco entropy helens abnormal analytic reinforce sonia romani vt blows cows clutch gupta bolivian ramírez manipulate knighted bahia sliding shower ipa everest audi shetland cooler outlines orbits purity hawthorn marianne skyline ignorant emerges implication policing mariano hoc turkmenistan limitation prosecutors weaving transforms aubrey peck busiest wikipedias prosperous rewards precinct bu novella wikia diagonal bowled sbs authenticity journeys detectives kinda basics putin aviator churchyard alderman culinary rosie luzon connectivity ava panda bankers prescott entrepreneurship destined goaltender biomedical doha morley outing newcomers schedules sire samson cheryl liberalism caucasian dolly flu traverse ceded pieter coasts grossed foothills collided tricky wb envoy seizure erupted sweeney contra disrupted morale enhancing caravan fortunately barra disappears bahadur presses independents rack reactors designations printers algiers lehigh tam wexford fibre tory navarro chimney yellowish romano modernization schultz robson lyndon wandering rowe tutorial ignacio iq distributing vertically remastered alta streetcar gloves impressions invade warrants imminent reese rematch unitary mei sampled cinematic federally volvo kosmos halle hernandez refurbished ineligible mayoral rhyme suppliers esperanto wentworth javelin herrera landowners cooperate heroin talmud alsace ely wee royalist melvin chico divides incentive constructions chili kern landslide cochrane compensate deposition flyer gina terence slots weightlifting paraguayan vh donate marius fins corbett jihad ate marquette orphans serge mecca wrongly muller sap palatine goats borussia handy upward analyzing cheung landowner refinery indexed manson ethanol recognizable runaway corona synopsis cebu johnstone tightly consoles crewe indicative informing jens vance scare coincided mari bloomington scared nara jargon scandals saddam neglect wnba moldovan mcgraw zoology spanned confuse guerre slowed metz drowning bsc pasted engagements connolly edged communicating mcdonnell fonts elongated picasso nueva disagrees certifications regeneration futebol tristan mercenaries telenovela nikola wrist motions hornets awarding beyoncé dreaming inevitably scar mein dora lombardy rochdale prostitute dk pistols implicit saturdays hygiene armagh emphasizes maclean orléans coptic alpes authorization amenities yun thorn owning bureaucracy articulated timed bedfordshire aleppo deprived invitational banners dmitry transmit compartment grimsby natasha massey amc runoff highlanders weighted abbr endowed sabotage wasp wedge ulysses pins fir iss evenings augsburg roach carriages mold saskatoon improvised alzheimer shoreline censored oaxaca münchen excerpts originate delight galileo tendencies exploits haas avenues flanked hostages invasions discourage positioning symbolism chased gaga foreman contender bison encyclopedias drummers ecumenical hazards disregard ballard injuring discus contaminated hilary barangay bios clocks bandits styling remembers mcgee delist lego systematically initials learnt astronauts manfred sermons westbound messiah nationalities invading cookie confer repertory lansing preseason kaplan coated boogie belts fx wrocław curated epithet jarvis impacted milford junta petra drastically harcourt akira calhoun amour slash obstacle repealed coded sickness stm neumann explosions eileen sensing imagined proponent trojan cher stockport oricon jewels insisting neuroscience bids gotta duplication condensed negotiation realization inviting cathedrals doherty epoch sociological département populace seychelles prc laundry catchment bourne fragile nes pickup oblique youths cabal informally breakup rye investing organising proportions rees beg prompt anatolia kicker intercity eastbound legged adjutant elgin olson zeta blew princely fern prohibit warrington cristina blitz strains exaggerated truce rods oclc induce thee legislator suez mortimer vandalized carver metabolic ong integers cavaliers pennant markus confessed revisited meiji conveyed shareholder shaping huffington mainline memorandum overland accumulation peach certainty migrant efficacy listener mussolini congestion upside amounted oils compares nyt khalid chinatown inspiring remnant striped pumping göttingen sumo wilkins mixes broker navajo faroe vascular knocking admittedly inflicted hua possessing unconstitutional ue spruce enforcing fairness paramilitary yds accuses repression yamaha abbreviations jiangsu backdrop shreveport feeder stout benito ensg gareth pivotal rolf densely pharaoh dynamite astrology predominant chichester italianate rehearsal evasion jing liberated drains remembering medicines doubtful classmates spheres requiem startup insistence ernesto giacomo topical pretoria idiot longitudinal ratios helium fargo penetration conserved saharan sediment licensee sundance ecosystems yvonne convened legislatures maharaja routed electors biochemistry rodeo exposing queer easton damian differed handel commemoration revolutionaries lore ur eugène bicycles ellington lookout interpreting ek gull sleeper medici trimmed replication hyundai auspices moto facilitated arroyo hacker sour ob alfa idf butte starters downloadable inverness traveller sociologist maureen andorra hunted heel suites annapolis damien isp gunn scarce proteam rosen consultants katy plagued controllers energies linguists badges houten ness cruft grassland vineyards bikes ccc smuggling narration ensures dane puebla marta bully martian donegal dyke raided calder contradictory lenny exceeds salesman artworks betrayal doubling olympiad distributors cappella abruptly negotiating pencil prescription statehood weekdays novelty directional bureaucrats wolverine heal contempt reversion spectator ger fairbanks anyways professions soto gustavo rebranded joints grandmaster bharatiya antiquities sacrifices rashid nutrients sudanese joaquin witchcraft lithium ironic nepalese durban defect conspicuous regents spaced musa maynard unwanted ich martina pubs anchors sighted warship chaplin dow marjorie foundry johansson muir feminists travellers fuse nicknames opium guiana shrines groundbreaking benefited amplifier munro framing supplemented christensen bromwich nouveau foreseeable undid yusuf kildare wrath baptized shepard silla goldsmith drone novo underwood quota karma vis mcc ogden cigarettes kamal geared confessions lieu jazeera skeletal yr cartridges gardening checklist bog dietrich rhône judd globalization sal downward polly evangelist diverted launcher euros engages stereotypes hedge forster jae navigator polished fabricius implicated xl monde delle pk henley kawasaki domenico whitehead uv hannover disadvantage dietary mesh lucknow pulmonary inadvertently counselor compiling rig fisherman kathryn stephan vojvodina anglia louvre gregorian kat essendon foreground cant vu thom posing marches brentford monterrey meps louie bidding pensions scenery arched almanac siemens jewellery wallis battling mateo spamming unspecified gibbons potsdam gladstone caspian revoked fatigue ensued toro hash hooper hopper quay souza govern selangor claremont assign slipped brandt bose pluto balancing highschool aristocratic guerra goddard windmill iata dept homo archeological wolff adjunct perceive projecting montane stylistic carving humanist sahib vanuatu gerais mcgrath carlson alf sachs usher strengths waterways indications sesame leary meritorious breathe scribe vastly linden palma clad amidst barley sportsmen sloop spartan meteor balcony bored cute gospels ayrshire neighbour gigs pinto jailed acacia retarget leach asiatic nantes reefs mandir refueling glee coefficients onion maize gogh purse lsp amman sequels canons annotated lambda fortification toxicity dependency conversions emir sirius wetland auditor goethe cottages baths punish abby distinctions armand ecuadorian norma doctrines entrances sewage axes pretending subcategory idle gems nehru pistons shocking nebula laval lungs manuals aztec vendor honoring curb incentives manifest whiskey gallantry salaries graded biennial marcelo unitarian bakery traveler tanaka webs trainers zombies bai contradiction dude eritrea sexes manny distorted sudbury retro restart manly köln orphanage jericho newscast aquino nesting ribs jumpers underside disclose penang saratoga convict recommends narratives purported stables jürgen yao brownish nair takeoff chávez comets heiress wo stargate gently admitting kermanshah uppsala cypress zhejiang restrictive economical republics integrating nico instructional occupants soc pawn secession tyrol avoids tian grady anarchism attends downhill guarded natalia sparse ambitions syllables keyboardist hungarians ignores jc hated bathurst macon psycho hanoi superb buys hodges hl disposition suffice rooney chloe hearted levant foundered unto bala ritter banquet haley clade kan turnover lazio ventilation gears lifts shelton mccormick mcbride jang snout rotherham nemesis reconcile coating teahouse lombard alvarez expenditure aviators toes parc comedic walden outs canucks supervisors cheating harvesting olaf breuning carthage paraná gottfried examines reputed dunbar elijah fordham idols barge rihanna cartel angered antilles radcliffe capacities fueled woodrow mustang overseeing organizes wah asturias beaufort palo ama quezon darius motown westchester trenches dcc contraction shrimp dungeon supremacy crust scriptures postmaster tal roanoke covert yeast mongols maxi glove gilmore muscular ste weighs deux ewing commissioning clauses dune fictitious razavi storylines coa frankenstein burundi bearer raul stylized pines prakash rani dm cleaner indicted kinase dupont guadalupe progressively landlord fleetwood migratory rejecting guise brewster nomadic caliph calculating configurations jar readiness insulin dagger goalscorers ren sasha perfection ursula novak wrexham spaniards orkney donaldson tata auschwitz halftime transitions testify trolley tae earle konstantin oyster severed img reversal concessions sampson pisa mcleod trojans paige twinkle antisemitism comp practised caf lining suicides collateral xinjiang rugged fy blizzard esteem stimulation pity inherit zack cadillac frédéric tokugawa twain shen photon munitions incompatible trolls toad revolver prevailed synonyms ingredient enhancement overwhelmingly légion als mounds regis login bogotá msn fungal aryan revolutions bonding sven nicky packing transformations esq tastes keel installing bl fis vivid plaintiff decomposition groom phonetic synth censor über tianjin hari topographic bwv davey expands occurrences relaxed outward reg crafted americana implementations experimentation argyll burr konrad gazetteer basins darrell localized deploy lineman messaging spence fei locker badger gearbox palin cumulative spellings loyalist wharton peoria knees mannheim yd ser argent cradle mclaughlin confesses detrimental clockwise prosecuted locus regulars wakes linkedin jordanian kaufman domesday disliked viet rust notions porcelain semantics southend eclectic cared horseshoe ci topological macleod bloomfield longevity variance highness guo harrow ebert insignificant transplant lib anticipation chao lg terre voyages sumner razor veil luciano arcadia formidable tides argonauts tick notwithstanding unpleasant cantata columbian netball chevalier policemen animator weddings quartz dumont abdominal renew affirmed computed primaries laureates filmfare clones flashback utilizes traumatic outdoors hoffmann construed jesús pushes riaa locking supplementary agrarian blossom octave jude ptolemy queries ropes albrecht haydn cdc grenada fade aspen roi deteriorated egyptians boasts suárez lifeboat groningen sevilla hybrids babu depart skins burrows paisley terminate ghent reigned usernames lowering desperately seismic eastenders pow madden crocodile abrams interiors dent marlins betting hagen republished reelected chong fiesta projections muddy invertebrates paleontology novice rower accolades prologue cinderella cyclic amalgamation berwick blatantly freedoms transmissions organise reflective stabbed simulcast reformer denton oppression foam monograph gentry chemists gabrielle dresses lectured maneuver nerves adulthood bray milne daring hamid utter herbs harness cleopatra brno emancipation phoebe reactive christy reset bianca implements correcting fugitive cicero bono synagogues invariant mindanao pleistocene attackers viz hebei defamation relocate superheroes optic dowager fuzzy specifies anthologies amin poultry helmut comedies beech fivb lori northernmost simulator ding southernmost apprenticeship msc raceway khyber pensacola royale macro insider righteous mirage sweat banda phrasing understandable rutland inuit tumors cavendish voodoo pun gambia thematic altering pyrénées saitama reacted sint tulane boiled regulating progresses warmer usefulness intrinsic stainless lulu vegetarian tracts alam kissing annoyed raúl shattered noaa dusty lilly interestingly foreword ives seo neurological vibration essayist poisoned invoked frontman archdeacon renal jiu triassic priced internacional fabian trailers lillian poses uneven turrets harlan cruelty storytelling virtues gorilla rochelle gui auditions oboe remarried bounce radicals incidental sonora guarantees orchids myrtle charters superficial patti volga hilda hamburger multitude dire aeronautical programmers quicker clinched att tnt packard bubbles stunning garth burroughs crypt rewriting gonzales restless overwhelmed miocene fused granville influenza tomato nagasaki medications engraver distinctly waist salvatore tunis puppetry fremont semitism recycled commendation scorpion damned encore inhabit shapiro argyle bingham interrogation gamble bridget regulator advertise classifications butch thriving wiener jalisco administer henan workplaces newbie librarians ub filmmaking easiest ngos welles devote shrubs attendant aerodrome follower methyl grayson vaughn bodyguard byte splash peña classify lieutenants bridgeport technicians manufactures lennox swans canoeing planck eastwood formulas staffed directs gérard envisioned musik repeal mach mori rabbits prostitutes compassion labeling synthesizers hen delgado bosch cue dh kaye fielded hawker barrie hawke ithaca cornerstone incapable eureka anastasia cayman barnet fronts clippers napoli deviation quakers keepers mutants peng rum hahn spacing britannia muñoz parasite topography conglomerate amusing outflow offender waller mabel intercept iroquois perceptions nic honesty faulkner mined cluj blazers abide lpga pontiac abusing turmoil rhino kilometre packaged trois aspiring inhibitors barrage piazza truncated trondheim capitalized busan phased dank outlaws pronouns ignition evade buddhists kobayashi woven mute fai irony cabinets persisted gc potent subsidies gin nuclei procurement eintracht pictorial maroon prem inexperienced hid designate eats macquarie booking adherents icf hove caliphate ox tolerant aristocracy plumage claw backstroke migrate tilt hillside avalon wasps temper corvette edna chopin glendale chaotic assaulted mahmoud devotees padua matrices dilemma fide eine plum attacker googling pertinent bourgeois mani graz mosquito euclidean cub echoes misses assemble ethernet bait scholastic dip schubert mauritania lev crisp totaling multiplication larson breaststroke chefs suspicions ngc groves ingram adriatic knicks outpost darmstadt rhymes commodities fashionable sediments punitive léon skipper irina grassroots sticking blaine capitalization preached sheppard magistrates nadia explanatory mina tensor signalling euroleague estimation ivanov keystone imitation biennale salamanca islamabad connacht converse bradshaw unseen daryl mbe adolescent skyscraper montpellier gag dormant vanished partizan eastman nunavut attach volatile caleb moniker cardiovascular nec reza melt disks pri broughton dx zoological bodied portage supermarkets assassins rn earnest cosmology amar seaman ejected mandal scrub phylum tyre havilland gotham brabant premiers fay skopje decker mermaid outspoken comrades karim climbs archiving slain amplitude appellate fishery bragg excitement horne salute inflammation málaga surrounds friars backbone petals pegasus moselle aisle hobbs armando kharkiv trafford ridley verge altitudes hates midfield contracting cocktail outsiders experimented maguire bard faded alternately unlawful eternity convection kimberley lute huntsville darryl schizophrenia mcdowell grasses ugandan pollard prophets quito truss outsider cambrian newmarket hound staples narayan illustrators drury barclay preferring faust maha gage alleging thence downing elf octopus interceptor deserved fines someday hangar prohibits beau ebay albans deutsch lucien contrasting hannibal aegean mcguire dil filtering hourly johanna stacy naturalized erica compute evenly terribly palms kickoff withstand naive kylie vase azt dominica azores outgoing rollins internationals mcpherson barre jd bulb crusader spines fielder macy nakamura greenberg goldwyn ode guildford aqueduct rubbish vasco simulated arboretum oleg notch greeted choirs lew biologists taller centric jamal hermitage footsteps wiped booked trier parramatta iain lakshmi moravian peppers versatile mundo pedersen whaling binds dim bazaar featherweight halves trieste jedi luz firefighters causeway magdalena mist arranging galveston comte forcibly parasites isabelle amend fijian fidelity sentencing shenzhen offenses baked nea lure roundabout listened pointe parasitic solvent vested modifying karabakh scotch valuation severn diversion goldstein cas mornings hunan dummy resembled hb kelvin beavers calvert injected artifact manipulated musically italiana dialog fluids slab walsall grams hillsborough lizards moonlight cantor carole saigon telecommunication gunner sj stray brightness molina pseudoscience obey prism impending octagonal universiade sorrow jarrett dolores fronted gunpowder babylonian curvature colonels vip borg torquay antibodies cracks sinn heidi yep bergman christophe marko mavericks siam apologized unauthorized daphne mozilla jenna replacements frustrating francesca takahashi passports claudius scent charley bmg susquehanna scam danzig stature gunfire rallies emory dependencies serials drunken stalled clapton compile huber obesity fourier sn infancy hyper palau siegfried candle allowance islamist strikers principals oversees stimuli jai hodge mathews parcel welcoming shouting godfather cuckoo breeze hrs drying mitochondrial retrieval minogue duc syriac sebastián hurley cms emery madeira chihuahua kali bloggers civilizations hezbollah currencies frankish vibrant ingrid sentiments indochina electrification danced acquainted chow originals wren convoys waterway rotated phylogenetic welding husbands vigorous congenital fulfilling tolerate menace eurobasket spectroscopy marek mckenna showdown shrew rebirth gujarati identifiable unprotected strained lyle booster stealth fayette liking hodgson decatur newsweek moshe musique greyhound koreans contacting zulu galician ferris ripley merseyside ostensibly abducted floral kilometer mazda sequential entertainers gaddafi yielding narrower rivalries croats czar coinage ter newtown ghz sousa lynne kepler ● sheriffs reworded mohawk hawthorne laude doses ape nazareth doubleday jess magnesium societal intercourse murdering yamaguchi armada gilan functioned chapels athena canning southport taunton favorites gladys vincenzo clerical disrupting bogus gatherings privileged lst pollock kant warp wcw domino pockets ops disneyland resurrected radiant henrietta mersin sonar acquaintance clancy contrasts appliances socket spiegel harmon sikhs modelled brasileiro muzzle bitch supervising rotate impress pantheon duran ignatius pv lcd astro clermont scary genie despair undermine joanne buff bosses jurisprudence andersson dialogues sabres enfield gastropods cutler rostov congresses jaws grupo goodwill grim trustworthy atrocities cowan kassel telenovelas supplements synthesized steamship proprietor grimm publicized props shortlisted superfamily erroneous rouen cheer buena amadeus caledonian guatemalan chateau commended downfall mazandaran liar paddle opéra appropriations hostel mandy vedic jaya gwynedd depended josiah theta medial regimes sticky mallorca vent motel jena karel turing superhuman psalm bal hellenistic exhibiting winery backstage tipped bromley primate historiography discounted rave asst taxon mcintyre bae pause zee garment nikolay catfish reworked markham kisses botanists disclaimer sig sabine defiance aj lucrative surveying rudd oneself ara biz greensboro campos nguyen topping cactus identifier spaceflight unhelpful firth milky mule beasts scrolls teller serum nevis tarzan kindness tempest swinging administratively amateurs blacksmith upstairs deportation bland sincere empowerment motivations mildred boating ♦ awake subsp faulty fran closet symphonies intuitive admiration pepsi alla multiply reuse estrada junctions manpower kei converter avoidance marley nsa massif fucking federer brawl redundancy erasmus offending fairchild untrue dramatist thinkers residues advises tesla housemates protesting circulating forerunner galatasaray rodents soluble zimbabwean colloquially brace newborn tangent nominally ruthless valentin corrosion neue regression equator ict ascending classmate glam groundwater marylebone vulcan nih sibling clutter stacey blackout erskine dade toulon poisonous derivation believer eindhoven accompaniment trajectory apogee batters fallout beers auditioned becky qu anarchy arias hacking calculator heraldry gama marital scripted bender britney leukemia coyote horizons advising thru fitzpatrick eun esoteric omnibus slice drastic laughter glands samoan professorship salts tchaikovsky poured austen spd leftist charismatic dominating lesions southward simeon halfback abstraction cid mammoth slides gainesville healy fallon volta thorne assaults clemens meme fabricated temp meath rumble malacca titanium osman grover adele och pinned elsa moors soho accelerate tun bissau copyrights xm claws braga homunculus constantin mustard regal bestseller wessex bolshevik isla karin grenades blackhawks hun steamboat leonid secrecy grading flock fukushima essen breakout prostate tariff exponential yarmouth acknowledging upton dumped trillion slant williamsburg analytics judgments descend inhibitor conservatism merton shortages hierarchical janice reigns scuttled ellie contrasted improperly wilfred sheltered lothian methane abandoning thoughtful bending keane mediate battleground pollen revelations unambiguous dieter ventral leah stresses yoon maru chill salazar expos liszt abduction morphological murderers turnbull cater jealousy ui mandela diva bethel maximus fontaine mahmud specifics curtiss wilcox stakeholders exiles yellowstone radically captives macbeth intricate horseback thrash burgos coe clarendon amd litter israelis sfr summits constabulary residue orr hess keating recounts pendleton waits zelda laird sikkim ceasefire gonzaga stimulate seperate crushing primates emphasizing typed equivalence spree mora incurred permissions unconventional bites sprinters moi legit validation rasmussen humane storing cosmetic showtime bullock eel hampered isotopes abandonment provocative heraldic concentrating msnbc lazarus elastic gideon scully elves vulnerability rockwell relisting larkin hedgehog bryce adjustments authentication crook hearst anwar consecration olympian amassed ina camouflage weaknesses temptation turk cisco fontana sql stuffed lobe aden lays parted pcs squid mesopotamia dogg jamestown swamps seafood shortstop milo rms cantons curly ryu overture kitts ipad crested woodpecker jimmie fabulous endeavour ctv throated newell partridge evicted jonah mcmillan indictment alton wellesley oven atm captions dormitory univ jolly clubhouse ubuntu temperament guarding encompass charcoal indigo havre marxism horizontally anonymously permitting leung lordship zionism lesotho tufts salad atheists nestor spared airplanes magnolia willoughby pinch needles duluth raoul breaker tee excelled dustin socking mats eskimos confinement gal vaguely endured unnecessarily messina haired mons indus jasmine moons leyte curt haleakala epsilon gustave manchu pediatric opole constituents yeomanry browsing innsbruck aachen olympique eyre intervened snowy banknotes karlsruhe hyperbolic expenditures orton modulation somerville reeve frescoes frederik isu khalifa cops epsom applicant riddle apocalyptic foliage günther timetable yoruba gorman therapist predictable gutenberg affluent hottest casts transcribed guineas jameson münster peyton federalist kraft haarlem inexpensive peugeot montrose restricting sustaining standardization strata realises derogatory locale ashland mentoring abigail glow ionic radios massimo jst chechen proofs alexei purcell dp cookies telegram fips boarded branched mic deanery harassed pathogens rene demoted breakers criticize sportsman acknowledges longstanding restraint swear oops nils journalistic protestantism sous este orbiting kk outrage commandos flattened ferrer costing lace wilbur blackish veneto retreating deciduous arrogant elle moderator nepali brescia forgiveness cones aspirations wilton maitland kota tore ellison malabar cesar swaziland collapsible westfield fabrication seton realities injustice barron calabria montagu denison syndication csa tp coco seater praying irs correlated aarhus boo kitchener pei nightingale dentistry fujian fencer fresco pairing lucha chopra villiers accents specialising investigates modena cellar dubois hugely sligo hackney paints bellevue isaiah weaponry foxes fortuna tanya mathieu avail spices hangs gland additive angie burnham nexus brit headers vivo ls educating modernism kr kimberly occult parodies harvested sykes overseen menon improvisation wainwright override picturesque kathmandu redesign unión achilles battled cabins thuringia annette ahmedabad stir merritt gershwin unbiased spoon thierry lasers magyar bautista oddly helix worldcat supplemental grill baseline ono cautious perl branching evacuate ike readership rockford tubular pursuits imperialism havoc malley tolerated faq executing atheism hauled chicks gillian manx wondered intern worthless sequencing allegro merges shamrock inference cpc sneak medallist antónio arden exemplifies ric peat bop asher pharmaceuticals festivities barrington princesses bargaining reuben bam ringo rumored rendezvous reckless margarita venerable unpaid inferno berber omission judas seminole harassing steamed fabio mercenary grafton tempted nicholls wendell sable intermittent svenska rag theirs castes azad gardener federated coloring dahl spikes gallo billing homogeneous enraged inca bulldog deprecated whereabouts residual fountains pleas hilbert hurst katharine royalties rumor jerzy whichever dissolve duane moritz hemingway zum apes springsteen southland jehovah appleton erika penelope regan puck mcgregor heartland differentiated footprint distinguishes scoreboard eras blanchard cognition lowry masjid panoramic kingsley beforehand adjectives numerals amp yoo microscope balloons aggressively kellogg respondents colloquial serviced showcased umar santander forehead conqueror hyphen emile augusto alameda cruises epstein discretionary petitioned faye gnome kwan instinct itu mechanized pulpit flees handing extracts tashkent viper quarterbacks vassal deutschland crowns losers foolish wheeled almeida substrates penetrate pau reused correspondents townsville stallion ws michelin brill dorchester homme colo plotting horsepower valladolid duplicated raiding hermit cues tung mergers crashing eredivisie adm dns chariot pavement braking bureaucratic gunnar grinding registrar apa sogn imperative rankin nonlinear orchards charlemagne combo pentecostal recaptured beit nasser himalayan transmitting arson singled ellsworth breasted greenfield transatlantic hardin sakura ordo timbers maison peptide rescues nawab tg consultancy sway invariably descends privateer toilets transient premio frazier ayr maths tailor nietzsche supper piotr gaul botanic smackdown aw nocturnal magdalene contiguous reprised unilateral hilly classed champaign schloss feasibility borrowing darby trough cured ax albanians squire meade symmetrical heller limburg subcontinent deportes telescopes starvation abdel jeans gamer lid rajshahi terminating dunlop carmichael embarrassing snack rampage jalan photons naia disturbances azure mun cali vittorio tendentious baronetcy aiding simulations fenton contradict cj baccalaureate henson deepest izmir assigns prohibiting nytimes deng barbarian maori nur petitions environmentally ish familiarity sulawesi massively holm kgb indices donnelly flair polynesia antagonists clover soy scaling wingers valiant couch hines rubble winslow aramaic polynomials strive confess spreads siegel andrey accomplishment lexicon declines bakersfield rant colourful sopranos olympiacos bump athenian paranoid gel slept risky bathing nbl duly mixer luís inhibition nonfiction tracey consolidate reasoned fission barrio lear ascended eviction alteration corinthians coburg borrow bandar rightly temperance fraudulent catastrophic kanagawa gesellschaft sabrina scrolling funky sato musica pronoun acidic mania mango copeland restarted confederates tracing roberta sargent astra excludes banco adolph biting dealings mustered teaming punta prank aldershot jelly fredrik turbulent reinforcement arg climates oder jana hires godzilla arbitrarily maestro davy lega tvb eroded icao blanca ambushed nicolás redistribution capcom gomes battista advisers eminem ivoire scans manifestation screenshots mentality orson sonoma deletes blended cosmopolitan triad fills crises upgrading pill yue marge jonny crooked cbd aqua detecting carbonate fjord godwin iqbal yuen graders motorola betsy folder breadth cheney cores ipod conscription batter flush hiroshi vertigo incarcerated keynote armory jeopardy poppy shady erroneously hardest beads creditors coward mimi grimes tenerife navigable dakar banded slug promoters squared pots compendium nazism mimic pu endeavor anatomical mccann runways experimenting faint andrzej nizhny scala insulation horned aromatic depleted registering douglass rodrigues hatfield misery explode tyres breakaway chickens sporadic unilaterally intervening mika custer perak vines lyman rss conditioned undesirable doesnt cursed urbana configured hadith disposed aeronautics smallpox crusades lust logically maverick gratitude tentative befriended naga fallacy niels enigma insanity platt leighton workings whisky woodbridge selector lycée eocene partido ucoz ami parisian pursuant lucie commemorates cruising mort baggage isotope conical kendrick slick launchers binghamton sigismund aquitaine m² bodily hurts parliaments saba weakening popes sadie balfour homecoming osborn gaius pasture jimi unionists bassett mira +, kia unsupported waffen nadal alcoholism powerhouse grasslands outrageous contend millwall tomás segregated hepburn aix router centrally plastics francois premieres π distracted relational stéphane furnished airliner dockyard webcast anarchists verbatim collapses inquisition pst wwi abkhazia euler shortcut letterman aldo zimmerman taped exhaustive shootings jeep florian ramirez zachary favors mauro optimistic successively gadget cramer józef lacey ligament spores presumption ashford persist utopia solids choreographed agatha lad bouts kimball cg projective dentist adi schwarz sulfate aichi corsica herefordshire cosmetics upland goddesses skirt jaipur esa csi inorganic phosphorus facilitating accelerator ifk smyth completes kettle fatally char mccall triangles pills reflex norbert dauphin arundel hammersmith rectory elaborated asahi unicorn diner perpetrators jharkhand nilsson primer beatty moray wrestlemania murcia grenadier reductions javanese sedimentary ata ours crawley mammalian linn scientifically sheng heresy layered frans fyi sejm tak initiating lifespan privatization pumped bukit aguilera irrational finalized widowed uta assassinate indexing klan anthropological rk dfb brody seekers stalls coincidentally oldenburg venezia chromosomes steeplechase gaulle huh enjoyable gigantic illumination humber { proficiency mbc spit fianna pads solos vidal apologise hornet kaunas expects polydor porte luka maximize chairmen libertadores legions livingstone dunne nikita dod jura originality preschool enhances bjp learners influencing membranes forsyth laborers oskar shiny widened apt immature visualization femme twinned mcfarland felony marriott ruben gl ymca spouses yemeni mummy edict cries sima violently flanagan flycatcher climbers puget horatio ito pointers uniformly eugen horst tds deans inmate clashed hartlepool emitted locating characterize fences réunion philanthropic kowloon nirvana enchanted gough exemplary falkirk biplane conn transliteration chavez magdeburg dwyer filippo undone enthusiast tabs pronounce wrought ultraviolet parades berliner remade feral francs polynesian emu latina bydgoszcz casinos tuna mushrooms español whitaker abstain ima zx rebellions sportswomen asimov resin greer joseon vest pagoda wal vendetta zamora gillingham honduran wiesbaden commits bayou casimir nutritional kickboxing cochran tempered humbug chongqing cholera salsa fusiliers andalusia braille salman jen commence ornate malvern hallmark situ egan archduke wd bandit vans circumstance afi baptists feldman mcnamara aguilar bash pamphlets condor necklace bellied perished italic hormones liter himachal newscasts sewer spurious multicultural outlying vanilla doorway quartermaster constraint faisal socrates dobson kun forefront rigged rx quake fulbright robins nicosia selectively calf loyalists enriched billings lorestan antibiotics ferrara selena heap cymru maher observes vertebrates joshi microscopic lei guerrillas sleepers vitro seaplane coarse lister peshawar astor bookstore relist harpercollins cigar orthogonal rampant hypotheses vitória sacrificed aquinas fwiw hartman illnesses bless standpoint arterial rudder diablo pia carly silas patna mana indira multilingual deserving outputs blindness revolving wanda iar internment embankment abba mojo conus asean enactment bellamy hou dumping rosenthal absorbing vortex roughriders redwood anxious dalai reclaimed rizal cocoa kmh electronica kazakh expressive flaming yucatán externally carvings roadside pains rebellious uniting cassandra kok kew bolívar distracting ammonia mahal rousseau vr vulture trujillo writ peking ethnically equivalents filly convertible accountable notts riff ortega impairment anthropomorphic sculpted alleviate magnate spp steer impartial rojas ent wadi totaled tad mathew jacobite estranged webcomic cree sham chancellors joaquín cryptic spoiler nell cham mathias saturated martinique pious resigning moravia diversified afp hospitalized doomed notebook cultivars adoptive pendulum gonzalo himalayas leuven inhibit haig raft clements disadvantages yamamoto boon occidental vogel deserts westport authoritarian zodiac goodness bain marcia transvaal mcconnell quan sixties matteo bra barbuda topeka grenoble grabbed cesare cancers braunschweig traitor uphold convicts hinted starship proctor comfortably cock accessory krasnodar artefacts debuting weightlifters ivo adorned haji extras exchequer fullerton continual rhetorical skepticism internship sync philanthropy marshals spec informatics kristen cuthbert greeting horticultural lame schuster durable memo ddr closures rg outnumbered wellness hermes resurgence ponte yuki orchestrated gamespot posse beverley uploads slap watanabe lesley fleets trident subunit trousers craftsman orthography waldo wrestled jiménez kunst packets triathlon bcs croft consultative stratigraphic molluscs fina plurality naacp warlord saline collisions outset majestic aloud spelt rhein vichy ipc bastard launceston nfcc bryn fascinated cracked entertain binomial simulate afrikaans gustaf debra brandy knocks messy ratification biota tweed chunk iberia shipment cipher pakhtunkhwa hectare timelines mendes fellowships frontiers jj quartets tatar dissident nowy prefectures negligible valor umm delighted nyu alphonse hackett selby sapporo fenerbahçe nagano mobilization djibouti venerated dmitri protections aek revered lana macpherson elizabethan vhf frivolous cos immortality angolan arrays tyrant folds indoors jia aura zanzibar sine quadrangle immersion specialization simplify thicker sava reunite pinyin papacy escorts cfa nonsensical adherence paraphrase factually specifying scarecrow osprey incheon papyrus maui quadratic motorized gracie mercantile worshipped amore kanye projectile liaoning serialized refit nightly beautifully stagecoach féin antennas wynn adapting villas willamette violins causal adjustable michelangelo extraterrestrial raju lodges wreckage conservancy labourers assimilation zappa renato thurston accessing entre lubbock unresolved lorenz statistic intra refurbishment etiquette creeks swiftly metallica bertram plainly visionary peanut therein persians uncontroversial heroism trey budgets middletown hays xviii degraded routines chet noisy disappearing polygon ova hampden dorado spitfire posture eater luisa fo samba bantamweight transmitters ozone muster screenings colegio cara defected dempsey waikato bertha clicked deficient getty brighter excessively celia schism laredo maternity penultimate assad relax axiom evansville bloch pickering distillery upi airmen aiv tres mueller tba taxis londonderry trumpeter gutiérrez uri portico thrush dumfries hendrik hurdle allotted enrichment heterosexual tijuana biographers fidel renders medicare hadley plaintiffs splendid mcintosh mifflin gateshead psp hirsch ballarat pinnacle catalytic unfounded maneuvers bladder apc servicemen prematurely singleton dotted mandates ascii marsden ferns devonian cong piloted republika livelihood selects bharat regulators obligatory salim vibe hex smoothly responsive lux mortars interned respectfully fearless cabrera kanji opined pandora cst siu periphery imperfect lush hooked mustangs harpsichord sawmill danes chorale kochi nugent lobes anson scranton hurlers alot salinas frs libby gallons parti iteration quechua rainer constants postmodern werewolf climatic fayetteville dissemination diffuse elevators eta quadrant burbank flavour tatiana antalya flyweight lourdes gibb seung gestapo tartu kf acton huey bumper dusk cleric commonplace tents anzac windy paine unc illustrious powys brink wildly wounding vandalizing relaxation chai inversion glide reclamation customized psalms puppy gall prodded vicky dumbarton lehman townland pageants crabs akademi buren planner juveniles garlic gar violinists mehmet turku oldies yak rediscovered afforded robbed fitz taboo leverage bm bananas général biosphere persistence truths recounted cereal enslaved counterfeit liquids plovdiv antelope pied martyn roar rohan flex widening weymouth widows alerted tierra bao mildly reciprocal rotational kyung mariner champs peacekeeping airship tombstone brutality reims embassies transitioned essayists stipulated elektra prequel honeymoon sanitary grind shalom canto gifford ballpark sébastien manu alas norsk philanthropists clandestine notifications flinders midwestern ported guadalcanal distraction avec waking vaccines guadeloupe roderick tuvalu saxons silvio intersect lacy duplicates fractured regatta cracking eared whistler tamara hana commuted ascribed stampeders varna elsie corridors elemental winfield heaviest quorum herzog mohd inaccessible persecuted handheld utilization burgh bret alloys lowlands delaney hers theorems combatants interpersonal adjusting transistor pico chaplains sealing sizable seizures spaceship jerk crows moo pianos taj barren disable begs detectors britten attaining prelate eau zenit passions hydra kidnap szczecin revise stampede hogg chalmers boulogne bochum reincarnation minnie gottlieb communicated stills millennia bipolar anthropologists sentimental hf grays billions grizzlies escalated pedigree expressly banished transformer hepatitis fd dutton peacefully mulder reborn landau stimulated flavors embodied qualitative backlash colby gliding southwark kristin advertiser dorsey zebra ditto coeducational breasts stigma nudity leland berman statistically olympus fares humanoid finely suspense townshend valencian demi baking verizon mfa infinitely surfer cochin bootleg impoverished predatory schoolhouse fokker beak hurry prodigy auditory specialize dorian benchmark rang palette fortunate gables laughs researches cafeteria kts cornerback reclassified fluorescent accumulate swat planners childish eels contenders constrained carp perm caller dictated augustin interstellar encompassed coyotes piercing blockbuster catarina crowe lark sochi accountants tj abolitionist olimpia expansions retracted ubiquitous mayfield enroll wycombe catalunya linz noticing nellie leith fuchs dani heer politely polymers hershey fernandes anticipate antonia stu rj helmets extremist robe haut rescuing acm lag tripod wavelengths merle nutrient overdose giulio cahill backyard headwaters interlude schulz kalamazoo smartphone snoop pandit chew rehearsals mri intersecting lucille delisted christgau separatist wilfrid intellect larsson necked caterpillar varma facilitates bargain jock reunification sarcastic friar loeb suspend kala conf haskell antibody jitsu asthma floats microscopy cultivar leasing qld skye gregor oise horde asha swarm andover frey susanna dizzy spirited rada uno spielberg gallipoli candles hôtel expire poorer chiapas cinnamon empowered myriad anytime impedance embryo cans hh rko salvaged shang alerts biking labyrinth parochial categorised curate refining moderne srpska implicitly metaphysics suck funnel discredit remington ardabil nasdaq brahmin tbs yachts parity arjun ante davison axial barring divisive nasir unexplained maratha ssp cheat marginally sherry boniface wheelbase sparsely exploding silvia coolidge brahma goalscorer yada fats epilepsy aidan sly nath mukherjee holiness nectar cleansing flaw agra bowers midtown spurred pvt palmerston artemis raider noël darcy australasian sausage mezzo kbs thermodynamics kilmarnock brochure cardboard guzmán proficient barangays amt tallahassee defenceman adolescents elliptical breached departmental electrode hampstead erased sistan taxpayer minerva bonded pusher pastures strands whiting greed clearwater condemnation blackberry mh colón jure pullman tariffs sitcoms gotha cartwright corpses deh kurds resentment behavioural elects warehouses argentinian smashing apparel meteorology allahabad bute amplifiers shipments montague huff balochistan ballets extermination stamped preach signage unsafe roscoe avignon referendums emmett txt lászló shaanxi bearings arne machado sana tildes davao wary bonnet fantasia permian youngstown rsa nhk magdalen checker haines pinball unicef patsy frequented reclaim ornaments mcqueen insurrection azul elliptic supernova fujiwara crank fabrics rulings kt benoit ousted deterioration eclipses herbal literate vowed exponent verve geologists maximal motherwell linen engel spectra englishman guelph pygmy crucifixion extracurricular interfering adultery uzbek esteban bacterium handicapped fiery cloak héctor accra biases rubens hickory gestures obstruction sheen clarifying ivor erickson béla sewing terrier dew ay vito lough nv pathogen pearls taxpayers casper corrupted grabs immensely mehta jeong checkpoint forte echl regulates artisans chola tahiti cassie sulphur gk senegalese cpi modi reversing walkers saône eucalyptus thrace coppa headlined astoria paced ashok grumman ledger figurative propellers fundraiser narcotics ems competitiveness sects lambeth looney crustaceans batavia metrics constructs beale fragmentation vulgar hose cedric directories nagpur dawkins vox duos würzburg pea ecw nit hanley nano bundled mooney chengdu simplex catastrophe unpredictable interspersed piero marietta gladiators brahms muriel psychotherapy mortally sociedad academically smiley villeneuve biomass carte meyers replicate benevolent gaye walther cad weakly fbs weaken cobalt bursts cranes rad dissipated pigment königsberg tailored superiors lettres dea courtroom ravenna obverse whereupon falmouth resultant piped sociologists manipur nagorno promenade knob rallied discoverer mastery boutique reykjavík rocco generalization clemente víctor parenting mallory plight strategically bien nai bosco henning danville loft shawnee skinned micropolitan bandy illicit instrumentalist linkage dough sigh microbiology demonstrators pipelines inoue spectacle hype wv relic tedious terraces militias callaghan comcast mins cessna priscilla abyss xxi urges aris coates gallon radha animations jody ayala kampala commencing crystalline purification undertaker antoinette soloists coimbra dalmatia dorothea kl monsieur eruptions favourites reopen mariah craftsmen domes hubei recital devout woolwich pressured pros dispersion bmx huts planar foe shanxi fascination sala scanner comprehension coroner covington fledged adapter prentice ripe berries pali flashes sanjay flipped cyborg rennes extremes milestones fcs breweries ulm banksia summed durga cookbook duarte ganga dissatisfied hiram tintin daley miniatures lawful proxies mala skulls concertos dispersal syn lanarkshire brevet lovecraft anjou yangon administrations independiente alouettes richland freshmen spawn nostalgia nach overrun stochastic wheaton ligand quarries addis yoko roxy unsolved slippery downey folio obedience ardent oakley didier napa impetus geek summon horticulture sidebar kampong legality peg falun polarization naughty stab colonialism panathinaikos vf syrup patrolled heavens regnum madurai flushing gael thrive pejorative sigmund salaam aberdeenshire kenyon afterlife chemotherapy luggage adopts catania troublesome variability trafalgar pulaski antiquarian verne undergoes streamlined lavender vocalists fáil mountaineers inspirational bolts horsemen islington cartoonists seniority thyroid plank camogie timer duets leaflets cristo speculate stadt lodged servicing diamondbacks exemplified garments racetrack peanuts eurasia unarmed salamander nino haunting puri ríos keefe reluctance enjoyment newbury gunmen draining praises omit lite bolsheviks slightest glue lavish sacrament isolate provoke amiens blur aggregator jg worrying rhinos coasters stroud kremlin teaser lobbied rl grad counters explodes wisden mysticism millar snowfall osama modem sadler philology francophone bello blending bombardier plutonium pax cakes ses billionaire argus postings receipt nelly nativity whitehall calligraphy andres selma ei malware signifies israelites warhol rhapsody lasalle hussars travers commandments duct alexandru narod clumsy marrow tercera meng rijeka subtitles str saeed stemming cossacks niigata converge flanks bethesda alamo yamato reorganisation reacts alphabetically coloration alistair schema geese yosemite sweetheart internazionale daniela yeovil spongebob intimidation contradicts tomé ateneo donnie reina nonstop oprah affirmative scuba dre tweaked schumann observances rufous luo localization badgers perugia vow mor plotted torneo deva haworth eyewitness realist accompanies winthrop ting conservatoire stunts coles shelled aquila mississauga bethany erp misguided recap nitrate mennonite midsummer mahatma anomaly cai vibrations garonne benign thrill prado undisputed italiano preventive mentors sectarian trainee caen malibu bradbury dundas cheetah terminates schiller sy nac rivière operatives irrespective subsistence aerodynamic puff redmond odin françoise wardrobe sos worthington eccentricity vicki lamont pci kristina wm tw hubble tees aet lyn almond symptom childbirth covent flashbacks postcards inquiries dragoons trudeau morally xiang appointing grudge chabad captaincy alves blvd docking rib buxton shenandoah stricter gop twenties gg rash yogi divergence permissible consolation creep début boa alegre rodolfo rahul addicted sorcerer habib defends meuse floppy hoaxes owes cloudy hmmm showcases franc optimus hao fatima collage brampton telecast coli fermentation biathlon deformation tortoise dessert moat reintroduced dissenting metaphysical basalt ideologies lauded adventurer probes optimized favorably conceal supervisory carnivorous jerseys forecasts folly karaoke superleague cervical impeachment smiling cato bribery wbc saleh devoid acquires tobin ludlow meier smoky raptors refloated chieftain fútbol departs pee onstage predicting formative gunnery ghulam porta artur morals skirmish rosary modal sectional landfill nuggets sanderson linesman incompetent bw rebbe rbis chandigarh trapping modernized austrians cercle strangely glamour robbers olav await chadwick frightened boar enoch ringing breda insulted strife sorority flea slavs impractical boomerang sheds cults calvinist corinth invoke transcripts vom perfume bony lakeland pla bikini potts petite racine patriarchate shahid abolish genital circumcision ogg recommending pelican plaques memorabilia unearthed tractors glowing relaunched dominates reformers shutdown slogans dashes marlon jian ripper tern appropriation arranges bullshit squeeze inventing spade cooperatives lawton highs dg fetus electronically disadvantaged waco hesitate blaming youthful harass artificially calvary exercising pillai cholesterol stabilization ppp martini ecm thiruvananthapuram nuestra nope imf gunboat basing ukrainians postcard monographs apertura regensburg rebelled dysfunction parked vertebrate conor melted kodak montserrat triggers hoon blessings farrar mojave saab benches repairing engravings illusions bowlers asap stringent irregularities huntingdon andean airdate cartagena uttarakhand assigning shaker digestive afonso hike mosley kruger kreis stunned acorn bluetooth iglesias ine greyish humphreys vfb hopeless antigen lockhart sturgeon joo vigo teutonic larva crockett monmouthshire stormed obs insurgent buckeyes ganesh loco walla bingo embarrassment sited peas disqualification rested bst attire paralysis calderón sao senna migrating romanians mansions invent ural financier vasily mayhem tammy jacobson som endeavors rensselaer troopers dementia staunch chop abrupt splinter connors sponsoring bagh thrissur endorsing kirsten indexes spawning stepfather newbies vertebrae suny salient skeletons allentown sunflower chaim gallen neutrons commas hog domestically gilmour hideout adamson furness whitley shui massage gd melrose suresh reliefs deficiencies capo tübingen racially lakeside wipe melon nakhon carney rourke rostock sardar ether fetal sash daimler sarasota connectors ruhr entertained martyrdom prometheus slough bridgewater earnhardt confederations forgetting subsidy ceilings freemasonry stacked ericsson headache jb farley massa foraging noor frontline beograd degli geothermal luiz zoologist penitentiary reforming summons contended gregorio flare comb luxurious chlorine footed bashir escorting formosa arthritis fragmented timberlake surpassing negligence barbarossa ¢ yadav glossy marcello drumming lizzie onions therapies shuffle cryptography abi cartier upbringing unlock escalating specificity auvergne falco alaskan rh montage nod björn quercus kn selfish whitesmoke restrained torso avro dias piccadilly bertie templar boyce ethnographic childless antisemitic colleen corresponded callahan exiting ripped uma prom seward federations mileage shaking concede mubarak cora anglesey countered mcdermott arsenic ashanti prefectural reinhard sari geraldine coleridge leyton fluctuations antoni pleasing scooby asheville housewives arun mahler horrors lexical showcasing kinship orbiter pancreatic madre percival avenge thrower lm goblin leases brodie filipinos villanova dons reopening cabot inspectors orwell boyz embracing laundering arisen overlook tome bancroft opel rubio tagalog uncover australasia crawl barbera yung salvadoran oberlin olya hathaway fractions scalar whalers fables adair westmoreland heightened orissa fridays hibernian elisa duisburg sup electrically surreal latex gsm substitutes sonatas creighton etienne styria narrows trumpets defective huxley broom mehmed manifested dyson erection dun bhopal gown brutally cameroonian falklands stubbs undead counterattack aruba chevron culver recapture clasp riviera manchuria renumbered geographer expectancy contractual petar achieves wiring bastion pear manipulating rejoin teal psychoanalysis aur unofficially fledgling kibbutz expansive exceedingly aldrich importing wildcat lockwood kingfisher angrily curran pulses nikon monoplane earns kato adolphe maharaj inferred durant tuba lutz hesitant tre supersonic shek annum utica barking swanson mcneil reactivated toast orb rosters dumas savanna rounder mckee scunthorpe benefactor expiration blackwood begging furlongs freighter khalil throwers swedes rockingham prehistory licences toc sloane shutter minesweeper anc interwar overcoming classis nicaraguan cba sightings voltaire fewest batista kilograms populist percentages nicolae scot befriends yoshida rhin marconi pigeons barbarians gower roux forrester picard reversions sutra observance rooster stereotype mangrove shoemaker decrees poplar unimportant takashi ola puritan linebackers embarrassed reptile ons oecd concepción signatories promulgated cádiz pelham cavalier aggregation interchangeable cellist ver autopsy hutt reis lancers morrissey lettering devotional enclave dunfermline spoof categorizing reversible sadness objectively interrupt triples bandleader lobos disarmament parque wealthiest hewlett sankt unrestricted steppe lui kuhn mckinney redeveloped zoos petr marche shutout cutoff flap leyland magma scorpions mollusks plume esperanza telford bellator sleeves hajduk sampler rocker guan sleepy scotty blond fortresses waldorf outraged collapsing sorbonne padilla spearheaded wyndham domesticated maribor garibaldi harbours ashby cio visuals bantu góra kielce gaines futuristic err meteorite panamanian axioms noticeably macclesfield denominational taurus quarterfinal dix constitutions magicians dunkirk multiplied riyadh owe walpole rei beheaded eno valence validated jima recalling gunners cine melinda elegans rwandan durango bhutto latham deane volt reconciled filtered bede sponge vizier goran wrongdoing deutscher expires baptised sprague netflix smells wheeling quadruple bong saginaw liza monaghan pyramids hendricks pembrokeshire blames robb carousel grossman introductions vanishing fraternal xin motivational rui preachers micronesia herr juliette lander ardennes aladdin boolean flamenco deadliest needless ramakrishna curaçao har suk shaded renee dynastic franck clique adolfo raging astrophysics moffat dolan phonology delicious usenet iconography inductee endure nanny masts tabernacle ballast midlothian fireplace clinch westinghouse khanate transporter stain inflow fading paddington piccolo shinto governorship sindhi enquiry greenish comptroller lopes cons rds airspace arnhem tay pryor courageous salerno measurable stump gilded archangel adana alkaline asunción zoning gras vigorously milling dwayne toto hilltop psv acetate erase numeric ascot hilarious pragmatic patriotism unverified devonshire layton dalian leverkusen kickstarter freeing emirate conclusive waugh mathematically arteries mobilized pedestal spoiled carboniferous arb kirkland propellant cossack recurrent bulbs cdt polled modernity gastrointestinal tracker stimulating mau hinton specialties whitby qian bedrock inflated encountering grille verifying stalk alligator weld terminator souvenir witty photoshop hebron yeung ipv categorisation fung incarnations mpeg liao kettering wabash pows fairies mediocre tankers precedents fillmore sapphire angers gianni hayley markedly downed rowley ultrasound aleksandar marlene tarn hasty galilee relocating northwards anno reindeer champlain carrington faroese whitfield zimmer myles cabbage stavanger kollam pomona wastewater emergencies osbourne brood études hexagonal oran barns mitt prefixes smoked dissatisfaction pty saviour dismay growers carnatic crippled manifestations balliol smolensk lax exhaustion margrave marques uncut pitted deccan breeder terri optimum hebrides asbestos samara misc homosexuals pillow neutrally moro gino accademia howie stitch northrop unfit dopamine astana obscene gamers relays dubbing lyme gabriela loren invincible excursion enact finlay ghosh revisit byzantines overboard bequeathed headlining tasman irb waivers breaches amazed cosby uptown challengers usn frenchman ultimatum unleashed dsm medallion chromatic cultura heartbeat oda narayana outbreaks lipid toxin sublime curricular juniper lübeck stumbled undeveloped kickers ♫ sideways mongo outfits ferenc scoreless unequal ost salted somaliland enzo bea upstate interviewing contesting combs bulgarians odor innate tolstoy partitions breslau discontent lucifer cena riches tuck rembrandt obsessive primes sinhalese starboard madeline administering camilla compressor pj logistical dio relativistic slips waltham frameworks structurally notifying piles inserts agitation waverley persuasion rune nosed ricci viability baluchestan unorganized seventies honorific robyn interchanges yamada passers pacers kam guangxi roosters iptv respectful abnormalities hedges evading gdr dewitt chases incomes sharia kearney ly callsign pedestrians drifting hive arcs gemma abbasid richer practise bun repulsed whaaat inconclusive suffragan epistle perch groupings latent partnering zenith canaan hastily hasidic larouche shorthand ordre philatelic swelling mcmanus unlocked naruto calle finishers mangalore forecasting asians eure spontaneously pfr stockings adriana fries championed blink shortcomings stove congolese reprints rustic yangtze guevara screams volunteering catchy niles payroll solitude obscurity toothed drones quotient softer bir reliant stravinsky rebound condemning entrenched palacio cascades lv tangible oratory howitzer bona haus intriguing vous remodeled babel burrell caine vaccination shizuoka horowitz dimitri anhui addict xia persists prowess distract wuhan elites sendai hartmann hwa stabilize refuted tropics tagore kenji scarlett recorders cavan travancore grasshopper hanja lyceum mrna hohenzollern grooves yeon progressing deter dandy philosophies mcclellan enamel magellan penrith tit readability juilliard tapping anil thwarted gallant terrell tycoon finley elbe tod looted yaroslavl tula moby bonner acta elective grease cortés digby furnishings demonic signify lick thinker bearers kombat begum liber excelsior brightly farmbrough robber weinberg aitken farc rarity rodent shipwreck detonated antics bypassed dragging lodging equitable universität yom vodka zane misrepresentation mantra thinner caldera angelina padre collège oneida inquirer minted gangsta agile boasted hollis malayan conductivity asymmetric emulate scars scripting disparate insular harmonies aga sclerosis susie alger antiques plenipotentiary appetite invaluable rhodesian landry khaled superfluous roaring wrc jutland enlargement horner devine gis dragoon riemann timeless pyrenees herbie hanuman showers fray ethos jansen reggio sober fundamentals friesland tentacles carlyle liquidation involuntary burlesque outlawed rehab é parnell mondays volley juris declarations phnom haggard irt bundles patterned nouvelle istván accorded parr mails kaliningrad duval persuasive dnp sugarcane inhibits roofed heyday appliance spas moderated oncology kh diminutive siren rajput washburn hospice gwr iodine temps misrepresenting astonishing pumpkin evaluations divergent deforestation usaaf rents inappropriately cynical swapped confronting divert onboard oranges workload likeness mechanically lakewood cohn wynne ebook swings menzies fong ramesh enlist mccormack replying bobcats unblocking xix redding infirmary chelmsford renegade imaginative tac fumbles bests greetings photobucket spaghetti gamecube coveted realising karan norwalk warcraft pics bmi glastonbury relentless sion lorentz climber intensely incarceration slit haq vo holistic mather lerner ayp andrade ester venomous intestinal transcendental institutionalized arrivals palmas maude pastry hangzhou namco shorten sunken suisse deferred sinha miroslav príncipe docked subculture tori pong stubborn bedrooms justine alden packer tomlinson puma bongo substituting tinker christened dwellers choctaw amarillo tess exe sortable paulista greenock polka subdued spalding stabilized embargo pilar inefficient dryden maastricht unnoticed sta dada maroons horrified smashed mitigate dwarfs nel likened korn remakes forbade perpetrator nader gaia grosvenor disrepair flanking grosso capacitor costal monoxide columnists recited mcnally resisting artes lovin nationalistic comanche welt provoking revista yuma ppm inconsistency echoed falk epilogue cnbc gunshot winnie dsc dario visitation seamen liv kano triton calendars palladium olomouc stonewall silhouette awaited pretends simons concourse mountaineering impressionist heathrow hobson movable barges cervantes alchemy tapestry yeh mantis schroeder hoop paving lineages embryonic witnessing discriminatory farmed distal hinder npa implausible cessation glimpse conner bieber epitaph inspected offaly stratton conserve flutes saipan lobster muscat vikram logistic rossini assent nauru sancho doppler gladly altarpiece py sidi noises articulation martino ramat citroën thunderbolt kwazulu windham mountaineer vomiting roh courtenay denoting distributes byzantium gigi riccardo informant felder mahesh anecdotes subsets triggering keene encrypted bows loki limassol québécois relinquished helper jacinto ccf cheerleading prosper steinberg namibian tolls suicidal ordovician previews retribution hardened mays avian manifolds maidstone guinean kuwaiti fostering turbulence streamed valdez mendelssohn politburo pritchard snowball canned prevail chores aborigines appropriated jaffa walkway surprises parkland dso ros pliny toxins saber flo strickland oates mello père throttle valentino towing augustinian largo trademarks holman nye coimbatore stanza cougar hickman afaik ajay shogunate polonia overt sweets sinks cornelis stormy nazionale correctness placebo ombudsman patriarchal hamlin consular hakim garter cummins prized manslaughter complains moldavia standout subgroups cremated siva kirke nightmares leaks dat gladiator hunts bloomsbury intrusion tonal contradicted undoing lullaby grievances orderly conquests pdc tranmere lemma sleeps slavia brant dina witt sunil socioeconomic clio gaussian skinny fitch romanticism connotations fractional tonnage boucher dutt kari zeitung olsztyn dignitaries eugenio fisk potosí buick inspections mum thugs florentine entomologist amnesia ordeal hwang psyche stockhausen manganese jepson ilya monza gills serrano vicksburg sena saffron compensated nagy mta moncton tait ecstasy wreath warhead ashkenazi blyth ruse pivot lament jagger finer arista damp bridging forbid gpl cagliari whorls troubling entrepreneurial footwear seeming hoyt allergic inertia plumbing courtship fractures bibliothèque georgi assimilated bela hadrian cornelia complies dishonest leben ado keenan facets unregistered retractable fated forgery anonymity gabe shelly chand paranoia luge saud electorates deprivation farmington aik practising microbial emit unproductive bunting abstracts baum thumbnail earldom deteriorating bateman disapproval blackmail mikael mitigation minuscule priestley attila spying tamworth clerics ollie dredd goya aisne mash restructured interscope boxed richly bruges woodruff diffraction coll openness couture waitress española futile superstructure handler offshoot carrera moreau accelerating miserable sher mahabharata sportscaster slowing driscoll colonia loretta conspirators looting djokovic steaua conscientious knighthood granger paxton pir lig juárez cornwallis umberto gillette pokemon overlord fascists bessie salomon maurer obscured liners codified hester sama clément joplin safeguard infielder skirts fertilizer surat adolescence opaque sandman levski hue neuron flourish grassy melancholy bonuses cor tat solemn chants wadsworth shimizu kristian agustín explorations digitized boosted strap batsmen chancery thigh shepherds blasts novosibirsk crat cursory renounced polaris effected lichfield shrink tcu belo brew barbecue pauli biochemical tending reichstag jfk britons electrodes pharmacology montoya radars gw debatable jingle rajya widest coronary sidekick vas sagar gliders soares sporadically podcasts thunderbird revamped baskets zeit lesbians devin worsened comoros sexton starling blount loughborough bowden paw electing unfavorable tongues nihon alvarado lewes gaze regrets fearful seanad indra pdp diminish yam varanasi shrinking burnside adversary protracted liberian amur gheorghe sloping romances intuition punishments lobo cctv luminous scrum roommate ftp mayan blum pep thayer pilasters overflow stacks albright negros vinegar embark undermined helene forwarded elgar crichton rea miraculous mott lucian accustomed lucerne bronson diy epidemiology jurists mitra overthrown carcinoma arnaud pugh grunge pls assemblyman escarpment calibre rollers lago administers shipyards yee aggravated westerly seasoned nazarene bassoon magneto mentorship seizing contextual brightest ravine snp capitalize shakti bleed savior sensational dae accommodations finder nuisance pedagogical familial een savoie enix antibiotic conquering derelict dismissing farce scoop collide transnational australis retract hoy publicist platte fédération claimant surrealist belleville flowed pretext kosher gadgets pavia gorgeous fps gurney condemn < ealing hopeful foreigner infringing illustrative celsius hobbies faiths italo katowice nbsp heisman fraternities retitled downloading riggs camino assamese parliamentarian grossly antennae hardship discrepancy stalker drilled atv lukas piping gleason shillings idiom kuomintang shaman bsa lsd atrium nypd crore terrence misty equestrians gian pyotr completeness kyushu epirus grantham civilisation sturm sorties lest punched pastors marston firmware cistercian borden professionalism amg svetlana juliana accommodated musicologist reboot visas ageing leandro inertial textures monash pods outfield blurred mundane innes harman ids billiards constructors knopf agnostic conformity yup lymphoma ppg inward stewardship xxiii bogdan multidisciplinary thermodynamic envy girard projector loneliness hubs behest detachments uw britton nga cutaneous trunks lends melton spins starch carvalho kessler patrice estado ashamed feyenoord cooks awakens ontology hainan fairview ltte hmong streaks activates decreed dreyfus ning stabbing slaughtered robes ‡ clipper firefly fansite deceptive circumvent sonnets gia dispatches gator madman dime tanzanian respecting cloning chisholm universes matheson translucent observable malice sidewalk mover dag ephraim smokey psa noms sired avi fugue hautes shiraz removable dall suggestive vauxhall disciplined matsumoto malignant hoboken minden barbie voicing konami resigns conti hof entourage chemically storeys murat fiorentina trios viewership tectonic shearer fable deployments nps exporting patriarchs behaving stooges alban galleria profoundly mato brute krzysztof avenger overshadowed recreating positives unaffected scooter glazed nearer pedagogy interconnected shack tq rabbinical subterranean hominem raman slack mirrored chopped probabilities sanity kelsey gecko multiplier shook chhattisgarh eaters supervise phylogeny schmitt rocha wastes cavite daniele mcgovern scripps kedah cristian jacobsen stevenage necessitated inducing objectionable subaru reacting razed beware pip grail dehydrogenase barrymore fergus longhorns subsections ez vicenza toole eisner spindle ast heineken egregious scotsman ossetia islet smiths lehmann interacts cleavage radford subordinates clausura inconsistencies ffd cx disgusting bale warszawa disregarded compartments aiken nightclubs pepe instigated deduction woolf mcculloch isthmus septa edmunds tearing maxine lisp operetta tadeusz baseless senatorial pacheco ♠ notoriously leech interpretive oahu nitro tver nighttime bolster stereotypical cowell aeroplane kipling hillman thorax iwo harlow delegations propositions midday tributes zhong summarizes sculptural lugano speeding swine weeds molten kanpur critiques weinstein angelica stepmother predates ganges adept ley flourishing crocker butterfield devlin keaton locust hain pharmacist cordillera parodied allegory lair conveys afternoons huston figueroa macgregor kml greta interacted floated moreton margot donating malagasy quaternary pdt bluffs purana ordinances budd oi hangul infiltration mcgowan anu fiddler exerted dissidents gaz andromeda mould ig mauricio hitherto actionable hempstead monique vat milner nylon harriers sharqi geophysical shogun wick cyclops mcclure kazimierz georgina indycar roque purportedly receipts yokosuka alchemist recoil tentatively charente pesticides graphite sounders mantua typeface lees liberator intergovernmental departures defer shelves tricked parchment hindered saturation kemal britt oed organisers miyazaki amr miki paola vg jive alastair gathers parcels originator medellín mouths uav mecha shun reaper pneumatic mace ecole hijacked melo msu pudding seasonally quark quintana recovers filler bungalow elusive aqueous consciously subtitle nanotechnology bac zamboanga tiberius convocation barth crc loaf dashboard kaiserslautern squirrels akita pens carpets duquesne hama petrov pentathlon focussed poorest bowles beauchamp tripura seinfeld oblivion jams sonnet noodles frieze consequent eastwards charmed mrt doomsday synthpop hormozgan appoints treble braxton vin treviso esquire bergamo eighties wolfsburg durand normans fittings waiter veiled subscriptions malt brandeis clicks chanting stints socialite pickett transplantation brothel meek koi surya waterman dyes wittgenstein chatterjee kal stemmed pashtun upscale khrushchev roper excise raza ahmet farah eriksson methodists arad madam seán mullen midget figaro northerly sault yorke jaffna grenadines bearded passerine dearborn confucian zac tis roswell ignited fascia hatton industrialization rabbinic dangerously bitten gare dubrovnik puja helms thani pathological richelieu exploiting rouse infiltrate mexicana rectangle chn darfur lorne zia cursor supercup brokers smear spartacus hardness mirren sucks beginners bleach cadre ducal sulfide millard organiser resumes chunks lina stare arresting humanism deb yat dowry cultivate megatron overlooks totalling meir alder waterhouse gambit ebenezer anomalies dole superstars boardwalk chippewa fandom conte andaman valentina traversed spacious concussion soleil pests subunits prosecute eucharist elise haze penetrating haplogroup gutierrez mysteriously synchronization renown marlowe breech bandung bulge crunch nürnberg predation holliday hypertension pasta crouch spacetime flick undergraduates banu drexel edmonds untouched herds disconnected unsubstantiated carriageway vert sinaloa eastbourne extradition truro pennington flashing unconditional sideline rockers kool immersed delphi fredericksburg sterile fours iloilo zi retina bess castilla peacetime mcmaster audubon wrestle aaf gansu luce unremarkable exacerbated cfb stardust mishra surfers articulate fuentes welded takeshi flaps burg kaohsiung zoran belinda objectivity dames gunboats chomsky longford waving pskov habitation rubén weil ethic messengers disperse tort npc animators eliminates blackstone coney syd campania vlad commuters hawkeye golan circumference bv optionally carmine rajendra conveniently borges kirkpatrick bakr collin jacks fundamentalist yap stewards catalogues inspect marlin lowercase definitively devastation leander frisian obelisk yarn famer langford vaults nasl knitting cohort gabled crosse birkenhead bartender migrations dialing profiled unmarked swearing landlords endorsements rtl gabriele singularity corinthian anthems legitimately kuban playful arf thakur dn toowoomba stair rewrote whigs vere allusion gayle demetrius hypothesized bandai gopal durability mahoney mst reinhold makoto offseason mains steak validate exited dictate boomer duma warped bypassing professed sx gimme garza tonic rockin pelvic westland mormons disillusioned velasco screws livorno kimura bennet setback geffen knapp cdr joking ath realignment densities laptops apis coherence brownsville interviewer bh canyons schuyler wmc appraisal snowden primal penh whisper voronezh reconstruct hauptmann shelved moorish rawalpindi escalation kar portrayals aida bette ares litres transcontinental parlor leblanc lapse forage unjust chameleon evaporation sandoval brownlow shocks discharges charlottesville mathis analyse masse mancini cornice sprung bethune pang xvii siamese dubuque chronicler embroidered sei campground alp rajiv haywood thanked cheerful millet ribeiro vet beckham albedo wylie lemur thug henchmen dis delft courthouses calories herodotus tame bribes montfort behaved brecht isnt gatehouse shoals spiny harbin warmth boleyn predicts fawcett wicklow tribunals disgrace casas ashram jiangxi aroused dekalb celine tiers almaty fff sphinx brigitte inhabits bhp talkin disgust gambler fluoride fasting lovell creationism obtains meteorologist greenway welcomes offend admire monologue johnstown weller pandemic romagna kidding pascual lump depots architectures bazar magee symbolizes quarrel aristocrat defamatory embroidery kami plentiful flc uyghur cushing decidedly schalke kriegsmarine utmost colder tyranny astrid trooper alienated perrin mondo ballerina castilian etruscan rook jermaine rightful simms gharbi ronaldo offenbach hye conclave recess shortening chevy ¥ vandalize shouted mediaeval culmination palgrave lotte cloister geiger daredevil pacifist moser mérida purposely sagan misspelling rooftop tabriz thrones danilo bobsleigh loma autobots tendon degenerate franca boil templeton corcoran sighting erratic limbo subscriber fiercely stanhope arno vigilante acoustics angled delegated coltrane thunderbirds maldonado baer henchman mj bantam deus ritz makeshift diligence pouring piety suitability leopards twisting rambling tambourine bicentennial hsu jug undivided grizzly amphibian commute acknowledgement adrienne virtuoso karol youngsters fantasies usability theses rounding refute deploying exempted yell cdu justinian pardoned grandma ouest cornet tompkins brentwood enlightened ananda memberships coruña amplified stucco annan cathode sentient voiceless playlist hurting empathy wollongong chapelle mtr sender topo weakest conjugate requisite galactica suffixes fulfillment acb ze suppressing iec hawley conflicted farnham favoring rahim crewmen monstrous summarizing alia tapped mervyn terrific orator aggies ligands oviedo koh motte caribou wildcard niall intellectually aisles reassessment accords landscaping bends hurler vb disagreeing schoolboy faux coulter mersey eberhard chewing mcrae macfarlane junkers marburg nathalie resorted fiancée northumbria lakota subtitled protons burying conde bro blends yew misplaced flamengo neurology overlay pence utilised drayton notables uneasy illawarra generosity rin fonseca mocking orchestration bracelet tramways murad acs percussionist watercolor nasr pahang kiribati homemade extraordinarily daegu mich gilberto bnp barbed chaser blueprint alveolar svalbard barefoot medford bounced kenton daleks gangsters atc fanfare seam caa canine geocities mak penetrated enhancements belles himmler goth luthor petri chromium tamaulipas alluded hereby rajesh ire rockland cabo gresham steen tundra bribe kita kwok chimneys indifferent thinly fink harrier elsevier stapleton coping tiling hardships enid westernmost advantageous exert stanislaus iupac missoula goo erstwhile oscillator needham friendships piraeus monorail granny arthropods unorthodox sumerian fatherland postdoctoral miley silverman apprenticed oratorio dynamical krause alianza somers dijon editorials khulna tossed gilchrist unbalanced reestablished nested irvin volcanism bianchi stead duality refinement quid cfr dives wander prosecuting townspeople blofeld amounting diets admirable isi sss carpenters suleiman surmounted complied majored hummingbird kagoshima cutters bisons rigging marred aia scarcity ishikawa yahya recognises sprang fil consuls pleading assembling nomads plummer undeleted univision rupture goode mullins nineties referral existential bullpen atherton sanger aborted visconti harare sane gully orpheus spotting assorted belgaum colgate tardis atatürk negatives replicated stairway numeral bum potentials werder magnets sewell lr deficits kira bestselling applause treatises renomination anemia lila ornament sidings entails wilhelmina serenade neustadt diarrhea pierced llewellyn vicariate timmy unfairly unanswered tokens kazakhstani whitewater greedy hammerstein inactivity bendigo göteborg gifu infer wrapping abingdon shackleton dreamer hajj bushes biscay hertford quarantine exchanging boldly concensus amen dacia dms gol klamath fax bondage arable pineapple constituting reconstituted ypres tsn fen twickenham evanston minesweepers netting utopian redistricting cui hindustani careless stoppage rachael terengganu scissors synchronous heywood counsellor sps ribbons cortez melodrama aleksander matchday seeker invertebrate belongings businesswoman wordsworth nearing pursues steamers nada villarreal rotunda shakespearean unwarranted shiv lombardi alonzo mulligan qasim melee finns castor airframe soaring swam interruption zhi polarized whence qazvin goiás holotype linemen regionally succumbed underwear extremity mahdi eur wittenberg screwed steroids passer igbo ababa heater missy chah boulders motocross drifted yehuda estero recursive tasting mundi discomfort spinner hyatt vélez bumps locator manley walsingham passwords vehicular priya trimming buildup mexicans cried barisal sidelined cohesion lewiston masterpieces bottled galen transliterated profiling lind hwy pervasive advertisers hasbro brahmins cirque shaken bernadette goin rosh apologizes cpr ivorian firewall bower música apse simplistic lupus felice schwarzenegger jawaharlal cough mikey amazonas kennel nakajima timeslot reappeared verdun occupancy kayak jos lps mille throughput briefing antitrust altman lyrically nui neapolitan undefined ile kashmiri louder retreats buffaloes winkler urinary ptc johnnie uniqueness anglicans michal extravagant substantiate goliath landon iit induces blinded upbeat dispose ibis košice jnr sickle ordinarily unnatural containment mccabe campaigner misspelled vivekananda analysed reiterated possessive ascertain overcame barkley markov bobo petrie swann plead avn sandhurst minami aegis condensation picket ventured derailed wifi specialises reminding stork amplification vacancies furlong staudinger emphasised discredited sarcasm electra tho voss jiří milligan kandahar sevastopol unintentionally hostess invitations sloppy oldfield mekong levied swallows stalingrad bree timur southwards complemented parrish heinemann mong hegemony tabor willingly intolerance imitate honeycomb dewan lockout rallying nucleotide xd gaetano puzzled hacked isidro kart babcock guillotine mentored alluvial mie garnett kitchens immoral westbrook kor dinah earthly overkill maa smythe oss rcaf comprehensiveness predictive strives nemo kabir jardine caveat celeste apprehended expel sizeable tenders alicante extracellular paget replicas disambiguate tamar whitecaps negeri palsy delaying hens leger samar euclid jpn ennis vinnie libertad unconfirmed omer auctions ofsted attendants disproportionate assortment drank sco linus justifies forfeited armageddon refresh maury portfolios pinus shutting normalized sheehan optimism pershing oppressed weavers vagina kieran cajun distressed outsourcing freiherr seriousness annihilation cpa lagrange chic supérieure gens stag std underage formulate escobar discourses parton nikos brutus pooh hyperion hollyoaks proclaiming manors seaboard chechnya winfrey tully schofield ide gh rommel pleasures coupé mosul mbta jérôme flop núñez adhesion vidya uprisings ezekiel nkvd psychiatrists waite merrick tyrrell mogadishu jars massacred provo precursors muay arp irma niño benoît ott misconception frenzy decommissioning handwriting saad easterly stacking knoll aguirre atelier nascent bernd paler profanity imposition intermittently implant kafka stowe heist pasting msa congratulations kwon telephones sungai hydroxide viscosity scarcely vents qaleh gaspar codename obligated spicy mainstay shahr whelan pathetic voor thelma trotsky columba ladd upn hickey genevieve recollections hegel juneau revere confessor housewife kitten discriminate estrella amphitheatre polymerase batu plugs oppenheimer mya parting tomáš flavored contingency inaccuracies fulfil tennyson silica peirce gorbachev adil goodyear hrh almighty dread lonesome björk rewarding cheeses cartesian parapet sora bara planters segal barclays thrilling seaport stara tutelage boswell joon chronologically leno tennant wis preparedness landis khanna lingua gorges fragrance cider awa mehdi netted serotonin stew hofmann eugenia eros flaherty giulia excursions compounded hardwood wye causa emitting caters acharya unbroken tomasz pesos soapbox intestine dao keyword gent lethbridge tromsø dedicate indistinguishable ams harmed bets amish corfu admirer rhinoceros gpu poke coinciding vigil prosecutions approving brugge hinckley siècle hades dogma zeal raspberry objecting misused nikolaus fetch leibniz gibbon spore snare acer oswego bahasa guilford depletion smartphones conduction shree yorktown rentals yamagata acosta smyrna fullbacks irons eid counterpoint euthanasia dunk exposes rsc regia tungsten groeneveld jános greatness easternmost healed subsidized belvedere unplugged ultralight wickham cooley bef mastermind episodic bouncing obispo ravaged planter mesoamerican quill dislikes assisi fireman civilized jayne bedouin insofar abbess faire handgun cordon decorate alarmed josie shaikh wavy wikileaks intimacy homology synthase envisaged jag mio knut josip seren dolby scrooge pompey lancelot bennington diode slum presbytery solves friuli balances seri twists sus calibration shutouts harbors louth paralyzed harrogate outlining vallejo penthouse lippe afflicted councilor shines vieira thunderstorms beginner lech ensues pai airstrip olfactory noticeboards corbin reinforcing ammonium allegorical dowling forging gainsborough eth decider coo pigments paco curators transitive leaking wagga volts participatory hansa mutated schoolteacher gays symbolize parishioners coahuila ecoregions propagate asterisk undercarriage crackdown cunning fringes corrugated shaughnessy alexey peaches cohesive wig namur distilled solstice adhesive subscribe dunham tackling nephews cyanide intertitles downwards polio hitters xian estimating mosaics caspar caricature mep firepower cupola friendlies contour criticizes personalized ayers bodybuilding tackled equip vienne anderlecht humiliation usmc funerals hawking provenance decisively withers unjustified jennie swell freemasons sherbrooke messed imp thebes implants conceding wiggins viennese keegan encode dalhousie slc plated cro concave fruition skateboarding diminishing mawr rabat lorna persepolis koenig enshrined ralf jumbo slander chaparral reels immanuel perpetrated cale workman anatoly auctioned healey comrade disparity ufa brides littoral kangaroos awhile mou allusions extratropical imran jenner typos seq progeny tilted madsen wont schenectady globo barbour kelantan hochschule stripping mancha necks mesozoic infect jeffries kiwi uncanny lutherans disobedience valery aditya exemptions grammatically dreamworks filip kandy coincides brittle bangla normative recurrence pn windmills vanish xxxx gunman mixtures camels yosef searchable freedman foss usages rancher radiator repelled hoard nizam cline imbalance retake blackbird weldon visibly fonda shelling fab puccini rwy bullied atypical rudimentary hiro bevan quail moseley extortion councilman salam xie wil remo cameraman oa pcr vio snapped nee playa leavitt neath integrates barrios blocker casanova yar tatars panhandle coppola nadine repay janis furry stallions gehrels dnipropetrovsk rudi allman snapshot terriers adnan shakedown domed hendrick diptera learner latitudes satanic luckily tajik steamships weary smu gated organises lindbergh desserts godavari larval elmo howrah handley aceh winifred hypocrisy crocodiles roadways ethyl spector pim hindustan acp einer organisational chiropractic reinhardt fuego deriving yarra num dope sabina sumter tubing recognising horribly controversially interoperability severin surrogate villanueva reformist tampere staunton montes bane thon dumps foes functionally fleur devonport cortes autosomal cooperated berne frees estudiantes arya essentials pushkin warred pantomime threaded ans surgeries linton unheard cockburn pavilions asin nominators chatter rumoured storia hossein conduit wheatley kottayam hamiltonian weiner romantically strategist planetarium romana tsv lansdowne kumamoto persuades banff pemberton ruskin streisand bolded genesee maloney misinterpreted misinformation tbd girona arles myra fixation suitably errol slew vodafone banat overweight incline chamberlin uniformity hacienda reviving alienation ionization gretchen beagle biscuit colossal dum nok kinney bruckner expressionist wolfram incremental bahraini cords ibiza gulls luminosity needy wsop egerton carex ionian lua bleak wielding infante awe tray vm townhall tur intracellular lar ramadan keyes midpoint manic cemented distillation tulip qatari freeware rinpoche lms aki walmart intermediary attributable felicity petro footpath naxos snacks dues hoosiers henceforth woolly hutchison pitman irresponsible sobre formalized improbable defaults dv georgios lengthened conroy miyagi mcfadden robo stylus empower macaulay unsatisfactory bayonne mischief leila annoyance buds tout combating nonexistent chanel ronan burrow fruitful biscuits microprocessor disgusted bottoms spock fer habeas speciality cantatas bradman clovis jeanette tlc refractive isd morecambe kursk johansen monticello concentrates annunciation alkali facsimile podgorica comprehend catalyzes acadia thanjavur preclude meetups csx estelle embroiled stearns jie kinder stocked epistemology anyhow sidaway kamakura encodes recommissioned goodnight casing carrot escalate summa maia zvezda skelton imagining hertha fairest cronin recollection paley leela abilene archiv iranians baiting plover kissed parallax primo pretender ramayana minimalist recessive albatross grandpa frazer footprints rebuttal stardom ditches forfeit jeffery seminoles unusable ust biblioteca sharjah forbids oo vegan canoes fostered zh abridged prayed brenner sticker livre interfered skirmishes esteemed labelling laurier resettlement comical whittaker redirection menus gilman fiberglass ashe disputing chilton herod rub networked indore jainism petrel doric josephus fdp spinoff keywords incest bobsledders kaufmann propagated intrigued sakai billionaires saarbrücken gendarmerie experimentally catechism garda avril distraught invoking muppet toolbox genghis adventurous mandalay shao nordland hippie agostino teri bani dynamically downgraded bnei grandstand sander brahman unrealistic más preventative poisson bock ludicrous osijek volgograd agility ov thoroughfare tongan calypso activating sinhala shakira watertown conspire nausea mackintosh gmail litchfield zhen woodbury mane concerted genomes carts mansur olives rushers homelessness apoptosis cosmo furnaces anas steroid tama illiterate leakage terrified bos polluted aquaculture azteca tuscaloosa ecoregion informer aerosmith romain excuses vandalised adhered mahogany chaco populate grotto footbridge bourke bamberg hmas extracting sandro lured comédie astral adirondack secretion layouts dragonfly druze thorns pollutants afar bsd consorts dowd inquest paratroopers summertime schuylkill lata herbarium incense agony aptitude sutter coda einar grenville dawes birdlife messing henryk fermi descartes figurines reruns thoracic etching tinged jojo organists equinox censuses valea urn hitch yeats nicki fief tarot deutsches vipers hug higgs seminaries schwerin carta headland erfurt umayyad iona dooley uncles swore ancona guizhou mönchengladbach vogt motogp bmt ohl beggar cunha moog whitworth hanks capacitors tattoos findlay accrington aung herschel merthyr webcomics testosterone confrontational ebony skeptics minster waged vos fyodor tangled signifying magi foote vázquez larissa samir branco choreographers corrective escuela hoops fortnight alludes ascend soaked hélène tutorials ironclad sweeps hillsboro autistic allergy prout niches chinook medusa northland iu balboa joss praha philologist cyclo shortcuts overtly ovation goofy heterogeneous hotchkiss thrice royalists vologda toei theatrically decompression blender kimi paleolithic figuring sed josep leprosy suborder porcupine bugle unintended complimented rioting widnes pliocene latency haag coils superbike doña catalogs audible tipton ornamentation empties pooja carolingian ovarian nvidia garnering gerd indifference ricans maeda sourceforge khl mindset smuggled payton biggs solicitors bhushan spectre harem khartoum selectors proudly germ childs baruch seconded hsbc interfaith nijmegen lackawanna omitting altai constructively ven lifestyles pyongyang stoker honoré ripon kiln günter viktoria elst longing rake lycoming trax dara sita amit poseidon axles taranaki ¤ neuronal teh revolts attica épée tuttle blythe castleford salinity anecdotal horrific fifties puppeteer mus jma oversized hush striving peril azam halen iyer adriano atalanta cuenca iba fiancé rumour magpies clarion tonkin metcalfe ncr classifying transept aya embedding idris harwood sledge stirring interpolation newington rein humphries fetish starlight wand mocked inglis vending fad sparkling bhai mcallister stationery sirens mormonism dreamed ziegler dist antônio gila sree layman computerized ramona desai tillman battered curricula tub hardback priestly perseus bombarded jax interchangeably swinton seawater lrt passaic fao márquez fermented environmentalist hydrocarbons lecturing recount verma unifying vader nicol emailed tuscan watergate punishable molar adc precautions charisma irregularly nxt homeopathy grands genomic gentile natchez connelly anglophone cairn sse hyman embraces bowed caffeine patently neu unhealthy sudamericana darkest hamish metallurgy amon sore mallet burgeoning ck tra moira californian texan harz chou jk elvira overload polity homeowners toi behaviours chairmanship incubator kip festive shivaji dwell condé guilds farr chernobyl pore stefani curl asl hipped fictionalized gunther metaphors polygamy deem microorganisms entomology lancet academician uranus vaginal npb bayreuth taman orienteering scribner hamadan malleus meetup carinthia neale ghats charger paulus ridership knit animosity sayyid methodologies frye resilience faraday cafes acrylic dictates retrospect heh wrigley desks sympathies gauteng clustered drills cristóbal gaya outback mcneill programmable henrique redman projectiles tainted beresford minimally aye hindenburg sedgwick austerity centaur michał wiseman adrien repatriation bora piet sia deformed á famicom playmate verbally grotesque vaulted delia fernand excommunicated aif mercia arras lonnie medvedev zambian sandwiches elegance abortions wallpaper scottsdale leeward fuss atwood plywood casket lhasa bulletins kingship toolkit plenary unreal lieberman fractal sikorsky grandes bowel kline huguenot luv transistors bibliographic caitlin uniformed carlin cupid zorn annotations seeding cognate samaritan narrowed scrapping emulator sakha krishnan ironworks drown jed groupe pup bunkers harshly wineries prospered outcry catchers beattie beneficiaries pom sackville oscillation boron tutoring asbury reappears sensibility harden pfeiffer rennie collectible chapin desperation caro persistently gauntlet gob woodford bigelow ucl adversely pedals chappell première wordy xt voip salty slabs overtaken markazi frau directives chien cumulus afloat creepy genitive ceres malian clough cambria ard htc springboard ayatollah contradictions rainier questionnaire ais seduce duckworth sooners catchphrase unwin rockhampton universitario whl distrust tramp faked grabbing condominium dundalk racks versed cubes expressionism compagnie torrent hesitation narendra interprets minaj soma gbr phrased quirky shabbat observational apollon delano husbandry disposable outpatient amis maris regaining adoration akhtar jw resonant vasa homophobia mayflower grundy slovan stv cay beneficiary technicolor minimizing fifths tariq trawler guildhall branson charleroi trentino fared topographical ormond carew amphitheater standby imax predicate impromptu filtration ovens hemp aspiration conventionally panned demolish vladivostok hamad savy ruff revitalization thérèse liebe morten zara uxbridge maronite headlights toed ofc reorganised snippet bernice ashraf montclair dressage asu bering issuance cantonment musketeers agnew midshipman renée métis forked henriette attributing artisan opting motorways banerjee shay hitachi yeo nunatak mailed skeptic importation chekhov sacrificing multilateral vassar xe aldridge mendel eyesight vijaya macedonians tweak haider repel fleece livermore typewriter harlequin deletionist masons duggan feats osnabrück germania longman inge intrigue roasted merriam televisa lanterns sucker environmentalists disbanding shingle jamming maddox unveiling aes muhammed armin greet aoc breaching osu jn newt ultima ebola cereals marist dq hertz antlers obi aberystwyth ayres evoke hopping shiloh embryos attest upkeep hilal fanning tele ccm frith skit tidy helpless albemarle deacons cryptographic lugo odense centrifugal idiosyncratic finned subversive behaves governs liabilities argo littleton sieges dojo roaming montessori gita bolivar pitts brice kirill slipping thanh pasquale foundational morphine disallowed booty laing chaucer digger misrepresented platted rocking beni doran inert garnet wanderer udp attaching ahn zadar protégé assyrians hardwick moulton walloon autograph interrogated jeddah hikers williamstown hawkes janus engels thrilled mcfarlane hounds basemen serpentine agua grêmio sliced speculations derwent phipps slices artie restores nandi stanislav polski preferential xing terracotta silvery klang swahili funerary reinstate hooded navigating aldermen criss vosges penance paperwork kola mandel goldie gourmet ord debian purged handmade buckland iirc lucca syntactic profitability budding toolbar colton curzon vincennes interplay landslides anselm culprit haunt dionysius fluorescence aman roo patil photovoltaic necropolis lawler wba isps michoacán arcades pertains è xy banja coordinators medway assures westerns malo epiphany slavonic randomized firefighter colossus urgency sultans assay dickey refreshing kana metamorphosis martel basra serbo chakra theobald bouchard loudly footing sho nag kraus solvents overton omni iglesia fathered passover detonation portman pap knowingly ema sera cci parkes compromising footscray eq dutchman intangible recife gazetted schlesinger addendum primus awakened hysteria plethora hauling klux leveled touted greenpeace housekeeping uptake sampdoria gaon cartilage pratap emblems aztecs presbyterians bayard roulette roscommon emergent inequalities chau gloss kubrick dipole kidnaps raya mccullough holyoke reparations zapata foray enormously keats saltwater ramsar peregrine niki bundestag centurion insured englewood mulberry chez alamos motorists phra tangential electrostatic steeply halsey paganism rosalind prudent urs apulia llp incubation canis marvelous gpa peckham aac parser greats maurizio mink urbanization sled widescreen sonja seok camped moderation janine salah motherboard zanjan cowley numerically mediums mackey censors thunderstorm jalandhar giordano radiohead discard monet religiously usurp triumphant compositional solaris oppressive fragmentary bellingham turbocharged metalcore multan okayama ipo curving kiefer ultraman optimize ovid tempe refine dung hopewell plo clockwork popped wpa matriculated trailed turkmen justifying bleu versatility perú alleges clapham bonham brunner lê andretti eustace punching tma buchan gridiron grounding bielefeld constellations kirov contractions fugitives unify velocities downstairs arunachal lynchburg hamm cannonball coeur southerly gard installments instructs procured matador plundered deterministic gerardo kershaw chrétien improv gaol kata retinal abad surveyors rabin hora sulu mashhad barbary fpc wildfire galbraith popping abram clamp laughed dormitories psychotic remit serif conducive empowering heartbreak innovator chr rotates suv wac corvettes luanda tsr signatory encircled rawat compassionate stirred plough castello westmeath foucault sis sephardic wrongful ulf unneeded skipping lichtenstein reeds tanganyika anguilla millimeters arjuna doctrinal macedon teixeira gophers reaffirmed guts approves dagestan pringle pere kd peptides pimp cinematographers housekeeper shakhtar anthrax centerpiece pathologist boland interpreters tum forbidding penrose ute rubinstein dst langer garrisons intimidate lupin eto blaise panasonic happier misunderstandings debussy capsules gerrard struts burman bainbridge pseudonyms jahan wola kozhikode warlords rapping eos rosas györgy equine tia dielectric cagayan portuguesa toned concentric eaves abbots multiplex precipitated agha hijacking halton frontage millie seduction ararat saraswati dealership mamluk ridicule townlands valois nicht mime hussey jinnah quotas loomis unwillingness homs prerequisite bile icy freetown confucius kuo iggy reckoning lk dunno wie piacenza weave tenets diplomas uwe disestablishment deo renegades unborn jurors petrus graceful emo slums kiki dividend counselling rescinded vfa blackwater broadened hdtv devise claimants whyte echelon doubted bom gastric robby gruber cultured fermanagh thirst alfie officio trolleybus hutchins bayesian connaught airliners ounce allele theorized forewing winnings mera fribourg silt rations oft serene blossoms curie dansk tph dey staggered maarten withheld italicized housemate canaveral conley medicaid beryl hopelessly dios cellulose hla huntley equate bbs latimer creeping trumps lazar vehemently blogging roast ounces jockeys concubine didnt banca guwahati worldly myron potters interscholastic carrick cymbals microphones yip broome sully uphill geometrical marr quixote gippsland tesco conseil annales criminology loot edvard selwyn cleo incompetence caulfield liberate totalitarian initiates stylist ugo armbar tahoe yakima standardised therese pino downturn mortgages puy bint langdon deir elmira binder juries sinus abode reword apostrophe moresby appellation mobster parliamentarians buggy samuels emptied vane cie libelous rosetta recitals reputedly chaney untold spar bayonet provocation tvs misread telephony gauss lesion amritsar fabrizio brokerage richness secretive faithfully gillis garvey kon seamus ticino subordinated thesaurus syllabus snowboarding zedong hips kaya firefighting meats valour flocks pathfinder phoenician llanelli exposures scramble altars parrots winton naturalists guanajuato dissection anesthesia courtier garrick signings pickups poltava workable carnage loader jovan fatality goss zimmermann cranial grosse hydrocarbon tse gimmick atf abies basingstoke fahrenheit medically wearer deactivated summarily suwon beasley magically stoner kaur skid wim burgas turnaround ackerman plutarch sybil dorm bec claiborne rondo margo renfrewshire qingdao appended eiffel hakka leif fram hallway rescheduled coatings suzhou marcin cmt confectionery headaches tot subiaco barb permutation quintus semen recast scipio mineiro sycamore bothering proverbs fibres jilin chee lager jørgen pye liddell hymenoptera frankfort phenotype malek kinshasa patria redacted wrecking pernambuco ascap aron pebbles curtin inventive raffles bund oxides typefaces wali castel stalemate bains thirdly naturalised searle safavid projectors horizonte kidneys mcnair dieu respiration heaton counselors tempore lf jekyll marginalized poitiers harms clinically affectionately augment suzy krauss chap aussie extinctions mascots environs ncis rota hispanics xtreme extinguished engravers brookfield remy nite aps ,, whats icarus sugars declan edouard almería todo unison ayn jocelyn phosphorylation ici duplex hails qr exporter jackpot compliment kalmar rephrase rem bachchan incomprehensible materialism decoding patras neatly warhammer inflatable holocene masquerade aire astounding leavenworth pleasantries promontory valletta mollusc denys palmyra ferro blooded emd oblong synaptic chieftains collared rho kuznetsov beecher complication goto odis girolamo assyria protruding pons benghazi exquisite fertilization mcnulty retailing riffs carrillo canteen rajan opal monochrome parke uplift unbelievable summarised disdain cheated woking stressful whistling sandpiper isidore judson zeno lili tragedies abdallah paok shenyang cassius airbase postulated bram olsson elevate heracles carmarthenshire enigmatic merkel forman hons hobbes mercado belgrano nmr giuliani wallachia gott presumptive peta obscenity könig traverses guessed arabi kean conveying emulation zaire prophetic jaroslav outposts karlsson chowdhury acquaintances spilled sina coronado reproductions patten aylesbury saladin molded merced pew tcp kmt els lecce altercation obnoxious earthworks teton receptive storyteller mares mobsters pinter everlasting suitcase ravel subgenre lofty chute donny peek assassinations vue lien karst restitution ponies refrigerator loudoun zine doves krasnoyarsk attaché parra pixar staffing rotations arafat ancillary druce humility undocumented jelena phonological lh thursdays superstition dordrecht dismisses raccoon moods heidegger sasaki authoring pao southside fodder kale pebble observatories torsion tweet saratov correa animate cnet isthmian caicos sarkar cruzeiro tapered lodi stubby wto compilers puente shrike authorize folsom saito cyp fsa selkirk cumbersome ln pahlavi rafvr filename xyz dysfunctional milt hadi idealism isaacs cassini rearranged photosynthesis glutamate forester chiu humberto fitzwilliam calvados relapse marquee swallowed mainframe susceptibility maoist sonya adidas dominions mex mazowiecki olympians connotation broaden typography fiasco habitable croat stiles galley peralta jodi agonist gordy zeros temptations adverts jacobi matured edifice kishore hearn tighter jovi oligocene uda carbine genders erick elo accumulating dulwich ralston vedanta alston emigrate larnaca murdock bachelors hj dravidian señora juana goldfields calmly russel yogyakarta porous selim niccolò mcdaniel concorde bogart robles epithelial diagnose battlestar savvy kiran stl everglades zak skype feces dagenham phineas pancho comique argos ledge moldavian edwardian straus annulled ryukyu beached setbacks janitor narrates oru bullies pectoral willows criticising marys esque hackers ys hogarth sil estrogen abrasive accomplice atonement dragan hwan bok messianic westwards phantoms earp iceberg backstory sigurd ornithologist chronicled westlake deviations revolve adolphus conductive aalborg benn officiated clowns primordial brabham subscribed quits loreto ibsen dodger headless follies cooperating pinky paleontologist stipe malays judgements reflector jellyfish infringe worldview swe marxists revising sosa pelicans shortlist instruct groin infested intl betts rosalie vocation surabaya blenheim sacraments varese mihai cabral flamingo remorse phenomenal archivist capone abner minstrel epistles astley constables pouch smithfield riparian kiryat pyle degrade fashions sae resented xiu kofi moored bioinformatics peroxide predetermined archbishopric tilly starving reproducing kelso flor nunn aoki colman removals extremism rath whorl aland rnas resupply devolved pressurized nuit gramophone kendra usm hemorrhage vile subtly methuen uplands shaggy burglary wiser biker ivar acronyms bluish elia prohibitions enthusiastically messerschmitt summation extremists pamplona nik pittsburg willi consecutively tipping armaments unethical yankovic blankets cautioned rupees frowned taro wager pediatrics susanne contemporaneous misfortune recombination convoluted berlusconi qa vladislav allende rav majid bowes undermining primrose aristocrats jordi azeri putative shrapnel rump bellows campers cantilever formulae obese booths carpathian transsexual aphrodite tumour raga nantucket intravenous auditing subclass partitioned torrance deduced notary udinese timbaland whitehouse hustle kirchner simón tierney toros gironde smarter inhabiting plugged sengoku luoyang nous ragtime arrogance certifying capri hayashi raptor veneration dang arcy trainees aq jardin handwritten foy berths westmorland rong herbaceous felicia stride joaquim mes pilipinas tú uncontrolled primacy magnetism staring vw islets cavern taranto kansai protestors brewed shaved ivanovich shading sandusky aloysius insomnia algarve consolidating cmg roddy insensitive lawmakers wyman ayer rug pendant confines stoughton skewed tessa hazrat neilson parris corning rollover yazd saxophonists xuan husayn farina sexism rede windward mchugh onondaga hotspot hoo airstrikes allowances monsignor foxx raquel stickers reins jm marlboro irkutsk distinguishable esprit assemblage hornsby weeping yelling eugenics neutrino frightening irritation fairey timo qt tweaks epping chained lolita rishi tora vk dreadnought malfunction coldplay shlomo gallows strathclyde enlarge murderous helga breen gervais grote józsef mansell gerber assaulting popcorn oceanography vermilion doctorates praia overdrive cortical osage wag buffett contemplated haste supergroup fatah anatolian buddies souvenirs convergent cern stylish eradicate vitamins arriva recourse grit yolanda ucf rehabilitated biologically kyrgyz bayan eradication ergo embarking maas artistes okanagan moulin sorcery tuskegee decepticons janssen yau rcd incoherent tivoli rowling prophecies takeda vettel bundy intertwined philly abstracted baha mela ethnography ludhiana ostrava characterizes perfected dreamcast attainment davie tagline codenamed germantown remodeling lakeshore apron vetoed hodder franciscans lonsdale watchdog ignite mostar laois ophthalmology bookseller woodlawn fenwick crtc gypsies brasília cosmological swollen mcdonough tyrosine andalusian classically pussy bitterly drier traditionalist plebiscite associative modernisation raton butterworth jat howitzers besar cirt streamline pranks compromises marcellus lichen narnia freaks steeple matriculation eccles arrowhead blaster ridings shasta exploratory lawless sangeet noam progenitor furman toon nang acapulco disseminated lavigne mathilde francia raster partick jeju repercussions fluff humanistic paternity burner repressed undecided stamens bursa lis pentathletes bubba vor hover beggars parenthood trotter sparrows crucible forcefully getaway pyramidal tougher salas kavanagh göring rococo moriarty musgrave commodores detour basie miloš valeria lisboa afrika hath knott facelift decathlon lingering tommaso miscarriage fates armitage alle aggregates maxima cracker navratilova felton caruso moratorium hitman unlicensed emirati envoys griswold othello soe waterline luxembourgian beatified ailing donahue cray dunlap jt luger superboy illini susannah pickens hallucinations psychoanalytic superimposed materiel dato triplets kampung hirst ≠ indispensable puppies newberry marais geographers demeanor dreadful anvil sturdy biden permutations santi therapists centrist atl intruder horseman kauai realistically sludge frasier winona mules holbrook schoolteachers rayner renting stalks rioja laramie furs jetty radiology watercolour doge cadence kumari csu basso stabs aides recognitions deli agar tsai charms wrapper donizetti cbi molotov proclaim caballero affirm genomics punter interpol avenida avionics mayfair attrition rimini centauri puberty wraps prompts lymph privatisation industrialized persson insulated lala rok dividends stumps stahl kites nymph aurelius whois tikva laterally tarragona rapes capsized nerd tomsk digestion pesticide bougainville palisades mardi phys mods immunology daffy copley teng bab midrash powerless sanctuaries abruzzo saracens jamboree surpass glyn flamboyant bosworth zola sakamoto ojibwe loi myung basses justus dw tsui placid prognosis unbeknownst videotape heilongjiang akademie bhatt igneous lenders vv xue substantiated settles mislead cleary supervillains pies esposito grandiose clustering loftus instincts dilapidated favours kamehameha reworking kenney satin rafi domínguez nozzle tilde vantage unam hikaru vaud nowa jeannie pid irritating cordelia silurian broward stringer formulations headlands coyne rigby cpus kee detriment smelting arthurian malnutrition kodansha tenuous verandah quitting ephemeral occitan hei quests viejo rockabilly maki alternated curing detractors taxed kirkwood sakhalin rab viacom acupuncture kinross punisher troyes tatum subchannel chopper buyout blooming nicolai inflict quigley leona pio lilian dilute supplanted lobbyist enrolment arif fades haymarket cropping discern apologised homophobic slobodan quang samaj lighted veracity decaying taoiseach swain tricycle beatrix sanatorium pescara guimarães refineries statistician ive papilio gurion blanks feasts biochemist taoist leaved olt haim consultations idiots flak simcoe motivate coercion riksdag husky bailiff infused nmi horsham trumbull omsk conmebol harming sikhism brunel kde competency schaus disapproved caltech sheath jodie velázquez hypnosis charlestown nome fulfills herrmann valenciennes cauldron proc cushion apg neuchâtel ece khomeini joni capitulation rosso huan shostakovich activision incursions indre pwi crux src alphabets constructor invocation blazon corals darjeeling lynching querétaro anode jour gijón interrupts additives colvin humiliated kindle anticipating normale thumbs sockets undersea insolvency acadian greeley eukaryotic dermot correlate chia sprawling lunatic silverstone vitality anus whittier echoing koo acropolis rockaway plunder koblenz balearic poincaré hittite theravada vn ittihad antonov audiovisual ubc contends coldest skew presidio chilling kamikaze valparaiso cornered turquoise rectified comté freshly watering virtualization waring millers strung pitcairn insufficiently severus videogame dobbs gritty aesthetically halliday adversaries camelot bodyguards surrealism overarching glamorous gcse aomori kashima zona vandenberg hispania anaerobic shellfish finisher lyttelton pediment jozef qureshi conceive francine majoring clears penzance enclosing booming newsletters snell darth kohl backlogs barbra decentralized effigy determinant armée normandie inbound eskimo pará sapiens milošević hatchback ukulele bremer harmonious autoimmune sncf magenta acl neuter sincerity guzman donner perón wallet addams bibliographies pans uterus youngster lieder dwarves knockouts hinged cardiology foaled aircrew guyanese abdication metcalf rout waiver revolved pumpkins parse ntv glitch compulsive ♪ brixton blazing bartolomeo collagen allowable campion smallville subic curia mano combative despatches filament deprive lga icd corolla liquidity wits nsf kindred dianne lng carmarthen infrequently deliberation taos regretted stillwater accountancy elmore ozzy bj dutta degeneration encampment ipl freehold jokingly hammers loom adamant hobbit haiku racket counteract absorbs intimidating actin mowbray dusky lleida publius solicited cma humiliating haile presidium caw pinoy arlen sch chantal blindly petah quack longhorn valais gamba deepak slump ryo cavities eusebius attachments goh mythos sewers nav hippo lys arco physique maserati workflow sutcliffe denouncing bonfire palmeiras tpb guayaquil lecturers rangoon marque città behold cronulla landscaped gracilis systematics mechelen roadrunner multipurpose nahuatl antagonistic genoese choruses shevchenko wray fairgrounds mcenroe mena headphones womb reiterate wildcards executioner shaolin transnistria méndez autobot honoris speyer storming circuitry curiously sibley tunbridge václav ridiculed ginsberg divas livonia fibrosis voltages fte qom mellow hecht poughkeepsie iliad xerox soriano stoney schoenberg lucio incendiary panchayats hernando surrendering arr bellini tran mystics oscars remediation egon deducted eesti obituaries sem weill cur quetta compostela handedly renovate kos sayings minions mchale glitter stamina mol svensson giuliano taping reece uninterrupted intrusive slavonia pretended yevgeny vestry dismal octavian pardo nix virtuous eclipsed anecdote watchers pascoe refrigeration unambiguously tonne virginity bianco inconvenient graft locales ricketts noodle blooms vj tack gestation elisha laotian aerobic flax rw sneaks disintegration sacking irreversible healer kierkegaard methodological mizoram resorting cotta villainous shoal oats sheena outfielders pell renumbering channing twa cameos thurman repainted evert revolted naismith freeport hatching diligent bader ejection sceptre irreducible hanlon oriel revisionist naylor pleads patagonia privateers laporte odo bankstown mage widower popper negev oshawa resuming naïve chino segovia theodosius vassals netanya grainger olmsted parable mogul ichi tutors berks implanted pentium dasht svp misconceptions charterhouse mian mme reared blasphemy curtains dickie espanyol hauser dodson snl opioid octagon memoriam combinatorial flooring lothar fleshy forceful euphrates broderick belfry hulls aan freeways naturalistic zorro shorty radu unrecognized bruxelles gwangju softly hilo fanatic nishi corrigan plunkett wading blackmore bonanza pandey tigre widen jewell algonquin postman optioned prejudices superficially lombardo dlc sages micah guerilla talladega platonic buoyancy reverence kinks mah exponentially tsang slugging brilliance doolittle nanda clam vee analyzes martyred phage yogurt sexist woke kimmel lauder darreh bonaventure warburton langston telemundo swamy tecumseh ceases nicotine certify arkham pasteur recessed geronimo nameless punic biosynthesis nadir abidjan walcott majorca woodside bohemians ry shinji iraqis parthian executable broadside jalal infiltrated buckle donoghue unattached adhering discrepancies principalities osiris daft dashed strut suffused beyer sill evils baloch ipcc morelos balmain sind disgruntled xxii recite degrading onslaught gustavus democracies roam gilligan adhd matty wojciech recognisable murmansk rud reuniting allocate avex kermit amato coburn purified terrifying propel waterfowl kalimantan ulsan indecent steaming modesto notebooks católica schreiber precedes gre amazingly crackers dirac leaps footer dine potion ventricular tamils banknote arlene rapture sip anew veer appalling snippets sibelius neologisms owain carlow swastika pawtucket sdp bernal mccallum hinge giancarlo timpani vultures corse ozark eritrean mujer prerogative delightful hamza ioan tenors overdue curtailed janusz spectrometry avila conformation jolla matisse ney defiant walid scalia trajan justifiable forgiven rattlesnake agm oxley fibrous goaltenders nawaz hinds asc swifts jörg farmstead forensics pencils televisions cle rockstar woodville botafogo inglewood solitaire himalaya marisa kneeling totem sein heartbroken lovett femmes jester pronunciations javed clegg sportscar revivals transmembrane gora benedetto dordogne realty sacrificial unintentional recitation strikeout shaky tavares tok palate spades bohr cages yonkers sssi limelight sparking proprietors anfield oy universalist exp sommer dy waltrip aliases voc simplification mutilation kleine centimetres lx stalinist riverfront westside taker hydrolysis devotee pointy languedoc dismantling anomalous twente gaseous ticker dunstan decorating robberies rel crawling rearrangement waverly unser pocahontas générale baillie loon bladed issa toyama lomax toluca meena interlocking obstetrics bequest schematic ahli caterpillars newsworthy stalked prog terraced kilogram konkani rags pheasant applauded aha gees polishing inhibiting hennessy atchison sushi scarf prudence statesmen neckar kippur goodrich sunglasses montreux inductive ballantine yvette chimes transcriptions cherbourg diversification brushes evidences nicobar ria wirral vadim decoy blackman downes heian kapurthala historique sorrows agendas onscreen spreadsheet espinosa lorient lindley laine tories kfar hendry rakyat marcy feuds georgie perceptual jukebox nederland makeover prost amending airy monika inhibited aoi depressions earthen hallam conjugation ehf spitzer authorizing rolex crucifix adaption judea expressways fillies consumes aves apiece riba plunge golestan mohr camacho flanker lexus blondie hotter manche adventurers maxime milieu valkyrie tuff cretan parabolic newfound skunk iván attaches pristina wednesdays vasquez kabbalah insecure postscript indeterminate submachine filaments buffet goths zeeland heralded estuaries jigsaw oberst occupant kidderminster sylvain juicy hough chas fooled bakshi inactivation yisrael unloading strongholds zoned cocktails rainforests mahendra taps rerouted staffs brookings uncontested tarsus gauges yitzhak ica cre instructing cheeks catalysts asymmetrical ironman skillful huffman contraception dachau rivières misunderstand acme refereed ashoka barristers ncc cuneiform hypnotic daisuke entangled rationality typhoid banter proximal trademarked ranches boyhood homologous reactionary classifies amine poirot grandeur aortic unduly northfield gist transitioning enclosures mcclelland leven cocker emmerdale enrolling paulina embodiment reiner inquire lender yakuza vail sheryl yugoslavian goshen westbury briscoe bate favre birthname hurd polska affections relinquish torrens poul woodwork lwów meri outweigh spoil drosophila daw heine jervis schmid reap nlm treachery imperialist ora elegy pertain plaid klingon barnabas kurtz footpaths redoubt beira madan sucked cannibalism goebbels elaboration amalia devereux curiam celts workington lott reuss burnaby becket vitae oeuvre epp workmen roundhouse lancer sheik astrological stele infrequent valera auxerre resurfaced aversion pathogenic hyphens ainsworth treacherous cef edn resettled stool ranga bofors scorsese olof irritated testimonies wacker nomad modulus filthy haram gorky noblemen kayaking romanov gmc luhansk infidelity toughest colonized designating snipe ise rosenborg dismissive haakon collaborates weightlifter enclaves startups arca addictive oxidative colonisation ahmadiyya taco setter upholding buzzard dvb oakwood maids stanzas listens orf sitar jackal conspiring constitutionally austronesian imitated pinocchio yp groundwork tequila orbitals worthing roxbury hedwig detects powdered cranston cremona lenox aleutian sst lilac eparchy achille cleft landes mises toussaint pastel firestone esteghlal deseret creamy deerfield toshiba dillard bmp flagstaff caverns parishad steered portmanteau sentry sternberg comm omen rts symbolizing premierships seibu secluded vries gaels endocrine mufti aspire stomp starcraft rarer ces appleby chime brazilians mendez ansari cybertron depopulated compatriot lanier hemlock sparingly boils edie candice nautilus lemurs reagent redeemer pura choo unharmed ochoa hainaut bidder ogre squat fowl flares magister ifa coop pancras blurry rivas disparaging smokers zz lenient lewin evo cottonwood wilmot vir mosquitoes prefixed hydrophobic consensual condon lewisham rios furthering boosting equivalently ekaterina denham snaps goldwater métro deteriorate downside aosta cassava bhubaneswar grappling confrontations jah watermark cremation placer benefiting forearm bak nonviolent ashcroft goodall hpa sidewalks burley phu boi aspirated incl arkhangelsk slr revd gros pushers aries shrek wilma foiled diesels colette asw bluebird tien rewording guntur pierson technologically revocation abercrombie bryson contreras miao biathletes erode placeholder fend inspirations esplanade magda danse diverged flammable mcknight chelyabinsk ej rhondda molde cadbury tahir calumet canonized salter req mcg shyam diabetic tightened timberwolves starbucks ronde grained relaunch architecturally alois usurped dosage nostalgic ghazi eglinton desist estes evaluates renounce farthest mahon facades imams fredericton gilliam demography beowulf stave pubmed veterinarian frick chf irl gallup doodle farnborough hinterland betray cusack alford solano sibiu kamil vinod academical sembilan diagnostics heinlein proactive niven laurens cannibal ext grilled jolie osmond wielkopolski gcc nitric democratically spills hrsg merchandising unhcr condominiums multiplying fanbase antidote disabling popeye warranty overcrowding gradients makati lagoons sokol wasteland debit prabhu methanol tiered jacky zealanders susana outbuildings tucumán intrinsically nez impede streetcars gatineau edda chauhan volatility martens schoolmaster eo jethro chambered adsl nonpartisan sanctioning amends hoa searchlight airflow rockville bebop vibraphone dekker aclu toms opposers unavoidable chicano clueless tigres confessional hangars rainwater misdemeanor kissinger khabarovsk mugabe jamieson bowyer telekom schaefer touchscreen erdoğan jpl resumption sedge franche bode ade loanwords nrc gait outcrops instantaneous noche vieux itc enforcer fh batches sag ilan laplace hops sachin backhand heed koji alternates ladders cures placements negation confers archdeacons trivandrum retelling eminence lada umass delusion hindsight bor fanzine callum gazelle towson hallows irrigated thirties thaddeus dsp bystrica championnat baikonur veda nin remarking cobham marlow timeframe vesey craze varela cuff shanti antebellum startling guardia diagonally maulana haaretz pizarro pluralism grouse tripartite moraine barnum regenerative knuckles bene finney toa lifeboats maturation incumbents plunged prokofiev soundcloud ridiculously corinne parentage shaftesbury flatwater fv teamwork conceivable pq gwent spitfires happenings gaunt nucleic entitlement busby sputnik universidade bourgeoisie mircea ibadan dalrymple hessian patrician priestess pease corazón oblivious nationalized silently unknowingly scorpio recuse unloaded manifests counterproductive wiz ati discounts swung azhar purgatory reddit vicarage smugglers centralised sola accommodating dfc descendents carbohydrates petrova needlessly abdicated reciting beatriz contaminants ramones nth baptismal congressmen nesbitt enumerated beitar infusion moma screenwriting procter rattle hailing hashimoto unicameral bloodshed tarrant googled intimately epics nailed busts cantabria tui deepwater allgemeine speer katha calicut footy tuesdays muni ahly defenseman discouraging dickerson dissolving aster disband ripple thracian overtones polarity blasting gannon morin rideau ryerson ambulances specs shaykh sightseeing paderborn hradec halley kcb rhododendron unconnected bracken lino seong renaud fenway gardeners cartographer sepals buckeye drc schrödinger mchenry scatter ddt sándor archimedes composites tania ucc corral bigfoot poetics shetty seabirds snuff shinkansen slovenes capella bibi masculinity chickasaw reminiscences debtor adrenaline fahey inf aram dominguez exaggeration ahmadinejad thane yad candid communicates horacio pinot phonemes dominicans dipping exponents lowers sculls swapping assurances crt firsts corsair solubility multiples basu valdés franconia supersport grids militarily grimaldi rawlings nitra musk electrolyte tread applicability omg nightlife elapsed netanyahu banbury ringed kerrang tekken pruning hovering manipulative buckner sith zephyr ifc giraffe hispaniola mcmurray mitre kaz ttc hagan turntable philo amused mackie unwritten epidemics bumblebee ics fedex womack barbosa frisco esper blight seeger imitating subtype shaffer gilt thrived malden lorry procure pw fujita overthrew thiele limousine casually butts gondola amorphous persuading shriver budgetary nay rolfe burnet displeasure synergy søren riverdale diss drogheda comstock orthopedic calculators boardman cadiz agonists braces brendon redefined laced smoother comintern chainsaw pte conceptions holla whirlwind relayed acetyl bogs hernán freedmen ishmael ingenious dyed overhauled straps newtonian primavera liars valet soledad bothers hashim evolves esc spiritualism theophilus participle byway strabo categorise podiums translational processions possum brando capacitance stupidity disrespectful dichotomy leahy munroe evoked romulus vallée rua affaires intoxicated sprite woodson rocca mercier standoff phobia anz zbigniew payable tas gli crowning probabilistic gul amory tupolev robbing stainton convincingly karla consejo jagged stitches mesopotamian ghostly metallurg oberon feynman islamists sow iz orally hispano contradicting curlers deterrent lozano credence braid squarepants hervey baa wannabe elasticity visayas checkers unquestionably ariane waka asad fireball quell orne bubbling ganesha whispers adel mediators righteousness retroactively metabolites reciprocity limp jv shouts kingpin kongo stately kino grooming hordaland decepticon marv carbohydrate sequoia unwieldy anglian gq outwards dit grandsons amps hydropower hessen jama república segmented renfrew gwalior muppets brat aau backers boggs croce dingle denominator quaid uva kickboxers tigris quantify affirming hy kendal chanson abstinence salome memes seigneur excitation kya disrespect worsening bouchet ponder sucre mahmood rants charlottetown ewart shomali clem lytton rosy kristiansand torus kerosene cpl pcc archeology fleischer strangled newsroom inconvenience davide barbadian punishing prairies schiff phalanx rhea confucianism rojo ranjit noi ame storied lessen lynda lilith hw monti carlsson lira mpa intelligible garuda bengaluru worshiped jef yea tonbridge stylistically harriman strongman cowardly psychosis ism prieto dodo sangha sanctum cárdenas qantas agni rhyming confidentiality ramifications lte gopher iterations rectify curfew nurture boulton booby condolences snapper karelia hanseatic mcguinness slime peso odysseus locomotion bauhaus catapult smh excalibur dominik gags mackinnon novello reductase alleles byers preoccupied signaled paraphrased druid scant visakhapatnam nenad apoel warburg spoilers oceanographic lanark scc aliyah habana decency pals vanishes etudes decimated eliezer xun appropriateness mitochondria sod ccd cunard holst donne rajkumar slapped aligarh kalyan honky tomography apprentices prinz symbolically asif evgeny gert millimeter grozny hanau warlock sophistication dentists melayu actuality parson irreplaceable narrowing gon uber sunda booksellers nie qinghai anh catharines tonk dsl artois catharine chem williamsport colouring chp circulate scratching overseer aslan hatched ferret athos ubisoft sinners glynn cristiano songbook expended itinerant vedas separatists shilling doria msp moreira eunice supervillain pores affirmation morristown adonis maritimes watchtower pristine yomiuri masterson capitalisation talon rayo fingerprints maimonides flipping kuan naidu ★ vick accolade nagaland nx sensei tacitus geneticist burgers serenity shih categorical gypsum stingray drm aragonese antiochus harvester virgo ewan deceive sneaky caron belknap editorship phelan wingate pashto vistula oliva whipped benetton antigens clooney polymerization loup zeke haldane begged zoltán melzer sturt carmelo photographing sainsbury bonneville bastille rollin ratchet oedipus reprising sowerby rizzo endlessly overruled matron conclusively walford katarina parvati separator oakes skis eulogy schott sich aquifer cate esters sugiyama bataan cuyahoga braddock cultivating alhambra upsetting impulses forwarding fortran robeson cthulhu shovel ingestion cambrai urmia legalized docklands scandalous dalek paleozoic jodhpur truthful nuanced tangier allie scania aikido sverdlovsk commoners tus highgate fairmont ansar restraining dike krupp pompeii omits avraham depaul sinner ragged wtf monarchies utilitarian horus fournier georgians trad mots musket hulme runes pledges merrimack cuomo reinstatement censure dartford trusting underparts aloha kfc hiroyuki shanks django bessarabia withholding dacian linearly meltdown thornhill lj renard oka plano roshan distrito dougherty catered mysql enrich sinfonia nederlands parte shams digs wallonia alum interracial overtook vela lw suture cassell carolinas cumming lamas shakes cabernet palme sarcophagus woolley brann ancients aff athenians matti towel bol shielding novara iwata magpie suva kiowa associating zig permeability cesena shoppers memoirists jäger shroud poisons gallatin qajar mites turboprop disturb cuttings cornhuskers lachlan monteiro chorley intifada roos deflection heinkel propensity dejan abolishing mambo aquarius waned howl golds eagerly gulch craiova cns vigilance hydrographic união ilam unlucky manoj connective entail restricts pentecost collaboratively solidly bead dredging bickering saucer soler gusts impressionism entente boreal ascetic fannie nominative harvick rearing disseminate randi sondheim minimise saarland rolle contemplation dinners fluminense shinjuku characterizing subfamilies olin vinson tomahawk matte seabed kinsey fsv indica nuke drawer psu cuevas sadat firenze carbide webcam shave schilling oceanian conforming rajah gotland overhears synthesize thong mica transmits arran bdo auditors disregarding vivaldi subdue chakraborty milburn nutshell perseverance mesoamerica popov molding westerners prosthetic astrakhan proclaims albury srb boycotted nou boast sofa seagull brackish overpass leviathan bagan unter gerrit upheaval mississippian underdog nanking sarkozy storting hemoglobin bake modus solidified fagan zis corea airway renata spooky joyful rarities diurnal svt alte hearth enfant appalachia lisle meticulous apra edmondson radom anya sartre chiesa dessau reconstructions blasted geary totality zonal wir lebron roddick berklee ppv tarantino bild agitated adalbert abiding usaid respondent willed dwindled correlations limoges longfellow triptych espoused pallas mccracken seamless napoca levinson nava masque gai hera crankshaft changsha moles amer bonifacio oscillations coniferous badajoz stupa ponsonby copernicus schindler ihl scalable pawnee attendances vases coffey grips maja elie battersea manageable artistry dulles oni ellesmere kohler automaton middlebury cáceres hae waldron furthest invader entailed wulf rolando suri liquidated defection chalice bandcamp rephrased jørgensen bharati niko schule oxen countermeasures precarious dyslexia kudos beltway ferdinando mcdougall windshield selina prešov appease pontus custodian sanborn esports purplish mmorpg cfo bitterness existance bloemfontein briefs inclusionist inflicting talkie cheerleader brine shedding kootenay napoléon advancements humankind businessweek syphilis hmcs improvisational yonge microbes marsha coulson sepulchre allotment valenzuela ♥ sxsw transfusion wares munch vyacheslav sewn jakub definately trending heston awaken closeness gruesome bucky calif watcher artiste rambo heng madrigal haw maddie brod chrysalis crossfire petre pedophilia charlene meister intercultural commissar albin balzac asquith latinos overriding mcphee herding lucid muskegon iban eights funimation fruitless rainey idealized guesses marbles loy buller domini presse rapist wrongs lancia elongate saliva thanking matchup wooster osteopathic istat krueger soulful sprinkled sodomy irresistible giorgi latif emmet inhibitory meh constitutionality republicanism kostas chiral slows cartographers whitechapel ailments sucking carina thaw sula bbb baal hildesheim rookies excepting benning designates untimely chum complexities hiller contemplating agro cartman swampy brilliantly ryazan slugs lucinda strachan cruiserweight jewry rehman fsb foyer tilting roost gallic annotation noire cta disingenuous desi dared wisely botha aroma sabbatical beaulieu ioannis genitalia cookery combe intentioned redford appalled pelvis warblers darul dipped ® isaf pancake nazir imposes resurrect encarta cartography stripper xc gad crooks squires musashi mears chairwoman chromatography asociación infuriated materialized woodhouse graubünden compel plasticity chartres supercar emphasise lutheranism bridgend theodora semiconductors venous bookshop aerodynamics caetano mattress centrale wodehouse affective gravitation substation vx maggiore sergeants bhakti commuting oxidized warheads ethylene muses alibi crucified outscored jak impossibility iau shrubland alarms rebounded biddle stiffness lahti parkin missa krieger finitely directv waterbury funniest bouquet initiator fez shipley storyboard diluted fdr paes neonatal fim chagrin okada niue gaylord coon grist leaping alarming vulgaris padova reconsidered usf redistributed brookline yost futurama cnc naturalism unlv drago ecac ooh inciting disruptions parlophone peri hewitson mansour coursework modesty babes sensations kinetics hv marathons petter sororities endgame marseilles ligne lise stressing asi whore mackinac kalam sharapova bathrooms osceola repeater harford iff snatch sterilization webpages mcewen ™ elam pickles auerbach apparition seaton orientalis coyle orientalist coups kintetsu dobro reflexive oaths discontinue aude ntsb reload mammalia luang ingolstadt gunpoint hatchet pierrot quatre nola hmv midshipmen roundtable liters yui xinhua stig facet mezzanine riordan nuances batangas pacing ferrell hilliard cla bundeswehr crowther lizzy rif papyri bogie leed maricopa enlists clothed girlfriends roadster suomi diaphragm philippa unger paddock nha inks misfits emilie maneuvering gurkha etched debrett sie slanted olimpija winch ntsc necrosis endogenous traversing capitalists shrunk giang étoile skink loring murakami whipple lampoon mckinnon mujahideen coronel penchant bridgehead benevento inhuman floodplain dystopian juggling canaria slur thwart inject jahre kush unwittingly workout clientele sauvignon thi rubbing preamble twigs secreted antioquia misrepresent borealis coalitions redress gábor marquez conforms reusable outfitted eerie freezes margate messier soups lettered postpone sportsperson teo paton kp auxiliaries majlis printmaking cyr nicks affixed gyula weathering relieving supersonics proposer blackjack depressive stryker hulu regains oe alb viewable recon falconer alva tremendously sling morte dwindling romsdal ell arbiter talmudic joked braganza recoveries benzene commences renzo sadr queenstown probate jem ginsburg porky ehrlich kenosha schengen liguria speckled bulacan peloponnese pdl quarried industrialists csm dann abandons nymphs strapped complements skates disguises onslow sufism goulburn brevard marissa supermarine jimenez chl roskilde unionism ervin sadistic dimitrov radiating nightfall amalgam nyse palawan fuelled herne checkpoints stinson hustler helpers bastia amazons limbaugh presidente bosniak wh debrecen bowery pampanga hcl routers hydroxy juanita roch precede codec mademoiselle cit roewer noonan insightful lynette orville mortuary ftc cellars régime frying roofing megalithic leopoldo politico caithness formalism gauthier autobiographies vitebsk identifiers condescending nap noone carmelite maharishi sequenced dodds mcarthur volodymyr räikkönen spoils shaheed sydenham murong europaea servitude equated streptomyces mordechai cytochrome santoro rac perjury angoulême kao commercialization youssef brow ranchers pala rosberg kowalski corby lyne tiled chautauqua kirsty mischievous ruining cocoon changchun weser mattel esmeralda turismo azalea smt istria cosimo standardize jacobus cytoplasm allege polygons playgrounds spammy centering xenon spraying centimeter ¼ bowdoin minimized deems waddell crafting malformed inundated monolithic habitual relaxing sever ats saki hostels levee occured northwich schweitzer stoned islami excommunication mendocino lindy orchestre roxas conning zap mockery redone tacklers palakkad ranching bulky detectable unwelcome medic albacete shad countable jaeger auspicious complimentary seams levante zeiss restorations folders modernize telstra independant chimpanzees bodywork doubly riva fetched nuncio arcane mombasa kirche srinagar leamington modulo cca ilocos takers sylvie darin alexios retrieving saxophones ornithology cusco rejoining castelo mueang indistinct louvain cobain cev pde melchior criticise docs toxicology collegium blas pellets printmaker beaton suarez barros dutchess bfi eldridge childers manon glebe frankfurter motley whispering meera innumerable algernon spectrometer braced grin attains recherche misled fenced arrondissements alluding subversion illuminate outcast genève bellas harrell sidelines rote lambs laughlin boosters crag hud circumstantial seng jh rik nist iaea kyu branko mutt curses sahel montero meningitis gravesend naturalization corman suharto pane showbiz nittany repentance distortions whitmore adagio gatwick khazar pur fey abdulaziz restructure coed winn odeon metrolink leeway referential latrobe lapsed newburgh ruptured garnier towering geller reloaded wandsworth schröder wattle workstation vole hingis leanne ferocious captors pomeroy cooney dehydration plainfield satoshi proquest illuminating destitute gurus razorbacks cheque farnsworth adkins groton piloting realigned traffickers matlock shima sportive nepenthes cobras langton obsidian ibaraki tryon navarra parametric croke eunuch dill stow foreclosure nevermind lowery caius hoskins dona athanasius taskforce fright polemic motherhood lefebvre rampart davids lettuce micky baines impurities terrapins aube tem mens hassle excellency evangelicals putney methodism oysters dissenters caesarea fulda identically dalit forza archetype foresters cassettes arousal semesters rugs continuo silenced hotmail bumpers enacting hendon ias tartan deathbed lowndes ow mishnah countering cantonal uproar insecurity nimrod droplets marianas untreated crests deptford emits warne northwood beacons flyover nationalisation walrus ibf setlist mutilated redshift glaring chauncey scriptwriter klub lefty smack testimonial briton pollack symbolized minot mirroring schoolchildren nuneaton buttresses tawny whim conveyor carrollton typepad deadlock astonished gonçalves helical hallelujah lia pkk beret vestibule upsets epitome balinese tenured battlecruiser gibberish shockwave winterthur displeased cassel jacoby distanced traitors angelic manta tabla wield fahd lakeview unforgettable scilly shabab nsdap macabre uconn moulded dera agglomeration molloy ciel priori hijo wilberforce painfully scion ubs akkadian episcopate umbria cristal slid beet phan progressives biopic fells dalí logarithm tumultuous inhalation mindy acquittal centenarians waved talisman rammed kabuki rosewood frei hamasaki reverses holger incandescent poignant rcmp ogle diocletian brasileira spartanburg chak jackman unfolding aleksei azur waldemar reckoned goff unaffiliated wolfpack hatcher registrations divination solicit admonished alters repatriated disuse ayumi rodman lindberg hornby roca padang soaps sylhet betis sulfuric cadmium impart molds treehouse subsystem bottling saar lipstick prudential cayuga jillian vagrant bipartisan berlioz jails inexperience blois dukla solis ciara eloquent bakker abt parietal punctuated danbury lows splicing impatient dido besançon martine whitlock iaf fondness roxanne thea restraints remedial whedon christos nîmes tenths hepatic foto passionately cucumber waldeck bicolor icing helios guang redeem arboreal redgrave homeworld badr cda salvia lamborghini zhuang ration hawes shakur fingerprint izumi jeune restrain vai hin geopolitical morelia ceredigion revoke geodetic cloned lula comme maniac featherstone parsing lingerie wald miklós belligerent candace stumble altrincham beige ancien dalmatian nocturne belcher raining iota leds baroda afghans inspires hippocampus immortals almagro lavinia khalsa rutledge radiated disproportionately bionic hafiz earths germán huntsman shing spaniard visser inclusions dol sørensen supercomputer ludovico illogical retarded pfa pundits mordecai unproven antisubmarine leconte remodelled bdsm cité newry camberwell lightfoot outburst libertarians precincts flavia snowboard correlates resolute asynchronous plastered discriminated ansi kachin zealous penney negotiator haruka neve aretha ince algonquian sandford kingsbury emigrant mundial repositories silky tribeca disorganized hotly thessaly legalization ary sprayed humorist skateboard stains rosette bimonthly dietz mortensen tiki duan injections tron mediacorp negra resistor marimba chimpanzee capra somatic sardinian masturbation mandi putt antonin synths rasmus deakin corvus confiscation morel wanganui vardar wec bahamian daybreak peloton crispin undetermined abv dieppe alberti chauffeur ssh slo joliet pickle regalia schrader radium intimidated nigra steamboats sailboat brt disprove vives pitbull mgr bhosle udine sedition bachmann bacchus roussillon ocampo harwich swabia mindless fukui collusion karlheinz trofeo hattie wrecks huerta injecting sade rework taoism martinsville redeemed philosophie unworthy howling osamu brecon duvall utensils infield sensual tyneside despised bhattacharya taiping manus taras bwf pondicherry linkin highbury abreu gulag pollination sforza karzai garrisoned unclassified pave haha darnell dionne cris clarissa dulce llywelyn albino osvaldo ssl inga bighorn feu selassie federalism knudsen brevity intruders lenape doreen oleksandr perpetuate ghostbusters sangam xr nicer touchstone paramedic buda hoi narasimha summarise soliciting fora newbery singly higashi ugc wg dax phenomenology rudyard admirers bfa radiocarbon nunataks hone rebranding pakistanis converges musharraf tumblr hawkeyes nah pears officiating valdemar anglicanism bjørn homing kol ukip noc gretzky jacopo confuses bhavan lorca solihull adv ornamented ponta etymological torchwood enlistment thresholds zaman jahn clarifies rigor firewood marín gorica parenthetical oriente ambedkar abbé yourselves sephardi ennio glaciation papuan dassault occidentalis aerobatic counterculture gdynia rationing divya madhu acrobatic keita decor gambino tic nir cordoba individuality squeak chariots barak minoru halting olimpico macho hybridization lun brazzaville acknowledgment maghreb pharrell byrds kirkus edson laila tiananmen lmp tweaking cin secularism taxable defy drawbacks hsieh pituitary norwegians bracelets hoare oxidase quarks ethnology hamstring plugins positional ker poitou capitalised entrant sayers noh posey shoshone frantic gyeonggi peebles paloma moll neri hamster galena deregulation necessities biopsy spiritually agassi argentino bullion drawback baring mug lash addington scanners distort christiansen thorium molière reappear gisborne mayr wallingford analysing cheetahs refitted wada eldorado gabby faro mirko abuja dravida trimble budge chastity cenozoic tugs holby vandalising livin purview asuka intermodal manawatu chivalry vendée fedora chişinău hauptbahnhof mortem rabbitohs kreuz santorum glenwood arbs fertilizers exalted poaching copious fortaleza syfy celje supermodel shastri causation indentured karthik ahern miyamoto granular famagusta coolant baronies kot nicolaus ghat sauces vaux bernoulli clutches etchings ripping bligh husain breckinridge gaby xander metallurgical padded rioters incursion seashore cosmonaut rhonda coy cowper globular summoning mumford hales militiamen menachem uncensored oud parakeet seaweed predominately starfish paradigms pacifica lakh mauritian beebe cyberspace ingalls whittington jamison taichung salons bran ssc conceptually enlisting monsanto cheering mcmurdo cuisines cmc paragon martí oldsmobile brookes instrumentals stipulation charing yeltsin diario dnipro covalent nutcracker atr pasco schulze bridal ores underdeveloped headquarter libro sto multiverse clap sqn hermione deduce zayed donbass brunt nong chua slashdot bamako bueno fresnel mottled solace berthold sturgis caravaggio regularity eon selectivity lifecycle enthroned spinners swallowing devolution sdn stonehenge mountbatten pinkish renoir rowed cigars situational addicts tweets sds oeste kearny beauregard visualize replays radioactivity littlefield wanders newsday desegregation tribesmen gorillas lotto lovejoy frida apocryphal innocents indented affectionate itt midwife ringer scr argumentation lse athenaeum abul camilo oakville safeguards microprocessors extradited africana japonica mismanagement gulliver warmly cote sitter sextet showroom saville turpin trucking zionists rami menstrual dogmatic pima delicacy napalm torment sawtooth deceived subduction appomattox tabasco ghar poised contrived methylation squeezed juventud pagans depressing factbook dac hotline creditor craftsmanship gtp salina galleys guidebook mimics istituto brothels contours standish altoona paras yakovlev razak mangroves rotting thiago cq margherita caucuses rar kiosk repayment capillary christiane kemper dancehall undisturbed fundamentalism gunter erm bushehr rascal feinstein staining notched kamchatka canoeist sufficiency alcatraz bilal stettin simulators approximated bagley lads brantford lubin comprehensively ueda carthaginian anchoring feeders instrumentalists triomphe galle zim penobscot concurrence hitchhiker kula modernised myer repressive madge bohol norske exclaim goldsmiths hummel carlist darlene fateh indiscriminately ronny jonson liberec marten erotica czechs octane leaflet albatros baz pinkerton heisenberg bloodline scrimmage hufnagel loo fils durch sata afa conwy ornithologists conifer gaiman southgate ikeda raith iwa gorham tensile seti railings colonna greaves pate regenerate stuyvesant bannister northridge monrovia brun munson dummies atlantique mavis seer brough bartók dci glaze zahra fourths mailbox silo detonate hula patricio kev talib achaemenid dilip imre nostra acidity aec plating mahayana chaudhry mobilize celta andante schooled caterina dresser nene hala stigmata tzu railcars superpowers frampton pardubice defying tweeted underhill coulthard airman heuristic personification masson shangri eyebrows barque gamblers preakness shabaab mahan iridescent hervé commoner sceptical davos meteorites banque batten saif shaving billiard fleury ryde cordial aimee vasile kelowna salvageable israelite darn giorgos foggy abstracting autobahn sprints tms detachable accommodates ilo wildfires mckean gillies afoul sukarno harkness customize borgia reliever pvc fleeting ephesus proletarian grasping electrician mineralogy tallied amstrad stances patio meditations mts goodwood türk osce worshippers sars thrillers eukaryotes hoff ramparts nontrivial ferrero baez wacky haddad subjectivity hellas sauer eicher refund windhoek chaldean retires hossain ratu portraiture evangelism overuse ronson afield manuela wolfson brompton kama demarcation meghalaya urea elicit funicular gunma risking pravda mouthpiece caesars sarmiento pryce imogen duque millimetres metaphorical thankfully onus grammars juror overgrown causality eurozone ascoli cation chaves isolating attenborough trampoline waddington scriptural evangelists numismatic wrt solomons wop wests ploy spooner viru crappy abbotsford sacha hedley tiraspol aloe frankel cielo preis ratify zoologists leninist pedantic diagnoses ostrich germaine outcrop spleen rigs parrott dismayed rendell panelist reba gleeson measles conceivably carols sá wayside shingles bailout refraction sheva stetson unqualified fiend fürth starved autry subsumed revamp occupiers boasting propagating confessing wingfield spacex ticketing stringed tiempo rattlers aisha cif clipped florent buoy pikes layla kamloops comforts mandeville psychologically embellished mull paramedics cetera cotter coker ragnar udaipur polygonal reshuffle battlecruisers kazhagam sasanian pharmacists collie ange senhora ghazal trnava particulars manhunt osa livy florets iwate fundación urgently fosse jeb homburg vacations tithe salta rossetti polyhedron laity shelburne tisch fille chul tilden anemone judeo pinochet mee loophole intervenes ump deletionists lassen dossier feuding shearing kutch geddes flirting camillo roda pawns evokes hamill lamented evaded cormorant tremblay barrows arbroath kagan bbr rickey plumber hoods intersex rah rowdy multiculturalism sortie retardation canvases astaire cafés savile borja tobruk colorless regimen emptying hagerstown billboards geophysics trento antiquaries afs ellipse elwood pancreas sanz régiment hopi opportunistic hayek amulet ambivalent npp prek belgians tenement ingested maidens pixie platz schaumburg kaoru perpignan legume sooty dmc lovelace elon gödel tatra favourably cir dunning stewie alvaro ducts scratches lindgren vern powerplant delineated muerte norah mongoose sik pompidou extraneous nurturing bursting fastball ocular balconies farnese feathered siddeley bretagne lichens hydroxyl conspired hammered healthier bmc logie publicised fieldwork characteristically pars agustin exec overwhelm reunites prefects kievan studded camus nuno harland paralleling ingham eleonora conscripted slugger stabilizing brushed triumphs antiaircraft disclosing bentham lobed boilerplate interplanetary momentarily kanawha evocative caricatures shmuel turkestan yeoman titleholder snipers dignified meager dou embodies automata yarrow fermat capricorn excesses latterly fieldhouse hypocritical inhabitant bakers lawns speculating clipping oshkosh copland shibuya haat decapitated rung borrows strikingly fawn grodno nomen nuova flurry aggregated stott cubans fuzhou rodin doubs suing bearcats haredi afrique fait aso skidmore overran disintegrated yoke reine tossing slider haden aleksandra umeå topps spinoza ota capuchin metamorphic rapp dornier caldas discernible scalp wilt exclamation freemason courant zoroastrian verkhovna scourge haller impacting tioga virginian impressing sayed sèvres hijackers shou oerlikon juba agreeable shortwave plantings spitting scrambled harmonics administrated campsite retraction knack pes bathtub militaire chesterton nra jammed izzy shatner qiu alms gallimard hump copyrightable uribe polytechnique peres generalizations fragrant barnaby vl pcb mccarty kunming bookstores congested depository pilkington consulates mew oem maranhão requisitioned rajeev breastfeeding overcrowded vapour bobbie bhd hebrews recounting oulu brion pundit punts jett sloth troubadour holographic resilient jaan impresario altenburg dorn amigos netscape bossa dui veritas physiologist argon complying girder volker valli commandment nicaea mite photojournalist dahlia beeching shimon weighting egalitarian nagai motherland gilgit figs piedras pathé majorly uzbekistani woodwind bebe deference parenthesis delimitation tutu loudspeaker orsini wayback marshy kwong geneviève emanuele vladikavkaz mynetworktv quattro tangerine gelderland piss advisable winnebago deceit succinct dissertations curitiba salih monarchist reappointed quickest affidavit cordova tek rogues nis abound amputation elkins hagar heretics crowell ratna palatal stumbles damm melaleuca hsc lowther adventists melanoma crump linkages censoring bight silvers valdivia potency faisalabad diddy aragón udo dionysus dello followup reflexes eased appel afrikaner discourages plankton mangeshkar endothelial paleocene museu mystique lufthansa impersonating katharina loudon cnr crabtree mawson preludes povs braintree simulating overtaking mclachlan melts snes comin supercharged nossa stifle feline federalists mythic swordsman messrs spangled ifd acceptor tuner ayutthaya hoe steph bunbury ohrid gillard kohn fluency galina führer wy palacios dissociation nadi gotti stocking astrologer dimitris sarthe nigger adrenal vivien chipping experiential antivirus aut estudios detainee adheres basset bau councilors mariam condoms starry abbeville presuming mot warmest onyx symonds neuro calamity globes sunnyside nabi futurist intensify babble amador footballing preposition penalized dinghy bras volk microsystems reagents busted librettist envelopes paf ucd orca arian chesney drs securely humanists flavius holme medea fausto ironi philately tampering labourer profane herons rapport mayne nationalised barnstable callaway displacing verdean defied wallabies posen bromide kristine gnostic acetic deflected carrasco clarinets kz bannon viscous unoccupied airwaves filth korda mailer genovese quilt titian trish nothin ligaments keiko bakhtiari parlour importer pharaohs pittsfield monogram preeminent battlefields supergirl congratulated partitioning reorganize crm zwolle lipids vitale rebuffed predeceased soar urbanized cygnus columbine seedlings vientiane reedy birdie rigidity furnish vostok csc ogilvie pounders downright catalyzed endo pelletier donato relentlessly monogatari lineups nabokov lithography daejeon operationally culminates uterine cortina ela betrothed péter hsv assuring cahn infanta idioms enmity rotc customization upriver propane woodcock renditions delusional lajos hirise sadiq rotax argento klondike highlander bovine lbf ramblers autoroute perera topper brandi wiping heerenveen angelou sangh tanning repechage habsburgs insure staffers rfid trapper sura stagnation strengthens regio puig maier pena sandown manoeuvre obliquely depeche slaying rethink hilaire dykes rachmaninoff daria tainan jafar doubtless fla carmona cached slurs cede megadeth rochefort creationist pussycat interurban allium dilation extremadura kurosawa flakes argonne entertainments molybdenum roleplaying childress hombre telepathic laughable bourg vinton freitas wields ffc vid altona guaranteeing monkees aleksey partake sharper cardigan trotskyist injures bia datuk vedder shukla amal thatched oriole infinitive bandstand fontainebleau episcopalian mallard busters rubles disbelief goldfish psc cornwell miramar intoxication jeter botched sérgio regius isobel amara wurlitzer pärnu junius jab deliberations dyck siri hungerford harmonium differentiating cerberus spillway mpc rhoda bons mingus seitz derrida dillinger rediff cmos tactile witherspoon unattended jac backpack ptolemaic kincaid wgn goalkeeping wink hinges ramesses masking surinamese furthered shocker snapping chipmunks ect outta valparaíso monologues encroachment amitabh absurdity smc itinerary risked carsten exoplanets shahi trilobites desoto werke verifies julián pulau kingman dissimilar lagrangian incapacitated bigotry sills slams aliyev intonation stavropol virgins spiro pretenders paweł tana pissed suspensions kebangsaan blueberry converters matsui rogaland blob hodgkin cpt caches gunfight marymount cleaners proteus spectacles premiums kestrel tec parva blumenthal redd sandler termini lowly turnovers kwang crompton signified stratigraphy goulding superfund expanse brito heaps homeowner purchaser drip hildebrand serialization delusions legge prato tibor extraliga shameless qe cheapest shiga tireless triumphal roadblock annoy bottleneck compress contagious dinesh emulated agios trombonist ctrl libertarianism brd believable makhachkala whistles lumen kalisz feliciano catalogued guarani nacht angst tsing everytime tff restrooms waning precepts asterix telecasts questioner whittle executes bondi sprites puducherry bicameral musicology pineda ninh mayotte noblewoman undulating ccp hyperbole adp pola livejournal rosales complicity perceives safest limpopo convair forgets beal prussians baumann jahangir carrots tandy caddo sewerage bristow athlone stadia nieto sten logarithmic lotta warringah aymara dela jonsson kaluga innovators flawless tohoku bunk avondale canvassed reggaeton emden locates nablus mito weakens lithuanians orgasm predate wintering bumped quicksilver venetians carbonyl courteous shp angkor harps millionaires betrays exonerated genitals götaland greig krypton untenable tilburg fuses tinted pangasinan sagaing forel vibes headingley calibrated inheriting aeneas cinder morph kenilworth tête fumes stepson prenatal capel aust borland tanager dnieper adrift colm hinting diem misérables motoring polygram heroines spaulding bhagat gautam corso whistleblower intermedia strikeforce paulsen beechcraft conservationist pups workhouse shikoku grub richey lyra leftover graces disconnect trafficked corsican spezia merck tableau knuckle dandenong disraeli theoretic balaji bischoff fuente persecutions sunbird determinism engulfed pohl chadian colliding yaw penske coaxial koko swabian cystic hanshin toads mandala lasse slapping wielded lian trombones medan ellery virtus maura endeavours lookup moa milder liberating defencemen alcalde formality marchand shari borisov novellas tripp chevaliers sevier guglielmo ueno underlined ustad sweetness orientations gracious pdb hounding pythagorean tease bosniaks heyman muda hyphenated insiders hackensack staggering ophelia coiled carve naa manure resists georgy adelphi modulated lanham sportsnet furtado antiquary austere rucker reactivity huesca ramiro overheard edm grinnell fronting dugan airships vaz wicca waterproof msx urawa coadjutor cosgrove jna tunneling lockdown misaki lillehammer annuity edema nucleotides jayhawks wellcome suspecting demonstrator tiff mockingbird aca lugers rasputin gabriella luise overturning neurologist crs nco ael jabal vigilant beauties kwh crusher gerda discord approximations sanremo intuitively realisation nanoparticles jacek folkestone bibles segundo celibacy chivas werewolves slasher roald vulnerabilities boredom mário fane cel thx cadres byrnes bogies flathead nunes kda exegesis slipper formulating widget powerpc mandible clarksville dodgy tibetans raping herein gutted björkman repost splendor bolted sotheby gazeta accrediting paypal negroes karina skits lohan dominick joystick dispensary tull crickets posits shearwater madero ils renewing leclerc mossad gisela octavia hierarchies ochre polyphonic miraculously triathletes lner cellphone apo hayat seva individualism provisionally keighley pampa koper suction ‑ metroid darko carrara pocock akers sdk ply loci gravely carpentry reprimanded cheaply timid repaid affinis hetman zacatecas morgue kwai cheerleaders bedding conrail tectonics bossier prods elms symbiotic sagas abortive moustache ethno léger wandered environmentalism cardoso bhaskar evers mobil jsc rajputs caprice nestlé cuny peasantry herder sati sexiest florin incite fuzz khimki friary malloy carillon xa interment equalled barbershop haber tobu dma vodacom signalled totalled watered dewar elba kauffman euphoria crypto styx airspeed wolfowitz risc sensibilities acr mfk pogrom alerting ayub cloverleaf hazara chromosomal gingrich arezzo eri hippolyte feeble smit howland iced fielders chemin ovary vd orozco northcote sheaf riddled diogo huguenots stabilizer blundell embezzlement mss orenburg hokies kharkov broadening winless wilfried aurangzeb paredes devotes babelfish motorised deflect herron consented metering tsu coalfield bypasses assessor teague fluorine merited sportswriter linnean landers southwell mga hofstra schaeffer computations berkley watersheds niro andalus cru verso subtypes ato gris fabrice rife hoisted puritans yaakov isuzu ldp belfort birdman perish freer leek willson erhard bridger condom kojima grevillea burglar amenable uvf tibia transferable broth executor annihilated tighten northumbrian carcass westpac midas sangre shotguns groot ochs lateran complicate drysdale cosa spiff vevo duchamp sympathizers tiara spiked rsn dewsbury fishers completions borrowers mlc unconvinced rodger legation gimp buch ohm rawls wilkie backstreet kikuchi deafness terr ludvig lilies recharge spc eisenach pct minamoto salix syndromes abbaye maryborough narragansett bolstered mtdna morbid kenseth angelique kernels ouachita buford thun hannes burch blazer sausages bolding andrej smoker melodifestivalen indemnity urbanism hargreaves vitoria nomura keio ruc khel nid costas ammo tenn mughals petrograd mpg vga galois csp allard redhawks gravestone teleport bih gluten numb ashlar southerners interferes omani utilising indulgence ridgeway schwab saddened directorship calculates theosophical rosenbaum polyethylene finistère metalurh tet decipher isabela abort indebted kinsman pharmacological jagannath tps ecu zielona hennepin arduous bayley nonzero ultrasonic nemzeti myocardial facs celle aab satu pekka lista amo scribes hugues tenacious huns infernal journeyman pall ruddy uncertainties bonita africanus centrum mistreatment mysterio immaterial mdc thaksin ruud injure rena shareholding havel anaconda interns muted fleshed trang ivano ismael anjali leonean cordero aspirin ladakh théodore sunbury kirkby paranaense odia awami slapstick lapd retaliatory tipu caching nama ruslan amphetamine cashier smacks calves sequentially palmetto poetical usp mechanised myotis absentia calvo auger tripled droughts saas dataset niklas neto boas idealistic gravy tantric analyzer kalinga euphemism conspiracies subplot deen crowdfunding genealogies iguana wrench disbandment mcadams integrative gopi uncovering naka gaba navigators scrubs delinquent eastside kilpatrick eisenberg stefanie grocer tfa oireachtas campsites iterative mda legrand giza satya countrymen spires ramone halliwell hysterical popularised tragically día encyclical herat sathya torrey taming ecologist mais brie kittens reservists multiplicity usefully hillcrest purposefully ramachandran ukr satisfactorily yekaterinburg rove farid mirnyi bedtime pag barcode sylvan starfleet whitlam driveway brier reckon yulia finnmark vigor hoist hamer cruised mea dolce infestation supremes neutralize karting csr specter ninian agassiz overlaid ¿ cinta noses draped vishal houdini appendages suzerainty jaffe lemonade histone deirdre reardon levitt hillel neglecting gwyn godard fudge gustafsson inspecting moya ajaccio pittman steinbeck semicircular livonian chasers protease gauguin theres undiscovered undp eyck examiners chanted barbican urquhart octaves upa albano visceral charan femur sio eldon tomlin plinth elis abolitionists thales denbighshire trekking diverting vegetative alcalá herrick telemark facie amundsen publicize praxis banach subsurface shielded wootton skanderbeg mpaa leonora jeannette capua strenuous biodiesel vireo catskill cautions narva nab heretical collectives lumley itis calcareous momo icebreaker stimulates deductions bodleian opengl levelled ebb behar emus withhold magnification morningside fosters zvi runic clams poli fructose lovato país csf paschal monopolies avenged diameters segmentation mila voiceover ecstatic awaits gianluca ridgway pews angling lowestoft perched inman subregion shigeru disrupts tochigi kidnappers bachman recieved gateways hideyoshi perforated phenol demonstrably prunus loyalties bureaus cytoplasmic managua eines natwest bacharach aip kostroma nrk worsley erlangen sentinels rethinking antonius marathas andi waterworks aguinaldo goings telescopic kuznetsova unprepared ostend chronicling anglicised suspending shekhar umbilical reprisal poking cour psd yumi carburetor endpoint gx lem fracturing collars unites reintroduction wafer solon rtv coen divisible calloway alkaloids audits kom soybean riverview fatimid plow danforth siddiqui hokkien yachting anastasio denounce underprivileged mulroney victors purports mediating tou receptionist idw meer ariadne auld iraklis alcohols slay chipset smelter lingual jus kea swarthmore wonderfully bataillon walkways shattering shem separatism prickly goody normalization retrograde braithwaite muffin yokozuna condemns decayed msv linfield loa sponges seduced leste ptsd breakwater devoting rabindranath iva heretic undertakes tammany cyan fervent doon zamalek cottbus concurred emissary shameful dali polemical foothold firemen sorceress excretion alyssa audiobook montpelier tyrannosaurus wsj narcissus piquet rohit lemieux speculates ainu thankful thunderbolts epithelium pfizer atta touré deluge maputo tantra ingenuity handguns brezhnev disjoint beckwith turban atrophy qs marauders novices ille llano uncharted chuan stosur orthopaedic negate swimsuit deuce jeon tasty trappers steuben clinicians akershus khitan disarmed vinny leng leanings kidman rescuers picky griffins inanimate onshore bx heartfelt handicrafts handlers nederlandse goon sportivo favouring marti coffins scathing sulaiman barnstaple indulge testifying renton messi chrono alphabetic franconian wildwood hornbill hideo oliphant allegany sheboygan flannery artistically playmates cranbrook sakurai gac jazzy hillsdale leveling acceded reprisals hardie gelder tinto planks stipend houseguests bushy lederer fairbairn putra newcomb looms humbert watermelon ziggy pinewood megawatts lenz wikimania responders mussels stela sachsen hibiscus discounting toba graphically chiltern rehnquist opinionated mcs insurers burrowing trabzon sweater cac ticks overlapped generously uttered chaudhary subotica smuggler blackrock encased earmarked mimicry wollaston beckman tyumen piggy jerez gripping yule blackheath corbet stockbridge grigory fredrikstad busta plateaus jani hormonal kwame hohenlohe withdraws thrasher stoll pumas ardenne snider michaela ashfield mainichi tuxedo flicker troupes plazas stiftung buckets origen wedgwood biannual sedentary phonograph reuter montmartre apathy mohamad flemming satirist concepcion drags carlsbad transylvanian unauthorised mcclain hibbert fonds replenishment bahru boarders pdr displace tiebreaker uf nia lusaka lifeguard configure droit artnet welland decorator hur paintball gazprom dic avatars paleo xf stalling rhenish spammers minaret gluck intensification denser oba macrae sampras ont garages huelva ladislaus iga gucci acf morita reinforces taiga moreland wbo freezer extrasolar fists sebring accusative espresso bodybuilder sca ceos ranchi reassignment seddon ivana muskets kodiak borel sorghum elaborately rima merv lambton harpers sefton diwan mixtapes bandmate petrochemical studi serine riverine lashes tetsuya preside dnf receivership corrie cec yunus equaliser ceasing plessis disparities sperry flavio burdens jha zahir bourgogne butchers korsakov replicating bandmates noyes shelbourne swayed thoma crores ❤ congas invalidate saale leonidas fiennes nordiques sinan tumbling grissom mimicking zealander pickford leong rfcu accented ethereal platoons glenelg structuring relevent investiture imagines homebuilt ecliptic pritzker kragujevac sloops topless timeout rer ariana kearns lundy cormac aptly sideshow bolshoi hydrology vitesse instigation pregnancies fortescue tracer pepperdine radicalism fling pint rigorously alessandria heaters circadian gratuitous sprawl webmaster rebate chemnitz folklorist cripple nayak wayland osgood pesaro append gatherers huntingdonshire unfold inept redondo quartered transplanted gaspard crosstown hommes nyasaland oriya paralleled gv drinker airdrie exhausting manatee valeri gta shayne barksdale sopwith waveform mpp mulholland govinda sergius generational capensis zag halloran antietam lupe puzzling ecozone darien mated macomb steyn herz galt deliverance psalter upc stags malevolent biloxi barium negligent recycle unwise aran suppressor estimator rencontres hitomi horta protectors nashua pinging ogilvy darter sinful handover unstoppable fernanda childcare stinging stillman harlequins diodes tupac tightening individualized twister vacate archetypal tock governorates netherland rak stitching gust prada pps maluku laments kyo renominate emphatic rafts norrköping morehead distantly digitised superpower plantagenet subconscious phish presto sabin secessionist omissions spammed psych roadshow hairstyle wicks courtiers boathouse undeniable mako gearing faking rafting bikers arbuthnot asymmetry informations crematorium poo goggles forking leitrim valhalla corneal cubism goku foresight tgv daewoo probing unfolded golding choppy wealthier dov forfar uncompromising mitigating equates adrián bahr vitali connery hem apparitions cisneros retaliated scouted hatches retook aristide cee henk grenfell brash awfully chimera khaki cataract mindful bryansk barthélemy edd intertoto newsreel chor fortitude mucus airstrike laban stalwart grapevine ecb cleaver aviva guardianship grahame benítez heliport sportsmanship tcm diverge lloyds bozeman pharmacies deepening governess hein fluke grampus manoeuvres neotropical offensives ediciones vertebral tera midwives menéndez goers fueling confidant neurotransmitter naik contraband oncoming wap discharging yaroslav abrahams declassified hairless attractiveness inshore havelock pontoon malhotra racketeering gunned mennonites cabs flung exhumed abundantly increments chatsworth refrained sutras euston nao miura towne hideous ehud burgundian lice editore zuma mistral sarnia alamein stanislas looping mouton arellano sunbeam abteilung watercolors bihari dissonance breyer tunnelling damping grieving natured eriksen goalless kapp jesper unsettled bibliotheca emc qd pinning pocono coos outages fibonacci morrell rabies notations discworld greifswald pearse morgantown ganguly pepin colville baffled macroscopic infrastructures shipwrecked gogol juarez toho buffers krebs kyi apl iodide ethnologue northside vfr telepathy mayagüez szabó fastened homers monteverdi reclaiming palearctic haase brazos viscounts flowered oboes iis agosto pachuca soria scholz costumed archduchess cps mauser stockwell dilution tatarstan dalla bennie gimnasia moyer toma yamashita artemisia mota sourcebook custard insoluble irishman ehime spearhead ajit mccord selenium tsonga silicone eiji brits perce sharpness krajicek foodstuffs suceava carbondale nauvoo linguistically alcoholics metzger descriptor galapagos calvinism whitey sonoran masahiro modifier gory lasker whimsical tapering lapland machiavelli scofield categorically timetables betterment pcl nws fluffy shank bittorrent paradis isola yasser quinlan extremities tester tendulkar revolutionized jurong espírito firebird vermeer proletariat impala paulson textured ende alignments overtake stratification renames dürer liston gendered tranquility decoder viborg howarth glorified christa nouvelles transpired borrower hearsay rafter hounslow ewell avellino cochabamba ber antichrist pecos automate sasebo ishii raffaele enclose greyhawk swaps gaffney jaén elinor lune filho neva transcriptional toddler crate zell overloaded outro daman roving saban bales antonín beatification catalysis xb monolith algorithmic unconvincing hooghly xtra delirium meehan dimorphism escapees murchison simplifying pharma workstations kasparov cypriots dormer jesu buttocks semper gurgaon ando tegucigalpa coelho wakayama umno brandywine eunuchs silverstein electrochemical delilah luxemburg misinterpretation rhif obscura perthshire pennies bronzes mistook pilate konin nei taper nfpa mahindra forgives leal likud misnomer lyell granby blume sandboxes huggins ofcom sz yum bacillus porno joubert hom clp outbound opie esl magallanes distinctively natalya necessitating lollipop mmm muldoon sistema manassas szeged climactic koala kenan rügen paradoxes posited møller bhutanese scrabble durrani agricola posterity heathcote conjectured countryman anglers sorensen kentish ims hau draco aurangabad bárbara annandale hergé rollo aristotelian donatello quarrying falsified intensively mend apprehension hells cocos undersecretary bustamante mycologist tusk hyung puffery analogues nationalization villers natively romo burney catenary cedars sauk rcm cleverly simmonds oryol picardy reciprocating filings mindoro makin gurdwara statisticians babbler firestorm circling srinivasa mauna tropic undermines trs hordes rasa prescriptions blockers sketched harpoon swede tosca tapestries daemon cybernetics emin stoddard garret patiala herndon unpaved dystrophy canadensis runcorn ccs teamsters gautier vivek iverson groceries allergies tere gillett eglin brews livia callers woodworking minotaur kashiwa sotomayor stateless underestimated reiss stakeholder erecting quieter cluttered outboard birthdate turan cade fathom restaurateur après halim suzie coronet leto unfaithful estoril belgorod eniwetok impersonation beersheba hangover decode kj katana commandery narcotic pylon agrippa scotus shunned amicable transduction bronco anglicized datta bolzano heber annular bulletproof sitka terminally slits atwater munnetra minutemen hydride gn diversify rusk aru kazuo giroux slipknot debugging piled ewald beaked blackie loew usha wuppertal flintshire defectors talons gambier ragusa galápagos peep gwinnett movers kiwis breads accomplices therapeutics looming kilns upanishads remission sideman griggs affleck gers amity amma stepan anatomist lz sefer encirclement intrepid kuching amassing setúbal typified patrik carlsen backer carrion whitehaven revolvers supra trot rood hawai casale masquerading peake crease skåne symmetries marat alkyl carranza caligula dor freda flavours asgard dealerships lifeline scratched communicative intercession grinder spammer jl eeg vesicles moltke phonetics rebounding endorses locarno antecedent duplicating perlman unsuspecting scarred lae pelt mikko endowments nadh gannett bou licking yardley frankston cardozo gartner ptv ruston nicolson flugelhorn wallaby alloa siro kandi varga baie coleraine woodman frome machining dialysis abdi typhoons mahony seaforth authorisation anguish voz moussa chiswick minto mudd motivating historicity understandably gosford trina cuvier nagel tenses polyester quarto telly sedans unfolds guetta finkelstein powering calorie aesop camshaft biotech ewa araneta orientated corazon denim dunmore kanal suitor everyman leonhard paladin valerius colonials covenants maidenhead brickwork shredder sidon bangs deviate hyacinth gamecocks narita deg bambi unconstructive peripherals schweizer colorectal git commissary legacies slammed grier dohc grc murali ullman mitchel synchronised heathen maximizing deuteronomy utilise bnsf bastards ivanovic theseus subjunctive kyoko redfern nubian geyer avert soundly seljuk squatters technische dahomey renters hutch studs cul familiarize mizuki mura stockade emporia dissolves coetzee impediment posner satsuma uns enveloped westcott transplants bah salvo jeffreys bulawayo amplify briefcase hx sixes ningbo cephalopods staircases rightfully caramel dales bribed vmi huracán plexus townhouse jenson pretense yamuna brechin epo keppel nanak nunnery nus cams cbbc ferrers palladian clemency bloated oto stylised meghan cramped magnates dishonesty benigno precaution sens inseparable jurisdictional infecting salting elmwood uplifting frustrations unaltered ulcers foggia elvin plz mellor dissipation prr isoforms nasional strawberries geodesic advaita esperance innocuous ingeborg konstantinos kirtland hijack multivariate camara semifinalist bashar mayhew redbridge nevsky wilshire enron ​ havens neutrinos laborer baikal ganz amarna greyhounds erasing usman lipton unequivocally fondly tydfil witte pym campeche kenichi freebsd gallardo isometric abbeys tezuka phonemic kbe habilitation characterisation ayurveda stuntman trisha hensley subhash carpathians beehive candida regrouped cram bubblegum conversational curley beheading imperium clausen printmakers shunting jiao ufos rhett ingersoll ushered amalie contending ena skywalker fluently sundown shadowy remuneration dubstep ibarra affords antimicrobial prospector cyst cilia dispensation unclean mythologies timers hereafter ove ach radon peachtree labelle whitewash hulled didactic biff emigrating flirt baptista chopping aller sokolov novell sialkot watchman syrians desiring cima cyberpunk gilroy biel aneurysm xiamen rafters crick sau wirth runescape basse tenet piling natura droid crumb thoreau salm arie geisha whitefish zandt quirk cinéma altos psoe blige torrential devanagari vetted bashing convective mikado murthy raines rumi vestiges marrakech whiteman assailant correspondingly wicke multiplicative gautama wps asperger loos hanford beauvais ceuta clementine rivero milanese ficus catawba feudalism thule philippi debacle qualcomm recombinant rcn minoan astm potable andros mariposa fenian kasper pca swears geospatial asturian purging cashel negara rifled odell icu cysteine falstaff embraer espoo boro portia teasing supremacist resent pats kangxi ontological anni carboxylic tsarist wasserman tokushima harada dhs classicist nfb oca headteacher gutter aligning aquaman generative llb narbonne burnie venu cleansed deadpool vytautas headstone tormented frisia psychedelia garra tendered eelam rehabilitate refuges coercive bpm wicker liberalization finalised payloads corvallis collegial trigonometric mussel estefan diggers nhc puns keeler nyg discreet côtes takeo koehler baffin edging opossum psl wga deepened internships spoleto wranglers hoyle raghu linares bukhara loveless grt conscripts errant glaser eisteddfod gabor skytrain cranberry castaways boers mutton meticulously nlp depp rizzoli sportswriters cenotaph wholeheartedly frail borac biometric cirrus mooring palos stuffing converging megapixel toppled burghs reverb sanga tempting zwei worshipful azarenka ncl mamma superfortress pavlov dictators pyar starkey pda coexistence birla patronymic mhc hares whomever syllabic stumbling fondation dass dispensed unimpressed gamboa slaughterhouse sheraton dissuade rewind contre tattooed kontinental scaffold randomness exxon fanciful disallow steampunk stretcher polje steadfast lagging templars zeitgeist gamal hawkesbury proverb spicer coldstream zapotec thirsty synapse conformal simferopol hurdlers taunts mismatch placenames getafe margery unfavourable zoey injunctions spares adaptable robredo paleontologists pedophile fiske marinos khamenei nsb ivanhoe alcock tunku biofuels augmentation mummies diphosphate locos tous cherished bernese unilever vignettes thorp girdle vitaly forsythe benthic payback perpetuated pepys nla hundredth parapsychology nella hipparcos minkowski corrosive lorena greenbrier precipitate sukhoi cuneo geyser superintendents suraj awkwardly cornea golem velasquez classicism geocentric seagulls steppes mosses demille adama cwt farber joakim sek yaoundé bpi siglo lamarck português lennie taunting wexler gts boavista slocum defoe gonzo vengeful etta purvis scsi montmorency luminaries elbert eoin czechoslovakian thrift borromeo smirnov lobbyists fukuda aunts hairdresser methamphetamine ferrand headmistress stellenbosch rocked wraith educates longview duels stagnant kalan jinan silencing cliché corroborated brainwashed xenophon huai moura courting agk thrashers banding tomislav dismantle coerced ewe akiko taxing soles communicator bittersweet insecta glyphs localised klm vasil lengthening majorities cliffhanger ieyasu dryer woodley blotch cano sme hyder twitch shimbun keynesian darpa remi tiago whipping cubist gosling semnan railcar invokes crvena apostasy ettore banked rimsky tabletop oyo fairmount clippings vilas spokeswoman tropes noriega hating ashdod meissen honeywell affinities nance derailment contemplative matos sauron sweetwater ronin puncture lomas midwifery farman prescribe schiavone akram allier refering verity redlands malin cancelling congruent glued agnieszka bbl antecedents tsuen peacekeepers eas coining waging opry otero sela lillie arequipa hoya farquhar pidgin godmother babylonia kuiper bartoli wadham principe madoff plasmodium fema beloit safeguarding sharkey annabel disgraced woodcut harsher cézanne beset impartiality gooch acadians gilgamesh aloft akp crates impostor hieronymus enquirer layne brinkley ribbed pres luzerne maginot metra tiresome yer kuh trumpeters maung symptomatic bucknell frisch schmitz silvestre limousin tart sandringham mertens dôme reimbursement weathered multnomah bashkortostan fecal xxiv allocations flake reputations chandos corwin boko fiance maître pinpoint tabloids nowak libri wargames collieries netto nudibranch farrow osi inpatient circled amélie schafer proms transformative naturelle priceless predicament unsurprisingly gant oxyrhynchus abnormally rediscovery headgear soapboxing rakesh northwestward haircut puerta hellman cuatro rotors blau predated kiri hiss whitehorse elitserien studebaker slag ranji tryout seeley hausa clarinetist rarest dah egret courted volkov unedited bayside purdy ajmer dardanelles ginny musicale paws irani convene enforceable inez fripp dux trudy christiania hippodrome pounding bijapur absentee gruppe crumbling gowns godoy intrigues reissues luster dysplasia changi redox crozier aer straddles yann infarction grandchild bhupathi nakano agora brokered midori steed menezes mellotron jacobean tricolor cfc roundup trajectories impunity tits gretna peder iia dupuis sahrawi postmodernism mcclintock xenia balsam embed entomologists kona getz gozo massed jonathon yuriy ominous ruthenian apu crayon washingtonpost summing nsc señor helmand aeros vonnegut morissette zygmunt festa dermatitis manchukuo macapagal hérault riddles haddon vella keeffe glens flensburg fiftieth focussing suvorov dugout attenuation tenderness disqualify rehearsing wellbeing infertility grenadiers dismounted inaccuracy nanaimo occam perilous aurelio frontera chirac ¾ léo buri manohar desolate orbis shred individualist yama seclusion antagonism aureus cog daunting appointee misha pirelli praetorian euphorbia halts navel inherits uscg kilimanjaro franken aker shiite scherzo hurled taggart macintyre pressings revitalize atrial justifications herpes barks nikolaos empirically geologically hacks shatter caregivers rehearsed flamsteed eucharistic maldivian simi rafferty cfs substitutions indent marmara blockading maccabiah fleck grasshoppers hobo yoshino lookouts mobs desertion nast lismore sveriges barricades obenberger mornington grievance motta oren dermatology menlo intramural multimillion oromo confine movin emphatically nilsen bettina gallus heraclius publ accomplishing outwardly glock bioethics disarm norden newham watchlists randle scrape delights camper hashtag meigs septic gloomy haida centenario moores valverde wollstonecraft plucked outtakes flared elche granollers saha grieg eda usurpation emblematic kinsella consistory wycliffe stary beaks ticking crossbow wakeman mohave intestines katanga langue thomsen grubb crystallography wettest briar materialize emanating moderators undetected typhus chievo lassie teleportation wrists rfu yonne roommates inset prevails gandalf uninterested circumnavigation unido scheming falcone tavistock hairpin tfl ascends praeger fearsome pula khao exorcist vero ahmadi voluminous yannick deportations dammed coolest dijk unset ashworth danvers todos rainbows mang kilmer cura harker mukesh burleigh charly divisor nurseries lovech ducati joyner sauber invercargill bengt emptiness donkeys catacombs dalby sbc foyle expandable muchmusic pilgrimages ninjas emiliano approvals aerosol scarab opposites adjectival killian buzzer alkmaar pasig asm mozambican samiti wayward pippin pacemaker neanderthal hansard thruway stonework boltzmann concealing mihail pld firehouse jungles wetter homeport sprout júnior laminated stamping grunt config neely ender queues argentinos peacemaker shards aberration disarray monson reconstructing maciej chickamauga hymnal wcc boxscore consulship kars misusing spectroscopic quantified kayla valerio bondarenko nadezhda rathbone quiero cherries arming yucca olympiakos erc eloquence irradiation centurions raritan parley isc parachutes dispense nürburgring amputated keyboardists francisca arn palencia kaori betraying thrills oper clinching licht faunal ascertained athabasca gallium sfb douala hamel axon geordie sisterhood dz hideki avis scares giselle nanyang meeker fairytale haro alasdair aeg skole dene amicus audley understudy mathura figurehead scopus congresswoman entomological saleem boyne vaslui impeccable joys rbc mitzvah seductive hunedoara oration benefactors perlis watchmen jingles toda ibc bivalve provincia anselmo congratulate kultur carruthers mohsen moynihan fossa gallop clergymen rectors lido madhavan amboy fridge psychoanalyst sook samaria byng mcghee holley shahin lumped lauper beatings incur aime trolleybuses hitchens revisionism discontinuous harte bermudian sundial microfilm rabaul pulsar honouring angelus labonte fsu hogs cursive dupree carré nepean corollary holborn chantilly compliments coexist vividly brindisi mnemonic sinbad pessimistic lubrication drei futuna alessandra aoyama coachella leica quinton widths vegetarianism wolseley repudiated redrawn surrenders kilburn ahvaz monongahela infinitesimal panionios olney bins inexplicably ruthven forgeries nantwich contraceptive laver petal usac pelagic talia rasheed goodies motörhead axed delany bacolod hieroglyphs pastime staley crassus lumière teodoro cusp kerouac fiancee tacit rebroadcast abdur embody strom flatter tunic mair bde leveraged donohue alles epicenter pardons interrupting meu roddenberry zug manet washes uavs hdi egbert mcelroy dpp moderna alix tuzla parlance jabbar berhad disembarked preto anak vers rou dari bahnhof darkened ugh rina redstone confides rationally hoof beastie belgique pella minuteman mistresses mcloughlin competencies cccc sufferers perturbation etat circassian authenticate moraes headley peninsulas evangeline garrard moonshine rotorua hikari vijayawada morehouse astragalus sheba shoten oficial jamaat zeng tatyana dinh overheating wenceslaus dreyer aguascalientes trickster invests esbjerg satish rupee samarkand psg armadillo ashikaga sherborne grebe durante appointees etobicoke barter fumbled bergh cysts kath rpgs denning xxviii lenoir gander orsay grandis fortnightly cdma foals glenda blagoevgrad astrophysicist libero hauls ailerons ashgate metabolite unsustainable biophysics benchmarks typology oberhausen leafy focke pevsner koto nimbus scaffolding disclosures sketchy dennison gangwon advisories trainings chowk unparalleled megumi dehradun zafar virgen mads palliative yeong misreading terek hangman hardtop vibrating cassino alisa hyland asante ganesan cappadocia deterrence cockney infra biarritz torrington bizet rtf bylaws whalley technicality obtuse blackfoot trove comprehensible izak unruly tutored ganglia bandera btr urbino matic tabitha dufferin bedside termites glare parsley lawrie macrophages dap tugboat elevating jains laymen meryl manmohan whit blowout fusing drab dior chiara tricia blinding gara forde vikas recursion babar archaea lymphocytes macroeconomic nostrils swaminarayan balan caper hyena hibernation amazonian sloboda flips pylons astute weg lessened quadrilateral benelux spoons inspectorate andries englishmen asti woolen powerpoint customizable doorstep flavoured mathematica particulate krakow causative geeks brookhaven ebu conifers ktm editable amico stinger overpowered woollen madera blackadder sutures kinases maddy dayan pfalz remedied isley homeopathic nadph yelena farragut gipsy stoves smuggle edelman gwp tsubasa wishbone babbitt thrombosis militar colfax kurgan witwatersrand auden napolitano stallone rawlinson goby gau xhosa tambov crayfish nieminen bartolomé psychoactive anima ciphers refutation cheadle comité saavedra tripping sovereigns prefabricated asker interstitial siobhan vixen dressings tut scaly transfiguration draughtsman trivially milos upfront darrow savages factsheet claudette deathly orientales yakov perverse geforce druids extrajudicial göran cocteau uthman elixir enumeration entanglement kunsthalle scoreline steffen yury mourners conjugated mémoire arcana cana eder pmc darwinism flagging alpert streaked penicillin ssa niu arbuckle binge lukewarm gagnon mek dimitar dbe agamemnon vittoria arslan tint humayun uab telethon sped energie yash motorbike drenthe deranged septuagint wea aberdare refrigerated baguio jute falsehood germination arendt frantz kessel inquisitor pickled corrado stadler huckabee karam rawson trackage ventricle pari rosenfeld woodbine tanjung coughlin yoshi doughty leiber lumbar shulman shamanism resonator insurer interruptions dozier mémoires crippling broodmare leans camarines contemplate fellini aargau dazzling warmed disinterested govan ako eastleigh determinants proportionally bibliographical perceptible abl flagg finchley crib pero coupon cluny refractory uncovers subsidence idiotic hus alina retroactive substandard seamlessly woes klagenfurt universitatea raytheon lathe quesada geostationary handout bexley reo tiwari parganas pixies numa msgr lithograph froze isomer polanski flask canard belli cauca coogan roan lendl ideologically icty obie dafydd sanctity deadlines tidewater curler exhibitors bulbul toleration pseudonymous criterium caravans scavenger compounding chargé powerlifting boson freire barrichello demeter ¦ morro courland glyph vicario mariupol masami qadir heraklion mackerel yl ishida unscrupulous jonesboro crespo kravitz juncture iww bodhisattva degeneres tosh lysine raps marija hajime prendergast coworkers cochise judi kanyakumari prodigal gamaliel crittenden financiers bakeries durán acevedo kanda meth conic insulating orcs porters rolland akon glas tilak stepney tofu innuendo marsalis eurosport sab symposia rmb veneer metroplex iskandar chuang mitterrand daugherty montezuma impairments gadsden skoda labem georgiana carefree chambre pensioners lyre menard emotive gorton tif ambrosio roebuck hexagon tupelo hamar pantera discreetly clunky vagabond montauk giovanna pentagonal orbison gillan carnivores anthropogenic meerut mcafee corbusier xxv sopron tru silos pog placenta wcha dingo barreto brainer étude pennetta aqsa arti tarmac yahweh chardonnay lejeune huckleberry bundaberg xxl descendent drugged luft donal sharpened kaleidoscope bertolt euripides ribera grayish taiyuan deconstruction lures heatseekers impulsive tokyopop tva reentry popularize kailash preset grau yvelines matlab luncheon barracuda jamil racecar lifes cockerell postulate windermere hellfire degenerated landlocked angell velodrome crenshaw feeney xxvii scopes ey manna spotify lyricists hurdler cruces hurtado ewen wenger uso zf outlived bergmann chansons rims greys deville suitors tetra mustache gorizia rida captaining dendritic uncontrollable disappearances santosh teak paleontological televisión doubting juxtaposition monoclonal domestication howells macs lunenburg dendrobium tulu tirunelveli sieve facilitation inkscape iwi dvorak spinach phuket intelligentsia hals spirals deloitte harney cocks ged cnrs corte andriy gogo fireplaces infects registries bajo yasmin heredia spilling gj kickboxer alon monies riel chit saloons ner gaynor allianz naoki purges ungulates nimitz privatized mcnabb seu trespass peppermint lombards dweller compatriots juices rimmer consummated silicate hdmi machete tinsley posh brainiac festschrift fallujah bushnell viana hausdorff kbo murex dartmoor nugget unbreakable alfalfa mips bystanders authorizes zdf glandular giri disulfide lousy modestly lodgings uml austral glimpses precluded noteable toki analogies jochen udi muammar madigan telemetry macao trabzonspor eject choudhury bletchley bagpipes aap mulhouse cowen vercelli alejandra aku sassoon catalans creatively ranchos brookside mishima cto snub claudine fundraisers calico caustic mellitus booklets seamount broadest gakuen burnout scraping kazuya sneaking vanier lucía subsided radiological bloodstream greville junkie tsa ilyushin shakers nicklaus yuji shouldered fergusson tedder mainwaring cosworth clearances beretta intolerable kenshin assimilate isolates meandering cleanliness forties especial lcs goswami shoreham picton politique dispensing astrologers boney porches kain hasegawa equating inferences theron bluntly mvc bogota emmanuelle platelet disorderly tiller isomers mpumalanga speciation ludacris northeastward scuola outlandish lurking minna rushton origami tos möller erasure herbivorous payoff kennett köhler jyoti cyndi westgate stirlingshire isotopic accrued newsreader sono huntly fuze kagawa retrospectively baudelaire legia fished timbre forfeiture engelbert kremer morison informants callie brochures ulcer pinellas paroled terrance balustrade sparkle chaka formaldehyde interdiction bannerman foothill subtract snoopy tippett commendable commemorations whoops longstreet marwan implantation invicta littered tomo ctc pohang hatchery bnf utes ili serfs gilda lachaise pattison noida glazing royston neves glycol kareem catholicos ospreys spitz kjell revisiting kui bowing bonne williston grumpy taganrog natchitoches bernhardt saf senseless pradhan boilermakers bledsoe educationist portillo raged lessing cryogenic byproduct quinta hallett catheter famitsu delisting unskilled ordinate amigo ponting espinoza smog imageshack stagg mungo zan franciszek jara interregnum masted normand affirms tarantula verdes ssi pitting nuestro enlighten warfield bata mitte texarkana bowser jacqui dfl decays yalta captioned staines electropop usurper pantry fenders qpr saatchi outermost bodine brainchild maracaibo powhatan chests cherie atmospheres amaral tampico ozzie sedimentation springtime pogroms elitist maw adela datu devo porridge geeta wellman tunstall riflemen nehemiah pirie paperbacks journeyed scams bourges etihad puno hagiography vicars mam diogenes cvp fsc curlew pseudoscientific rangel chattahoochee kafr brontë grandmasters andrás clef taoyuan insolvent deming greenleaf hynes dares ní obit wma brava nussbaum memoria tasha failings impasse petrovich dialectical unmasked kendo ano pma justo stoic marianna damir hues cathcart castell kolhapur scum obstructed taylors blakely flack hindrance balanchine showings bruton cours triumvirate hanoverian converged randers harrassment ahan gori mcewan circulatory litt coombs sacra zed bento churchmanship audacity unprofessional larue menacing blackhawk odes idiomatic maktoum shirin cvs thetford goblins bespoke ramsgate chitty teas jeunesse elbows hyo larisa ornithological mim najib theropod iridium residencies grosjean stitched desolation pervez morricone tamura ironside livres vive scheer shutters covariance frisbee enfants etymologies dagmar haddock enugu epoxy genk yume clouded conferring headliner spaniel kaspar convents gamefaqs scrivener forma wik merrie fiore hec predictor hospitalization duh hibs byes modifies muhlenberg nuance bonny inflection kenyatta loris yarborough polis constantius imtiaz yukio subsystems foal barricade yiu abhishek mikoyan discontinuation reminders veena reread cleves steels applaud broadbent batchelor haugesund evict eff cagney anorexia biscayne ringside dore siret bouncer gagarin rémy dailies cortland rejoice hunchback previewed yells orme easing wardens ascendancy runaways marika bruised scanlon nrg croton colima dictatorial licentiate moorhead adjourned gato eilean muncie stipulates masato oup aliabad engl thelonious chrissie prolong ouse marketers muskogee maryam reproduces celebratory miró zo energia boldface shafi konya redshirt payout arnie rabi isr elkhart churchman ecc vilhelm xiong reconfigured archway carapace dries loughlin gamespy disassembled nematodes heide smelling rekha rayleigh hydrothermal doorways microfinance gabi marysville dimaggio imprints bleachers mujeres uriah decapitation diemen banfield pires autonomic banishment phony kiyoshi sascha qazi earhart conjoined conquerors bugatti universality alleys supercars brouwer unleash taf papadopoulos kondo tithes railing perils egmont suede cargill coldwater falla meagher policymakers udupi pontypridd malla jee carteret simplifies boden tenacity infatuated welker reenactment okinawan complainant paphos martians repetitions parsonage bastian multidimensional startled canonization rambler sambo unaccompanied carnaval resins mcmullen imprecise hartwell besieging endicott silverware benefice chilled darshan waylon sagra neurosurgery printable zagora chae invalidated tucked suspiciously sorel loudspeakers unfriendly audited tisserand symbiosis seon verifiably marrero spat seraphim helsingborg grime representational orphanages equipping acrobat apprehend autres nicene dearly ibid chalet weep pradeep starz penicillium moretti ait thankyou ocd admires cardona granary sebastiano hani seyyed unreadable sandstones joyous polymath habitually glycine fjords notting whips cromer complicating perpetuity aspired exorcism lorain hardening triplet oust imelda assailants brownlee separable colley ardèche subs spenser sams république cladding molson dissipating npl beulah chisel rostam carton wushu laguardia mykola arxiv spyware voigt noxious ayrton strides bled wildstorm uic chancellery petrified jhelum uppercase adenosine nighthawks civilly mundy birthdays tuam fallacies freund geoffroy deanna latour chikara rona relented confusingly tilbury lunches kilo partying sleigh stratum ivanovo genet narayanan castellón pape foxy disseminating lcc puddle ug pasir alphanumeric underpass novelette haskins expositions guardsmen cheats clays businessperson srebotnik voorhees mohanlal encapsulated levon hemispheres gloster ppi intermediates dram stratified bastions coven cools frontenac bian esophageal mclennan abn vernal tls leones kinabalu renominated transcendence despot daf gwynne jest rosebud bridgwater sandberg cautiously ila rhone fishy adf miyuki cmd typographical thighs lesnar dahlgren sculpting patrollers bravely torches yana solothurn sported glade recurve carmela allotments minefield filipina hammett unbound credo greenhouses kelp pennine inventories dropout salamanders leyden fakes mower espace nematode crossley nansen faddle darkly unchecked ite anbar verdy rinehart pho accuser hateful cuttack eisenstein trna kovács komi shuts indonesians decentralization collider hauts berdych hylton teleplay coincidental vander swinburne kamala auch abb blm axons cruciate ganglion tuen condo octavio conserving sheltering grzegorz belleza cession elicited marple trestle sandals brodsky purchasers vpn mithridates withdrawals conch hollister vigilantes infringed colouration rewrites duchesses galvin waziristan robison draftsman framingham alumnae bla irfan olde bbq datum pacquiao pentax kilowatt gamelan stiller vilna earrings metropolitans lucan carradine ansbach timmins stink bhat orthographic asexual toru instructive professorships unchallenged safi verbose cristobal fih lacroix envision arce worsen boynton hurried polymorphism koran mannerisms deviant prov remanded inwards admissible hourglass leveson oppenheim renunciation pusan shogakukan mccaffrey kure lundgren keswick shaheen moorland tz transporters scapegoat jorgensen lossless pappas nuys materially tottori diseased kirilenko aeroflot wormhole revel cracow perihelion plush wsu jinx baltistan khwaja anachronistic aiden harvests annabelle talkies weblog luxembourgish amasya dysentery pál zn buell retrieves mortals whiteside slowest bunyan uga uninformed undertakings xo soybeans unstressed auditioning parlement observant evesham fullest avia trill puppetmaster nogueira terns ordinator pra secunderabad griffon langs rupp salinger numan elan stourbridge armas linde incision gauls hammarby lenovo octavius kerrigan equalizer rumsfeld panicked plat blackfriars garvin fret respite janie grosseto delve sark literatures buf armchair headway installer asshole demobilized bradfield diderot slovakian recitative baek leviticus oxfam machinations antimatter einem sacs ntc crossword viv nubia melba reiter origine roadmap petrels simba eurodance turntables hunterdon ogawa afca nuovo yanukovych krajina haverford raipur ecologically christiana placings disengage maclaren glenorchy choking nakhchivan hsien reston cruciform meine achievable hollander strove impeded lionsgate krista iman vented dialectic samos biao transcendent ltc macdougall zohar qm invades kish unseated enver carnation viterbo vial livelihoods faiz viswanathan ppl calton bismuth arguable swenson mfc thad narvik codecs serpents herbivores semple hajji cimarron drammen potawatomi mullet eso tyndall laserdisc taran endanger throwback geometries counteroffensive mores commend tengku whack lull argumentative ishaq hildegard eyewitnesses bindings daggers cockatoo céline lighten proust mór mephisto superstitious indentation amaya takao defensible nurtured kuroda norepinephrine fouled issf jogging belizean blackboard rifleman akshay novelization sodom tristram brigid diez nutmeg boing schenker heilbronn guano hologram pnc aas crusoe obliterated illuminati ludovic enzymatic wrangler venkatesh blower ribosomal figurine correia reportage negotiable amway yeti maidan dependents sauna enforcers lévy cic fryer selva retaliate tenancy autónoma vashem amun dubs mccauley congregationalist bullard ribble shillong kartli centerville fossilized cordell macneil diners melkite slideshow rodionova infighting decomposed culled siddharth cockroaches awry pretentious lindisfarne kennebec heaney discriminating portability passageway storefront erupts customarily tableaux swordfish poco murky hoosier hime hisar outings mayall thematically kalyani koda selves sbk nerds minton hiawatha teck shove priyanka bz srebrenica crain wildflowers subtraction daedalus aly bukhari hightower prospectors incarnate vespers caged masaki boll strathcona vladimirovich gwynn tristar honorius prospero oxidizing introspective disruptively lepage viic roja perpetually tsinghua huw pnp capitan inna lubelski teased murfreesboro olmec motets engle enrolls radii cascading blueprints killarney reeder thackeray kigali tuareg cerf starbuck nairn janne premiering semicolon adapters asparagus leduc standstill cayley hyphae babyface channeled faff fangs ₤ miquelon hallmarks najaf sacristy unlinked trinidadian ramallah undergrad cmu fanatical manolo pictish eads lipetsk jalil bolo restroom monocoque emitter lasso applegate candlelight schell schenck stover immersive scraps vdc gosh clumps trusses edgbaston braden miri cypher dedicating antimony tappan cece insistent southall keogh condense ruffin thrusters symbolist compensatory sensed dipper marg sixtus gestalt weevil gsa vesta whatnot cricketing haasan paoli tyrants chula huon sdf prescriptive groovy stasis lounges turkeys dreamers predating armani mommy berbers donuts dcs uneducated overtures gotthard anatole happiest sancti bartley cyprian understandings reformatted berio nipple balthasar driest sou karpaty ferrous banshee koizumi manhood bgc moulding eminently edicts slashed taiko linder beaker splice alef silliness kanazawa stillborn ryszard oregonian chetniks bojan rickard oradea buckwheat gourd sawmills zé mindfulness lockyer airshow irgun fluctuating howes broccoli ammonites lieut malawian millicent endpoints metrical elphinstone accelerators leu magazin kampf yolk villager saddles scca firsthand moderates triangulation pratchett auntie manoa morena checkmate ionizing minima telluride uid uwa debtors cotabato supercentenarians rechargeable landsberg karbala granules yrs whiskers perumal transcend treasurers admittance natak amerindian fortis triumphed fawkes idp puffin fraught promos scapa colonize binh goan vindication prays mankato quebecers coors amicably peale thang nether fess dhc vlaanderen marksman homelands shona cts siegen soweto oxnard tycho laissez infractions budweiser paulie seaway quintessential scorched redefine helmholtz montagne nonconformist skated upsilon anorthosis callao bolu greensburg streamer grenier slovaks sweating interconnection salamis fls hine waseda xn multiracial capote keri rowdies ifl comer didi accretion canister folkways iro yelled seceded sacco sihanouk shredded platnick stereoscopic ichikawa bigot zambezi stubbed shanty dynamos ternopil mexicano sienna renner demolishing mq raiser rosedale cady kwajalein declarative vsevolod kuk zora seiji twig ticonderoga mesolithic gondwana assertive hippopotamus ets cohorts hypertext adversity merovingian votive barrera cacti sti zander vomit pago escola inflected azusa dimitrios kidnappings boardroom halford astray benazir años rbs takedown puntland tisdale arguement salyut neared astrophysical horrid neel año forges chitra chroniclers conspirator saya submits derided nuri alborz ona meara swapo lita capes aggie overlying bobcat adderley caveats martindale fouls campinas ignaz federación weald ratcliffe derail encircling memento tulare colosseum oas resubmit airfoil regionals simulcasting callan jacksonian alstom fruiting aether dissected mrc eugenie ahsan reconquista ioannina acreage hdd julianne hdr forgiving serrated molyneux kruse buttress urchin courtyards seuss springbok biome azimuth promiscuous superconducting gambian askew liberally vellore levers laxmi aubin hermits mashed haslam shipbuilders raphaël murillo bradenton unscathed complexion blasters frescos pimlico bev poblacion christiaan yuko thickened landlady boleslav coquitlam extensible retinue suo reichenbach brevis karlovy lantz ziegfeld hollins njcaa texaco openstreetmap maur madrasa strang acland asn alessio renovating acne payouts agri signer liege tov scorn kazi gamepro longchamp confidently cartels trowbridge dostoyevsky roseanne dolomite vergara jabalpur glottal eliyahu siem dnb lobsters nakayama skyway carnot sinensis solute gatorade datasets autocratic sloped undeniably conferencing muck rtp dislocation soften assembler kaduna cybernetic alun ribbentrop alabaster hemel dps janko nakagawa binaries apocrypha benzodiazepines differentiates stun tiverton safeties jewett rishon faris bouvier vivienne ronstadt burgoyne yorkers waukesha inbreeding staffer cem mourn episcopalians iceman ichihara lackluster paganini dún conchita interconnect douai youngblood román ndtv jal aylmer flimsy unplaced attentions bridgeman fedor bassists newlands debutant tinge ido unita pigott darya knobs bulleted nox sopa rinaldo competitively physiotherapy reclusive karlsruher kingswood verdasco trainor schäfer creationists lichtenberg chillum harun petitioner apia parasol rashad deliberative twinning teruel sua murugan outage osorio dandelion benji intersected msm scarlets atolls blackface emp drunkenness weis fiu energiya hooke yamazaki witold unsung fliers zar borgo wrongfully genji buccaneer woodforde donut kiosks gobind unholy cartographic rosita nikhil asd radiate lactose sor pha scented winemaking muzik pipers bonilla handshake ambulatory chih universelle hesperian aggressor molasses paradoxical sorrento piranha snapshots jamia beveridge haque naam sovetov otc pater provincetown pacifism pinnacles disclaimers fatwa léopold hypoxia acetylcholine roadrunners caxias clinging wittelsbach orestes groomed propositional barbet lemons xanadu govind broadsheet radek countywide moy mints shiner gazing hunslet skylark mustapha prosser champa orc cnt trapp maynooth brasenose dharwad distasteful larus intermarriage stingrays flagler jarman maida whirlpool collated pragmatism subcutaneous incised impersonator chuuk suis sameer rashtriya rumba hayman bem merwe mistaking rivadavia distractions manipulations daze chlorophyll fabre brooding taverns nani diatonic softened canby paducah shortland shales uppermost linköping deportiva worshipping worley indignation expelling cea mayoralty conciliation retainers bjorn borderlands intents hussar troms weezer hbc fitzmaurice piecemeal kitson stranraer huygens yazoo croquet adl cutie prodigious venturing patchwork mcentire ozawa ossetian segura dlr poulsen duda willful ihr humorously bohème uneventful belém delinquency prioritize draconian intolerant pz knud outsourced aorta nashik budgeting doordarshan ravana bishkek erste jaco fredrick fingal rerun woodard krazy finnegan seafaring swartz wyeth qmjhl adeline turbojet impersonal sherwin mycenaean tetrahedral nalanda masa hendrickson kumi navi dian balloting michèle knowle selene trigonometry mingled harmonia commedia amaro cinco sdss conyers winded ursa youzhny wilmer icp miho sojourn petrarch ebro deadman sadar incessant breisgau emporis bitumen zigzag roseville aquariums kx servo cowdenbeath wozniacki contingents plattsburgh itch kulkarni sundar odom honeyeater pausanias keck triennial pulsed yuwen retainer grievous vimeo richman severance kinnear coconuts pant chanda marquise freya nys claro scarface mannered ichiro aam despicable bulldozer ayurvedic salish tvn garo solder usns nefarious humpback menagerie commenters xmas thermometer drydock geosynchronous ietf prs eradicated gera beda mestizo mitigated dative aslam ascents resignations tufted limes schopenhauer whoa vladimír airtime quicktime siedlce miliband sufferings valuables ardmore mobilised bridle zaidi extravaganza ,and propriety heptathlon hn handedness angra kut guaraní beltrán ife corry kafelnikov forgo beep tarc idi speck aileen riverbank godolphin gehrig karelian steinway cymbal suman vindictive pairings flywheel catwoman bracing honoree coulomb gor bada pazar vices shallower cathay grierson stal nzl modalities purify compiles cwa presides ganja chromatin incriminating giver hillingdon lanza aaaa defensor hospitaller pubic averted deserters bridgetown faw gorkha huck rudeness intertidal silvestri salto incitement cerezo wairarapa interwoven epigenetic adherent medallions defies yuba hansson pronouncing pelé petroglyphs rau indestructible keyser caregiver supercopa stockbroker haverhill bogged whitchurch abra prepaid masala bodley miz haganah byung berta akan newscaster hazy blandford escherichia chats phenotypic quayle lyonnais telegraphy goblet bedi accesses nightjar alo nascimento mms oglethorpe yardage edgewood ruy sherrill preponderance psychopathic bitmap occipital eller theban responsibly alight tymoshenko shueisha travolta pipit warrick ketchup defuse molars visigothic laziness midge methionine clydebank flagrant industrialisation tortoises anse drôme squarely mikhailovich antofagasta lehrer cosine makassar azov fatih reebok altruism patiently herbst counterfeiting mesquite heim embossed vindicated conklin marcella neuilly gyeongsang donn lomond composes cathal avellaneda americanism grieve canvass christo castings handbooks newcombe helge reznor rubus mayweather haru schaffer être aranda nanotubes pdfs pcm tremor widowers miro trott lockers accumulates conjectures ballymena takagi fujitsu ellicott bowe nazaire schooners paradiso biswas proviso krieg unbearable tantamount kcmg wharves tumble returner photosynthetic decadent anadolu heartache degrassi kkk kublai zindagi sociedade esophagus aline intelligencer nodules synchrotron celt malpractice ashburton excite ankles positron emmons sparring foresee damnation vz yc amidships dunbartonshire penza nida woodpeckers pulley aar popularizing glitches cécile hefty babol badd bristles mayday kono boutiques apricot gomel tiberias socialization statuary ordinated aldehyde webbed jf toyo evangelista canaries eamonn hamada thc almonds mccready acuña quarrels indio semiotics cetaceans coupons kristy progressions dixit hypothetically seongnam farkas ailment extinguish dixieland mamoru oscillating zr stopover sel grammer bcc winkle abstained christendom salmonella togolese conquistador utterance travelogue buchenwald aftermarket belton succulent lucchese abo evasive hanns unravel reunions compulsion param shiro morey tokelau amol saprissa changer edberg biafra carioca ncos predominate prospecting antiquated algal diphthongs eec decorum bearcat pécs solidify preemptive enlarging gat clijsters leonor overrule slayers covertly anacostia sidekicks apaches subvert productively surinam foreseen babak controllable adverb hangout badlands recluse danske cinque shalt bentinck zec suu lennart anointed pledging uninteresting jindal deepen graced huo yamanashi kellie lode bracketed ursus anja prewar elland subtlety flashlight kpa aqueducts donington illiteracy mirabilis rafe smallwood drifter ccr kel rowlands cme josefa nominates strode simona frosty pergamon repulse wale filipe beryllium petros preferentially genotype checkered whiteley dama eurogamer garber araújo nuffield biofuel acker gustafson trebizond phonetically isham subordination woon duchesne shrewd shaka reb tipp cob prospectus dreaded claes trickle wasteful buchholz streep lacs allude clocked venter germane brougham camber halved salads fut gemstone prejudiced shunt inked remixing amx replenish downriver eamon commonality janaki universiteit holdsworth beardsley geri stanfield assays exclusivity baru germs ym excused unforeseen gpo spirou quantification wk manipulates dicks yousef arantxa abrahamic smock watercolours bombard zoroastrianism uscgc provençal sophocles atsushi kadokawa tauranga apologizing voix becuase mithun powerfully pickard kasai qasr bergeron forcible unsolicited longwood esch synonymy sparky monro tyrannical kozlov lauda montparnasse prizren pzl leiria orquesta dimethyl uru stasi cushman nevers narcissistic hilde desalination hollingsworth famille objectors ree rajasthani immunization prepositions mariachi dukedom fenn faraone grating chios overijssel blakey levies bernini kilbride ribeira maliki pontefract samadhi hariri terme dislocated picardie characterizations facilitator flue sheeran pettit taka qarah minter siti hiroki selfless icbm greenhill togliatti demotion modems amharic marla barometric bonsai fabius torturing conservationists transposition racked greenwald damning yeager shuster ricard magsaysay pds dilemmas widgets breuer plagiarized soden cahiers momentary guilherme jagiellonian getter zipper slav bolger epithets heralds singling norad crazed offa bodmin somalian oakdale osasuna flattering negri restarting wer empoli mastercard optimizing jig divorcing brereton gielgud alexandrian snowstorm clot emphasising galli nar nacho franchi chs obstruct esta gliese vukovar blockhouse prius reuptake scraped preoccupation feelin hino crewman placekicker liberté woogie gab anatoli roush premios depriving steely femininity hexham dura marshalling merino concubines hes contravention minesweeping greener keeling gascoigne scrutinized subdistricts generalize choe scholl srinivas crandall evoking dex olivera richfield boz sabotaged leitch barroso llama ruck rudra dif enda satie cheong graff injustices uyghurs aalto bahía henriksen abdulla paseo seabird shura cantos zvereva detracts standardisation ulterior tso toth declension pellet donates cupboard excised rectangles gennaro antonescu lavery factorial scythian quantico jari hock rabid preta ibáñez misgivings capping meher blurring kortrijk maximise marchant libertas kahne fec stolberg burgesses futility fishman randal tartar smurfs salma conspicuously silverbacks lifesaving islamophobia exporters middleware eifel kalgoorlie bothwell bridged keselowski shazam oneness mabille steiger democratization summerslam drava nuttall jud suffusion morbihan qiang surgically traoré groth leszek befriend decadence moffett paratrooper conga phasing winehouse tangentially kees ori rmit occ parsi detain newsom seaford lumsden rdf redux inversely lum academicians taito pastiche chatting utv hing kasey mansard cowardice periscope anabolic sneakers heckler gosport marquesses dolph diploid woolworths exif sla solanum quintero prat feuded tirelessly dikes kingsway rationalism honed punks aveyron phong starve canfield breathtaking gorgon modality bayes sweeper kenora spectacularly obscuring leake eltham unicorns lucretia ✈ kakheti geraldton obeyed ure carling basaltic grader rearguard nimoy sufis emmys kleiner ibanez epidemiological marte solent sandwiched henin fissure dualism rips shifter castaway carotid kotor disproved broadleaf sotto pauls degas ferraris stalag methodius nonviolence camargo downer paraded bestow viagra deuterium srinivasan gazi bicycling exclaimed eternally couplets nutt nevill aro trailhead takeuchi brownie psychical distorting hovercraft mitcham puss twofold distaste mutineers nullified newnham amina tamer invents clichés succinctly ij megawatt buddhas dushanbe chandelier darwen factional faure mercator hyuk chipmunk patched bioavailability colne zoot authenticated supercharger koichi diffused unattractive mattias exchanger alternation jarring vejle debug bathe appreciative loggia inés itchy arai extramarital octet adcock yuk galego timorese bhi prune generically benedictines oily marrakesh mizrahi becca tupper irena panics lightest chidambaram maksim arabella ballistics ocala obstructing csiro inyo lattices overcomes fca intergalactic begonia fiduciary watercourse dempster resounding pericles repute aharon femina migraine grohl zhongshan rheumatoid toughness soot pruned imbued quibble brea severing jaume mami colonist narada garb mejía irv neuroscientist discarding hippies branford jarosław unsatisfied macaw provident carne oic opm tooling menominee hillbilly karoo pyruvate linwood lld cyclical luleå soa gish davydenko ih ula waldman siempre ketone deniers accompanist cariboo hap maradona mccollum carnarvon braided schlegel galeria magnitudes sudha etheridge eloise throwaway vann teapot futbol inlets shard almanack adorn hawaiians yearning haunts rowman campaigners prefrontal pauses ruggles actuarial graphene nichiren honorably oscillators hives danza pacification hering bookings kham slotted ilford norge villar prescribing adjoins subprime suborbital escalators bessemer raine kashi disinformation picts leppard metzinger shim personified lahn epistemological xanthi christen booted wildflower boulevards chilly collectibles dinar steadman sagebrush maturing geer rochus belenenses reggina vmware steyr legalize casement elizabethtown ini gregson minimizes gam widens oita bola hak ttt escher nika lacquer beadle roasting mmp dips frenchmen mestre moveable brisk dementieva wtc modoc credential smoothing schoolboys postulates nyman alfaro devising yuka philological mendip heffernan cancels ashkelon kells rika outgrowth orlov debilitating recep kirwan mci rapporteur faerie anagram firebirds crowder wilhelmshaven mishap jaber hisham abed mtn wook naya barranquilla boulanger tanja phonographic halstead commercialized ventspils encephalitis reichsbahn willett nameplate cytokines cotswold exterminate raisin tremors buffs adder tyndale dangling farsi krusty booms pacifists aest pgs limitless humbly cranmer ghani boe childlike ismaili taunus sochaux aamir ponderosa serjeant everard hyacinthe mbs cottrell coote repubblica surigao tejano sivan firefight makarova tremont replayed depreciation beecham kumasi bulkhead preposterous clann dtv scientologists errani bulger charon allocating atacama knuth pais repose tolentino lingo protester passau monogamous lora elven leash cot tyrell longo arthropod thorny sluice mcauliffe escalante courtly trespassing bur underscore unlocking willa hitmen filibuster wawrinka catharina tasted condenser levitation hermetic diligently lehi symons alanis campgrounds corleone headset diction shabazz pupa topaz gaillard moron mcdonalds tutti fallow doin goodison ux sani shampoo carnivals horsley shimla evangelion citigroup oar regroup bayview hindmarsh rogan verein savant pythagoras gleaned wedlock yatra pastoralist keyhole grimshaw machinist enforces hanger venkateswara barbaric gulfstream gsp eleni masood beavis menzel redcliffe afm openoffice almirante iffy culpeper cheeked innermost amedeo gollancz alania theotokos radiative waterville elstree pathologists reclining riverboat corky valedictorian makerere amply rawlins denali splendour azevedo schoolgirl dpi richthofen pregame sportswear abdicate seaplanes radiance leaguer fluted cri uil soared leichhardt wane rube thessalonica nieces windscreen marbled fogarty discoverers bungalows arrangers hobbyists schnyder babur noe alpina impassable dens checkout plumes mobilisation aubert edina evaporated gretel intermediaries mehr honshu galindo impenetrable ionized anyang novorossiysk makarov prins atholl amanita cichlid marl lumberjacks karloff ultras ataxia rothwell enquiries ivey hazen inaction qutb cdf yellowknife offside bicarbonate nordisk hurwitz trask eben pastries vestfold owings betancourt lackey gianfranco doane gabonese hondo halal greasy skips pauly vallecano mischa feller skimming giraud hazzard skeet plump abellio cutthroat reinhart ilona chubby dripping erzurum dyeing sinestro ocr faceted bards devious columella langham archeologist chara electromagnetism orinoco nll persephone modo unconditionally musicologists cowles barneveld secrete welling whaler camila pancakes mattie dredge emphasises toothpaste ucr allenby impregnated budgeted greets understated arvind brunton geist furnishing anesthetic infraction mahut imitations guin foxe plumb frères camaro secretions bolero wnt newborns luk fatale cataloging stavros precession requester bream inexplicable machina krone dufour outbursts sofía minolta alcoa interrelated intermission isaak paparazzi baht matamoros intercepting lass blitzkrieg rebekah pyaar plundering tabled lauri vadodara meadowlands lázaro mannequin fcl nevins unregulated bana angeli trendy gto popescu goffin mcalpine genova quine gynecology ayesha copacabana amuse archetypes deadwood leonards plagues viticulture midler percussionists ranting snide cand restituta stilts lansbury villegas brac melodramatic bewitched hasse clarifications chasseurs mollie cogent salo loach wilkerson birgit morphologically châteaux nkrumah arminia imus bhg starfire eventing crass diverging hydrochloric roslyn maleeva hüseyin hugs halcyon mardin zoë rationalist novitiate miramax debunked ulla catalyses ufl fleurs gympie lassiter nextel tei antidepressants hesitated feist quintets soir wolcott riverina cornerbacks harrowing viña awardee coro slaps headdress steinbach hillsides algoma scissor milly macaque vaucluse unjustly tala lirr opec sayre ould stratosphere pegs clackamas rosemont goring expiring murrow colonnade corrientes polyphony transactional pippa hocking ladislav arora hamper granth tralee ilk espiritu closeup freudian patuxent maxillary pedagogue mycobacterium apec nss parallelism effendi mullah minimalism billington sverre nishimura higginson pomegranate phosphatase chaitanya tilley interludes trouser harju scrambling cwm pompous kohli hutu rushmore comatose specialise montgomerie laszlo ingo hetherington equalization solís gelatin onsen ptt chilli sivaji functionaries trianon watermill assange abou shute falsetto gaeta senecio hatter shamrocks ferrol matsuda steuart sparing ved nudes alderson krug lippincott sepia aig abreast islas varney surry morrill outed sé goldschmidt cadogan cabell snr yon wala nissen bahram throats gout impounded lpg antisocial lahiri homeric narrating slattery liqueur spotsylvania springboks doubleheader danmark taira edgy sketching spate consequential atticus bally commotion aural yay eichmann arnulf portadown lazare fives sweepstakes matriarch capitulated gatsby mullin vermillion millington stralsund juli cleave worrell edgardo islamism rit sdtv elías witten fitzhugh garrigues ormonde quin unsettling kolb midseason gamut gawain clades devalue gascony bystander stepdaughter mocks cleese emeralds metamorphoses aurelia sirjan meteorologists meltzer repulsive sewanee alco worden mulcahy swagger binoculars pillsbury mamie swallowtail vel ophthalmologist lotteries cistern kell alfons handcuffs brubeck doped articled compaq subtracted hotspots fondazione wretched homegrown internees ihre upholstery telstar ysgol dimensionless stato cpp antananarivo kemble tuanku matías undamaged bustling conquistadors hander winans secede walthamstow rauch flatly asb catholique crossbar armen valdosta amrita dunams synapses ringwood lithgow rostrum cleanly mordovia flashpoint horgan bullseye ipoh mem ketchum disposing marmaduke blunder seaward halogen douro blinds provisioning clary cancerous chronically freelancer ifpi nabil disjointed yutaka uptempo mtb backups legionnaires hazlitt sandefjord caja gog percentile omonia rcs clave heriot morello barstow precambrian grammarian cavanaugh avenir circumscribed hernia tocantins mitford courtesan trawlers aimé clapboard folkloric thurn blurbs sorely balmoral unknowns odette pauling craddock hasta wholesome nasi inflamed bosley qualms apologist caswell kilgore shingo yn peerages marinus writs huta dorje predefined mazar brahmaputra hospitalised destruct pon expedient breakdowns sasuke sorenson magnusson rubbed superdraft aic jenks greenbelt lourenço steffi pankaj bsp sma ssn moms zanu vez backdoor toaster jami nva cirrhosis fucked sleek tappeh ura dearth montebello marton encylopedia stimulant dauphiné gilpin transhumanist wigmore fajardo malfunctioning refreshed overcast fringed gaiety tendons urbano pogo oha cockroach caerphilly neurotic blockage tetrahedron beatz efficiencies mapuche gwyneth semifinalists philharmonia affliction oiler gio felled waffle ecclesia violoncello primeval designator mannerist pheromones transgression chops perceval protectionist herkimer juggernaut whalen laverne bodybuilders orig flightless robespierre meditative pml flashy civ uaa crake riddell assoc reticulum incited browed lusitania incensed graze emissaries hércules fitzalan brackett lolo muirhead cheddar hellboy ilie krylia meteors asaph bolivarian kryptonite harish psy casals werk entice wenzel berge charcot oddity glossop dory onsite penniless raglan susa hamelin albers solver wozniak judaic peloponnesian lom ghazni bernadotte luxor fes sidecar bistro raina medica articleid mouvement kronos bahar leroux frankford missal earwig sanitarium sano vanadium fetishism rushdie gentrification fj catwalk paymaster moline kingstown mahadev lawrenceville withstood probationary imsa bracts korg mayes umi galton incompatibility abbasi coauthored straddling wicketkeeper interferometer meenakshi sommers wenn coutts edi bremerhaven strewn candies doraemon liaisons unannounced hardwicke taunt technologist androgen storrs helmuth venturi veritable kaifeng edgerton massoud drifts refuelling cav faceless canaanite epidermis clinician superstitions someplace valerian carpi yf dials agarwal baidu conservatorium terse abril snohomish gruppo tino wolfman fateful mehra xhtml cerebellum marshland seema stedman visigoths preexisting ophthalmic smk unintelligible horan fiedler bizjournals hypothermia snowboarders headmasters ermine mccook fila trm azadi reconsideration lymphatic serfdom scepticism tamás jours dena pandas lisburn arndt vang iri rinaldi kelli sda cllr invisibility hafez tatsuya csn conant impropriety holton mcm llandaff glazer crediting sici dsi sowing wz warlike violas emulsion afterword delirious semarang sabatini tripped contralto gasquet consumerism strontium kerber overground presque reiko attributions lectionary nebulae airforce coren cosenza emmerich resistors ilija brees orang bülow annika hachette fierro polaroid calderon antti bolling kardzhali gmo obregón dramatized shamans lipscomb stjepan savory incubate metalist milli rajinikanth garbo isl nudge phraya idyllic coxless cannibals counterintelligence selectable invariance imparted busway ves lalo unprofitable pil principia vihar traviata obedient exclave ablaze athol caesium kennington actes safed crüe anka katerina quagmire amerika goons taku ilse pantograph castelli taff underlies meru holcomb primitives brockton sandeep steers unresponsive hillier wold mitrovica klee necessitate watchlisted contador terje ane immortalized playfair weizmann messe mz connolley bonin townsfolk averse redeeming kenner operettas outweighs munk adapts officiers spatially mtk rustam schaffhausen retriever incognito sherpa tailoring ankh alcorn formby taz knvb milliseconds barham erna latch corus nestled obstructive consummate denominated assesses ovaries janson velika lutyens mech anupam anubis stroll critiqued guerin elucidated islay venlo wpt zuid woohoo branca donau equus pta velde cyrano perplexed tsushima sweetest tirol campfire visalia percussive adsorption precautionary malwa hebert demarco sarsfield stalinism givens mks entablature calcite bayt hst agence permeable tyner yancey exposé mucosa stalybridge flava putsch regulus chulalongkorn cylon chinensis thereupon halides bosh dost doj rauf barratt keble californica clearfield censured morsi csl rhs inferiority viridis worksop messiaen plas holyrood rapists stoller dulcimer signers fitchburg fictions goldfinger toshio sunnis timbuktu monomer rayyan calligrapher ragas rrna lta lightship reinstating raison schur sieradz brushing perrier johnathan citi massie hyeon plugging cacao birkbeck entwistle foramen sabadell turki aleksandrovich mulan perpetuating hamdan cuando manitou suffices xena zhukov urology trincomalee nabisco estuarine warehousing nineveh dup unreserved evacuating zemun krüger maule dermal chita neanderthals srs fansites mies saran drawbridge bikaner margie tailplane thrives swivel farouk teenaged dissipate xerxes famers enos curtail riau narrowest washer poon jhansi ramen dependable dupage veitch preliminaries fredric methyltransferase atlante bouches collages analgesic cecily carcasses axl shag rundown smurf soi galilei messes disfigured dexterity shafer unsaturated birger bethnal castleton shortfall ssd apologetics ovas homesteads dearest accelerates smuts rothman kesha tahrir panelists sizing quilmes merciful masterful shopkeeper vests morgenstern seger pcp amines bamford walworth gauri qianlong vasili practicable oswestry juma leda amu tbc disservice beards brooklands marmalade naft tracery operandi saguenay beaconsfield randwick consuelo snelling overdubs armoury antithesis leesburg batley karo matures longueuil realtime pellegrini fogg improvise mockumentary wiccan perverted timbered gatherer crevices sepp everly gaiden eger meow mcwilliams emacs scotts diwali permissive manilow carman ppd selden rickshaw karsten hardworking tadpoles shone dawg rijksmuseum wort discontinuity bergerac kashgar lug homemaker anglais purposeful addictions flintstones handily gorda montego cadastral quinto evie vertebra vendôme bathhouse gabba bloor hexadecimal moulds ddg philologists bretton smithers electives fhm glengarry eleazar internationalist herzliya jossi gwh reassembled serhiy globalisation karna igf trobe cisticola kayseri coagulation lapses mladen noda kamel alten dungannon emcee inger dabs cantrell unrequited oceanside miu facundo crawler trueman paulette usfl installs atleast cfds vestments shanahan matson katakana machi payer aeroplanes powders mathers rigoletto gmelin reestablish salcedo melaka boydell skateboarder morden lilley diallo haan hermeneutics wahid tmc kivu esso duk infallible hermosa stuttering concurring breakthroughs bremerton squaw uncalled thanos marbella winder libra bleaching procedurally kimble freeview indictments clashing ebrahim marengo luzern durrell skim morgana sucrose elmhurst elks castellano plenum nami mise giannis vaulting trackers gaurav suzuka finesse conceptualized livius brooker semaphore faria inhaled perf drucker glan codices macbook leaky scooters voce pilsen foix reconnect trapeze hewn booklist swinhoe lenore conurbation certiorari disparage glockenspiel lactic thrashing forêt glaucoma scone daydream reyna lorries escalator brahe cava yeshivas passively krugman contemporain higham fairport reus infantile stoltenberg fiume vespasian xfinity borghese schenk hansel stenosis alexia symbian orford kul frieda merdeka alene repent baca clapp lubricants sluggish vying eckert downton sigrid longitudinally shibata barca lifeless ldu tutte miserably goetz alexandros litany proverbial laurentian zvonareva memorize marvels calming redevelop stash demeaning stilwell profess casio dacre negated secures bonanno swims lq pounded immunoglobulin sapienza bakar underrepresented fürstenberg doggett belgrave congregate bitola millionth lectureship cargoes gaulish slumber llodra hsiao docg positivism gingerbread singha sequestration metalwork bulbous serna sponsorships né wasatch mut gcb pohnpei chonburi broz mosby vetting castiglione hydraulics responder bhagavad masterchef ruan kum gentiles oars madly kamp brownstone septum inadvertent delos cbr milliken imphal neuropathy sokoto fitzgibbon layering ntt barnacle progesterone uli gullies sutta inflate nafta rhizomes toungoo decoded verano straightened improvisations femoral marchetti pellegrino ghettos pele bharti néstor clerck effingham inconsequential transponder sys podlaska nikolayevich categorising lockport altair myrna akiva capacitive samuelson sympathize keiji annoys acumen sadd rappahannock damme bisexuality scuffle loiret saa melina chasse untrained pontiff exemplify compensating inadequately fso edirne jehan kimchi khun gwendolyn monasticism csv rialto sweetened trope mistrust mouthed lusignan clos formulaic calyx whitefield nesmith spandau orden seb gennady zelaya matchbox emulating worf arnaldo ambushes mistreated unep trollope joris handset untouchables militarism masterworks asmara plácido marchioness spliced jarrod enc nitrous carlsberg attentive pigmentation ainslie cofounder tsg nazar urals earthwork steeped dredged artem recreativo sandia cheered baia creoles auk spiritualist laconia potash detergent shrouded léonard shrews whitbread ⅓ bodo klf lightness boulez leper svoboda munda statuette satyajit zor dived mirna cellos noncommercial denbigh hermon glycoprotein fairbank timon plebeian otsego loam haj hoch formalised pediatrician isolde edoardo roundel likhovtseva janelle fol elongation satoru chernivtsi anda iago polychrome responsiveness ssb sais aurobindo irishmen repton ferndale anker endangering dueling ronda audacious jerónimo flautist holtz mercilessly nevin savona vcs chrysostom politiques tiring ytv estrellas outweighed hardman synoptic massenet vowing deleterious instill existentialism magnificat vitreous analogs kennesaw pessoa catedral rels homeostasis vouchers grp syringe loggins beeston raisins succumbing flushed ilyich elio vimy nda gregarious hsin hillsong ferrier amethyst antidepressant alvar dazed oxo roms dewi schizophrenic motagua supervises nieves arnett tiësto rephrasing adore mumtaz licks dru grammophon naha scène sabu quasar sadc robusta coughing glycogen comix ashlee abkhaz admixture hartland deniz witted desportivo virulent ignace caliente bagpipe depose dateline abnormality lasky connoisseur strafford bridgestone safa fuerza tortures kennedys hager seto concealment sila prater dobie metalworking olivet underestimate riker wishart sextus tubercles ernestine zacharias leaned antioxidant ridgewood chancellorsville meiosis biju tourette nagle ayodhya orhan commissariat heals olympiads expounded laud haris prentiss delacroix greenery barisan pollinated munn porgy impair bracknell ecotourism aaj manama bandicoot arecibo raye snows loopholes helices dengue mpi sopot goebel archon reptilian crotalus xaver bosque hel schirmer copperfield claret raab kolar gy galleon enright robustness juridical lint kiko ponzi woodcuts weatherman dibiase oam alphonso kirkcaldy crossovers rhizome cognac woe moen kolbe tachibana cmj thurmond bsi badged gargoyles guantánamo recreations replete malmesbury oilfield qiao juniata amstel godley marvellous junkyard diop thier redhead zm mexicali boycotts spiel purporting rincón carlotta tabular pender michiel rhee roslin ohne musicianship millennial peculiarities annulment wham lunchtime radeon chrysanthemum evangelization compressors curled adversarial durbin doughnut wav unbounded spitsbergen tutsi northeasterly negativity upstart tani distributive lacan casal zaporizhia cavanagh groucho toyotomi ess perrault esk arl waterside ² inasmuch bendix amhara encamped projekt candelaria meetinghouse lakhs wipo offhand noll malachi waxy sarek storks splitter bruni paiute neues conglomerates aruna spars bizarro beachhead alderney omnium nim nebulous bodhi dizziness amado blush sassari badakhshan wemyss tiffin holyhead metaphorically tuscarora aerodromes visor zeballos condensing burwell alcott nankai beata multicast niemeyer tether bhatti yasin glycerol matsumura sultana islip workspace sandhya biogeography greenlandic ypsilanti goldstone farooq ober trophée refuting defensively foch hallucination storehouse solvable precocious siphon crags gunsmoke campbelltown moni tamaki nutty evaporate auctioneer belvoir talley sepsis heusen hvac stormwater workaround highfield quaint caliphs satori ambiguities uaw shuttles haney alito akash roberson junge rained ensue jayson mejor impaled aspersions vali vocally méliès nigh uday paywall breckenridge gracia bollocks refresher ellwood glentoran minstrels dpr atrocious garten tickle dedham oban sada westfalen heartless tca azmi algeciras ferrying winwood crystallization basques hiwish saxton desde juxtaposed encoder lod looped unelected bisected clout tournai intractable grapefruit apothecary sohn salgado hollows denman backfield mesto raghavan aum ucsd garay pabst keng vibrational verner walkin fallacious mok rigidly pelosi bernier leia fanatics jeu exoplanet unnumbered denier kankakee americano egyptologist factoring resonate wilford giffard pacino lucrezia hogwarts wenatchee thruster carbonated yai cze cfg underline orgy loathing vasari stockpile inxs caritas tid rutter mucous vandalise sema arashi motionless midline sookie sarai sgi mandibles submersible abington chiles cleve compost rundgren soong scuderia usk prisms otoh sevenoaks bailed imperator threefold parthenon seafarers gnomes reykjavik cherish epi doukas sombra idolatry unload retirees brantley frederica professionnelle michener dumitru embankments aiba reconciling jinja mariko claxton smashes agave pinckney gerson chica uchida sumatran pietà centipede vistas tzadik ghoul catlin defaulted rollercoaster wahl pgp battaglia woodblock swells haigh vesnina gelsenkirchen tris hoot perceiving floris cheval scimitar setanta soak grapple slicing deere nutritious briefed stéphanie alexandrovich techcrunch annecy sfc summerfield rashi fête polluting xxvi palisade orientale mermaids acad kovacs bramall cna artis sundanese overzealous dorman kennet macroeconomics lemay forerunners thq kherson freeland almshouses dori racquetball tact sustenance takumi ravines jansson beckley veliko kader headliners testifies profesional saka moshav songbird playwriting medics mixers shires bobsledder proportionality paribas tyra kroll transgressions narrators rosanna kokomo seguin vecchio enslavement clementi kubota rushden arin grégoire elko ethically hilt zsa mitosis offsets miskolc ramanujan hastened kirkuk docu legislated precludes prelates fens comunale camouflaged roz keele aue ikea chetnik lillywhite thiel levees omi woburn appreciable cheol pandavas dur jourdan anthracite tremolo spud fundamentalists barbieri lifetimes bingen sanhedrin prong lillooet racquet sown lorimer picker ranjan malick cheeky banging sian aloof plasmid verden vásquez tabulated erde hjalmar duress edf quirino dogged kempton renate wolsey goalscoring zuckerman kharagpur retaken visualized biweekly orfeo omnia ariège chippenham acetone chandran halide bonaire ethers mariya testicles rasht couplet chaff bassey okazaki penning dfa minefields snowflake superposition gatekeeper rsl clapping urgell dsb pyrmont autographs codification reincarnated pkp ján kargil babysitter soissons romford unworkable ignacy osc incisors estrela commensurate redo carcassonne quiroga busoni periyar manukau unranked philatelists insectivorous electrolysis holed unbuilt domingue dayal marcela bridegroom hyperlink ebooks vought mahjong acrimonious primogeniture congruence tritium legible bassano sohc effie misbehavior blücher tilton tetris yvan restorative hikes fouling fylde panthera resourceful irreverent lucero pathan moroder galloping nemanja cowie pui populism hefner adriaan adria liquefied eschatology taki recherches snarky dien tauziat palustris impeached bap aeolian garson flagpole tejada baehr epileptic katarzyna delimited disenfranchised dep epochs bayeux arion misa phobos moir subtracting fauré jeeves gloom disengagement unaired rta instated queenie valse lope matias varley disordered searchers manaus möbius mita tomcat placename hpv colmar msi fiqh bruises lci tlingit slippers mccloud frente epps storch webby overjoyed rih technologists yanks caldecott freaky poppins utep aurea aishwarya tirpitz roussel gilead forecourt jeux platypus mbeki soundscan devotions kempe dba woolworth cogan uprooted blimp kantor mensa veloso testa amyloid lopsided kau subjugated sls merciless vouch homicides thumbnails sagittarius webbing bramble newland baboon bivalves firstborn mukhtar hailey verna joost emplacements rubik cidade barman antigone threading specious deeming kloster yank liberators programmatic founds citibank rother neutralized suomen auer paradoxically bromsgrove knopfler haydon pimentel unplanned tawi lietuvos chocolates émigré belediyespor circe xiaoping rusher mino ales gerlach adverbs bloke merlot blok gunning garrido recursively mckeown steeper hitchin madrigals clearest inflexible smitten apportionment endocrinology impure ganj nona curiosities wearable diu trovatore fajr diarist newsreaders immorality boomers perfumes tân etiology expedite bollinger girders sweeter embarks rebuked mötley suburbia onlookers kaine ƒ cabeza microbiologist nook erupt koe ridgefield eames semyon ort virginie laidlaw prd kazuki collett tewkesbury amjad avocado shareware exuberant warangal mccurdy hasselt swirling crum strathmore ene whining graceland mère smartest takayama gst strindberg mobilizing nazim shaver rigg resale bil triads autre rapa glencoe creeper ujjain sunfish xj excreted jenn skillfully shipbuilder workmanship saltire thermonuclear hep goodreads hearne thundering jenni attenuated moloney berets mur willey lek torsten willfully charentes babbage vitis misadventures semblance angelos hardline kroger gawler rundfunk rectum uz girardeau okamoto dejean dts cng tdp alienate distilleries handicraft anakin legendre khans equalised swelled luttrell implosion minnelli continuance régional faintly issuer swindle broomfield rubicon molluscan ths intrusions barrack blockaded deering lamina sustainment abyssinia excision alda insulator selig rascals turdus dashing jolson appellant straighten leniency vinay nrw splc maneuverability subcultures transjordan saws ftse gálvez staunchly pleasantly fromm maes gordo mati elen airbags shimmer raccoons avenging lexicographer aja vuitton izz bataille cling stratovolcano hatteras ulaanbaatar impassioned infanticide schweiz fingered pirin kellner cynicism foreshore cooperates haveli octahedral cse mckinsey conflated cueva firebox mmr aspires aboriginals cozy generalised overbearing manchurian macros bushido interstates industrie munir kavita vangelis maga ruggiero superannuation prejudicial chub pontic diehl nain rowell refereeing riddick naca euskara spiker vesper overhanging parabola convolution proportionate equaled barents shashi rensburg yavapai bhojpuri pauper mond burdened superfast pseudomonas verdicts motet savchenko creamery jas citywide idiopathic durst quashed adorno giessen sicilia abyssinian sobieski ablation diverges legumes psychosocial laswell buenaventura matterhorn papi hoffenheim bassoons manhunter nogales inhumane cantus cask concordat exemplar essonne tarlac correspondences jemima sarcastically icann patterning hydroelectricity funnels repulsion abstentions impressively wied bosphorus putter runnin bailiwick hypothalamus darío albee taha danielson heike lexi graben cheesy magus crewed embolism jackals laker deeded bittern tubers blom monochromatic awoke abbottabad buoyant watertight noguchi lipa proclamations stour tlaxcala zucker libération pathos tempera motu stockman hants overthrowing vcu suffragette rockport pica bounding baile enlightening pennsylvanian jón electrolytic cowling imaged jebel metros grayling souter freighters gallica tyr kossuth pathogenesis pettigrew daugava staphylococcus rcc warts factored mitsui casco levan mahadevan labours fairing savoia calmed pilbara sickly sequencer dupri reachable imaginable kaneko rousing safina inefficiency ulmer frederiksberg zavala maldon vico lookin bayonets cumbia dhawan musculoskeletal unlockable ishq barat niculescu eventful politeness debunking ayacucho geneticists kavala procurator capoeira afon piney parables whitcomb turbocharger audax magog meander ancash aaliyah superlative valens fixable wertheim shaquille raz domitian plummeted heydrich flatbush hannan emporium johnsen prichard watling grasse utada jobim pattaya hab natale qwerty pueblos doré nsl illyria craving mikel ecologists lurie wheelock fop corrects bmo fae intensifying ⁄ chasm holbein gordie antonis revitalized poulton subpoena harbinger aldous edgewater carthaginians komatsu edgeworth anuradhapura sassy tinian computable attlee cluttering yvon minibus palembang batgirl condone labial underdogs flirts ecija toccata autopilot ,the mulk kluwer mahathir scythians uddin gyrus noa jackass unlawfully rüdiger larne rickenbacker aryans haye nighthawk kabaddi modernizing akhenaten collides counterterrorism meriden rejoins resentful abell abbie yoda floodlights cliche chillicothe veterinarians mame lidia metastasis redbirds batang imperatore mobley watchmaker mey gayatri blouse volumetric etna skids abbe sylar taiji rickman adjudication stormont unflattering seduces citizenry gottlob aphasia lire hag postcolonial interrogations lye disaffected asteras arthurs duffield solicitation mcauley exerts negotiators nervosa cyclonic veronika marga aleph ferried taboos coastlines predicated francophonie theremin xenophobia belge rha sra tbm gargoyle determinations unp empresses gonville fergie gnosticism jla shijiazhuang dwells susumu voldemort selfridge frse sundry wiggle belated redeployed sump contemplates pollinators gbe defaulting stoneham flyby alsatian landless hesketh hindering mappings mikkelsen lithographs proscribed wiles ferraro cosmonauts thinning ginn sanjeev flipper qua seizes retold deviated crisco paix franjo bauman tvnz monckton kyrie fuad socialites pictou evacuations sayer roethlisberger toggle unmodified ubiquitin ther hythe stockdale vuk gujrat depauw sukumaran minos bankhead trotting akane sinfonietta aardvark methodical anis emt roa dilated wabc tethered hoyas mónica lalit oxbow alexandr marksmanship brunette déjà mariusz dormers heyward stingers teardrop sew fenner dailey ridder karolina carbonell holmenkollen akiyama oftentimes leh freestanding esau epidermal humanoids eac ascribe messer warr holi fertilized symantec kuru grinstead jeet reassure csb loveland fain fittipaldi manitowoc gharb diaper narain dimer theosophy sveti candidature rehash dss honorees ung caernarfon veronese chandrasekhar coritiba distracts kress scholes konkan iam foregoing watkin germanium finches wessel astronautics anza reprises guillén sharpening optically morgen kirkman abomination rectal gruffydd royle econometrics crowding immobile ripening ulyanovsk repackaged nursed stax feliz zinn cowes misspellings tapia outcasts handkerchief laughton eilat brm melancholic transiting chaffee miko traumatized benefitted rearrange hoses hezekiah gums alaric pth gasol sacramental gyro relativism nts sandinista queried tizzle mountjoy aeneid candlestick tuan romer bucs veal thapa nitin wilber ahh vitus dazzle acoustical albi permafrost truk srt cursing keir jujuy maugham aristophanes mineralogist blackmailed emphysema entombed roughness radiotherapy egil conformational bunko ttn considerate swath montt ivanova tiber hectic ruano skilful ries pix henriques rtc headlamps chuo bootlegs clerestory neurotransmitters surged awakes manzano blacksmiths tirupati nota aronson olden quartzite malkin willingham uit backfired lemuel batty elly schiavo constitutive ekman kushner backcountry dominus stockholders undertones stephane typesetting hitoshi arakan earthenware ywca seifert lett danica coughlan nour kabhi neff monarchists dragonflies despise showy eluded pronged hummingbirds iaa quintanilla flamingos hamed andrena satyr obstructions seria santee atrocity dodging solberg indium fujimori liceo eakins netherlandish prawn roemer pallida luxe diminishes shapur rix scifi tol ack suffragist sankara ethnographer gigabit devaluation pearly exacting rothstein michell radley bba transformational vagueness jihadist forecastle leaderboard westview accomplishes bebo patchy sundaram prototyping platter weibo abstractions jessup melilla procuring abergavenny manos bushfires pare reals laure consternation untouchable hoxha violetta hutchings murs ulu raiden virtuosity remand khash choked undercut zhan jussi surfacing voucher bushranger boku monahan thanet nines robocop kellerman corroborate wsl stine snape cyanobacteria hors uu bedlam stereotyping astonishment ede grose sacral masthead abraxas skylight bagration prohibitive hunch safin fluctuated definable submissive pillaged pontevedra vasconcelos subgenres evita criminally weidenfeld soca ache dmk gord bloods tvr cunt bornholm fifi insufficiency gasp ruf bragging batt greenaway squamish subliminal primorsky princesa tdi capitalise lindner marshfield kosovan personas morbidity purest acura trickery aveiro orel inquired catanzaro hodson gounod patriarchy totnes pitfalls blondes wigs renwick kora parodying huy impersonate dreamland kirkham bolan stilt sprouts sturges cholas predictably insemination haringey linger opendocument ashdown rann plantain libellous slurry somethin tuft bestsellers moti galería aníbal berea subchannels bernardi salar mandating masterton sherri embattled fella gratification computationally paraphernalia franziska cantwell unexplored disrepute multiplexed jarrah tema finsbury mose indisputable enriching inv tidying lamia heredity yt directx rooting breezy landshut woodwinds darrin aotearoa alligators jacobites rehoboth itching woefully sebastião rayne anschluss tombstones sterne cerebellar fluctuation testers corvinus agate patrimony insecticide sundsvall dissented synods défense kleist hosni traceable uttara eurocopter pita lyase ovw clarice beauvoir modifiers mcveigh anderton shamir tes gur siskiyou cpm reposted tseng gaziantep coopers callisto sandys linga inclement hejaz nodal showrunner tribulations yazid daigo angler testicular pours fara emmylou signet priming panes rimbaud reprimand valente apologetic ricochet leib inst motels virgilio kiva darley annuals kook neverland elsinore fervor garhwal mattered derecho baritones cloisters cadena jomo skynyrd cirencester gata dasa fallin intelsat aeronautica roxburgh arica donned bohdan pacer exterminated prismatic dollhouse infertile blenny faraway margareta mingo emf asymptomatic cunliffe radhakrishnan clc marlo più allround intercepts franke shirvan scribd supposition ashleigh schuman noticias triathlete salesian concours banyan supernovae piaget redfield meaningfully pge chamorro cannery misiones zain reorganizing ackermann osha carronades mandrake nigerien jezebel raitt durbar eis forays innkeeper rnzaf spokes ferb jor overwritten rpi menstruation unabridged witham wipeout hippocrates texte pareto blindfolded playlists bharathi welle ulan frauen cyde plotters predominance passable powertrain neruda oligarchy amenhotep kettler reps oj mahoning wallach shipp damper conquers smithson validly hsp zootaxa interrogate plein resistivity synchronize svein barometer fleas mitchum squatting chantry occlusion legitimize strasburg belmonte prema ― bcl atletico copulation pakenham timişoara ccl palladio cancellations evacuees prebendary polyurethane scarring darwinian landwehr ruta nand grillo excavating dedicates ronne birding riser olly grassi mansoor zirconium touristic androids tanglewood usps oakleigh winningest mulatto geriatric tangle crammed pata fredericks komodo orangutan brosnan ciro ansel sikorski blister deductive instituting frémont chitral interferon bigg satires resuscitation kenmore rochford natures newbridge juha crescendo cloths barthel diversions columbo pennell heo cobbled carle transposed freemen papp hvdc osh gba bookmarks scherzinger iwan macdowell obtainable thurrock offbeat wordplay chagall inverter igg fürst poulenc daggett dispel bca lawfully galatea arta serres woolsey iep bounces morelli blackened andrée niebla classifier conservapedia quays lashed geraint infact platelets lyricism goaltending tarleton booed pollutant emulators inaccurately trevelyan frodo fob flocked krosno bua kuril creme morea brenton wdr henschel henman forsaken dorms bibb amba sulpice cen leftists allyn bein transcends ladysmith totalitarianism captivating practicality pashtuns kenai humerus panay spacey divested tonality worthiness mercian amputees ballon satanism stanislaw goldeneye grandin kurukshetra prabhakar axiomatic dmz chatto tbn exon rubra skipton backside buckethead morphed neuromuscular gascoyne colle freshness overrides armorial brownian heil pug glut pallet agincourt chamois seder reprieve tio dicaprio digitization conveyance igniting sculptured pcf josephson erlich punto streamlining bombshell stolle garfunkel acuity posada radnor lard cert salve manas lumps tovar metastatic eliminations fiddlers dha ahem seco misdeeds krohn gyan galactus futura bartow showman edin mizuno isma pretorius brockman briarcliff soros haka misidentified fro tablelands tailings jiro bauxite omnipotent leeuwarden annenberg iti daisies maccarthy sobriety zhuo swp neuroscientists fabien farhad whitten frauds jarl incurable furiously arapaho coromandel battambang amygdala magnetization enterprising companionship ouagadougou fon nikolas venables opa surtees lafontaine thera ramana stung friedlander delphine suter sanam booting merriman veera cra brauer farris watchful generalitat umpiring skimmed refinements abramoff hashem carat marsupials gemstones horrendous atco cassia ledesma fricke adj hydrologic streatham paused nanchang lak brainwashing tage pegg flourishes siouxsie storytellers ratchasima memos hakeem obra neko hapless ballade horseracing translocation kuti nutter sont criminality remus sanctus onassis qat incubated blacklisting sunnyvale viggo bumping breeches lintel franky wily efendi papaya dispossessed maar mui gooding demonstrative domingos potro nonesuch hirohito weeknights duomo waitangi hayakawa exoskeleton jost legislate tcr concertmaster lupo pavlovich gaborone cortlandt alana apnea dprk asda teramo pickwick sleepless mauritanian adjudged fantastical caddy saxena rupaul navan concordance newby remodelling peeters axelrod newsgroup dispatching tetrapods bina dougie banquets politehnica nadeem arginine exxonmobil grazed shuffled ibero phenomenological nhra holon armidale quranic mayberry urns sophomores termite drifters sona perks alienating legio dargah sprays zuni juke cae tiga unequivocal bidirectional dutra hattori dasgupta luciana dunstable alumna hema wuxi lapwing phenotypes pottsville semantically draughts generality rajapaksa boni reaves overridden erred balthazar sorin coronal tonto grainy kashyap havok diagnosing carmina escondido celluloid mallow lain haarhuis poaceae strada apuestas hina associazione wallenberg martinus goodson sheldrake varnish scaring rehearse noose safeway hemphill seidel soni flogging tokyu contributory farrington ennobled aquaria sieur deformity wgbh burdon hoodoo lyudmila gettin scat coxed gelding hayworth traci aleister yb abstaining macleay barone girth unmatched burj sparc mahone infantryman vizcaya castellanos crustacean hrt saintes capitalizing gravestones vets pepsico sarcoma badgering forefathers cyrenaica hollande bilingualism wmd mondadori gunnison thf fiefs brom retrial daud sandal hornblower schütz waratahs snowdon sixers pathak autoblock supranational milking foxtel spb curragh aiko cull ppc championing serviceable poop ellipsis vorarlberg blot killa venizelos debs wf creeps kristofferson carcinogenic lobbies italiani wein straws fulani miyako lamy gente suffragists magnified mandibular cropper creuse adrianople quai canopies karpov christus ibrox prodding ostia ça cosplay atms amiable reliquary rayburn benet raving dispositions flange pentateuch ese cooperstown zakaria walleye kinky ischemic econometric oude unaccredited gaudio matsuyama tranquil osteoporosis versace shenhua embarrass sreekumar sappers hardee wazir soaking maxie modulator recused tsuyoshi vesuvius robben tunney stackpole visayan aggregating treadwell deon volpe fart tubman auster khon hillock rawa fabled overseers heft inlaid spina apportioned emptive imperfections lubricant arundell welwyn insertions unmistakable utley golgi buganda coq carswell recruiter infiltrating geniuses gow freeholders adenauer mander thyme canute jeroen porfirio thucydides nolte printings kensal levantine cleanse inquiring petitioners killeen tallies là leveraging defaced redditch marigold nonstandard oromia noddy blotches jefferies agong risa abscess antal daycare kavi acclamation handcuffed hydrological saussure strawman hasten perelman punting behan plunging zetian hark pequot biffle villars slingshot thalia pec unstructured aaas electromechanical mashup birthright martell bitches nip ramu iana quirks absinthe royer rangefinder watery heung vaduz dfs rind pbl custodial radiators troublemaker conformed levett cann stretton hindley lezion lindemann konstanz lawlor culex conductance canes enthalpy panoramas flops looser hydrostatic cybermen plos hirschmann thefts halberstadt msdn handyman absolution videogames remastering grafting lavishly armee lilo lytle iida osten giurgiu vik conciliatory groening fátima sidewinder wendel hattiesburg baran kidder bellman camellia nlcs phoned muskingum rawhide carpentier consults maddalena dottie caster waveguide cayetano fritsch pakhtakor ffe nusa gangnam latakia meanders shopper belén nita prez waive ashish notional yuichi yerba unscientific masaryk wilco ciaran connexion hertogenbosch zuo bouzouki irregulars rackets chania vasu pina nightwing wausau mcshane szabolcs militari velez corbyn lanao caserta detritus eea kani dudes tdt poodle concisely castlevania flume digested kemerovo polemics gana pagasa hoshi enniskillen misfortunes kimono caras swamped cosmodrome recoverable cormier knickerbocker cofactor tradesmen ousting zahn inker travail bottomed chela biodegradable soundwave cytokine bava inductance bramley nagarjuna vibrato hammering dili cgs categorizes frantically heathland springdale watercourses aroostook artefact nieuwe slp demonstrable swazi onna butters geraldo clocking divorces krystal ember thoroughfares punctured yuna devas winterbottom norra dwelt stuttgarter interleukin glades theistic superscript saheb townes azar placental revels earners aarau pontius moshi offensively enchantment gymnasiums mists intakes tubby chucky lyall gunung rapides naps introspection fta grattan cayenne mohandas balázs halfbacks clawed bipartite cramps arkady delisle disenchanted zhe archeologists easement joule ventilated kimber possessor homeward kura bidders goc gliwice dakshina naan bren storybook planeta rosina mmc bloodlines sepahan yusuke harburg vickery maisie beholder degenerative polybius hoh tarawa cresswell fillings quds mush yousuf nayarit arusha telus whiz coulibaly fata meriwether swanton pomp goble troup desecration illusory hopkinson rugrats kongsberg caribe paramaribo allure udall rectifier gruffudd ballerinas milligrams kraftwerk sibir machida backfires lieber nichol marauder chaste narcissism nunez simulates imc gormley ruch screech danko devito wilks lorde starke electrodynamics testaments gainsbourg bolder pipa hewson beazley treblinka zou futurism drinkers optometry awardees repurposed schiffer falsification bexar popularization meza arent korps justly timescale belafonte moh instar cort thickening larch machen aws hor disapprove combos cleland gomer koster fondo chipped ruckus suetonius labuan kanto dismembered floss sudetenland megafauna moksha wanton impressionists caplan recites statham tailors samsun marduk jülich selznick fuseaction halil linlithgow effluent khas golfing wendt kwun letitia wendover blaney arrington naim deja handouts segmental valles reinserted watters irrawaddy frets celso unspoken downie bruin backlogged chittenden rameau samad kas hameed valero overpopulation yusef bipedal disgraceful cinq miramichi bajaj tench banal gid viewfinder rive genocidal hooligans evangelistic interdependence boutros emblazoned jaques klezmer avalanches overflowing fulcrum nya sofie ulama erudite cautionary jenin sauvage wilds foundries hammerhead martinelli mensch slut luring mourned noli glarus lingayen pontificate eliminator argüello székely dca √ hackman flounder fairground juraj ambrosia nifty asaf menial martineau contraceptives snowboarder polypeptide tiebreakers diddley shrank stereotyped greening pegged unhappiness abusers domaine flotation efes importers burdett morais devizes firework bhima quinnipiac underwriting hijab smoothed cavalli anare caterham mbbs burners umpired reductive merengue landfills chawla adored metternich cooperatively frontispiece margarete tosses townhouses stora populus hiphop cañas kindergartens energized isner mccool tannery darla regenerated umbilicus indecisive jog bromine muscovy cetinje cavalcade memorized populaire walkover scrapbook larvik kaj herbicides understory bicol grunwald mcevoy mcginnis clwyd resnick biathlete fujii binocular buoys shimane aggravating meo porting clubhouses hick smederevo soured humming tarnished cates necaxa metrorail victoire encircle otway elfsborg mccay easley cgt predrag pyroclastic brasilia strident cff watercraft airdrieonians bixby decompose southpaw flutter cookbooks ganymede yarns fci blantyre derivations hôpital pentagram brianna objector kou alcs elizondo dak echr joiner brecker fundy renderings isro soundgarden azerbaijanis laborious choate desjardins widener mistletoe straub profited ilp chahar stansted acutely chine craigslist harewood grammarians procopius vento freie condensate transom nootka broussard roamed ooty kronstadt berkman redefining cholmondeley questionnaires brienne pori hydration propelling acorns quicken dugdale maduro oddities fluvial tropicana ucsb osmosis azuma exogenous diversionary podlaski pichilemu voided stc berners motorist underscores ringling impurity grandview relish layoffs postponement lewisburg iulia breathless ivica puffy screamed remarry militaries allis ildefonso spor archivists underbelly hyenas ineffectual expendable abstention incurring truckers trays redneck retrofitted hubbell connick emailing yasuda fábio popups actuators herzl servette bega hamsters tracklist ercole academie effortlessly insidious coc andreev constructivist markt erectus fandango metropolitana margrethe assembles adjusts belorussian bimbo dennett ovoid hann cpsu wile ethnologist molestation infatuation svend maxx reade stateside wept mauresmo supplementation carola celestine kohen xylophone pham fronds perla petitioning etude berserk été wieland blogosphere gurevich júlio bcr tempus rerum moda meissner stp transited feria trillium antiquarians licensees spender reimer larynx magalhães choy thr hss tegan passchendaele discos decried katja reconquest tumours snipes couplers parklands cañada healers autodesk innis haight hewett musings experimenter loudness const remodel ourense teodor quizzes ionia boccaccio configurable viareggio imprison cuesta salivary toscanini melvyn flycatchers ejaculation bountiful mcp caveman axa winchell dob eyeball chemins iyengar electrophoresis symphonie agribusiness tolerable inr whaley singularly andrus marsupial spooks perdue weiser ines hasharon curacy waal dench ert dvr oli referrals galvanized gec graaf steelworks licensure berryman descriptors bayliss noto reparation potocki anheuser elegantly coldfield rie nysa ozarks brophy resized grahamstown corrèze pka slowdown estonians bagwell máximo devour mote wouldnt maples vasas whiplash centrepiece therefor moeller noord araki occasioned pullen recur horrifying tricking wahlberg prophesied baronial tympanum thorsten shadowed baluchistan gaskell regrettable mirai turbofan chroma enderby maroc cynon charred hooves glittering stratocaster côté nilo hyperactivity lucasfilm pasay bax matinee trt tondo organically peary pouches jeevan sheung qp neha perot washoe universiti bourget mishaps germinate stopper paddling deviantart oso veria arse blossomed mangled equilateral nettles instantaneously cutlery wagoner parkersburg surpasses barbaro monfils drillers famines rampur tambo dano wahab kher stitt despatched kotaku needn shl fortunato counterclockwise brimstone giga joly desh antigonus scythe reentered phoebus preyed kerk pendragon badu caso leer lucasarts hortense malang minting southey curatorial gali arak flammarion towels chui firs dmca warnock munoz ligurian haulage hrw anachronism elaborating agnès udf snowmobile lapointe sheathed novikov benevolence autofocus lindwall shir frege arsène janez civics offsite mataram salmond gush diller corregidor rime maquis retort msl vazquez böhm smalley addie brel gyroscope brazen coals skagen educationalist revives behemoth montgomeryshire arrears garmisch latinized demosthenes pygmalion basildon coombe actuated grower husserl potok sebastien prt wareham avars slaughtering montauban gulden canines conservator newsprint driftwood hamptons videotaped lindström inconspicuous crucially adrenergic mada wishful capstone tarim cartooning dunhill resold dreadnoughts felon mayu burgenland majapahit trefoil hakan bolshevism kanu baka shuttleworth melanesia disintegrate mannix hedda barberini peshwa britpop brasiliensis peterhouse vigour imi kenta miasto cajon cdn treadmill bandra absorbers virology moroni psychotherapist squall heikki hhs jointed odour thurgood twp jasmin philosophically singularities blavatsky dannebrog zebras tanjong mostafa talavera bossy pandya intensities kinston hirschfeld georgette fano rfk dany kardashian urbe kiernan swarms bse masashi steppenwolf ayu bruford sop historie tua aeon generale ransacked avicenna soule hollies kido tunings raynor jagdgeschwader hammerheads dogwood candela centerline counterattacks geun amok tweety refuel plante universalism infantrymen regrettably pharyngeal maan moguls cortisol surmised worshipers strobe timings antiviral misinterpreting ephemera bideford plausibly energetically interlaced sealand unsold keung imprinted lucena comforting jeolla petaling bookmark gks banc notte filer arman laminar eck fixer cutscenes phrygian tero juku goettingen ied callas batak elfman mastodon rnli tikal translink machinima lugar basile warping renewables logics wyo gaither frideric langlois nsu unattributed cashew bagged biltmore wotton corgan paucity não moyne excels daya saarinen benitez sheesh metheny alternator hurriedly madhav wormwood sieg tryptophan rien jaap darken naas gans frawley gertrud hywel mohs pillows cranks obeying haruna forsberg deauville grogan berra appendices shuai troopship unitarians exmouth manoel steinman seki dislodge takht nandini haphazard rogelio mizuho tabula grosser rubric motility salaried macht hindemith metropolitano hammock hanukkah geert vermin wsi scherer imparting ident functionary memorialized sissy monotonous arouse mendelsohn jager wilhelmine reigate uo loner kalle aor adiabatic javan teaneck whangarei ciao mdt mallett gash harpist jessop materia ies rza classy xls complementing fatboy seasoning crackpot metabolized implicate detach bungie soli absences swf bewildered imposter moiety ricoh krupa complicates steelhead andalucía dnc miya lamenting kapil tcs adipose limo repeaters blistering mobutu haughton kalahari abeyance iskra lionheart humphry philibert sessile rilke dialectal toddlers buono motile conlon jeffers idealist preah crabb putters wiggles ashtabula aeschylus conger morozov grigor benneteau supplementing bongos notepad rothbard phoenicians omicron gbs endearing tir signa mantras blip malton arnott dhl timmons waza thirtieth shoddy ultimo repelling tarek barda takuya bouillon tanah psr hyperlinks feldspar chilliwack sapper edsa admirably weigel ambience rebuke reprinting fincher collegiately bissell denzel kiddie accumulator meeks villard francesc zwickau aggressiveness wylde inroads speculators bahawalpur admiring heinous aquifers zong hersh petticoat gales rossa caxton intransitive knowlton chested sadhu rooftops perturbations serre refrigerators seca saracen offends booze marnie mendenhall zverev burghers floodlit janeway acceptability sharman rearranging domus spout zs arbitron arbitrage tahitian arianna nena koninklijke bulwark audiencia deduces skylab lorelei ironclads promulgation polyps moskva matta madhouse knowledgable mccreary scudder contrabass sporty suckers medio amide betula gault synthesizing piz ledges follicles argentines marcelino vibrate anuradha chicane gulshan berat scm umbrellas prithvi voids celery softcover stent reconstructive waxman thais wartenberg deschamps shimada unidos lexisnexis aza maggio joes evangelicalism kirin habs christer latimes coxswain insurmountable intracranial krantz kreuznach eyelids fords tusks alnwick cdi kingsford pheasants ripples gyms bareilly pssa principled artful hsinchu mckim zina porpoise notated gerhardt griff nathanael larceny oceana meijer gaeltacht soldering mvs besson orrin outnumber irvington nepotism veit aud debby wilno jonestown scioto bloodless shana yano weaves katya shusha westlife mickiewicz zeroes instilled angevin flemington mccloskey deum provokes compacted kalinin tippecanoe docket neptunes bleeker trembling remaster lineal distilling minced poniatowski dmx andreu privatised guingamp skaggs creams dervish underserved judicious asghar liliana serialised tyrolean sengupta kalpana bure chol supersede warrnambool labors daltrey bade beckford timaru pruitt farmlands fogerty tantalum yek extrinsic modulate seafloor emmaus corneille neretva hassett roughnecks improvising ionescu rimes tiflis velma tomboy voyageurs cleans milepost gilmer searcy noriko omagh ♣ armes encephalopathy hookers disillusionment ratiwatana bord muscogee slumped wyn eia ingmar leathery equities landline camaraderie boehm paralysed stratus sociales sabri subsidised enríquez monotone zala poi zhengzhou truex riki unsound charlottenburg knightley mha olcott shel thatch comers unchained disapproves cbo pisces habermas spengler hieroglyphic constructivism gump daniell forelimbs crh reintroduce ghq collette exteriors mds vegetarians centralization singaporeans photogenic momento hayter matrimonial baroni micheal baur clog alighieri summarises dbs riverton lebedev encroaching rescind transcribe clarita burslem baraka borrowings taupo fairweather taxicab urchins olli lemaire atman nightshade roadhouse dezful backus ü ooo ambon delfino attesting morphs lottie masur mirpur cutbacks unreliability sappho deathmatch quimby quartier succumb maritima catalyze shoved awd corkscrew ensenada rancid burge destino frobenius tenements stunted stari morpheme filial gracefully subsonic seuil megami riverbed hargrove propagandist gabala ivf roden barrens antitank scheldt ingres eusebio snodgrass alemannia mna danton suiza wilding sugarloaf waals shoreditch byline deceleration navarrese courbet chimp subjectively stoop diplo mccomb zabrze dancin bateson magnifying palestrina unsuited felis dreamy pavarotti timeshift pikachu boatswain glick ramble ogun immer flore petronas carrey hugging popa bellow wofford trinitarian spink violators reloading sandalwood deshpande warder edsel vestibular antares bernabéu westover wanamaker mers baptisms laxman moodie oued prithviraj canarian allosteric cundinamarca haughey maybach chesham radhika smalltalk eddington abubakar ratan sqm coutinho grok boughton bunnies slashing whitelist eccleston shears savio bayamón aryeh maréchal hatta egyptologists houser vamp alyson colburn menopause vorbis malé otaku lexis samajwadi payson beatle gauche kanon infill besiege flèche parco nau nonprofits kenwood banaras logano fisa agnostics dispatcher receptacle carnal wunderlich afzal tenorio nouri cuddy smalls kapur lgbtq okhotsk deshmukh civitas arborea screwdriver tazewell insecticides engendered brassey headlight cuffs shonen fodor minigames fairway titania horrocks greenstone equidistant alchemical npcs molineux calatrava louisbourg fairlie aircrews cullum asen fal udon symmetrically whitewashing keenly tsc shep niña egos helder bandwagon icrc refreshment laut pelle zilla howlett ills hemi bloomingdale rti jeweler muddled binns mckellar strayed nalbandian krefeld somber frosinone thicke mondale chabahar airtel jsa lapentti instigator halmstad erebus pooled eason fmri marchers cienfuegos cowl validating hingham tsukuba seabrook reappearance piezoelectric fleischmann bidwell annapurna mahi greenspan lathrop volusia fawr nmda hodgkinson criticises thalamus lynyrd sensationalist bodie leonel volition korolev behr foyt constipation tallying briefings scepter exaggerating lupton tojo cep brightman clickable venting kyaw playmaker staked hazing talal yoshio scheduler trapani snatched devoured gobi llewelyn ramadi galore pastoralists revues calliope angelis backfire stonemason abate adic midterm aldrin gaultier landa brera pgm michelson hjk plesetsk berthed keypad mazur bluebell fgm hamburgers newness crohn grueling tayyip boac nahin dagbladet poznan geranium punjabis minelayer sheeting vehement ahoy bds eckhart throckmorton pétain pheromone eishockey memel strelitz bülent histology coincidences skagit ischemia softening laye janes refreshments murshidabad sedative dismissals albarn kamara kinsmen sociale rpt ibo kinect scotti breakage shortstops mooted katia vivendi châteauroux avram humiliate arvid urania intricately conceição chur cardiologist toner naughton dkk guida surges fujimoto kingsport cellists barnwell egrets posit disappoint pianoforte counsels eyelid keeley grudges baumgartner mainpage nación beleive toh stings oxy zenica yemenite bullen farhan windsurfing espy lado aquileia zulia sumptuous revolting alu degrades automaker despatch craton fabiano duhamel dusted felling coot molesworth massillon activator doorman limón receded tunica thickets formalities leaguers tuileries terceira topple trenchard eustis orchestrations loewe hoppe culling piedmontese hazelwood aspergillus hsi scap younis vara venn bacteriology aftab alumina arima castration rajkot drexler sheerness wogan cantal denials strachey paterno zaki bomba fitzsimmons winsor scurvy drawers tomoko cheques rheinland undeclared yucatan interagency oma maire adaptor breathed lyndhurst embittered tomkins homogeneity hummer welby kampuchea pairwise murrayfield sorcerers ondo helsing natsume henrico deterred sledgehammer asr kotoko romantics brainerd marqués beaverton kamran teja airmail junko otello hagia gimeno ojeda rowena taal bhagwan newgate cranfield horvath globus immunodeficiency beets tsung largemouth sabercats hams bifurcation restatement sawai recuperate fulk appalachians backwater yassin unreasonably mab permanence erikson mireille captivated elkhorn moctezuma albini trung overstated congestive sibyl blackness anzio wwc stevan trolleys applet struve donning breguet downtime gpp solemnly assemblages varietal outliers wvu montefiore bonney calum blackmails poirier lawman christening traditionalists stumped bookkeeper kesteven unisex indulged dictatorships ramming lighthearted morpeth divina sustains norodom sarath mib auteur phg drg viaducts neurobiology kitsap abdulrahman rulebook poppies rapier chairing funchal colson sekai goole castletown scalability alga matrilineal altruistic seles bhavani spoofing precept swingin impervious technicalities codon oksana coulee zakir wasnt allerton momma disposals busting marshalls fidesz stencil aforesaid menuhin oporto aphids specializations ezio capriccio manger ehc taut pietersen oshima origination wedges persisting industrially rosecrans carreño bilbo coria extirpated centrality depositing gyu outstretched jarre brats catamaran canisius superdome desktops artvin spiced grooved wheelers beto anamorphic utterances grindcore fortify whitelaw fayard respectability shawl wru hallways rollout urbanisation kidnapper theodoros bul vickie treasured lansdale posten thomond mashonaland abernethy centralia milhaud ammar overpower hadfield extrapolation bruising homebrew greenlee kladno ionosphere gastronomy crofton lindau crewmembers burwood coverdale wingman deplorable fluxus bermondsey chunichi tsukasa snark litex libreville jeweller immediacy stoddart vesicle abernathy hannon amparo gatling paediatric werden bole nepomuk lascelles haar vod kaul maurier brickell ouster zsolt hilversum millstone stabilised facies vanquished grambling pleural scruggs fulfilment charpentier barba terran niobium cutaway squier municipally chumash phytoplankton soas autobiographers quackery confided kabylie bronchitis lipsky gah mediates caracol socratic subalpine lorenzi kunal pré cantabrian hedmark moribund giveaway loggers witton convening moffatt hoarding adda grills nisha subways presumes théophile storer spyder gatos montañés manorial cyclopedia technion kilmore blanton porphyry cédric berenice narciso shipman subservient complutense homozygous mccrea carvajal mcalister reynaldo babs tonawanda solway pettersson gorski pierluigi moorings burne unlocks philemon ludwigsburg meads professorial tabu berisha maintainer agata grisham conakry albay incestuous sprinkler charade nellore unmanageable springing franklyn weiland sunnah antiwar valérie damodar tulsi faithless klara grasped weeknight reusing obligate brower bemidji fiorentino damiano counterbalance turco thuringian martinsburg impeach hotham michaël hennig secretarial peterhead whittingham slava aami delaunay broun eagerness macklin kadir chappelle parachuting canción rivington provable ashmore rincon subarctic stoning petits phishing wyandotte maurya metrology lindholm monomers michaud etymologically dissociative junaid culminate gottschalk wataru angelika tacked neurosurgeon okeechobee alkaloid corio wavelet titling iom lumia wheelchairs durations marblehead seahorse asim cyclotron declination shrinkage footwork rtd mclellan piatra aguayo caving annihilate assis grammys marketer boxcar phat whistleblowers mcgann feinberg lochs olivares tolerances mowat sulayman pressuring stena beery wheelwright teplice gini voa testes sportscenter murata prostrate rukh brittain perryville ringgold cubana toit silvana boyfriends henshaw transducer milla ags frighten hondt airlifted yim wholesalers suncorp carreras caretakers kawai ethiopians sauropod channeling raat haldeman tudo ruthlessly bynum rewa swainson intensifies frere ungrammatical deleuze macdonnell sniping geometrically flowery eloy yoshihiro panagiotis squeezing bootstrap behrens gantry bakewell allo boldness pared hoary itm downplayed bling querrey ravindra cohan sayaka pistoia concertante erling parkside ugg crabbe pooling sleazy uploaders gyatso halpern zetas rooks biol cognates keisuke middelburg apus saurashtra personages toshi naser cousteau veers stieglitz bunt dower glassware silhouettes wobble franchised nlrb hiragana sitio moab composting norges replenished overruns absurdly asus northgate letts drugstore refectory rimouski nen narmada tuc realtor tupou assr saver nozzles nca yisroel taub gallegos characterise josefina appeasement infamy apostate burges ciencias bamber knowsley abominable maryville maltin hieroglyph ammon nagging mehmood izu indigent lucent disowned dampers seamstress weisman marshmallow ovulation clump arana breve pms kalman ozaki downy benedetti aurelian craigie bushfire federative budokan bumpy ballantyne ino maasai ncs dhar millbrook zonguldak luch fertilisation dobrich lemoine schoharie mutter gymnastic holiest molla meddling sugababes doak alk epc matra resistive rasul keren sleuth meldrum gtr interceptors intrastate braşov contes streptococcus harmattan fooling skydiving crotone faustino uspto giraldo koro naveen materialistic smothers picayune eca ris vestal tsunamis sharps gamla landmass cortinarius silkeborg robinsons shipboard dunkerque baen wildest atul blackford dfw monotheistic subjecting ecclesiastic headphone polystyrene moyen rhp pierpont gatti variegated instinctive viceroys cashmere squamous networth ignazio brumbies mildura expropriated amersfoort alor undercard daz szabo baudouin farmhouses caviar ghraib whitewashed antipathy computes delray reassured seppi marini bragança bingley duchovny ginga jaro seale gristmill gleb qué apprehensive anaesthesia shinichi tuva jerky veliki kampen marionette motivates stelae schwarzenberg koller hok paging navigated duplicative effecting buttercup harrah ltu milled sorta bundling concomitant sprinting imola gazebo suzi colloquium revell wiesel coetzer ghee keyed varun fizz pacify ¹ iib bottlenose gastón pyrotechnics matsuo culprits russula kho msf multicellular wagtail hiromi ribosome cerrado keirin eibar mcnaughton hier corset folktales scs strafing pfg clitoris quilts jukka karp militancy freikorps ayyubid effeminate dávila mcgarry vino nouméa indulging prolog laon akari tomar trachea baskerville marciano airframes ploughing southfield circuses nyj bunn nott shattuck abia tillamook cancún carmelites chandan stortford gcses ruble zidane furies türkiye rebus addy caledon necklaces sinéad vaca evanescence mountings glanville ifs shinoda agen fier bessel siddhartha shortness gobble americanus suda antoninus oneonta bolden surcharge pbk foxtrot harv rothesay acheson tupi coverts bayerische arpa globetrotters mayonnaise isha reforestation tasker thrusting salvator schleicher icj mukti idents platformer exaggerate vena cobbler hemmings wargame vasyl unconsciously sixpence marinas violist anathema leni mages koren pawan ament grendel tamarind pyne homesick dropbox skippers giménez transitory nambiar underwing paces pinafore undescribed courtois fibreglass gbp pox silverton pickers werribee reise indios ase trixie banshees sfa albertine embarkation antibacterial rafsanjan matanzas smedley pinheiro benno brak hypocrite bosons basilicata haeckel sanctorum matin anabaptist coupler muhajir bogdanov indepth hesiod deceiving capp nadar plana ius trimester balu drowns hdl ojo horthy heeled aber atlases sammarinese spyro arquette michaelis delmar islamia colitis antillean toomey jarrow nazrul infringements onerous tecmo burris knitted bly zila aco arriba weintraub hooray differentials ogc fightin scharnhorst cawley abutments opioids nci mayumi victimization fireflies freeform jerks encloses ashwin veranda gakuin shenanigans welk rhyl dethroned reburied wad yonder saks seditious bridlington neglects rogério mccluskey rebutted clavinet nihilism nagata kwa bertin hooking subjugation grigori cbm pagodas cayo corliss bma cabana brooch nonverbal cobblestone prerequisites beveren danby irises omnivorous rian carnivore summarising amputee nalchik revolución palatable cti scrope damming tolima ketch agung rader voetbal eclipsing monck tandon rowntree bootle luba winemaker polisario boaz ral quatermass expropriation slanderous tubbs naoko neuf corsairs pinker medulla oblige deviates geldof gilad chalon amorous incubus ferruginous dede whisperer lovable buss culp snowdonia assiniboine raimi dreamt crushes tuners empiricism germinal faustus manorama seedling andorran banka paulin creeds hannity countesses plurals ouch pronouncements fides pcbs barreled walling crafty allstars grâce citrate repo seaworld intermezzo shrinks videography telecaster flapping secularization hela bratton ntfs delius camry iver jeonju vanda enschede fairtrade dads bambino blackbeard laoghaire edel sbb gotcha rereleased vollmer microcontroller felixstowe speight yeni billingsley bigoted bettered cliques faulted marikina piedra thereto reeling botticelli apologists chutes honeysuckle distancing hobbyist lombok goshawk cmp escalates vignette gandy maersk racking stubbornly trampled aurel goldsboro hargrave minorca ravenswood koen bielsk syr melanin pou benzodiazepine dukakis industrious francoist augustana bustard yolande paramilitaries timm kjv tokio brin featurette gouda romaine pah lumberjack uan mauve iquique subramaniam gte breslin faulting dota lage rallycross fahad mesnil ustinov movimiento muang coverings yushchenko hashing burford molokai coster nacionalista zalman hast homan eggplant kati beall reserving pineville ratnam distillers zanesville heenan guilders ohv minimization lusk maliciously grocers hercule asic hermanos asch shinty pelton schwyz horny stimson michels overlords equalling lugosi hoshino ekiti vantaa kanaan unthinkable unanimity avowed sniff bache pitkin venetia unt nbr candide koga vier waders ried cavalrymen attests kubo earring minecraft nanette hearty foresaw kirsch azadegan dispersing patrilineal constanţa virulence limpets nickerson deckers ullyett lofts massing laryngeal thiers baños sequitur beaded tove gondor gyeongju maniacs ritualistic tenochtitlan mago paleobiology tallaght spillane pleases howdy alm jayaram involuntarily starfighter dwarfism exasperated leoni machined buckles wester carrefour dolmen waltzes distressing cooker magellanic prashant caboose greyfriars andra moorcock empresa uffizi coriolis barbier magik aral valentinian typographic asano tobey tuco sethi iambic severino boosts lesh ruckman nettle pursuers interlinked hopman shvedova panelling lashkar buns cryer trawling boletus abacus lapis fba maio brim amiss belushi landmine rosslyn newhouse suave underlie shelbyville longley lancastrian yaya psychopath platense infiltrates rheims geopolitics hob majumdar superconductivity mignon floatplane neeson béziers deflections quint dispenser puna mussorgsky adachi stansfield khor hickok dumplings vinh loyd eyebrow bombardments gib vinatieri inclinations tosa bushman sastri mariinsky clowes eris genealogist blackouts oxymoron bested mushtaq burl shruti ilm everson samford wl capillaries pyrrhus vaulter bullfighting smoot alexandrov corrupting heterogeneity cbf grazia saco ilia somoza mythbusters restate gruff birr cumann honeycombs sica hughie wegener heartbreaking beanie glutathione rhoads bogor catalana mahinda kristi armadale vig incas kristensen geckos thicket reformists uninvited novosti wich baldy fullness reinvented stradivarius reinventing peart birkenfeld independencia hooligan asami wimpy athlon baghdadi showa michiko ens violets unrecorded orly sudhir campana egyptology gunnarsson lubavitch sive kanata preload epr gorakhpur ningxia sandnes catamarca subatomic kowalczyk mothership syndicates clearinghouse embryology faithfull qaida neoplasms pirated sportif yummy waitakere erbil relaying motorboat allama tmz jago rut chacón abid hypersensitivity spg speculum mikkel vilma multifaceted ipsum jeanie sylvania malfunctions baldur excavate endzone sadhana edgware sugden sliders diable dalles nahyan lauro saskia coloma bartram solna iim lengua merion zlatko novaya ingraham goiânia jeezy mamba tdc purpurea piggott prescribes geetha nostrand viseu cardwell mckeon kink lacombe révolution ameer shinobu steubenville gulfport hiroko prim amalfi transgenic abduct wroclaw ericson culpa bukowski starks indefatigable confraternity pynchon bedchamber heligoland thibault lilla demobilization vacationing paddles paraffin colloidal wightman sandiego articular pondering arshad bonheur striated themself bobbi desam subcommittees hol hartwig multitasking asme babington dmitriy abramson cochlear rankine marthe reinvestment burlingame shonan khoury copyleft keyserling masao buin tallulah mizo brogan nerdy bek mountainside discerning thickly eurydice opining samaritans palaeolithic friel matsuri discerned violeta underrated utama gru cucumbers ojibwa ebbw reaffirming culvert dancefloor recruiters fukuyama citric whine tye ragnarok courtauld villalobos kelleher deflation condorcet besa florio aho misogyny absorber mauretania demilitarized goro skirting balling zúñiga finkel paulinho cws buffon bloomer demir pavle grigore striata lop itza holyfield corriere rossiter hialeah predicates dumbledore ambivalence masaaki bellinzona locket ibex princesse khin wronged intendant eldar couplings iterated relatedness ulrike geylang familie phylloscopus populating cvo unione philbin curd kurtzman gurdjieff manipuri profusely pearlman hohenstaufen tripathi fem starlet eko sawa sze tasso shephard prensa ngai kitsch hsing janette picnics protour pamir rijn skene quand cattaraugus bez insanely reinvent verney yearling mamadou brawn ferruccio daugavpils dupuy upmarket lehr sires uruk tuolumne progreso dury fellowes hadron bremner mraz wesson swirl ligeti schulte hyogo secondhand concedes unsupervised farren kmart headstrong accipiter iac sampaio crystallized altay endeavoured cuckoos ket brome artigas commissars aragua nourishment longworth hos audiobooks hébert nieuw woodworth incomparable batumi graciously kutuzov thermally corr ichigo scanlan brickyard tamper decomposing prokaryotes bim calabar smtp mosher rebrand wenlock ishtar ilex tightrope pki glancing friedrichshafen aphorisms mellencamp chaplaincy kyrgyzstani rallye ruger myriam hopf cyborgs teesside skien nellis umd talkback mixtec giggs horten raonic tachycardia faqs savitri fyfe erectile avar skeeter tonya titi tomi murry cuticle ransome diospyros dtm sasso qed mattek kurz atheneum augie britta bathed buñuel schwarzschild tubb redwoods pforzheim burnette jools buzzing jpmorgan otra edwardsville ratner gioia incontinence schatz mccutcheon bangles foretold partenkirchen sanada wondrous ripken rosebery ahab frontrunner eugénie anointing novus wuthering folger diggs blinking kagyu hod ttl copycat gershon chars wilf gyllenhaal cavallo sockers hawkwind chaperone clitheroe icl harrelson reflux cleaved irate feisty crumble retd headstones tengo gce mcnamee samut krs hommage porosity chiyoda inalienable sarge calaveras keokuk bahama townley fap misrepresents matilde lhc augustan sueño sunburst robotech bre roseland bagram haggerty shizuka alireza bridgnorth coauthor juventude interuniversity fergal havering jbl licenced janesville lene giraffes thine teater gung colwyn maior jis tisza feld autonomously stichting enamored friendliness lani panning howler sisko rosset maxilla charmaine renshaw haddington dua brunet cond lydon renominating wor elina apolitical tapings bester jaye haplogroups arkwright filmworks aviary rattus dhillon uar antena khawaja wailing rahal georgiev durkin svn sarin carell strasser bande kumara maitreya kalat fairer delves ricks tooting zoltan dicky conformance enna nala wls godot isherwood pettis wissenschaft incompleteness merah fath blackwall meagre figuratively unbecoming fannin affordability sailplanes vidin mmx proportionately howards cdk etoile proline exons diphtheria intricacies disguising spook concha kuma fonte eastlake junkies conformist weems scavenging shuffling liana belive participations inbred subheading tauris owensboro romanos deepa howden cityscape wcbs bloat latins obadiah leticia nigam lika pawar officier plums pontypool benchmarking redcar raff oryx melendez krill defector taunted illingworth unpopularity silveira pageid freiberg milltown stonehouse rebelling radix arma monotheism humus bishoprics appraised haters mcguinn executors kevlar denby pampas kurtis shania gurudwara shimmy septimius janvier dnr uplifted lundberg braniff keg lecco havant spindles mittal quadrangular scribble aschaffenburg kraken smes transparently dalziel boc microscopes mamet kluge phenyl motorcycling poked deschutes reale pierrepont merckx elihu mcfly bodega daviess meyerbeer greuther vyborg bellucci carers ashbourne enactments luan garratt capistrano fibt strummer instalment austerlitz fantail seagoing araujo galilean chay fragility bazooka eazy perceptive bts knockdown ouro bursaspor smetana cleburne doktor squatter houseguest refounded aligns dinos compressing bhumibol klaasen dally masri pilipino ecs legg dour cephalopod nisa sanya gua daa toshiko rudge windfall burundian locusts piscataway naperville nombre hyaline siebert snopes dft hayato campanile ghostface coastguard vanden datsun › apologises faltered mclain specialities clemons bjarne mazhar verandahs charlevoix ramachandra resonances cromarty selatan neurodegenerative krishnamurti frostbite albus shirazi northerners chaz privates bedingfield sorkin canceling currier odors fairleigh lokeren käthe ambrosius librarianship jawad dalida normality citizendium strangle menswear gaslight racists physiotherapist aéronautique bohm joh nostradamus testimonials radisson cherno saros caproni slims eurostar carib lida phitsanulok bogan cheever minangkabau wysiwyg dnepr tights eretz slates deventer cardenas vani roni lovat uninsured bandages tks jct ccn kis ferré rte sandi bru wyk juni lecter forst undergrowth decomposes veered arup pasternak pfister aml vcr quipped ghalib monteith lewd switzer normanton huawei comically ssg cilla errand psm wedderburn uanl intelligently minion universo zest nri groza torbay bolognese turek imd trophic colquhoun hedgehogs borer ballou joffre galante kamo glorify rinse crist troubadours copyeditor kaw geoscience lustre dentition forex tote beja scorecards amano bribing foreshadowing heme septimus flamethrower eif montaigne oxon berner benavides despises acrobatics zob shorta socializing sleaford turmeric ferri fistula sify moonstone ludmila petru misrepresentations minarets mims topsy asistencia tve multiplexing sharan rizvi medico approximating komsomol orienteers gremlin denunciation previn chapbook todor camberley unitas ccg subtribe relinquishing pankhurst palumbo uriel fou exacerbate diouf chabot divan kishan stenhousemuir hanafi condoleezza yangzhou havent gade tosu orifice instinctively beninese homewood scottie yardbirds structuralism tempos gunston starwood ghibli composure taser hannigan galland rothenberg cronkite scrappy mordaunt vijayan rabbah rejuvenation rwd guernica muzaffar bharata rubenstein pce adarsh wafers katrin gametes vaillant roku adac weatherford amf dunkeld moyes chiefdom trilobite hofer abitibi malformations bromberg fiddling emplacement nal finlayson nephi rikki clydesdale sulcus goma gudrun panhellenic albertus westphalian kolding torii abrasion ingushetia naresh lengthen outgrown ersatz montcalm bogle waterboarding alka mattei echelons chivalric adhesives dabney menorah pyrite maccoll oostende counterweight moderating corzine tali lavalle philosophic deformities organelles bioshock transceiver granddaughters milorad hinde roubaix expulsions hiratsuka unapproved belatedly hepworth melanesian northam tamayo spaceships novela divider gpus lignite matchmaker crept sth cornel chiron powderfinger macedo mangal meisner janse ambler gunslinger jónsson aguiar tik womanizer hata prototypical pernicious ryanair hurtful overused gallego appreciates sindhu sopra malini realtors pelota sinica taffy saipa conditioner gneiss franchising freaking dauntless dissension vax nipples lazer seh julieta unni elin cicada duca billet screwing centrifuge upheavals dcu davila indicus bubonic bumbling jenkinson bolingbroke wigner chandeliers gaumont fanclub linker marca agee cycliste cytotoxic kannan politic axillary misra prue nimble pulsating perfecting cont mulla bombarding monegasque fantagraphics lynched opine marbury zhuhai shiba northwesterly visage landmines oost gravesite secs marisol changeling tine begley pirata eponym targ gigas moller prides skiff constrain hinders isan lombardia gawker yle androgynous roldán melanogaster akai dmytro natsu mohegan kennard guinevere spontaneity coffeehouse weinberger authoritarianism opulent othman perrot reconstructionist fusarium premièred reredos salvaging thoroughbreds etv minutiae ergonomics pinion sprocket falsehoods hbf adequacy receding potenza socialize crawled slaven troughs parham millen nowell chemie thutmose fogo aad deol frobisher debugger mcnab amadou roby tule ayaka pinchot tachi attestation farthing manipulator magruder voight saulnier escudero advertises wreak beano schwarzkopf thomasville wayans quakes nikolaj adoptees sadi nris succumbs papillon brainstorming blabbermouth maccabees fanfiction gasses lesbos alfreton vapors yokoyama ramsden herbicide primers smokes tollway prasanna chenango uncooperative viciously footnoted fsf peduncle moreschi neuen fms doty fabricating truffaut sagrada jeeps smithy mourinho drapery bq carnes khoi myst rete africain bankura unsecured waynesboro wsc pitta guha methinks redundancies ditched epworth maron façades baffling fsn marquesas lingam bilaspur biochemists piel mcdonagh bunge trims seiko perestroika folie parle poona chloroform neuquén serviceman npo hristo brookwood jule ladoga maithili erb viticultural retrofit moisés pedimented shwe grice lnb myeloma solenoid basilio ferocity answerable phyla unveils artic wither udr doble yossi akc batson asses stipulations polices summerhill praga reggiana sheela biggar interdependent ajith penarth molested restorer sistine paulding playbill klug gowda chepstow belen bcd propellants stowed marginata alesi rafaela messner marketable internationalization andronicus wyre telegrams fluctuate assunta purnell kpmg renzi görlitz ofi bathgate detonating rattan leyla gebhard frusciante bley ocs litton unfulfilled traynor appreciating prosthesis cardamom legionary boomtown openers foreshadowed massimiliano purebred trivedi sandrine balikpapan oruro blocs annexing blasphemous stomping annexes allyson sager lysander basements wrasse hirano immigrate ilkeston ksenia yul countrywide letras mende breitbart lipman ) sheehy pugin cybersecurity ursuline bolin khe viki bushland patroness codrington irenaeus deejay lettre bharatpur pecking aftershock reka stoneman referent neufeld yolo akins slamming regionalism inadequacy saldanha counterexample pentland polskie audie morgenthau matsushita agrippina llyn almada cerritos jaded acu newburyport dez sohail photojournalism musser stabilisation petioles tng forgave ssm heuristics backend diatribe charteris cooch follett bituminous librairie threshing stromberg victimized jib cruikshank patroller klinger bluewings seacrest cim pembina lockerbie attainable bfc alentejo wip khanty bss arjona southwick evian untrustworthy redhill créteil hazleton ilyas mcbain anca basilan rorschach ventriloquist peschke dampier masayuki ilves willcox brickworks irreconcilable reflectors supposing eliade amass dickenson reversals hanbury infielders bhatia vicomte arabesque kosi docudrama clr rappaport leggett appendicitis jardim danson interactivity concocted mtu arda phenolic violator shrestha mtc bier vannes phylogenetics lossy alagoas cmi shinde cultivators takada lalonde wynton tagus mccrae mccourt fulford hardiness toasted masada coenzyme pessimism ponca mors gravelly civile czesław matej polities dropkick elucidate ecommerce epistemic humberside sallie goulet imai croc lumbini panini pervert eglise jurek palomino aggrieved goodfellow nae demesne sift dubh trichy lanzhou lordships magritte gigantea chaumont reay truscott trapezoidal newsmagazine remoteness rabe wynter identifications seminarians macaroni frills mayen awan elaborates medeiros shahrak innovate huss kutaisi handbag vrt milena dothan concertino munger mstislav momentous sitara berk silliman hartnett sweethearts photometric youngs waiters actuator dandridge probst scavengers restated oldman menschen penetrates chertsey tsim reminiscence acha harrisonburg cardiomyopathy franciscus tipo justicia appendage ides picnicking spinelli roofline bullitt faulk barnstormers loder abwehr pattinson shinawatra verband skylar reforma pompeo annesley emg montreuil gallaudet loka spinosa redistribute maturin ketones hairstyles charleville presser kissimmee akio gz bartel bulkheads grandparent giotto desertification guested fouad campanella rupa cementing smoothbore questia romy nirmala excerpted presbyterianism gigolo hutson ratnagiri jip moderato inane patronizing buie wetherby saxo accentuated jur sonne yoder xperia expend penticton disembodied astroturf colonised ghaziabad deactivation táchira talmadge pterosaurs tokai mcluhan troilus childe baghdatis raucous minardi weathers loaders hellcat luxuries pontchartrain landskrona fea dunaway lamarr leaching himmel cribb dribbling etruscans cabriolet canna potemkin kranj kilowatts civilisations drina fitter phobias poussin wetzel foundling scull resisters ossie shrublands imparts sharpshooter sumitomo barbe nightcrawler xxix griffey broglie valenti brigantine injects ypg mec thaler vernet unlisted arcos follicle donaghy mannerheim osmotic tenney fragrances jayanti sankey mdr peppered ostrovsky nowitzki tlk sandor anhydrous mansa koma consigned specialisation pyre synapsids columnar trigg lefevre exploitative smokeless reprocessing plantarum mire comercial forestall autographed sisi ellipses domineering toots winemakers jalalabad speechless almere gass foie usta geotechnical prob erg saltillo alister zaza horváth nakata aberrant kuch yasuhiro deceptively pvp showgrounds lacoste nape cormorants godman qassam sunita oden castlemaine dhamma ardea deport selfishness politicized sivas jacobo pigmented maplewood talleres askari endoplasmic constriction helmed sez eroding carty lumumba shahar bookmakers nitride muon gerwen proximate lakefront zofia matsu kilt substantively uy valuing zaid maddison gabs hankyu maharani makkah fanshawe rocketry trumped uppland jassim sasquatch volos wooten koirala hearse vasiliev smp stannard poroshenko bpa schuler nobilis coatbridge rovere quidditch educación terrors aintree tutankhamun seligman magyars cipriano discrediting stich duterte leapt paycheck conjure iversen canter orla amenity crosbie humpty somalis worshiping natur psoriasis rangpur chunky dysart inigo heliopolis sunder castlereagh mustaine kahan ramla olivine virtua shinya asta falaise reformatory bozo vachon dlp gibraltarian vrs morpheus tirade bonaventura sativa dione joël legco satyagraha hakodate guppy docile antwerpen churn papier baugh uomo overwrite assistive etf bungee phosphates amplitudes optimist langhorne lippi solapur oracles bij cudi kopp gehry janner grandmothers cataracts sergeyevich maslin pesky popham ambroise notches soman philpott confiscate palliser thereon beautification derg antler tagger lally munition kenrick ohno automakers cerevisiae amami verbiage schechter emancipated clanton townsquare voles wct nicodemus farrer unconsciousness cashman salonika shambhala maruti grandfathers tarnish lactate ohms haridwar kass dalits poachers esher lbc bentonville potions panelled lunn prioritized qadri polarizing finno bonito digi emirs lluís microgravity recuperating qh dorking kudryavtseva alekhine loca patric fpo boxset nothingness wyler thabo wholesaler coerce manchus toot matsuura dragnet capaldi singin coghlan sinkhole boban hoang margret bohlen tampered zita zhongshu shubert lamm televoting fibula chills utilitarianism holderness sheringham ethnological leominster contagion mastiff savannas lauer linke stocky desirability furrow luminance westin faints fabergé trulli grazer protectorates kur mitral hydrated pudong sheri upendra hallucinogenic rosetti loosen stoppard naito commandeered cablevision culloden unchanging metis squirt abney countenance portobello stebbins damsel mammary kerremans riveted isao marit renan gere barrientos plath hollowed omniscient maltby patronized xinglong ruthenia militaristic rachid aizu beckenham uncompleted taint pentathlete leakey takahiro findley pretensions oki mcginn tancred herts sharad igloo spits gtg madama campanian miquel palatial motorcyclist crumbled tethys dragomir mantel indias malling midtjylland fdic limehouse shilpa macinnis pastels alen marooned elysium sakharov traian ricordi reformat wsa whirl malaga nicolò judean approachable daoud cadaver catapulted qumran soothing gahan germano papandreou whitford thrall lanchester preservative tryouts pecan simian pantheism molise avispa imbalances redline overlordship blackett cpe llandudno ojos kebab hadhramaut foz overhang oglala whoopi freising painless photonics trinamool incapacity elwes finalize loudest compressive pani bocelli bcci fabricate putty conservatively sheaves numerator umts zambrano secunda poggio geochemistry logotype earthy frameless ezequiel pinal solheim aniston conversing bartsch hillbillies matchups bellary carmody fazal rudman anthers scotrail pascagoula nep hakoah fraunhofer wert pandemonium aikman sachsenhausen apeldoorn schwa nordstrom hydrodynamic clutching gunshots kock afire kosta avanti drenched odile schreiner quadrature nyssa anabel moultrie envious faversham girish rohr publique riaz llamas sawdust aldeburgh kandinsky willesden lemmy albanese berkowitz stairwell lsi arlo amaury undaunted embers rededicated nitpicking reeks bintang subhas cheyne comyn xenophobic orangeburg overblown auc alcazar irradiated mhp geir begg nagi zechariah adem hadassah skyler cushions schüttler capitaine grama extrusion conditionally tadashi jokers zooplankton psilocybin mola mcd altamira syme oat pfaff allegiances crossroad cuauhtémoc fads tourer foils pinar seaver masha prishtina boars apm pleiades trawl detonator indoctrination broch exertion understatement anticipates fdi lindell judaica duniya manifestly publicizing impresses brenneman crouching buttes cabling spurrier accrue tevfik salami karr menken empowers chard grete hauge belted abenaki fareham sámi sapp isoform sedaka mmo awad ldl sargsyan mosman rabinowitz waregem schelling newsnight combustible downsized carles pedophiles gujranwala plympton bandini turandot dougal darnley petrescu voy curative steinmetz sennett pendle cadmus vestigial mintz blacked brive chipsets limestones riveting passo glyndebourne stallings animax neuville hennessey amb bulgars rmc lacma ramey kling leatherhead rebadged gillen astrodome afridi écoles malatesta fennell vlado maguindanao illuminations thwaites amateurish poder ubon aerobatics postures hanno beeblebrox warms higginbotham dvina capers flattered typewriters rhesus ecg achim taguig norcross alleviated yah doh blossoming khatami neiman latifah blakeney overfishing logarithms glo docent crustal pik gynaecology legazpi wentz tourney duller icebreakers reactivation homogenous schönberg spacewalk episcopacy northstar petaluma tenures litters thymus straining sussman sobriquet autos sande altarpieces bluestone châtelet effigies scotiabank swoop hyperinflation rabindra bloodhound religiosity hyperspace necessitates keil cls inclusionists swipe khimik droplet kristoffer shakin nev buuren gygax sensibly proteases poltergeist zambales mcginley bruner fireproof hemming mearns communis ruinous minuet gandhara iosif lewinsky raked shouldnt hoke cervix orientalists underpinnings counterinsurgency pietermaritzburg viscountess karadžić petén jigme aqaba pennants jugend benham matthieu suki győri lances sqft battlements rq bayne izhevsk sargon cuttlefish freese pika ragan beamish emmen blackmailing anatomically consenting womens insubordination mildew toga ¬ wirt palladino subtext inboard subdivide pato bassline chestnuts molinari suppresses cassa regattas culpable prick slavko farcical varghese lodger pallava hemorrhagic daoist nanoscale ismet scotties chuen cbt rolla libor zeb suvs necromancer novae pataki nacl riesling merman inlay tuber ballo khazars kiha murmur festus knighton quatro eee outtake svay cynic oswaldo fidelis trappings nando pgc kop nullify pölten thanx foia unexploded bevin fraudulently vllaznia asio ludwik veils haystack sonu precipitating pomo scarp upfa glaad skonto limpet lomé gti nizami dangerfield vivre urb mailman gaudens quintin purport changeover plunger millenium italicised hopwood amulets caballeros håkan cocky nielson dimorphic proteasome zb windings rebut torr endures antonino arakawa spad grampian nachrichten pby marchese displacements virtuti piledriver tams insinuations burleson roode dedications brca wegner xfm garros masi miletus seul visionaries hooters turpentine politik ere unreported llobregat awm serling partie idolator donati storyboards inoperable sphagnum amata albertville nono ceiba tsutomu silken guccione semite sumer transponders howlin puppeteers lenticular pickets kaos ramblin heep popup poa taube grate capps scolaire simulcasts aficionados coryell sfd siddique quod shawinigan indulgent woodhead ffs golub fangio rosettes margit kerri harmonized mired hashmi resents genealogists decently diabolical lca wuxia rawal tft langan quadra beerschot zito bermúdez ulrika interpolated artforum sausalito chubb supp toil deploys buk doro kawaguchi fissile faithfulness transference potteries thorndike folios repellent disengaged lemberg stb wardle yayoi blaenau consortia deductible clontarf thar bikram población schnabel cicely pers microchip lakshman obsolescence perishable ponty nestle cbl walken tribunes bagot desiree mowgli beltran comilla yeomen parachuted prahran schola agena thump orchestrator animatronic rodham borthwick mne nesbit wyclef microwaves thurgau huntress womanhood bascom krka anatoliy saintly approximates samman manse infiniti beng bergson cobden amphetamines coliseo bsn macartney deprecating etruria sanda manju efe ashgabat fingertips furukawa agus pungent arcot crédit passe firmin vocalizations paar siting victorians linh besant dslr kehoe aguas aef ble candi almelo hamamatsu opc candler thameslink geodesy professes minnow bandura inositol mansi hotelier schiphol elway loh estella synthetase mechanistic supérieur recreates wako bookkeeping hosea tensei thurles takara ludwigshafen inconsistently rosser setups herders setae rammstein mdma triage sarab decryption kratos cdm recto mazarin sensuality dolor brut nobuo chakravarthy saugus lomonosov simonsen nabc chechens haupt gilly arsenio docid warrenton grantee armagnac flexor brockville behrend manda jona bengtsson anant vms shar sammlung glorification bonnier lilli donskoy nahum observatoire malheur fdl risker telegraphs fue aage navas batticaloa tractatus scientifique fermions lockett strasberg dunphy bandleaders sukhothai favoritism eic bellaire venona moluccas ett aykroyd stiffer saadi nitrite onn paralympian zephyrs ¶ tsugaru bastrop giffen boylan playford godly nilgiri bridport rnc cherokees haa inconceivable coley brdo gigante sarabande tarr priors maharajah fetuses hammam lagged boundless lagoa ghastly noelle estradiol brainstem acte farrah imperials ratliff defensemen condors olympiastadion splashed haitians kalb alanine saransk fingernails luthier tricolour birnbaum fanzines tidied cacique chairlift utilisation maz exerting ctu pti regine gediminas hosokawa mountaintop hinsdale daan snappy dingwall hoyer esi chaudhuri ubu naranjo blacksburg inaba boateng lemos anambra canmore manish daredevils gerbil pljevlja bogey mindedness deka jaworski curation américas gumi kolmogorov delong nore softbank kann sania sesto optimally printemps corelli clerc supercomputers colons etcetera unsympathetic nicolau tarun schumer complicit dima liste andria zant follette birdwatching provincially excitatory mauthausen dorf slashes telecoms numismatics celibate política oconee kincardine mags buen staccato donegan fretless frustrate mantilla daventry cus spool transcribing nuff fiorello leszno systema malan noronha naves kyun euphemia circumventing lilia kha talker boomed blackest kuantan fivefold jacobin simca thomason peretz harboring laz kash telefilm alpini glosses utm hecker gaur commutation braque penner individualistic musso telephoned maus lansky meniscus matthäus xk ledoux inari burnell linearity inapplicable guarda levelling bamba haploid tass indelible silverberg hayate nightwish ferment concolor ngs agia maeve diapers bluefield smelt cuzco couriers becher abkhazian renouncing sinusoidal enduro sba mariani condiments augustów wiesenthal ordinating kes rosenblatt scud feigned hermosillo oberoi postmasters shapeshifting giannina mcmichael spinola remixer stipulate bullhead oui reassuring razi apd bron gabler angiogenesis chambersburg vandross fognini clavier colter clogging camperdown anstey amusements masterplan bizkit ultramarine hemings geomagnetic ruda hipper bloomsburg knin incrementally montour seiner rudely golubev modulators harnesses dmt pinged phosphorylated piaf schweinfurt cobol ocarina maggot tatsumi affixes aspinall bernburg neuve merlo sparrowhawk hetty republik clapper palmdale puckett ceding dav alsop angering blackbirds meld jahr sombre boggy shirakawa umlaut hypo objectivism leftovers oke surakarta chal gnr dijkstra salahuddin kamaz millville dimes concierge blowers fibrillation hairspray keiichi stengel yug enlai wetmore wtcc lunacy flutist leite utilises jalpaiguri microcomputer clotting ullah koei biggie outlier perro rhubarb pocatello rata rydberg elster kripke malus specie bonar lampard dubbo mankiewicz lebrun toynbee painstaking blyton laycock véronique guerillas federals hydrate szolnok leucine vellum inflame norberto telekinesis gerontology umatilla carabinieri horia halland prevost meneses naver swimwear guanine hoppers orpheum fingerprinting wry vac mette bosom unsw lambast babi kym bookshelf klitschko winstanley thetis sperling myeloid donelson vitriol chewed muscovite fogel eklund superbly haugen dini cepeda resubmitted dniester dawa francais visualizing hrvatska mandelbrot styne taxiway sailboats pvda transliterations unirea denizens regt processional indenting abidin riedel mcilroy joséphine schultze peeled foreland ryoko noize cassation sawan preble alcatel sfl navin zaporizhya nexstar artemio tursunov masuda ormsby sevenfold sabotaging spoiling albertina uniontown effector ballinger coconino wielkopolska beltrami jps nikkatsu morioka mcu southbank doucet intermarried edwina browder impersonated loran etty centreville rapunzel palpable fictionalised seaports brainstorm fortin paraíso bahu soraya pressman seafront synthesiser handsworth brazier kiley filamentous badalona qigong tabby seducing terrorized blagojevich ayman gastroenterology krishan dewhurst schramm reger endemol dogfight rathore copula bellona zayd cso chelan kimiko rotem cornices truncation namor huddleston lintels mcsweeney mauri grouper matabeleland solange dakotas carelessly beleaguered immigrating madina beekeeping wisbech schorr kyra chama usan celina armavir sawn officiate expedited bumble undesired satomi incidences faustina valjean nimh ehlers sharpen longoria tovey karnak eyeballs rutan immovable harmer accusers dorr surety pariah evergrande chilena despondent klaas boothby feasting maelstrom naumann blomberg tirumala overdubbed eastland littérature selman alcantara engelmann bombus baskin cosmas yukari expunged pompeius magno theism originators fadl wachovia protege boyars mcnaught fanboy halmstads straneo berle divinely chavo opacity bod irreparable sanyo scape sabc floodplains canale pera mow excepted bic mladenovic hasidim panova hebdo lando wiese burghley wtt enz hydrogenation aphelion toul ulrik bronisław krishnamurthy cremonese neoliberal solana panties heyer proscenium s, shuttered camarillo downplay swizz kap bloodiest beatport makai debutante crofts admonition diagonals dwyane saverio sridhar ilhan prévost updike jyp schön gcmg rudders knorr résistance predestination samsara manfredi vlc hadid abitur nishikori glossaries counsellors santini resurrecting rapprochement ludington parsed sordid hater chrissy cressida cmdr captcha soka branagh paa monaro steinitz whidbey nuthatch jans dalarna praveen chukchi espen shrug champlin rundle fatimah aigle benedikt fuqua transpose duckling stabilise drunkard sdi lurgan cataloguing darke harkin outkast fanaticism halakha scarlatti cranium varvara mathur harnessing polonium newhall remedios likeable indisputably shankill grandi foursquare fieldstone thacker electrics whatcom elim grandfathered reiterating eisen downwind espada popolare irritable bulloch kroner naf sequoyah boop balakrishnan reshaped taw deland où agostini reshaping tannins shadowing defacto syriza reinsurance muralist kabbalistic patagonian annealing petipa carlito rothko thana ducky jaish symbolise cvt combats bourque burrito zweig michie secondarily sáenz bergstrom retour hijacker zenon gullible kravchuk sprained atherosclerosis dreyfuss yitzchak wisniewski grenadian anopheles peerless plotline jinping classe doth lawley gct reconciles logue hungama banga natick climatology rockdale kuna cogs cisterns ramin flier verónica sellout impeding luria wellingborough annemarie hittites dunblane astride mechanization wrangel skateboards blanchett myosin ravages khatib hospitable muskoka hounded mattingly souk peya mwc terrebonne wheatland pedra hth oye cve prefaced rspb djurgården columbiana balch seltzer strayhorn embodying wnbc bickford bough ryuji vad shorebirds bosse weddell cartilaginous collard modell vsc untested pillaging wilbert moana thurber ncp helsingborgs germplasm buckshot toney adenine chattopadhyay permanente wallsend mineralogical southeasterly mclane giannini jvc londoners noailles wahoo bergisch melodi hirth betrothal mra smeared utara prokaryotic scg tring ossian chiayi earthworms vacek agronomy bactrian castiel kilobytes cecile discloses comiskey verlaine mumps popolo darkening pascale tami groundless calmer arenabowl sewall libera markey ortho badi targa kaitlyn pluck camillus kurzweil steepest southworth ventilator ebbsfleet kosovar duped anniston parkdale retiro avoca sabino straddle loretto fissures chevrons adt psychopathology angiosperms aftershocks raskin blane sniffing chatterton rosenwald laroche centaurus evgeni embry hasselbeck aai così fcm clove shimazu shai invoice costanzo trelawny carelessness obeys ogasawara jeunes zil rathaus connally mohawks absolutism stairways wickes htv miku svendsen angulo buzzcocks hanrahan malachy sverdrup helmer starstruck drosera shabby hass psychics heures kerch inadmissible weingarten kingfishers durkheim taqi comunidad tortilla ginzburg lurid tumbler bangui plaine margaretha avp cycled ÷ affording trustworthiness predisposition borda nigerians bufo millonarios françaises tpa slats macular rockwood starace caer broadmoor holywell subscript resurgent ganda jintao mitzi duster kah restyled kuopio lennard screwball choirmaster crowdsourcing amadeo motherboards mateus flashed zanetti menendez globemaster racewalkers gladbach euboea hankey olivo esmond engler erroll illusionist portnoy brubaker saori ugric seely mitotic miyan chaux torrents repin erupting belsen robustus woof eero crotch socon spotter pran berwickshire vanni texting instigating parthians muar decibel indomitable sharpshooters cohabitation montano balkh perron wrinkled goad maclaine dott jel iai unleashes gero adorable ironwood rainiers sfx solzhenitsyn nightline immolation madura flexion costner barwick whitesnake einen wynyard metrobus underpinning encrypt suckling bille urethra aberrations proboscis manrique tahu malleable melodious zainab dariusz alachua imereti hizb spirituals superspeedway sealift futuro estaing remittances marcelle bowker exmoor anaesthetic johar lumbering leaded candido geno portishead goren pips swank petrovsky rakhine girondins derbies decked badawi dhcp venkat shuja dravid nobile trucker iconoclasm forger magne khaimah lactation cli spreadsheets seimas hippy tnf cascadia loin bonkers wolverton cruickshank mpla jorma wizardry adieu grander geum articulating duplessis bhardwaj steakhouse rodolphe biotic mccrory rps hori cga lifter chakri thorburn burra ironstone sealy mcca ritually helier magick arba hamdi mariel comforted subtopic convener jacksons collis sleeved nebo magnetosphere walnuts letterkenny prosody peacocks taleb refered swc weekes waldegrave tweedy banish résumé kayaks maneuvered ibs bosna aed sak surendra strenuously viol carrickfergus liffey nouvel demiurge scarsdale mcdowall mohammedan macgyver rightist leckie deadlocked vil fireside nortel convenor novembre tanka mercurio vortices seismology pausini juried hydroxylase amazement materialist orbs decolonization riddler clarksburg gendarmes gaël dimas smoothness wayanad airings nisbet animus hecate hoge sympathizer partisanship zachariah eaux rumford vyas nii bsg aarti tcl funkadelic dink relevancy caudron parsifal gracias balding asylums obsessions neutralizing kapustin mews epigrams rastafari kaizer vallarta corel ghoshal entrapment dalglish creswell maglev quince vasant firewalls oxidizer sonics helin ntr udc saaf dworkin mascara carabobo plunges polymorphisms camas maceo brussel africaine boyar speculator juju gorillaz ashburn kidz airasia mccandless tite infrastructural kangra sues akuma grout deciphered desdemona deceitful bpl amat incisive teddington ikki hartmut symbolises sudoku likable jtc docherty carpark introverted csaba sansom adlai houma customised rivet streamers bks corrales yasushi comeau witney shabana silber peet confidante chastised salis lucile nagisa furtherance joana godin narrate unrwa schlager preventable callous novas registrars tempers baranja makino insatiable blériot demolitions bagdad joao themis naturalis srivastava lazaro maul götz maffei cip kare redaction saalfeld candidacies railhawks katayama disables absalom mantegna purser barris dongguan clift dunkin conundrum upham domicile lovingly nigga ginza potted bes harnessed nakai talat heresies adoptions inoculation kats cartan rpc dhe aventura riana sgs paramore buffering metropole wn gurung ubiquity gels wrinkles hucknall gla sobota goodhue bulla fakir xlr snead lanning nicolay eschewed jellicoe butyl cuza morell strangler gheorghiu otu mantova recumbent pommel bodil spivey mónaco växjö interviewees carril inversions berrien joann pavements storefronts lanius compañía csis raeburn poindexter shani caton tumulus creditable pimpernel oram deified annibale buries rosemarie nahe recoup elisabetta qamar briain hollandia mobiles teflon encapsulation kirklees alwyn myo existant filles vajpayee sinuous janakpur triceratops bellerive churning juniperus shaukat rosewall airtight abydos sensitivities aks materialised orazio gresley greiner kingsland bastien nassar socialized lampeter eggers medicina coquimbo defenseless spacer vy kaushik shinn carbines feedstock liberalisation shockingly dik warplanes repealing krasny venegas scylla augmenting sass concertgebouw ahmadabad rives convalescent harbhajan jarry hase peinture sarma raimundo wahda canova asceticism sargodha polygraph convulsions barletta rajat lorem mlp uzi wortley devises gascon theropods korakuen pederson viasat surreptitiously fipresci harel ilić jaa kerner malisse saakashvili progenitors runciman bibby flavoring expansionist pasa eidos vide hairston ingush opts depuis wageningen mullan otl deputation amici strathearn banderas wiiware claud lumpkin helle aswan sitters parler harmonize keelung moorhouse tudors doable emre langkawi industrialised shoji spoofed dictating leiter tadpole córdova backpacking calvi khosrow eleonore caricom petey quicksand circassians brega tactically aragorn abhay premeditated russellville eldred ussf demarcated valéry smokin strictest kuerten kwak grönefeld surnamed sauter delanoy hayne adige veolia annuities subnational hypoglycemia rocher hydrophilic junhui signalman sgp linea balto dietmar pottinger cissé quibbles brattleboro ppt brotherly anxieties galahad salafi castres odnb caltrans temecula antonina demerara occultist inhospitable clasps neuroimaging schwartzman tauber emb rez manton compiègne subscribing immutable superiore santurce fz fiori remarriage ingesting groff glabra hutcherson loong wallin labored karakoram reinterpretation lessee mimicked tolbert cherub misinformed broadhurst plunket mewar disarming derision adriaen sittard allin trond checksum sublimation cyrene interdict rinks kogan undifferentiated mccoll mcclatchy punted chastain humana grafted lissa llanos mcallen paneled byproducts bardo lollapalooza euphonium cowdrey perversion apollinaire hiker ramanathan masako replicates hisashi litvinenko kabila ciphertext debunk payers ballesteros silks primed magnetite savar jumpin cutlass stratos retirements lifelike pbc halesowen timeform tigray neuer sarandon yehudi mutinied maac gnp bahman esterházy cloutier albumin bulwer hominid abdus tcg salutes peuple naoto unseemly dynamism rakuten crispy alarcón tatjana orem hegarty stroudsburg breivik caracal foi sirte althea camorra inextricably snares paavo ansett fusco repress regionalist dirge hierro mcduffie crutches leonie coots mwe histamine ecr bloodbath gdansk sasa ishihara dodig clostridium jair carcinogens mädchen vajrayana dispassionate tulips enders ihor burchard neuburg lindstedt coriander modigliani occuring okanogan silversmith palaeontology photonic godunov haileybury uninjured tropez eka emmeline reitz lewandowski avilés infidels voicemail sassuolo generalist danubian simo waleed guesthouse matkowski endymion telltale colourless tullio arendal tamed németh iras kuni shand hydrochloride alleviation ghazali merz yonsei ogata stol detergents mav egregiously chacarita camphor ghul siddha geni rawdon flattening sailplane hornchurch madrasah dionisio canisters vexatious dla scapula jono sereno misawa microcosm gabashvili lully falsifying rous supplant carthusian plaines cannock sepultura alençon hoffa ratzinger sedation nemours irregularity barbora orangeville biogas whey dreux overstreet finders nears niv helmsman raghavendra fluidity koon jez clogged coolness regnant roti highwayman goldfield manifesting socotra leeson daum freiheit truthfully closers chishti khamis eraser nullification stews stillwell heli saucers ullrich cascada mangan cuernavaca lioness stabilizers alpi parkman austell lithographer séance unrelenting zanuck levu tantrum sandhu rufino depopulation shum laurels disintegrating rifling romp dulko midwinter glc fordyce marceau helipad glassy braunfels haliburton biogeographic ruislip glauca mitsuru taliaferro ril angara monts nukem egy tornados phool yehoshua conceit matera preselection napster berezovsky baltika dawood thyssen amistad tuskers lleyton immemorial entitle germanicus bayezid enchanting curren templating apennines rekindle cooder kitano babylonians indecency diorama ept chloroplasts cbgb mattresses friedl astern immunological podcasting shamil domine klose thermo trialled carneiro escorial wooing ffestiniog giambattista neoclassicism fco patan peltier lehtinen onegin rusticated bohun qwest embellishments rendez boarder bouncy cotto bhangra gallifrey radially cripps fisichella paprika rossellini hawkman rutherglen recklessly maisons karlstad lelouch alcibiades cradock helpfully sequestered monogamy simla rhiannon strays sliema chaining lavin midden pasok spelman sado rosamund lingered fama madhava wessels gusto unclaimed quelques kaka huis epcot adar reproducible westphal phallus orman centaurs forthright tsi kneel barua coves idc trailblazer fpr mckellen colston panjang workgroup rodez halliburton bedridden whiteness nirmal draftees libido polymeric dubliners bemis omens spinks encores threes impossibly diablos caux monotype durie darden nem placate rocko flach ullmann drax rationalization fett hirose wallachian cfu marginalised sachiko painstakingly rooke congleton schon bronfman ilha bailly parachutist rounders gpr moca hallowed lithosphere croker abalone sauveur ganassi ombre polyvinyl ciarán bacall caracalla pritam memoirist pilatus archipelagoes lacuna jawahar lpc hofstadter toscana conjunctions tanasugarn meltwater wreaths geng atg dmu carolus switchboard outhouse appenzell subsist ligature appointer overloading wether logician nera bourassa statuses nhu haemorrhage stimulants sejong bassi dostoevsky russes turnhout promiscuity boult strangest obstetrician zentrum hurrah iberville egger musee dvi escapist pewter nissim civet franko crampton rais jammer phallic biomarkers abuser ephemeris souths unprovoked disproven mattia pavillon plausibility honorifics katsura daihatsu submitter delbert hato mcinnes fuscus spadina icosahedron cadell icm nanna purbeck misnamed periodontal ghg numancia gove marmion apostrophes aldehydes putrajaya retracting dockers loosened arango purim asakura sectoral chieti befriending bago impolite minsky signe unreviewed yaqui vocations beanstalk pyke dianetics lanzarote molto humilis aleutians newsome notaries calibers subroutine shockley halperin huda bung oleh fuca alli sorrel leukaemia monoamine unimproved recoilless kickapoo officership counseled earnshaw inflationary carbons outrun pentecostals simonson nevil jepsen shopkeepers munshi canales sweeting toya capricious garin grg talleyrand photovoltaics tpe vvd weta bysshe idlewild altamont clannad pennines bdp taliesin kunz lodhi ched lra dissuaded bartolo nema wipes blunders staves heavyweights sabo reiser chartreuse synthesised borderland amityville cramp sores calles rebuttals ictv etcher usnr lsa appian protozoa cocked itasca offshoots divulge citron doron macmahon bato moslem littlejohn wildman rivoli folklorists hippos bilge ingrained sketchbook burdick pensioner bandon escapement ambala eroticism gopalakrishnan beria wrinkle enticed palpatine kqed tynan jumble noboru nickelback samo eap lavoisier petrucci atn vesna hiatt honiara leeuw tejas aalen infallibility bursary hgtv cmb pnl niamey lynott caruana jap museveni santino periodicity ambika kushiro acetylene casings roped capa joffrey bullwinkle ecumenism gymkhana liturgies sofer carden pinder kudo modulating unrepentant banton jeri metronome guia palmar arkin grooms mangoes gulbis nyon refutes cccp hyperplasia peeling hombres lofton bereg detonates firpo cheri atsc resection bevel felonies prion adaptability pasolini ssw severo zines seawolves voisin asx ouen generis coffman bcm littlewood uncountable marden interventional liangshan polypropylene underboss swarming andika söderling claypool oatmeal survivability patricians armistead wallop biak moult kcvo meditate takamatsu pinkie lawlessness jawa googles darkroom testis weirs skyhawk findlaw sprinkle supercentenarian nwo forrestal affluence bmj sandwell arnaz tvp watermarked fealty tailgate norval sardis pestilence moncada areal birkin abdullahi désiré imitates velazquez newlyweds mov wels bayswater varèse reticulata enhancer oratorios kil hrm heute loess rectification orchester juste overprinted pel crocus paulinus cydia puisne pappy enosis cliffe vaccinium cala savers androscoggin showgirl pna airbag pisani janina landform goof rogier stille está theresia hew bopanna akhil caguas testbed electrolytes kaga camcorder kalev ruthie andere hoey waalwijk congreve wart msps tomic venereal choristers magill rafa afterthought merkey teda interviewee veli lashley documenta tasteless parfitt californians encyclopedically monarchical twelver plainview alchemists nett truckee tinamou immaturity naturel interrogative calvinists kahane iturbide montesquieu troon indebtedness cnut shivers parlors gleaner paulet growths crave kutcher machinegun powdery rafah boatman troika armature dairies sali whispered mercure similiar upholds collectivity coalfields proofreading astrolabe shoplifting sigel gulzar woodburn holler jaramillo electrocuted nicoll kaneda harlingen ramjet fpga manmade saranac egm bib oculus olle invictus eres mcdiarmid backroom hippolytus fujifilm prankster yoshiko verte tenuis tfw patella secretory cranford fuerte cataloged khai avangard querying deserter vaishali taíno inedible ejecta changeable standup yy agnus cranky perforation imago meralco neurosis mudaliar zod barbers misquoted oases soren dra tilghman cucamonga strangulation entebbe gilson machu flamurtari propagates harborough blatt escambia habe mickelson pujol margarine wyvern schemas husk spaceport melodica latte lindo kak temasek reapers smrt elegies clemenceau gramercy fet undesignated egmond thermostat shauna ziff kohlschreiber willys mems afshar endoscopic arapahoe vaudreuil olmos kio kupa patnaik ashura bulkeley telos assassinating smokehouse commendations crone bustos dominicana hoss vitor cupertino pollux sanctified dud herning dialectics pallets effortless henkel veen ileana rimmed damián resetting fittest spassky allee untranslated chemung xpress atheistic kunstverein cusps petkovic wonka nuñez prakashan studer bomberman morant graaff ulnar publican reintegration waheed hashemite chc kyustendil ehsan bidar temperamental interments stinky valeriy nebuchadnezzar ajc keitel orth mathieson grigg tormé postsecondary wef banten webzine librettists heritable rohe bachata sinead alleviating lafleur hambleton senile purba cabildo foxnews histogram yoyo crusading midweek ceti verges agüero circumcised klerk fauquier fein yakult prosthetics mineola sharada takeaway hipster sisak abject mickie incipient deepika figueiredo couto hira smokescreen simmering insinuating conflating hemispherical snapdragon hah edelstein killen oop kuang mapper kif vallis hindman bublé chambéry ferretti teotihuacan rostislav melfi sunspot skinhead miser radiography hypertrophy vta burks hillfort jcpenney antibes ferrocarril tetanus popp rosol doan bilecik otranto amro judicature stanwyck adress artaxerxes leninism agama gropius oddball worsens evp nanning bardot ahar integrator dois withering vps quadrants extraterrestrials takuma fruity nant moffitt rox spanky caan gotra extinguishing polymorphic albertson mucho separations thallium damselfly ringtone foro homilies branislav blockbusters extravagance refurbishing falkenberg jadwiga asan hehe forgettable hardcastle egremont burdwan tsb glimmer carb braham notifies ingle aalst jerkins eod cgr adamantly heiner cracovia heitor rda très allred eaa universitaria lawyering antigonish janney unpopulated sonatina buna galeazzo polygamous excavator goalkicker mula decals smeaton vityaz thornycroft eustache daniil spellman kir birdsong constantino usatoday waratah exchangers imac dels heirloom bleached wallenstein scola rosenheim comtesse felons befitting wokingham ehrenberg obscures sohrab parka dahmer gonçalo canandaigua rcp buno duns lavra abysmal exclusionary lafferty syndicalism philatelist rytas interferometry bramwell choudhary fluoridation utd sobel elston girton abductions reinaldo crankcase kazuma merchantmen simcha instigate hearths tgf chhatrapati sif elitism alif disembark aci taguchi glial zahid yuliya yaoi margolis cera microtubules reynard turnip shiver aback horsens killah avc lier sns branning enya thoth goldsworthy sverige leyva bruch godalming geolocation ccha urbain dado geophysicist barret scouring lawford reaffirm assen andino mausoleums frits endeavored lavas immunities eba jaundice norrbotten plimpton spangler coalesce kahlo suffocation methylated samarra shimoga overlays vikramaditya scallop brabazon polyglot belvidere onda histórico maumee carlota quart undistinguished snellen satirized kokoro midsomer rapoport eubanks tredegar lutea rauschenberg rheumatic fabián mendy ryuichi cipolla gns grated oulton autosport hagley yamanaka follicular davina hanne injector toyline hoorn reino sok xeon pillage godiva orientalism reprehensible domo existentialist roane pana jessy knotted sram smits clotilde unseeded inductor borderers hermaphrodite sternum russe ambrogio critera sajid bauder conestoga naqvi kieron tfs colonizing mccarthyism nacogdoches rockne sukhumi exportation urination allendale hardwoods eateries yas flavin halfpenny valentín idem optimised kuki bangladeshis malformation ngati mell dauphine roadie pegu microelectronics yoshiaki rheumatism quimper casson kucinich intelligibility suspends alberts sunnydale phrygia ellice sete haplotype noirs kaas sapir jeffersonville confining goldin rohde badal chicoutimi scriabin vaccinations poste agathe harbaugh jagdish duro hartnell niosh coax gira mortis grado compels pritchett cleaves schaub bettis sriram suggs ico dissonant buckler coachman saïd miron lohengrin pimps macauley antioxidants guoan esse huddle discursive rivne borehole mohini premonition insulators tpc travesty kryptonian confounding betel civilised petula aei corot canciones neuhaus leche sylva lorre takayuki languished visby denounces bulgakov yamhill pavlova tallis phs gari chenoweth complimenting chapultepec diethyl blowback pepa cookson chileans quelle menotti elmendorf gestational btcc pissing esme zips invert greenback zagros vodou moros ludendorff djinn mcb theologically montalvo unassuming agp twitty otome zai agusta frat yankton orchestrating bulmer castañeda illuminates macchi scientologist refill classifiers tattooing artisanal misfit scrapes crombie jokerit subtleties symbolising rechristened balaban lukashenko suntory zoya weighty lavelle phoenicia affront wilander javad inflating hyperactive enric lumpy otsuka callus robey dhoni circumvented piaa boles fujairah chinchilla koshi agricole excrement amrit sergipe harington reinterpreted uesugi broods hislop fingerboard pinacoteca mccray smokies undeterred boaters wilkin madhuri leisurely nicolaas polotsk páez stauffer reassures nyack stratofortress gagged inti tarik sulzer batons aerials zedd décor tura montenegrins maronites argosy mcginty schulman bukidnon khadr axils kerstin gsk buddleja portraitist scorned wraparound yala quevedo mongering mannar chungcheong seacoast emlyn smriti elson concerti biplanes siete dunster heparin donat perturbed zahedan fido unrecognised fid overcoat surfactant hefei roza orix foursome spearman eni parnassus deployable angina brownfield initialization sangster ointment berenguer dresdner galactose attendee publicise torts luneng zardari battenberg snag hfc sandburg iw eze nouveaux schist outperformed crudely feroz tati sdram istomin vérité restrepo cath monge shamokin boylston elope evolutionarily bakunin llagostera bgm spellbound brentano regretfully rivets zwingli equatoguinean aphid laetitia virginians vulcans fsm alexius literati miyake anthologized biceps bandaranaike denzil prestwich osgoode tutt vira zebedee peculiarity zeman kinski rhodri gaucho fijians marmot nima marinette florist santarém maxims konica unimpressive stuccoed pitiful ulema umbra eötvös uup schaller precondition rosea rossendale tearful alhaji majoris ryman keke nob belk goodrem trg verhoeven occultism raimondo berkhamsted fevers claridge cather anic nitty cerebrospinal perches estrangement olbermann ahram minho insensitivity doğan microrna rouvas navidad heartbreaker vall bellum outland weyden pati laa byways hickson vios ndr gavia kla efi ammonite powerpuff arachnids padmini mediocrity petraeus vizianagaram stifling allegra facile shoaib spotty kushan amédée firma gervase negating thornbury atkin sikandar stn ulloa overlain delineation bottomley pada lauzon gossard ricki kinsale plantar flattery mechanicsburg medlineplus valdes ipso wimax yasir akagi quartermaine feodor rekindled canadair silvester hatchlings bast khali tubs tico minimus aljazeera benched dormancy karlskrona ecco borno hite uttering harbored timurid cno lashing cheatham chamonix nds mymensingh tiebreak bsb uist stonington curls newhaven sdl variances messines lamprey vauban conjunto pizzeria gadd subantarctic awadh angustifolia perkin fasteners skupski gunship distrusted scalloped smiled payday supercoppa usefull nukes frankland mimosa dernier cantopop dominoes bookmaker plying archimedean hyped aerobics chert kisan lisieux heelers kubica ethnomusicology specks gossett purifying velho caesarean deltas obasanjo bentheim swr allt puyo alisha miscegenation hibernate drusilla figueres boudreau magica unforgiven shreya verdon babos musselburgh outgrew benoist legalizing theresienstadt ebdon protrude herbivore tucci accede fortieth natan pacs hooky wilbraham sprinkling oya handsets gell mardan macdill antonelli gamasutra kilian waveforms heartily mancuso plumas standardizing firings meson cheb kufa interlingua dilla epinephrine nastro theroux keira incessantly rahi mensah methylene brodeur hibernia doren beckmann stares vidarbha woot riis palmieri cumbernauld babysitting oyama phosphor guava withstanding xenu defraud equipments lepidus affaire muséum hydraulically grb abadan whitburn justifiably cres trinita rkc culpepper pansy biographic argonaut bygone macduff jinn hideaki gentler satchel considine meaux undersides perri mamas mehran rct hualien scarves sarita profusion lovelock chace downsizing extrapolated ager lich hazelton fag kilroy cupa barenaked drivetrain gutman maggots bct cauliflower nbn flirtatious haft afton resp martyrology sleight welshman equilibria borodin barricaded castrum congregationalists omb eloquently kellen ackroyd beith issuers supercomputing hanes wmv vvv medellin gtv uninhabitable ilias clarinetists heliocentric redvers mairie mosca refitting calderwood hauer rieti hangings dinara alemán uloom prodi centerfold plowing nbs unwavering beechwood tyldesley theoretician aravind duxbury naturae eloped rigaud guttenberg achievers dalmatians personalised ordain dayak andrija stg clonmel vied heretofore thrusts divest shepperton akwa dumpster naberezhnye quiñones manatees someones slimy delineate mossy tongued ruggero paroles palenque handbags manisa shader triestina rjd mildenhall unleashing defections maslow anchorman pluralistic slipstream reddick sigint suga vieja harpo krogh syncopated cockpits condos slovo manasseh ribes paneling elma biella danner helpline garlands wulff garbled cotonou rifts spr émigrés rara aguila baggio rog mii tepe encino carpio lifeguards harrigan chloroplast foolishness bonelli erving nobis mandan anisotropy nitpick valiente hardinge annam swale scops cerda apoptotic sastry webkit leur loaves hanan fws biagio uncompressed showrooms emerita bna diadem extruded kristiansen sayid batik fremantlemedia ded beinn zuckerberg kitakyushu eoghan angiotensin dort christology multichannel mottos debarge firecracker toscano sociable qualitatively supermajority ussher passant kubot popstars devouring finlandia fwd hdz carbone curveball durian phantasy viernes admira haughty newswire así untamed mbp undying demetrios chukotka kofu swedenborg transits yanni giorno kurnool topanga netaji knightsbridge paltz heatwave schuller yawn kreuzberg ingvar sintra rivaled galesburg teuta loosening millais corporeal lundin antiseptic battering simcity stallman altus shipton straying bourdon alpena cpn costanza ipsc goodell inuktitut ballparks berton manne ministered semicircle mcneese tereza maisonneuve scharf dumbo darbhanga mosquera lexie hashed lerma wsb passos sakata libris pushcart kenzo sella birkenau scanty halep codeine mohsin lessening delinking iconographic comunicaciones macworld deciphering jewellers zayn quantitatively cased fau jamshedpur atlus teleported surfboard josepha atromitos solvers baig earthworm prelature hatun gerasimov alai delinquents assizes centraal honing emboldened misappropriation tulloch clarks mazatlán headroom publics debi muti démocratique lunatics araya wombat abcd tengah svd rainstorm brachiopods baddeley bautzen stalwarts extort friesen laney ashur camerata blighted kii manzanillo hippocampal molino visitations namgyal territoire daiichi viti darussalam gobierno skyhawks cornering saccharomyces halsted refocus macerata marlies bascule professing boos jablonec schnell wetting mousavi scp gakkai flacco lorin amores canseco spinster stang conceals bunches sauropods huth darwish joinery rattling carlile ziva whitestone nfa evi nrj stanislavski winslet delgada dunhuang corinna olmedo joannes sakya inkjet floored otley varner dehydrated mouldings fedorov jutta mcavoy newquay uplink maharana saman passivity workouts rasta horwitz humorists faizabad harries prestwick olmert kaun starman functionalism woomera bms eruptive spousal funen prakasam scolds waikiki enriquez xxxi fbc brzeg factoid twh motorcade thirsk suro caixa satyam turners outlawing corpo tommie lonergan quisling patrese kumaon demented benediction seedy howson tacky gryphon tanabe gola paragliding volendam graciela earthbound keely monferrato openbsd sng athabaskan cammell allston rueda carbonic jacquet wsm sanna biddulph suh perret batsford hulbert raves leann friederike ndc moyle indecision touchy gaudí angélique roker unscheduled nocs gann linings prescot bores zakharov protectionism aktobe splintered prolonging juden lufkin reassurance hashanah reni bleacher evaporates simonds devries hort sholom warners pollak kasi cajamarca iftikhar photoshoot cruyff sagarmatha jeopardize mammadov menem sisu syncretic reposting alwar revoking hebe splinters thiessen covina jarkko seldon strutt exclusions ambiance antipsychotic underlines cassin azim lantana sib poehler roop fln doohan wmo gouverneur aitchison sime conceiving wildebeest chelny acolytes domer shipowner frictional solms senza mondiale rfi maven scour maruyama asf stocker wisteria georgescu impatience groundhog vissel siemiatycze tighe waterpark ruddock annexe bodin eyeglasses bayfield neh marginalia industria minigame benjamín goncourt pirot refurbish gemeinde hirata quesnel caricaturist faisaly orcas strahan mele mañana vijayanagar atria queanbeyan lieb aras reconfirmation carburettor cardio orellana vanbrugh fractals reminisce mostyn ucsf paraphrases prine lozère pisano prowl monastir implicating roraima firmer acqua jaina bernt rpf joop abercorn pubescent modernise graziano walkie bosman withered turnstiles basti thrashed pozo trac dunwich undemocratic shota warminster wedded hilario castrated soundscape megapixels iap erinsborough niel ammons husseini shackles dashwood lafitte lautrec repudiation pekin kamrup geraghty mohinder mummified kalki maclachlan prerogatives tautology stefania langevin kurd jørn owsley ramaswamy truthfulness rovigo photojournalists sarkis harshness apostolos montag bls bulldozers repugnant nailing weng lipped somos contras aubry dsv defensed undersized mugs cloche henze evictions dabbled cvg leys vma oku irrigate kanna ferrante chitose frunze murrumbidgee tinea tilapia kodama npd hakone jace polideportivo patrimoine radovan symington yoshikawa universitas bbva avoidable nahal wailers sterilized worrisome amoeba crj sabor norristown kindersley tearfully blumberg charlesworth homicidal myelin vvs matsuoka foci salesmen tenzin nadja echidna morphing gratis octahedron ‹ spotless burhan sov ottavio linklater srl stirrup contingencies rance clásico playbook loathe isf baynes constricted extinguisher golson orff rauma materialise ppa kiedis gremlins chanter odawara ovals miyoshi devan phaidon mcintire ⅔ topologies sandow grappa glace msk hilson superconductors bohn makepeace stuckey formalize elude marginalization keratin shambles berates roxana mariscal bolland apollonia manalo carpal wetpaint smb sardines collinson bie heidfeld collings saudis komm suan reproach gulbenkian rotator gauchos impotent trl tribulation londres unaided blazed tetsuo longed obata madoka jamuna eland destabilize hawick mahé malina speakeasy synge beekman mcmurtry newhart weasels migrates kwara transversely shadwell shaul pieve orta futurity manlius officinalis internationales infidel inchon irrevocably gramm generalizes transferase michaelmas exupéry chetan glycolysis hatay texel electrocution pelts immobilized hokuto betjeman apprenticeships wheatear cosme coalesced sacrum amoral keweenaw whims hypersonic indivisible smg corny tseung scindia naar mcinerney floriana fontenay lgm eskilstuna childrens hinkle paulino branden penfield decontamination glories wimborne dishwasher muscatine foreskin supergiant wario ibu dovecote watauga kateryna valuations était agusan fixated queensway exhilarating bootcamp brocade appa readmitted renfe devereaux mvd pns fainting alauddin plowman caliban summerville recapitulation sattar catullus mikheil compactness jaxx ghouls mucosal apathetic enquire macromedia proofing mle attar rino cruelly paraplegic avm marché excelling chimps frontières keying zulfiqar aji fiefdom tajikistani zayas shrugged zin vliet tonopah marvell aditi keels corticosteroids granados pharynx nucleation demoralized surah haya rêve overpowering extractive hubertus ophir suwannee minella zemo workbench liane hashemi izumo marsa haggis amesbury plautus metin predictability geosciences eyewear bartlet fé bourguiba haswell springvale sweaters vaulters arty macadam worships carnivora schick coincident postpartum bereavement sentosa sede balaenoptera juillet dressler pse venera berchtesgaden welton concierto cheerfully mdot nods punters berthe maracanã spt chubut stabler grecian communicators maint vala arousing brackley motilal celebes goyang kalamata exhibitor transfusions pisgah bbwaa enqvist maximilien soundscapes brunson kore litigants semis vell recanted belasco kashubian galls custis capablanca bregenz mammoths schutz mcglynn worrall cavour lob malkovich pompano aspartate substations grudgingly steaks trikala drover sufferer pompadour yevgeni vojislav yarbrough cordless lycia pospisil technica newline rabbinate thingy revivalist delicacies steadfastly shenton rousseff doings downbeat riche lauryn shaan kinematic airlife qq chequered collectivization proliferate bedded telegraphic ecclesiae salles squawk xxxiii drivel dinajpur lakas gangrene integra piercings irfu bama hotbed zither amiri dimmer zooming ostracized lightened extensor guaira regurgitation llm shahrukh bertil kunstmuseum irritate accordionist overviews forthwith pissarro kms resize pertwee aprilia mogi cyl rosenblum sunray culpability juliane bankrupted perdido gchq frg prawns mycology bijar myrrh torreón bormann naja asphyxiation supa trunkline neuman larks quackwatch khoy kircher taney ravan randell wordless wyandot nachman heroics flintstone roundly spiteful alexandrine posturing sanctification basta spm lakeville isamu naz tvxq ponts rubies janos murphey boardings hopefuls vq caputo decoys dyna pujols gazzetta radoslav rew graafschap arbitrate solvay inp jungian istana moulins humes scrubbing mudge jolley bocage entitlements bandage floodwaters broadwater cpo koyama samplers foodservice penile sabretooth sunt tebow nazca redshirted helsingør condiment edifices caloric headman arrowsmith akt pupae manufactory ramapo hpc profiler banger alkan vespa cowbell pavlo disobeying novartis broadsides hongkong exaltation vitruvius stanmore omnipresent nayaka ihs eurosceptic deftones souris hohe myint lesage grandiflora rôle stedelijk pleasurable colonizers unabated cvu asuncion yoshiki philistines gonsalves malaysians beluga marly corgi swag biophysical pordenone vegetated diya telepathically downgrade satiric cheikh billingham ostensible sociopolitical uhuru luque creosote punishes dreary cubitt bioscience rectilinear lamentation ozu amours stad reverie hanssen chota baldock connemara agitator swa fiends brokaw freyer chore accumulations rackham spee hoar totten fela unveil tfc buccleuch regionale mazes treme ameliorate murrell samaras lighters leadoff impotence farwell charmer npg famas nisan lanny attainder bickel thermopylae dinwiddie wouter carrow duiker ubi hammadi gelatinous manresa evolutions exhaustively desperado alejo sugita buttermilk carel inferring katyn trp upturned cronenberg agrawal dagon pythons lawmaker panhard fillers concertina leprechaun presumptively contactless reassess longa agaricus hilarion wampanoag mcclung oppositional juve hermine borax impressionistic lymington trapezoid housatonic hinkley elspeth eisler italie stauffenberg lids distorts brunch bila underscored crosley oko andesite dislodged roméo lajoie romanus reuven tentacle sergiy pyridine adjudicated cellini attilio guardsman lande adamawa unwell proconsul rockman wain shalit megiddo mef vitaliy ojai zemlya sangamon homenaje editorially christoffer soko wrappers placeholders yrc indulgences swordsmanship câmara trop belew pathet bisects secker mendelson bhawan boycotting directorates fula goalball detmold baited salieri aerosols shroff marita balad zt poh selhurst koufax republica granitic talus partita infestations godhead paik kirchhoff disheartened parasitology communicable gopinath digg marxian layperson ballooning subsidize strat universals microarchitecture reminiscing fritillary kosh advertisment frederiksen enchantress gev ryland dds mielec pidgeon brus pentecostalism tillie pederasty nisi usafe catchphrases toodyay aldwych cataclysm strolling vivir kohlberg cpd vanya zaria dioecious unstaffed ksu crestwood teases tenggara handicaps moz masayoshi manistee zakynthos ajmal funhouse agronomist diaghilev âge jiri transmutation tirtha greendale helpmann chuckle pharoah mhs imperia harri bastos lcms nodding yearwood drogba pauley eckstein dupré frederico animas karimi wargaming optimizations shrill afresh libertine lgpl lemoore heterozygous lalor criticality deism weatherboard mosquitos haydock casks provosts troubleshooting bellarmine tripe sanh zaheer ditton typ hersey northport bresson liezel blohm fabrications alberni spoonful relents bruegel insectivores contaminant steinbrenner muna aadmi amri espouse atb anatol cantilevered thunders micron lbw sankar latium canopus friis fingering maddux aznar tolled gnat spectacled ventricles deviance doobie delving cerrito staffel panthéon consents bix precipitous otus roig stakhovsky benares usu jabez birthing jesenice minab kray azs beneš cornhill hammill satirists merchantman purposed levis lanois showground agadir flaubert polonaise bugsy foghorn krumm banteay estación hov reflectivity sittings amand alexi karenina publica tigran stiletto meatballs ime collet zev terai vaccinated perfectionist shimonoseki sacro otho trompe whopping hibiki agnosticism statler riverhead moskowitz chinn rejuvenated swaying dhanbad bergin edwardes colborne crowbar roselle parisians executioners univac disliking lesbianism aswell keeled finisterre interlaken humanitas akali ards unwitting harmonix karn heterodox niebuhr myrick roadblocks twas savo lodz critters fce intracoastal iworld wiedemann inverting bognor srikanth moats sultry wdm junagadh doa koontz apostolate changwon saarc jyllands modernists rahm infinitum mando railhead resurfacing kikuyu dictum gaff zarzuela anisotropic balm inevitability celled louse bootsy anurag daniilidou sansa moyá ecce objectivist shyness crumlin perle marlena jolt enticing kölner fft haring buzzards upsurge redland kooning speke setzer gawad ctesiphon lares volandri rusted endre seaborne tapa dinky simran regressive juin ferranti brownell lochaber droppings takings shias calabrese lox typist nassr absolved moraines hemolytic rickshaws aah sanat troicki wpc barco mudstone combing herpetologist backseat kaminski hati companhia parmentier abatement perfusion gautham nozomi zenobia chippewas dowell chetwynd advani minutus mcas haviland overworked pleasanton phenylalanine stupas joie broadview millot cleethorpes towpath corley cylons whanganui wau indignant perching atwell metamorphosed caio jerrold cistercians parisienne lebel flopped svensk jenkin hallowell herero rheingold bioengineering searchlights sectioned vieques bronte eider pappu blindfold proxima polyhedral indianola masud pounce proudhon dramatics milian rize whetstone vcd inácio taxiing poynter golda micronesian plasmas goodricke dummer btc strider pmo gela hardaway decrypt triphosphate disconcerting faison bowditch bambang chaps moyers madi fortier wernher dubey bindu priam glencairn aep glutinous uhl liger evander slane orangutans hentai mugen magnesia sizemore salud ksc morty graydon icelanders queuing disobeyed illegible parisi unenforceable rmi cholet farmville tanna bickerton regenerating strabane whitton sprouting supercritical druk síochána melchor miodrag ´ sontag stinks ahi wolof farben multimodal otro gaping ika cronus kieffer conjuring rondeau mingle conflation ahs pretence bombastic hidaka prager harwell tinkering pajama santé sphincter kazakhs takei arévalo saltzman sarum kodi pindar standardise hypotension congratulatory frosted ndebele nicktoons phaeton nol ranieri amberley kokoda creston foligno hemsworth damas époque darna salton theocracy lma adage pus brumby bezalel toshiyuki pinkney conyngham masoud penalised mbt lisicki ruprecht qara morn lvov frisk mersenne taber escutcheon collyer jettisoned badass earnestly actuaries curiae safar mll dershowitz novelisation arrowheads statehouse kiwanis wuz hoban antes saro incinerator basf soames polina pbx disagreeable cobo potchefstroom katalin attache klass băsescu monkton mudflats archdiocesan rohini sergi clockmaker tpp practises funnier funafuti kull miwok macha spectabilis epistolary wot anstruther squeaky reimbursed liebig grupa rosé hoffer cpusa catskills guidebooks diamante birdy incantation macaques rages jann biomechanics obsessively superlatives cys tubercle unionville halden burgher tridentine fessenden snc dilbert laterite recesses tiro first palanka wenzhou lugnuts bsu macphail semiotic cuz rlp paltrow copts queensberry kapellmeister gro dramatised preying calligraphic nambu rajas mencken showalter furlough loja prowse marillion terrains lentils hella lory hoisting picchu shelving guaranty veined fuga glossed ratifying eminescu initiators retaking boba fuchsia metrodome nees lah glorifying sheaths telemann mammy massú tacitly hornsey ebel desa crawls repressions gilder barnaul taishan firebrand aurore ors eigenmann pique yellows clipboard glinka yoru picketing ramage vocoder bharath lemmings cloves unquestionable piglet halleck touche ailsa bist deforest deke cefn delicately acidosis fertiliser henna seamounts mouscron anaya astonishingly birkett mondrian necro junichi dosing mondial subba subfield blisters nari arbour abusively spanking tailback juvenal ced dramatization diarra evocation umpqua tecnológico earache kinoshita régis mathewson wreckers barnyard storekeeper beyonce milagros apalachicola speechwriter foxborough brierley allegories wawa horwood altan resettle cineplex olave frugal jamey bott lullabies recuperation kellett razorback sculpt landi hibernians subheadings masaru utero granta timestamps clyne bubo bartels azaria pulchra novum xxxv macbride foshan aten norberg dinas ginkgo tzvi ueber suppl lhs hedrick bencher didsbury tincture elke prabang jurgen proliferated rade discards tilson bunsen barbs subtracts letchworth shamanic buzzfeed halberstam infomercials bannered rufc armband gatefold jewelers wittman kawamura pingtung biliary dallara lali syunik pleasantville pogue lefèvre basking tasking segregate pulido basilisk underlining titusville ionesco quantifying tarrytown baltasar critiquing elfin blain nse conny cervera ketchikan interpretative subang martius vermeulen maracas vivi cashed báez bubbly chatty redknapp piaggio hannaford mogens prilep weitz dubin tss nuku reams thalberg amkar vocalization pincer purses drdo weise paxson mourns shahbaz collation pastorate belladonna triste envisions disturbs ginseng fetching newsgroups unassisted grazier yori pajamas cavaliere chagos welder azeris même sharples sainz friese gumbel belluno pyro faubourg publically tualatin magni songbirds custodians années eustatius logicians lav nanterre colditz simson quasimodo exorbitant acoustically mance houthi frontpage friedland jabba fyrstenberg clearings dramatica aegon albinism gorse latreille gratefully mdm zastava gano roundabouts mullingar cliches ingen gosforth aizawl evaluative salk ringers destinies tinnitus niobe logger heeded waxing tash prelims quiver bourbons purulia excavators icarly frontale suma airworthiness kops surpluses trashed catapults seagrass pienaar fatma tillis bolelli klas hordern masoretic eavesdropping brisco monnaie planktonic microtubule mcquillan cassavetes efron flipside mentalist erzgebirge abercromby myopia bromfield birney twining wrest kishi fibroblasts silverado cofe verging ramya dé sfgate stator abbasids imad anhydride indi troglodytes skokie bork gbc cranberries dudu pozzo baryon montez sesquicentennial throbbing wanderings ambiguously herrings hse trost seeps masterminded prideaux underpants anemones violette gripped calamities pharisees feds titleholders hortons historica wallington eighths imaginations macneill lindenwood llosa illegality swayze hundley streeter programing phds reitman unpowered jeg pastrana rebooted contessa nunca strippers ably aare wheldon sago sangli chiaroscuro defensa kuro juncus holograms chernykh hemant imitators dormouse needlework wbz goslar photoshopped issei sentimentality xxxiv shinobi berri neda landy uzbeks larimer takoma harsha woodall apertures sebastopol sviatoslav bskyb omo platters trebinje invulnerable koppel seagate handlebars separators narcisse ligier bostock tachikawa arvidsson trevino tbi halevi oceanographer oxytocin kitzbühel aurich cofounded hoxton whampoa pis gals surmise saddleback inundation flagbearer ingest cherwell oblate curvilinear rémi obfuscation headship nefertiti carstairs obstinate frenetic turgenev esm alpheus extender tomy sabra sira sayonara barna quik soya choreographic tacna defiantly berlocq misogynistic bayly warhawks nek sneaked mossley artesian isdn gandaki mylène harkins chera korail mimo puller rusedski mirim mung mellen confounded khasi boisterous kamui covenanters pmi holodomor unia florencio brae rothmans jaspers tadić westmount carlingford likens glaxosmithkline caucasians lakshadweep tch roches bitching sergiu shorelines denialism collaborationist clairvaux treacy beamer wadia septal paxman anise animating megabytes farrelly corretja bartleby glück springville lightspeed marksmen ruthenium tham rufa hmc preece golders hideaway sellars muthu johanne daughtry soderbergh tallon megamix defecting hyperbola pfl carburetors ksl redpath cathédrale internationalism jello odenwald benoni joong supercontinent inbox lycos prüm parbat daffodil smr hypothyroidism thein ribot westley helgi nettie pooley kartik halla doughnuts atahualpa overflowed vulgarity matrimony tetsu nicklas hogue splatter infirm lermontov paedophile ternate baywatch exclaims flatts backhouse unassigned sneeze yoshiyuki melgar kenna hygienic yearbooks dalia teletext roark ditko washtenaw maximized synchro takarazuka jit agi bogue noob siyuan kroq millett baltar restarts lillard swee ramses holmberg meditating benue nef mochizuki penna swire gerhart quarterfinalist koa huánuco ternana kristallnacht cfp cloaked goldfrapp spaceman courtland filipinas ucs bristle gunfighter istra tsingtao espousing petersham sherif blacker trimmer southwesterly steinhardt reinach buckman callender tarp oktoberfest reshape enplanements weeklies vater spiller idiocy azzam lifeforms arabidopsis pok cette bicester orm capitale wenceslas waca marfa vit vmf durrant misconstrued edmundo henricus majuro shimmering cadenza chibi oakenfold eradicating robs unsc chicagoland takeru shahab courcelles plucking purists mapa villon attock aznavour shakir cutts unending lapel retard nikkei iredale disloyalty defame imperio latifolia mortgaged yomi crunchy bequests birjand gosse esquivel rebelde dees oun semana lepanto wiesner crumbs hiccup gort jardins kahani gauleiter rathbun fowey dismemberment sociocultural bizarrely quanta ose ansell kuno aub mahabad rsi streetscape pardee irritant cupcake sone banksy cormack fdc hoek helsingin eyal stoyanov boreham chek turley vinaya olam lawes minn uiuc jailhouse chiloé sandstorm vishwanath campagna northrup tyrrhenian proselytizing antunes lynton adulterous hooton stilted muri hamann melatonin millman rendsburg migs consignment meléndez diasporas takin iic superstore centurylink bagong impregnable joules baboons bruguera consiglio ippolito madang unidad faruk campagne rabelais cabinda landholders plumbers ashmolean dillingham yorkton pointedly xli shultz pacified squashed ohr sctv bouton derulo nijinsky morumbi khoo abm jusqu changers chater naught whitmer fitton sain condit covey gallbladder stephanus treviño hangin catastrophes faeces sunflowers latvians dined eurofighter lymphocyte capitoline llangollen alouette dhivehi enceladus deft vrije prabhat olivetti brownies hardball karajan wardell negus iles bechtel bhadra flt haidar scrotum bitburg hoyos colliers brühl kundalini metairie loxton iommi navigates dellacqua stranglers decile sliver olmstead pano miers theodorus saku bosanski mattson clerkenwell cinemascope oxidant gavan scn toothbrush terrorizing puc lammermoor gacy wagstaff brayton irrelevent chandu wrangling srp demetrio prus bijou waukegan unbelievably shirai eastham bicknell fugues vivace plebs breathes giddens wyss menelik stutter sodor coxon paynter richardsonian maggi paracetamol lucida wrangell hedging baggy whirling runyon softness doppelganger webley assessors jurado snoqualmie milam voracious parkview nomi trig counterattacked extricate weatherly abp digress reconfiguration bucher fabolous bulging abrogated hamming stunting misdemeanors qala candor theorizing yachtsman mobb oppositions perigee bakri vanna penitent competences satrap nib rickie morvan mantell hypnotism mazurka bialik muskrat hanyu smelly skynet stabia kasem slowness scolded florencia bayi florins dunwoody hematopoietic carracci alcázar argentinean misrata duchesse ige nusrat traill challis elli reciprocated berthier mgs analytically gurkhas uq fragilis expertly tolerates zam bruise beate sutjeska balogh tadhg masjed kerwin brigs fredonia woken witter insinuation moc jovian sidwell acidification natsuki michalis airey arial elysian teu malinowski doublet anta gringo lavatory beals gmp mcl staatliche bugging tassel landsat avinash ferromagnetic duvalier suffield heighten kimbrough zarathustra dubose gandhian superheated hamzah clinker reoccupied mutate mirabeau fatherhood dwarka goce fracking bushwick ivie chiranjeevi dsa kasim roosting eberle inka assi variegata cannabinoid santorini haverfordwest pyrotechnic zuzana veng extractor cliftonville kickoffs ploughs versicolor tapers queiroz extrapolate kenn belittling markos amazonia maywood doig schneerson twine lenihan wimmer antanas holroyd istres frisians karts trusty tavernier amora trusteeship crossrail nargis affix bluebeard carotene kweli mallon munna generali tribals muro pádraig nizamuddin conscript alde cytoskeleton beamed djing muñiz uncritical panellist serf hélio earley uninformative profiting unmask fiercest siraj addressable gregoire grasset conover magnuson maro manningham isidor repeatable bdd russa kunkel ecclestone judgeship malaise zamoyski remaking plm yikes tapioca barthes gmtv psychopathy brutalist halfdan albani offical sbt chiaki bourse onside eczema artyom barbarism oakey syntheses naso einaudi tedeschi jaz semitone gizmo chinatowns castaño lazlo akashi looters huskers heartbreakers kosciuszko dermatologist ascendant weirdness usury stowell allegretto reinstall rangi martello swamiji dmus coppi delmas karak xxxii piso wasser charbonneau ozma wrens cixi bactria vips elegiac taconic kintyre npt kairat collegio synergistic lightsaber bhatnagar oiseau authorising zealot encapsulates committeeman spiraling angeline halfpipe beek straightening alvis topmost osorno ripen matins briand behn vasudevan ridgely unexpired cdo stornoway zoroastrians spied dau skateboarders ermita intoxicating philipps flynt machin salut kaman pressburg fantaisie homeschooling apcs guilin wolters marcher enslave bauhinia perros hornung sture carli mlk herold calderdale bair micrometers ravage immaculata peering amery moan dern pretzel libs rsp mexicanus yorkville cortese sorbian ment erastus gornja quanzhou poitier sayles roseau animism gurley reichert rudo sinuses altes photoelectric pacha sukkot toothless secretariats altimeter trappist ediacaran debreceni lilium atlantica dockyards cota vineland azlan panna malaspina anarchic dda otherworldly rustlers lic xfl inwardly fainted crean kuntz scents tuvaluan offstage ayuntamiento tuo albinus bechuanaland inverclyde stalkers helden mykhailo blomfield bedouins hku loughton kozak toons interrogators tripolitania bloodthirsty impediments schoolcraft tots bicentenary neuberger contented rearward anser tavriya kone pitney satriani liman birley mdy clamps gongs mamelodi blainville desecrated padgett astounded meshgin euan hulst collectable refunds preempted juanes zomba reformatting boyband supermassive gedo kitamura adjoined seg safran rondônia birt lamentations shimin qx carn bailiffs sickles flon deplored unease erol otomi fangoria undirected spratly yael ornette peddler pratique waldheim feltham supercell obstetric generalizing evgenia flippers nikko computerised ballina zang uerdingen bresse dhofar legalistic wark doce stuka européenne wagram buhl franchitti ponders superseding savin deportees fada airsoft soulja indiewire yamauchi hadji dechy hobgoblin wenham hrc shahzad stoops hurl luzhniki jowett pterosaur guianas sundari hubris mejia ory psychotropic tanana dwt kanya bernat binnie bps laparoscopic dala depositors baud rustica mikan larousse tellurium appleseed unblocks dibrugarh microsd hoddle halvorsen celaya béatrice bonjour barcelos housework polak mauer tamale altstadt prieta chardy gwilym parkhurst kramnik econ soane tolman clandestinely glu yagi archana accelerometer deca peacebuilding funnies avesta reutlingen interbank bci iveco duopoly mcgwire levesque friedkin malthus turbidity tsuchiya daniella yuvraj gopala quadrennial reales berwyn choc gastonia hooliganism redoubts habla thimphu herren awol aparicio sakaguchi patching divestment greif sheedy triglav presidencies gooseberry commonalities caryl teleports fancies djamena leibowitz tangents sasi patera unsurprising selmer eggleston zainal zobel unwillingly waggoner seabury kaito eckersley prohibitively mocha gilani hutcheson awash unspeakable bertone surging paro siang mmorpgs dobbins insomniac nta tolling spoofs carshalton culbertson wawel atal keto strangelove lato ranjith biron heathfield adèle subduing maximo nats viera capilla newshour crutchfield atone saeki squaring erle kiril sylvanus playfully miraflores palaeontologist hcp wnyc cadfael interrogating atar toten feigning yokota haruki tert jenifer tsk dewa unsalvageable loe constantia tapas treen upped leadville westerner tableware sarasvati starships reacquired wwa amboise stench cheetham stymied bleeds chutney belarusians hartigan holcombe divertimento ryle transcending vim trejo shakey xerez monger shovels homa tsuji yaqub florianópolis fluxes ebba bridgman johny stubbing adulyadej factionalism perusal stockholder quarterdeck bown copywriter psers garcés ramstein gelfand sujatha syncretism perfecto backgammon pmr borbón sharm heydar wfc partway spoilt mendonça tameside kournikova kashan placentia uusimaa wael osuna seaview cavalieri ashy breezes hubli bambara nonpublic newspaperman kru varian lilienthal miyazawa coworker tussock você manoir cristoforo technik busing roxie hematology bilder libyans koya atty gynecologist clevedon majeed tanto trammell hoppus belov diamant harnett ashot divo diffusing márcio perinatal zulfikar redonda minivan estudio teletoon sante perc patronising folktale metellus icehouse jérémy spurgeon bluth gertie berthing milf edulis schifrin anoka annunzio paule yuva ledbetter gatt lsc velha hyperbaric fairhaven ansgar subplots ucb duce eastgate overflows gunderson borys mbh aib dga vetter wyckoff camborne paston vakhtang martti weeps osler benchley mohanty kastner burgin redruth learjet iguanas inelastic villalba revs wolds mangum basalts henares tenebrae grayscale decrepit halter muzaffarpur zk hazlewood denizli rhymed rancheria stewardess transversal predictors reservist optus raheem capello brinkman cumbrian newstead moldings elissa circulates rojer pints salomé sfu telenor frenkel milutin fader szymon herpesvirus ravenscroft jue duras gasification horvat temporally summerside jeffersonian barzani magadan vulva carcinogen bogdanovich anastacia pollinator punchline bruck chhota lambeau stokowski sedona blinky chirico kobo ream foetus halas deactivate hatshepsut hazmat nxe reciprocate arnim rml stratotanker fado jrotc hannay eef raghunath friedberg carbonates bridgeton wavell stovall helvetica rafiq artical givenchy geysers fusilier baleen lorenzen gretsch bolkiah alts mcdougal estevan pedigrees ayo lakshmana icap thigpen tarja vallance neutrophils songz edc hildburghausen heald ducats endemism bsf playboys wipers trapdoor mohit milkman gilding uca loesser salleh lorrain obstetricians peekskill abolishment morland wałęsa lardner trygve systolic pallidus stourton gandhinagar historicist mcminnville scallops waterlogged moradabad crataegus socialista amplifying venstre paralimni unaccounted xalapa ferran communiqué ignites darab nori aleut mccune gratia annulus pipistrelle createspace sherrod sainthood kirchberg keun moaning intercom compositing vetch overlong surbiton vise tudela carmelita massy mycologists powerbook watermills salvio nmc trutv scraper lasseter hickam jalgaon cullman turkana jupp gleaming gripen grandnephew salé oort interleague wärtsilä osmania camrose janitorial arcturus creamer rhythmically duclos callow bitrate tanga cassano memorably forlorn ouverture peto lavey mulgrave tsp tallow risto gentian eliya hershel cyrille weeding godson switchover bayshore bpo mannerism freckles lbj archuleta buckskin delorme crouse heinze buckminster granularity thermoelectric nanga hosseini turton bizerte sunsets transkei deformations unrated arguements oakham linotype centralize stiglitz isidoro pumice citra adelheid abetting labatt entitles jayapura fushimi sorrowful sso southeastward endoscopy zapatero vfx raking keanu parvathi solveig embarcadero copepods guidlines lidar femi rutger pantomimes flutie dendrites superconductor myr ibises monolingual outram landholdings fujisawa leet doppelgänger skillet antonine tigger redact geomorphology pressurised fira delorean captioning rasch undetectable amira stools doss yellowhead circuito sponheim sinologist tollywood aquí gimmicks développement growling tain flashman vtb jarred baptistery sle muy mcduck glyndŵr safekeeping ordóñez seibert dempo unfairness dorcas meridians secaucus benidorm takako snowshoe pomfret oranje wildland videotapes backbench interrogator coniston gundersen kathakali nakanishi ligation headwater stam botnet silencio brolin homonymous torched mish maigret wythe sardonic oscilloscope taher mcardle ibge downe tabulation elohim melling excellently gostkowski tempt snps ipp pitti spasms cutty fatehpur wcl sreenivasan thrushes harmonization biosciences exhortation hereinafter aonb exuberance tete helmond gianna tarski jatin akihito marriner dodecanese convenes caries songkhla tanjore penafiel burmeister ayako zealots manik lunt arrayed interviewers bancorp bereft guiyang brydges skerries neymar lor mestizos msw echolocation manipal johnsons balaguer malachite horsemanship accusatory esslingen saff genotypes toft khatun horry imprisoning parapets spokesmen nawa howth clawson bere ejido akiyoshi towner wgc zarqawi sensuous riera sombrero lumping stomachs toriyama proportioned purnima bishan arabica playfield looe vartan quarantined disregards superceded calabrian fabiola maxillofacial wedged negotiates foerster thao geisel scoops pumila tremaine developmentally smet bloodied risso marquardt burk bollegraf unapologetic spivak unbeatable battlefront aoa disputation govortsova bisbee sandbar polanco spiritus dictation luján hellmuth grata tangail mossman zelazny bhanu interoperable minuit haqqani hegemonic bromeliad erudition becerra unfettered bottlenecks burdensome bolus trite almodóvar bereaved clubbing arvo piran rosyth gridley inwood ectopic biograph pkr pinhole upenn liebman franchisees mudslides pulleys canonically cragg imamate carrière retford bhandari romsey edp synopses gtx alomar tapper adopter bellflower cresson iea cobbett libretti alleyn nilgiris adjudicate stuffs marri reconquered kaki datong yarmouk massaro marchenko lalita sibylla yakutsk dagobert hovhannes michi venta arcadian crochet sako longterm pancreatitis innovated graphing bahujan circumpolar aldi kemi wladimir contentions caldron sheepdog southsea esr shinee hopton cots bradstreet prowler babelsberg miwa mauled wohl bricklayer ivry semites experimenters juninho amersham magoo powis liveries canta lessig federalized srf predominated docteur dorfman rego amadora karmapa chante kinematics spratt fancied mckeever tpm humaine tocqueville malory whores athleticism diametrically ordinariate chadderton plame qadi kovalainen lensing thornley saanich fcw salado pininfarina temperaments molnar hamlyn chagas lagan grandest ingots keener amritraj reinsert graziani pando almanacs herford betz kavya wartburg lahr sigourney exquisitely zdnet nucleoside bustle cabrillo thurlow tarbes charest contemptuous yadkin pavlodar hendra bucaramanga corpora jaxa salonga selectmen seawall betti flippant marigny radiates lydian mfr heiberg paladins giddings immerse nanometers orvieto saboteurs hermano tasteful helter yk straubing solti sidereal geolocate tolliver brio pacelli cays asma ravichandran bonfires legnano vilar proteomics gdf airbases esparza manheim bagel disjunct grafts kaa bedell tren espnu nozze chengde cella flankers rylands kord outclassed gubbio shoah risparmio colonie permeated bassa sibu escrow leacock coldly brandes abolitionism weft tripos frc smearing sectarianism yalu absolutist isadora trevi mcvey hosiery khalidi kudryavtsev norrie privateering yannis hawa predominates metrostars cumberbatch fiestas varman samhita knightly kanan bryden dislocations kathak sargeant subsea belair beardmore expanses rampal bardon bestows elkin oxidize stationers heterosexuality sanguinetti weybridge papas ugliness sharepoint carbonaceous reconsidering chiricahua werft universitaire abdoulaye mauricie wickmayer rationalize echinoderms kilwinning heiden subterfuge refrains cataclysmic odie potosi hippolyta chairpersons technetium ajman commonsense ringleader fof pillbox yoshimura panton misinterpret weevils volterra twc banister haiyan sarada yeasts bunton shahnameh ungar viziers alleluia peronist endometrial sotho frescoed rediscovering fouts savoury swart directeur romulan inlays congeniality shirk ablative vtol leitner discontinuing menai nak förster strategists sunburn gumball novelties prolifically supernumerary kotaro ashman oamaru stonebridge vei proactively stoneware jaclyn pagerank underpopulated belousov coupes pratapgarh cartouche spastic kohat staatsoper harstad socialiste meshes gauging goulart sundowns corsa goldenrod brockway trombonists alena nata hairdressers cornette madinah ulbricht cyclades yttrium maclennan gagne barnhart nenets melas charlatans mukhopadhyay tamim jetblue liquidate laudable bestowing avni universitat localize recapturing yara kelis dudgeon cobble twomey crucis personage gentoo thumping slidell prp retargeted masaya souther morts markku wilders marjory meanchey ogier unis studium kummer covalently attu howitt barrages resolutely ergonomic growl riza chacha henny autonomist linley juggler kross degeneracy bharu yoshinori suchet traver letzte hildebrandt geoghegan cortona lymphoid cajuns intranet rostral auth kurstin blackmon dibble lithographic cvc bainimarama parada phenom bux synesthesia ludo fitr whiff chalky diverts sawing informers switcher newsstand wedel langa penda harrods capriati suse blatter predisposed bunce staveley scutari kisumu adan hypericum mindaugas nanotube mcadoo hanif airlock apostol matchplay mlle tomita countermeasure fos davi rhodium wooldridge uhm conceptualization pref pinakothek cooktown piave mercosur lagi linehan frenzied questionably ebden muddle herc seepage fulfils encircles frosts dimensionality supersedes werther nelle allying holston leeches copyist victorias brinsley rabble sueur calvinistic madawaska vicarious aprile awkwardness tedx beater karoline indah nicanor englund yasuo cottle instalments goldblatt geopark bonk podkarpackie eurythmics devitt kilos conduits illuminator sagara counterintuitive analyte quelled pastimes coloratura etsu pincus susheela repaint ferber desarrollo cialis sneha nyingma uchicago cambio darkside tennent chitchat lbl spearheading taniguchi grampa redwall barracudas shihab parkour bravest cowichan aetna ventimiglia smallholders sajjad flicks netbook sonam testud expansionism permittivity protestations scorpius soulcalibur corina roeselare speedometer fitzherbert ardrossan lozenge striptease athanasios barbu huila palmes risqué chatfield spadea xuzhou uktv tribus fathoms valmiki vlsi dominika gelman waterwheel yilan cpg brule lxi kaaba involution tradeoff absorbent belmore tilman malia rougher rpo galatians tá creedence invective mateusz folate morningstar pizzas wasl brigands rfl reapportionment helton whiter iol successions pleadings destructoid fallback piri penstemon bagmati xlii auberge nscaa alderley sepulveda kagame ergotelis caloocan corydon dro sempre thicknesses caprica ionospheric smallmouth maximiliano ishak frigid lysis disapproving zevon rosamond lill soliloquy conquista rtr metlife profitably keystones rlc barmaid doosan arka dissipates valls arrhythmia greenblatt faenza rarotonga calamba pobeda pallid piedad kawakami latta pathophysiology silmarillion tanis surfactants halos straczynski srivijaya menahem distro bacteriologist tah varuna dependant mottram telefónica duong iphigenia akimoto xxxxx unfamiliarity arash reinvention guesswork barnegat pulteney inclines fenimore vesa stile transcended athist preemptively signor slung fraudsters yani shrimps hanyang congratulating cheech novy todorov enjoined unseat pila useable stef metamorphism oroville steppin toujours ferrylodge bessarabian concatenation surrealistic asgardian lowrie lynsey katsina darlin efl peckinpah curtius smylie afterglow faun hadar hurdy pandan tomorrowland glasser diatoms collarbone valenciana ophthalmologists entwined rotenburg alphonsus hassell veblen rsm atsuko menander ots wass entertains hasina antipsychotics paya pastorale ake offutt devendra redness autocracy checkerboard nuits zondervan assize angélica dok jaunpur gardenia sensationalism clamped blass demobilisation supremely taxonomists nagas rigour dharmendra positivity pwg lexicography bellagio merridew cristiana rhyolite ornata burrard ecma craybas crocodilians aos tromp nettwerk imager rajab sorsogon lougheed agr unacceptably carpentaria binney reasonableness conditioners langues homeschooled colla chandni recht udea kac archiepiscopal outpouring darrel bpd patrolman nhtsa gsi meso drb opeth orthopedics arabe broca restaurateurs paralytic thampi manifestos kaen corroboration pram realists lafarge racy jugs lapierre piemonte brag shoestring dein griselda janissaries patronised vari pekan oscillate estadi ftl neverending mcclendon hazell nels iupui disloyal ovidiu scuttle plage fistful xbiz stretford amnesiac innocently boehner hensel multicolored antifungal provocations afferent pucci ceremonially orators joanie depleting persie cowgirl garman snicket baltazar otoko lexicographers cbp tryst daystar epiphytic alimony moniz bruna frascati hollyhock modesta nickels zika peruse herta conservatories membranous kennan badri occultation uas marach leffler nagesh fussball sorrell eagan fru kaba neutralization disinfection tsvangirai podolsk castroneves swaraj diarmuid mounties hodgins conflagration sifting undertow margarethe niamh confetti ppr backdrops decathlete brookville gyumri diomedes kaminsky peculiarly raa portas vinayak iot fuckin comox samus sayyaf hage rocksteady higuchi beefheart caversham compaction morphin sno reinstalled transdev rennais denigrate blamey nourished kipp serrata ihf wos konishi squandered parasitism resurface commercialize compasses revenant marjan wynette traversal romberg historiographical caird tactician glazes sharpton despotism shudder raindrops reappointment circumflex gammon antiretroviral unionized ouellette miscommunication mynydd ddos fattah brannon victorino nygaard subclasses inordinate birdland kad transducers hoyland quba hamden cloyne pokes comercio osteoarthritis carthy skua funders antun malatya sacré strapping gumbo tehreek chauvel fazio rabbani dystopia commenter tangerang frère mí urquiza uap sloths isleworth rane glinda laurentiis oberland poesia anju bartle bashi spry dando yami yz concussions closeted uto runge xxxvi kayo underpowered sbi bluffton carbuncle efa lushan shacks ashbrook molnár schoen dube chesnutt herediano rayman injurious jewishness romina judgmental beslan parkways cosgrave carron kropotkin hed articulates menelaus canwest liberman merlyn josephs shor murphys sliven ofer deviating retargeting policed redesigning romandie distinctiveness alytus istrian zapatista abbreviate roehampton iza bullfrog aspelin parekh venable haug ruffo chock staton wangaratta rudin fenice gardaí cherian propagandists badia helvetic mestalla spoonbills magnitogorsk humperdinck sarwar halligan rehabilitating nila dungeness quam upwelling spotlights harpsichordist drinkwater brindley inescapable bionicle propounded glenville lalla tedesco khoikhoi slingsby nahr falwell quench eagleton appending cuxhaven demarcus rfm kirti imap benda unctad neosho talkative arjan dominator tesseract rodina guk preconceived mendota pozzi czartoryski rippon toppings abundances tiburon wildside looper erne kalevala lisi lysenko dreamin vada hypnotist kastoria amphora ampere arching oeil congrats svu stenton discernment contentment troughton hospitallers murr elmar nagendra josias misleadingly quarles lwt monumenta lix rajon echols dentate classique prolongation blizzards retouched erakovic gassed sosnowiec symon siliguri inflight disembarking aksu botolph arminius devalued lyubov oswalt switchblade toddy venda refugio capably beerbohm calc pastes microarray pocketed adu chewbacca stagger galba nathanson donde ottumwa newlyn shtetl gagliardi khana kasich kanta commending naqshbandi nosferatu blaxploitation styrene canarias patrie dwarfed revitalizing pulver ayre mcguigan wilsons balk compensates quinine monotonic cringe jelgava daula spurned dcm absurdist kiu spinnin breakbeat wisp stelios willems gaede hore bornean schoolmates bhiwani equipe hunza seurat unadorned faraj rossum servicio electrifying wj adelbert feldmann pso hesitates fothergill fletch wankel isambard basho khayyam benjamins hedvig gcvo bera equalized verdugo ramasamy connoisseurs bannock bolaños carrer negras ullevi gnosis aurum commodus everquest kron nyquist libert paresh réseau tvt scoured ewood jagan magnificence neem pepi extradite yuya mallika transcaucasian cordes gallacher soldat pinyon veneta walkthrough transportable ladin coffers kantian sklar encroached disavowed phill dolittle senanayake holladay fairlight backwaters metrication zubin gangland crème razzano twi cements gharafa negi westville init allocates capitols phlox wey agu einhorn tarango comanches salva uncontroversially embellishment maximally archangels destro bathwater berti uptight sadf striae bushels jetta skelter gardena hac tristis swope catena fock wenders haga rohmer urinate bawdy capitán lateef talas tbl perforations gordian ctx mateu walser colobus wickedness krystyna skyrock shortcoming glennon falcón memoranda bunga rescuer nishan farquharson defused pasión shrubby claymore heathens koti blekinge thal arik scrupulous drooping commutes hawthorns spasm tholen securitate spinoffs tro spanner kunio fassbinder ejecting displaces paintbrush schwarze bibliographer petting upjohn lacus overreaction scipione midgley fidelio laurentius vlach audra pediments petersfield ulam disappointments ascribes springhill chorlton gameday cardiothoracic proffered qaboos nisei flannel kesh insubstantial smug balmer dalkeith salama bengalis romany barbell limon darbar maly bohuslav soga utters matija ndrangheta ductile vals alok planing troma jagadish oddparents hambledon jolene lachine deadpan unobtrusive dennehy merian tellers tubules marcie allstate drudge valjevo talktalk gooden apotheosis cag peacekeeper tcf crookes duodenum subconsciously jogaila sitwell sanandaj budden orto paramus bugis cocke iki pavlyuchenkova tmp henke trotskyists pallavi grunts millimetre cardinale gül aujourd fredy functionalities intruded titration recidivism encrusted johanson strictures mytilene hapkido moped liskeard hirshhorn vogler tpg deleon reva detoxification beachy temuco massless dyfed oxygenated culebra genial louw levallois akatsuki ansaldo wands intermolecular achaean turgut akrotiri marja oneworld clementina rosalyn stillness kosova rattlesnakes papillae bushell transworld novokuznetsk bahri lito bastogne hysteresis sommerfeld rosehill gulab butchered stifled rupiah falkner nog thankless southbridge battleford inu deimos falkenstein milhouse dropdown francolin sncc ashbury pentatonic rosalia scrupulously copperhead haygarth alans gprs milkweed crappie recharging luongo clonal jairo coty hendrie roerich campbells ranbir normalize klimt webern screeching menton unterberger spla catt corday peay gioachino quonset strangling kurumi hypnotized diplomatically pontifex glenview amniotic chuckie ladybird multiparty outrigger insel tiburcio gargano mpr acción vats firestar unforgiving mollis ovc makuuchi geez wordings melcher aftenposten copse ciliary prepositional shimano niwa ofm eulalia rendezvoused foreclosed farida showered korona zand amalgamate breck zabaleta marzo intermixed mutagenesis franchisee shitty nfs corroborating fip poacher sportswoman rtm kalu jardín ashokan rohtak hinson eggman vong ecowas hednesford jada telnet transvestite grus xuanwu wic infomercial kohlmann flagella weisz interprovincial sarcophagi endow keltner aktion affidavits minehead ummm dragutin odra brundage eclecticism irrevocable iwasaki xkcd hannu altiplano manstein molitor mafiosi lepers shiina sof gsc campestris pastore cotswolds negates guna smethwick creel volleys baar canuck islamization comms vegeta estimations jorhat recieve rearmament mcreynolds didcot ridiculing certifies passageways copd moated archaeopteryx preeti coffs multifunctional jörgen morcha indymedia mendis msr taxiways khar wielder gearboxes schoolers delved expediency courcy toucan chaturvedi yugoslavs sylvestris petzschner colgan nitrates baki orillia alvear satsuki breakfasts hematite motorcyclists indefensible formica tiene briefe sahu issac kayes tulagi nsaids matara kutty farian penrhyn dakin spira boren lindstrom collison centimetre overpowers calms pavlos maatschappij giallo chipman bienville relict mujibur heb trioxide mpv arto danielsson mobius transpires hagi ripa bunter helmsley newts kauri sahni lodovico aspartame forebears alcántara quire iru testable weiler maintainers mudra nias tolland harvesters nicholl recyclable camilleri vlaams toppers dred abhimanyu medias toyland deluca riddim raney refraining ministerio eshkol ablett substratum ghazals bhagalpur tamarin einsatzgruppen rehashing aunty quien morne faridabad kasuga ammann godless cancellara truffle procreation sancta pokhara mvv equaling stenographer congregants muharram schomberg antelopes appleyard cueto vfds forklift gilliland coarser disorientation modularity agape schoolyard underpinned vill butorac authorise bagnall detested ploughed irwell blt diptych thespian battersby radioisotope vitally athenry maree jessen strozzi guardiola mandurah politica inquisitive lupino nandan moncrieff pryde ferric grabowski yoshimoto igreja froome clots ticked señorita budva phraseology lazarev kalashnikov partaking deprecate jy demarest manzoni nieman erdmann nicolette passat disoriented cristatus erlanger genki nocturnes condoned takasaki tikhonov copra trine wolfenstein priestesses costliest mahila sdr dolomites walkley jakobsen minder shariah cancun ulvaeus benaud skinheads kunwar tartarus unrestrained aeromedical preminger porpoises cheema slipway prut riata mardy isar gowrie epidemiologist kien schoolmate gcr soeda permaculture boileau yeates eyewall handclaps kem caerulea besser karun ising signora rinzai kava euthanized multiplexes anabaptists amaranth muralitharan meadville infocom shoo unavailability exothermic alexandrina seneschal ribose inhaling raffaello saluting hindutva soundness deyoung musial friezes goodbyes igarashi schellenberg stressors garrity diversifying cellphones porcaro scholten gra desha gurdy headstock vam horoscope soper plovers adverbial orin glbt helio mangala lyga sassafras baluch atyrau asada philippoussis holtzman lancing elca telfer congreso aang dhani leena sherburne parmar mamiya lda lewistown lts mercurial galleons boscawen impaling ragin jadhav ackland basilicas joya hillis potentilla uttam raden buckling muralists hydrazine underhanded cvn fie ratha brite astin probert ivrea mulch toppling earner delon philby hypermarket malika overheated wral stockyards trigeminal carbo sdlp sultanates nelvana khair clerking adal lamoureux reconvened essa yams taek detlef rumsey korfball bostwick docker airworthy lauter misia jovial kanepi solemnity francaise horticulturist cranwell ronaldinho saybrook recusal crispus myna starlings millipedes cogswell banzai contemporanea ltg diamondback agitators ilbo photographie ciclista belittle wisc esb padlock quando cronies shalimar harith indu zog ped burkett kaji malai abutting hitchhiking coram psh cua sentance tryin sinnott peyote furred unearth ero ultramarathon forecasters buryatia myitkyina outfitters prospekt velvety langmuir sah belden domhnall candia tuf heise scl adina launchpad urrutia ude tepper nudist tutelary thistles plasmids nonhuman uninspired agnelli guises tannhäuser peniston pande homily disparaged galea maneuverable positivist noreen hempel kapitan rocío staats grameen julianna plp moreso haikou elevates stopford monadnock ancaster riz inking caricatured bluesy poèmes hayao sidhu insides maximizes chandi expository handa mrsa nephrology alberton aegina aina kroon underwriters hir gasket fagin acd woolston ghali bolsover bogomolov ushering westbourne rearrangements trw vocabularies entrust serdar tangut oreste retorts kawashima frehley depardieu ruisseau umaga blubber strother tretyakov tesoro presupposes slaveholders thwarting xxxviii nagashima marcantonio gunships disreputable nyan voyageur sighs cornfield uts timesonline banjul highsmith dreamtime batesville markowitz songhai bunyodkor scarpa reynaud grinch holloman maleficent morganatic aramco tranquillity edelweiss dolenz buoyed niedersachsen gowen alc karlin adopters wednesbury inventoried lukács somerton coorg serially caceres rejoicing fionn iconoclastic orp tikrit bathers kurihara ouyang hake gorchymyn einarsson reactivate septembre quadrupled dramatizations karunanidhi bude smears maza toponymy usurping arithmetical vocalion operable ostrov khammam cetacean unfavorably odious spv cotten beaumaris abdelaziz schalken carus cutout constancy schwimmer dushevina nuuk alamance solingen musculature saumur slimmer qvc ntu tusculum heseltine dpa flávio abelard mirador shrunken bemelmans citroen raekwon arsen apna budgie unidirectional lemke yogic dromore workflows ktla greve deteriorates karimov cer shoving tfg fergana skirmishing memorization jigs deaconess strangeness hormuz trenitalia lade zing glanced peleliu vissi hammocks otherworld revamping stoica larrabee junker preaches bringer undiscussed freefall barakat cavell baily ebi hoagy boorman moebius psn babbar macromolecules hallie maintainable hainault vernier kaisha mahajan vibhushan hesperia romancing singlet wingtip claps discontinuities unobstructed seta rachelle chiefdoms kiska bussy supersymmetry hairstreak ocho kyla meguro haynie newsreels browner maka bleecker brazoria michio barossa tetley echizen mosh dutchmen burnin pini anastasios hasidism ritmo officeholders robards marsan kumaran giardino ewes arrigo solanki siegmund caraway steroidal dovid ashurst wallops vlissingen stane rodeos afoot iliac rpa kingsmill turtledove archpriest optimisation reflectance enumerate thinned hockley ahrens limbic triviality glutamine cornett southwestward esd gambill reformulated sna condolence unsurpassed naumburg ean derringer glaucous raffi churchmen brijeg localisation reapply erlang olusegun motorbikes paiva rebates padstow xuanzang gatto nostril bicyclists kitsune ohara snook hermite honeydew maysville miscarriages cobourg carnahan stylings grisly wille infotainment fma schlumberger muds kumiko jolo soest jumbled sapphires spinout steller furrows ault reinserting flac slacker jornal anoop timekeeping saharanpur spurt platon mlg savagely depravity yousaf mella checkbox tomes dungan vinicius doorn thien russet wesel turenne tgs bails repositioned urinating veering refrigerant sdsu grits kazuhiko bellefontaine weedon dib herzberg labia ingress parallelogram ynez talkers staub wieder musicological histological quenching crowes kix doma benguela newbold unaddressed toyah jvm copernican gynt naturist castelnuovo couturier alanya mehboob pinchas ossuary ihsan stabilizes clothier foolishly cassiopeia osx daichi intercut clogher aristides twiggy stockpiles icebergs willpower despina reith lofoten moriah imps gradation guderian croom mgb devant unevenly miike tagle smelters skippy lytham kaczyński tetrahedra kamiya lazuli perchlorate nebel tidbits muhtar seychellois celesta nakuru sousse discoloration minnehaha beeson geneseo yuval público deposing berserker multipliers naina kinnaird mainframes auchinleck renewals slush rykodisc sidonia ente jabs alamogordo rereading terrassa tomaso bachelet protea osun physiologists hayabusa pausing shs woong bachelorette lavrov beaverbrook maran conservators mirjana troublemakers subscribes facetious tcc lik kibaki perea vadis khoja sounder patapsco surmounting chorister alsos dayne hatted malvinas mcphail scheveningen juliusz ead tribhuvan jehangir mtl dupe bryne venezuelans adélaïde fois ashburnham mastroianni ogdensburg talbott autun collina decking admonishment mohammadi melli xamax lube reducible aspera upto leaped southam octubre chon moylan reit newmark galvanic lupa firewire bushrangers contractually comenius chronos deluded nusantara zoomed purley moorehead longacre syrah easel critter caligari kunlun dryness sandpipers polansky guile hornaday undress warhawk pern harum malate klassen bravado abdurrahman veg insecurities montigny khurasan nordin kristof quilting wilts dwi deandre amalgamations noosa taubman dinu skyrocketed malfunctioned porterfield knighthoods yoshimi annes mvo algerians bellerophon betta stinking berglund kiya sackler annunziata despenser culiacán eccentricities zdravko delmarva enrollments comisión belkin pith nuked stabat quitman grégory kerguelen mideast lemond bakke purine conjectural rummel cobos thynne acquiesced yudhoyono milnes macmurray schurz daiei haman passy falconi plebeians pss giulietta swati pawel wahhabi stickler panos jugular mulvey ikebukuro danubio alcan stampa harmonie flautists leghorn woonsocket gargan hagman morphy weinstock volunteerism velásquez francorchamps talcott bishopsgate prunella portes lustig renuka ravensbrück nongovernmental gbagbo mouthparts eula ween incineration allardyce sudirman swarup largs mayakovsky melanchthon fnc belloc ratcliff oif ehrman reimburse mantled angiosperm foran etr alawite fareed westboro penderecki unoriginal chirality tailless monsoons shmona pelayo babson foulkes emulates downtrodden berenson grosses krall knute regicide nikitin minke birgitta bpp mowing credibly heathcliff zulus boothe tib niemann invoices mael radish bioware initio sealers chaoyang isolationist szent canongate vagus modicum matar pendants coder nacelle brueghel wangchuck bonde flaccus popish gannet mse ironwork devolve ironing ebs beachfront gokhale pwd colombes blr hird segue tarpon stagnated ashutosh yh mucha malet lightening bannockburn reneged expat sportiva peruvians congenial unhurt vanes barrowman organics atra ateliers hernan frater duero riptide fons shingled laborde necrotic morpho godparents platforming terni ilhwa melford libros goodenough kersey mbr cyberbullying kalpa druzhba roel metzler pereyra lenten tollbooth danziger treading langur stonyhurst leganés micrometres mcvie opiate baalbek teluk kawada balin gütersloh crb tabbed aoba pitre aqa garifuna freeborn futa lari kedar duryea lowenstein vaqueros pdas shreds bootlegging coterminous bertelsmann calligraphers antonella carrizo ampang sackett tapir freeholder moonraker karaj boothroyd engadget mochi epub toluene bowell dinosauria sierras colic starkville archenemy harpe bewick flintoff nationhood winterton abarth regimens maciel paice nureyev santamaria inbetween coriolanus ferme agitating lorber denson nff spokespersons enthralled personae cherepovets reminisces demotic fervently farmingdale gir pattani cori byblos upshur eltingh grubbs friedel poule psychosomatic eves nots encampments readied zw schreyer tayler tecos simão fatimids días microcontrollers deepdale iterate statoil skeena vitaphone heyerdahl aat gushue almanach seaters alluring périgord unsightly gohar geste societe akiba razors oxalate nihil bashkir polikarpov marauding bacteriophage yongle broady carrión khanum handcrafted amalric voskhod suzette goud dolphy majora furor mordred hariharan nawabs vivant moja rsd mro friedmann orland bethpage ayckbourn newar yoshitaka lubomirski meacham sakarya lectern harpy esf kittery frew dumpling penne giganteus yair quedlinburg spahn droylsden painkillers muharraq brusque folies uncharacteristic cunninghame harvie puyallup brinton moret brazo terracing fudan traditionalism deena maryknoll barretto cookware harshest discolor körner fico pulsars yuriko khlong kudu kmfdm rorke encarnación disconnection khotan vishwa bicker lagarde scoundrel flirtation anatomic cornejo masahiko carinae polder assemblymen patricks tig convento ubaldo sandgate alki burglars antipodes iguaçu snuck hildreth mcgrady shenzhou tashi matz kavanaugh atonal saur raisonné loons kinnock zavod tepco earlham pivoting gaughan jumeirah hissing fabbri hirsh joi rogerson porteous hyang feliks gatlin injectors wasim latrines devos jubal kismet forceps katholieke accenture trafficker visualisation sittingbourne hatorah escapades pflp itzhak zürcher sorbus priddy karat sinop studious xenakis electricians pgk gsn enacts maile vassilis medes khalaf haroon harty président codify bangabandhu incan secularized sattler iditarod spouting wq timms mertz villafranca qubit buber anura destin tigress rectus beesley laminate martialed seashells alitalia kapital laika mtt fenix jermyn rohrbach montiel jakes tenma setters riegel promissory knotts interrogates greenough panicum nicotinic olean balaclava quincey langdale folha infamously rattray platini nariño junqueira klotz alcala relegating sachem polarised harlech weyburn atma artaud mascagni velella juri noma bugger batra vidar maximization aklan theocratic resonated tarkovsky headshot hurries buccal lindqvist jaune butlers aquabats pacts giray saic mrg vasilis teasers hamblin simpkins fitzsimons egotistical tork arpanet sakuraba confection furtwängler chyna ephesians cipriani bananarama pathologies coddington gesù kdka penryn jes multilevel typeset lantau corden azazel thresher mogensen khattak aag blasco flickering preys regularization olena yoav endearment echos isak villette thusly winstone pyrolysis hibbard delores seep younghusband kabardino wasson habitability shaven flatiron voest canticle bronchial workday jao kehl educations lenka reassert arteaga yavuz samy murnau sapna fastener morag spf seagal jamais mabry peony pwa stereophonic speedwell stubbornness musics stamper tranche aficionado gais logroño postponing kael polycarbonate polycyclic scituate macready grob kroeger vidor staal phaedra zyl faience langlands calthorpe bowels swathes lynwood amyotrophic toca rifkin osei faune compresses napo kreutzmann conflate yukiko lentz diecast garwood disputants nypl agulhas cii giscard iuris computability mothballed persimmon legionaries aedes heflin nasrallah uppingham tradesman hephaestus buru heerlen xtc napkin conjugal baoding zakat cloaks oudin midair introns giglio interventionist arbiters mycorrhizal outperform colfer bahini sepinwall robie middling backstretch petworth hulton hustlers aspirant centar raffle coquille sct tada nazarbayev lipson jeanine bhs raichur aiello sepoys braids incompressible suwa normanby uilleann unlink hetfield errata zydeco statically mciver erez inguinal heatley maren troyan squibb delamere utsunomiya proofread recharged wigram angora enveloping handan dashiell kelton fireballs pruett geodesics zoroaster ordway castaneda corrêa tld pav zakopane btn hkg dannii vamos neurologists impermeable pusey trivium billericay roeder chieh doghouse criollo dammit shiromani extents thome rtb attractor sardine tellier cruze taupin gdi sli raspberries clairvoyant contravene irn harringay haciendas killaloe ffb nainital bip sportscasters eisenstadt bask guiseley lrc graveyards choon dogmas seiu csg gellar pillboxes doctorow iwc plied molen maladies snorkel alleyne ligatures pez osb mso kinesiology dapper gambrel vivisection brücke naguib kcal divx uric baumgarten precipice eman ltv internalized naro defecation lookalike phosphoric mili bethe cowgirls caisson scratchy hinrich douce schuckert snorkeling heterocyclic rennae shellac incongruous padukone transportes ostrowiec danie spiegelman moroccans ltp fotos intercede kloss regress haneda krull bivouac mrp populi lampung atcc nadie fleshing bellshill sowell ambit whooping canadien unraveling comba retardant carolla kunduz fino fmc kathie ibom shide lapham puffed cavalleria tanz tmf convertibles hovey gérald skunks algirdas hiroaki adenocarcinoma bathory pasar yx eirik tlr tongs martijn penises bda congrès jaafar kohima gaddis mcguinty skegness bcp macinnes lacko hardt interfacing nicest cromwellian statuettes doggy agustawestland daun elucidation sheppey potrero gutta naturals skala epl ashington abutment imperatives caerleon recommenced steelworkers santschi duncombe blida renderer plotinus villosa falter dhu poprad igm tooele concoction polytheism komen bff untill raha skimmer oiseaux banna balsa minimalistic middens unproduced underclass mohali franzen uther bmd irie parathyroid megalopolis gtc puffs tola facilitators dreamworld poston ramakrishnan maryhill daren purpura aloys dhanush ascetics cowlitz pob ogaden icbms malignancy targum quads senghor nondenominational artsakh beste epigraph debaters musketeer amberg unguided alaa likening jahren raycom maler vestige changzhou pfeffer talbert moccasin wrights kadena dianna alg bettencourt randa rasool seele kuipers mainstays jaden ramus helplessness avaya bulaga arata changhua gowdy boykin discontented blick bhakta manhole capsid unguarded snobbish untied castlebar teeny prats meares tacos egoism aimée gms neuropsychology comando eme liquidator claudian werth megara lightnin striping olathe agostinho shatters kingsville preveza abducting panoramio icr buncombe valkenburg bakken authorhouse hmp praja propylene clo frightful kaku espejo pagliacci taillights abramovich jobson jeffersons shobha sandhills dalle swisher touraine pinched dundonald michail arbeit olhanense minis cribs baobab peripherally houseman sidetracked anatomists sandakan karolinska bendis warranting naively tortuga ranfurly meteoric sabato cabrini voir priories subfields spyros topsoil gani eek clichy sassou wobbly incantations pronounces cámara ulus emplaced bric declarer livesey petrochemicals advertorial aimer freestone jessore cleeve maligned froggy antica shehu ctr mcgehee strix tufa sejny harbison civita mdp thunderball departamento intron chika jousting mediaworks chardin sienkiewicz usace fts prabha nikolaev keizer zany shimer hoards adat prefaces aliso nuh swordsmen liens irapuato ditty osho theorizes cimino relieves bertolucci oncogene comparably annis jailer fanned mournful pendergast sambar asec mikuláš descents painterly benford sulphate rops emitters orbited preservatives manzanita maoists guanacaste cama lapeer tyburn coccinea trampling shapeshifter cryin acuff rhymney costco canarsie euronext tomoe cleanest sgr misbehaving ramsbottom levene elric resizing troposphere grandee homonym neela iheartradio bampton susi elva calipers stepbrother friendlier marilyns constrictor willibald kannon palmach delancey niazi hopelessness ifr dems curbing mortlake ater choline zvonimir nuria fending coons melons withington janssens stranding phenix freezers micheline golfo lampooned jisr nazario yarder sfs elphaba wearers preempt payette mccombs yamasaki burgundians imamura marketplaces cerium alanna glaciated tita abuts broadcom numbness babb arduino misbehaviour brp bazin dehn muertos misunderstands chloë imt microstructure arabiya acca trix bottomless intergroup totti heimat mccarter tanger finial namath etch rootes spiky skylights asterisks previewing probus renaldo bookshops pasi beaune crowne jukes keisha upp elkton dainty cimetière kandal benedictus feely milliner rerecorded binet malam sango mülheim sines tuy pertained moxon eliciting welbeck bushmaster maupin rydell segregationist makan duckett motti bowland substructure drang morimoto seda accentuate pto broads norbury gangneung cassady pellew shearman nutritionist nordhausen lucullus verstappen dcf rolleston rakes sonate yoichi kariya colli wyck billiton raby beli variably dokken shippensburg uncommonly perso travertine díez herrington disputable sida saddled pinnate romblon dykstra parcells hedy cabello saffir amani vaticanus starburst piceno toning marinelli aldred klingons reincorporated hebden ipr nibelungen wayang wapping eup agartala piura bobs bonet fickle secondo moderns gollob gentil cheka gook tachyon jobless parading isbell yongsan radioed skool microbe alcide klemperer gorka serendipity cybele askin sanu breakin grewal noyce beuys deform bisson symbolised hauraki infosys mcchord dolgopolov lindquist snob unopened northbrook svetozar gemayel abducts welshpool faribault kusanagi coombes mòr washers canvey pln carding blalock whiskered jetstream forehand carinthian sympathized bfd skatepark sonali sieben drummondville metatarsal inefficiencies ghia catterick japonicus choppers mackaye santangelo ismay rosanne alegría galil ske pemba boxrec pontoons conveniences wallpapers amie goldene artichoke legalised ovi jirga burkhard dhoom vociferous statens mckechnie colombians swum butting inayat groen mulgrew frock magar lumières offsetting wintered andie saumarez dhruv sushil fermo istvan boonville cornerstones romanticized newseum howick florentino xliii sigur ł cichlids trabajo zea poulin spada bamyan liquors overtone analyzers mascarenhas postmortem borrego suitcases raad wantage engelhardt gonads gainer polytheistic polizei firman quarreled melle huet decadal headlong zeebrugge stx frink clacton pogues threonine devore lebaron tancredi secularist currant protrusion fmp zune castellani clarkston tarquini veni ellerslie nasri pavol villena sashes kadri blanch perennials iconoclast lynde ecp despotic kazuhiro twyford captor exemplars rsvp dialed blais trice waid aliya acrimony feint cotes gicquel vana monnet evaporative grandstands juhl cryonics wilk egalitarianism generalissimo campy somersault almas poppe tolosa lommel runnymede yverdon grilling nuptial wix sikes touro outre kinmen methadone holmgren kuri anushka luso zutphen tyrion craighead moka relive flintlock unbridled kapadia engrossing terrorize dizon auditoriums shintaro marseillaise fbo mismatched chartist bayless carvers mukerji nyerere shira zemin studley dawning hastert lewisville looie tuscarawas svante stirs nsr sudo achebe bifida ouellet didgeridoo aea culverts sturluson bearden exhausts feira personhood garofalo keres quietus lcm bruns aleksandrov hirsuta counterfactual sonnenberg shula pocklington scooped flatbed plowed lif estimators merida warez zawahiri huq corda spurring siv monumento kuba pachelbel indochinese ushl tozer gramsci rosenzweig cne lcl khufu musicality hirai liss optima ttp whist chicana moriya telecinco pcie uman mcgillivray strived asper risley alphabetized blaylock noranda feilding hawn talmudist adare explaination typecast probed clerked stipulating chameleons usga nlt tnn catatonic sobers llanview burkhardt hakata nidal feuerbach lechner rocchi metamaterials tatsuo closets chana gura bitey supple coton mandaluyong clas lyte deutz channelled doboj valance dimple nivea detaining landward nutley gemmell vandeweghe kadam coroners pangaea nilly edinburg subtopics uncomplicated gowan waistcoat nld racketeer bnl jelle figueira ré npf hijos glyphosate wmata kahl anshan tabid saveh categorizations promulgate chartier pharm shue neoliberalism shreve aird risorgimento tractive cheonan letcher deprecation deftly maurus halibut comas yona oriol roshi dyslexic jacobins gci dusting func iranshahr magnussen chowder fratton rheinmetall antero koestler frelinghuysen reanimated sharaf firmness neurologic gigabytes biomarker unconcerned shoko retorted mccrary lalitpur bobsled pequeño jovanovski uncategorized phillipe bcg mgmt mugshot soundings oxbridge abdu yerkes cranbourne reiterates wanderlust macdonell overworld ikon citv intelligences sagamore glennie talca chiao saboteur issy thiamine jz flatten defensores bardic hallo aon alin furlan refocused mokhtar teacup sedges recklessness assn shankly stawell occident commentated horseshoes rado frode infierno peron neave flin mystified matchmaking bardstown cheesecake liquefaction sedalia forme grable exacted waddy davros stepchildren sfio waldhof moberly manes fluctuates yazidi corti retraining aby cromartie saison tyga antara melamine goalposts bolsa fick hatters confusions taproot syco sicard capiz romuald absa clothesline gripe paus shopped hobbits cornucopia quartering toland gena guyon stiffened freelancers celebrant bratz raisers fermenting peritonitis iola radomir gajah hampering stobart hcm contributer afg hockney niños consecrate holzman mistry federica willkie redstart tauri hund pvr dolina erratically papeete marischal petes strafed actuary supposes brune unstated lupita numero radi houlihan breaux referenda festivity wab prioress sveta moonlighting nicollet amager sprain punky providencia sefid srm universalists igcse cinematographic lanz morrisons chicopee celiac hirt baytown cosas sète jere sympathisers melvins terrapin pinner xps deferring nofx peddling flatulence gide supremo kusatsu placards lorelai suleyman scuttling heure hasselhoff paramagnetic rapide pharos gauze hosmer omid sailer minx reclassification mcdaniels excitable korman vaidya switchfoot hafeez reseller mycelium vania psychometric nonspecific ultimates barden meles pagar reviled dorsett lags wam ignatieff morrisville liberace tempelhof poise lacrimal aruban appropriating villefranche risdon chalcolithic watermarks orde cobbs tolerating goldenberg ameen pneumoniae mathilda resonators schenkel wertheimer imro overestimated structuralist kettles woodroffe knaresborough savigny diarrhoea bunyoro bitterns pieced trimet turnstile boyden frisell damen ridin mima lez fete trolled lotion pinhead dereham gille kraven chok bushey megaphone tanahashi cardin nondescript boli draupadi spey ladino finbarr magny cartons barty memorizing muta autódromo prophylaxis glazunov pendlebury roosendaal shabnam brocken uncontrollably tynemouth niort fontes droids skinks tabas lipoprotein madhavi ulcerative repetitious dudek deron asides microlight congratulates anantapur bhartiya geographies threesome barras wethersfield darkstar confederated metered meunier angelico ilkley struthers crutch fichte idyll baylis furthers mutates chlorinated posadas closings aromas nagaoka trak brose wok chavan usat comptes breccia fédérale handpicked locksmith gouache delphic idealised tizi concretely sweetener enka ponti progressivism nestlings tibbs lutes enthronement picturing fennel stewarts scandinavians oboist branigan zawisza mutya khuda koprivnica mcw jarrell xuxa arsonist dzogchen neptunian elroy residuals quackenbush dtt rudiments preemption gwendoline jsr rubi uncharacteristically extraterritorial poppa mateos gottwald demirel schönborn livio atti icse awlaki stragglers telkom voltron rethymno mopping ecclesiastes starrett contrapuntal rollergirls dharam ouija bronwyn abhi eiko triumphantly suze deschanel hovered walz liqueurs watterson barro moga cdna dishonor csds echeverría formosan claudel sandhill rozelle yeshivat inder slumdog incompletely bourdieu palettes lannoy stoudemire ferdowsi harappan tuomas longueville smattering headband voda bhaskaran barrois larose nuncios botham etowah saal macgillivray entercom zululand tyrannus moche schwarzer vitriolic silvano olongapo adelina ziad hither eran khosla rebuilds seamanship curio basuki disinherited conservatorio yaga cumin jaffrey graça geochemical homem buhari blevins storyid dais doer wasco noland pelotas medline shuang hemispheric tetovo gaspare rationalisation equalize peeping vítor masterclasses ittf fledging grund pail eridani kakinada disallowing abdou interconnecting romita surrogacy glycosides interfax pbr leverett crisler mcadam meac expletive murasaki binders philomena kairouan painkiller purves mtm stratospheric thermoplastic khaldun caput hasmonean chirp minnows taksim remittance disobey oystercatcher kennon polytechnical newkirk bellefonte gomorrah generalisation backwoods winer lubricating interreligious radiometric mortier sob dafoe bulimia lingus silkworm shakib olinda kanako berrer rspca gating devaney smithville nguesso gorgan caiman erdman focusses aniline quacking keiser modelo rousse jordin awning nwt anole nikolov anke genting antipolo advices cinemax saaremaa rukmini suffragettes aphex sothern lunsford muncy nollywood correlating moorlands gou selle merseyrail bijeljina steptoe reynosa entendre cowries amisom lewy persecuting pettersen sturtevant transcaucasia mendicant erigeron bronstein netherworld pygmies younes ksa kocaeli ballin degenerates rediscover bann mfg sulfides tiptree concurs masanori runt saylor naomh glonass overexpression wpix belgica agron kellman fowles kadima epigram wisin falconry mapleton setia okafor kazumi personalization koli spetsnaz crawlers malabo gusty contaminate mckeesport dugouts valentia neuss chouteau molik bruiser accompli holzer tattersall softcore cowards taxman cameroons libres erdem wusa chihiro dlamini sprigg darragh velo spirito sorge regretful kallis plantes sumac reopens cockermouth paralegal disassembly kadyrov subdividing sangakkara rádio biscoe belltower psf flaky shere impoundment scrapers bierce lightbulb powerline truffles debater patter virginis casein carpe dingley neuropsychological vagrants wolter bldg vanua njt estados bucking hieroglyphics simoni mctaggart sga clampett fussy swatch mader dribble usma sandon stenhouse orchestrate witcher symes kotla neelam stigmatized showmanship smi agrigento schatten vanzetti councilmember aiadmk dita cittadella murtaza revilla senussi gcl otros rotherhithe cadman oncologist soham escudo barbarous colorist poc mafioso hsl fernie kurunegala recurved gsfc minar capon catesby propos ismailis schutzstaffel wyse dejected mercyhurst matei idm kayser passerby fairclough geauga sublette finnigan asana majin kretschmer idate uehara ringmaster comandante jocks nordrhein lfp shikai tehachapi afghani greencastle mundus virginiana dioxin sunspots carley saari zc errands ega sanyal obelisks privat sylvian waris mujahid rsv notching meghna ahrar krum chiropractors drainages marky tenfold trentham kootenai raimondi sybille greenlight bugged fakhr schaal ddp cranked iifa universalis sehgal kage gohan gosselin konyaspor eupen zinoviev puffer leasehold attuned raghav pirandello raigad attentional rosey annabella longlisted morang weehawken savchuk stuarts overwinter pronouncement sixx sanaa jls profil hele khatri smarts dizzee gullikson distrustful petronius tarred detox undercurrent forney shuman turlough blondel luminescence halmahera dzong limosa philander arrieta rola yoshinobu starrer rubidium weaned telekinetic desiderius anteater liban inhale lemoyne trilateral snatches downforce kalmyk plex nimmo mackillop vaal aldgate lans kessinger unhappily joventut jeffs gigantes shama alim partei amputations balla factoids drywall pyjamas gibney millstones planking ratel tuberous unpaired pry yekaterina westermann cadw hovers goddamn mkd inborn barnacles bauchi haras yugo precast cuéllar maman testator urdaneta hobhouse solidifying summerland sprouted rodwell idar evades litmus senescence tmd middleman aint lunda manis teasdale mortification zebulon archivo fréjus swarbrick docomo sekhar lynden mccreery chouinard crayons signum yevhen peele apropos opines kovalenko falsify billups tustin larkspur mightiest krautrock silencer seigenthaler toyoda chiming hsa lapped specialism dammam adornment burly exhumation bevis dalkey valentines sunan spann hangers expound soldered hito cinerama intramuros izquierdo maltreatment demersal doyen benbow bub unm banqueting puntarenas ananta bluegill ghostwriter arius montaña tej gse omdurman frown courtesans zico tianhe rubs mien tendrils wisest anette sikander farscape historicism envisioning gren stackhouse oblates flexing belligerents rossington lep glean aparna repairman henge ulises odakyu bestiality ensigns reconnected smothered kfor kostka cotterill subjection harrold jad matagorda bakugan hasakah eitan kosciusko uti arenberg perse fornication doctored geld panicles gaskin kibbutzim nesta spinnaker nares bobruisk jagat christof aérospatiale bargains wirtz zygote erykah lawry sneaker iiia decimus savi nadeau loney putouts unfeasible cometh knievel valkyries shippers transsexuals centripetal erkki pectoralis inflections imperialists motivator misato protrudes garmin masques lukoil malle schlitz ferrets malda marj ooze fayed uris walmsley crushers kipper combed sinjar intruding newfield gibran xxxix muniz utsa wilkesboro penciled intravenously tehuantepec twang rapidity argentinas pankow guyot gratiot diuretic jankowski marray validates harmsworth dugald disa irrelevance landover theorize fractionation alessi midnapore woden gergely miran reisman brigada gourlay caracciolo alcove triana steno coupland edmonson kayserispor bexhill paltry joystiq faltering kurri bisset padi esotericism hilltoppers homoerotic senta vampiric cantabile hadad silences stylists pelli jadavpur hbs pisco dihydrogen complainants pwc isosceles ruffed rigdon slays amarnath renmin rehired oncorhynchus commends rehovot arsenals baudin ratko stroma soundsystem shylock broyles ilham pioline nahi southerner cirebon cibber reefer featureless sundara salvi fillet straightaway pronto sfo unhindered youtuber rubella openweight jx soba carmo unga houseboat chordal cloistered uncomfortably janey jobe depraved iarc nishikawa straights elwyn droop unquestioned teemu vina trialist jayasuriya ungur reaping sentries trike metallurgist rampaging unpainted reynold shallows traum arcata nigar groban tankard herpetology garners iscariot xavi rias gunawan couric gravis borah sables cdad citta gasworks tessier pizzicato ambiente kolomna tilda algo rosin colas cardenal pagano khedive floodgates tissot nudibranchs todt pragmatics cattolica shortfalls westfall ahora guanaco borodino peritoneal oxygenation thame icac brandishing reappraisal winsford swaths marae bruun wack chopsticks heritability reda hsun giacometti dever gollum yeshe downlink dinars fontane mazzini emsworth velenje nalini vassallo cloverdale saini epiphone theatrics karine eponyms sagittal carmack sclc transpersonal tollemache coolers delonge gromov pyrenean blancpain kumble petersson tugboats marz rebuffs indenture cte melos dawe jésus kennels bawa lengthwise carin haussmann northolt dawid jezreel cros zhukovsky mafic bme yelp murano zárate malbork muto pettibone pageviews commandeurs gekko hiroshige belichick velu wallasey ceballos forecaster ljungberg abounds chiari merci redecorated girardot caisse dunder kennewick jönsson marquand bausch ridding marchi mannion michaelson ministering lamarche watan tonks iie camelia lemony vergil domodedovo janitors utr hoyo jonze marios thapar shilton ponytail rogen lumiere lunga chron densest bova transnistrian retrospectives mapai jameel fauntleroy bunin sibilant longshot mook menudo neunkirchen baltics feelgood alamitos lippmann chanute vandalia aranjuez moyo fehr dells adrianne attenuata hedonistic suisun magnetometer moria tingling phe dentata movistar snug jutting scalpel varia chakwal damiani bibliophile scd pieris brokeback dacca corrine ordos lamport depositions craniofacial edrich cygni dilworth catharsis circuitous lalu mln rya quasars eggert offload shapeshifters hortus sapo volpi eschatological munday insula kajal gruen vélodrome minutely boman cobh assuredly noy roadsides nido blobs krona welter mayaguez iuniverse splashing hern sweetly kingsbridge brundle holography hashish puglia polley tyree naum dieudonné woy ichabod deighton fratelli whelks armida bladen bcb rafinesque befell ajahn demeanour kremenchuk longbow marinated roused sinden nivalis evaristo spectrograph goch middleburg geral flanges rgs postgrad flory milstein epigraphic sharda lector thenceforth memmingen legitimized mccaughey prata corvo continuations shush nahar givat neurosciences flann humbled arvin mosel alby carberry emr crisps stormers weizsäcker cien ransomed tancredo cavernous alcorcón killigrew guyed talwar élan petropavlovsk xang animaniacs phillipsburg dodoma evidentiary schembechler facelifted nyx shoup greentown taxila adirondacks sait eachother whincup sadism tuxtla goldfinch pretrial pistorius kunda triunfo ridgeline dogfish langen foretz stoltz sterility krasner cubicle youtubers tamarack thorold chamba sinfield cosette cockle ansan jumpstart whitson jihadists yancy kaskaskia wrestles touting kanak overshot ramenskoye lazzaro tortosa downers maschera comunista laff frentzen bardi aksel delinked iwf dufresne rdp schacht configuring ■ enviro wolstenholme ntl aggarwal tindall tempel gameshow impartially mendelian jadid sixtieth botvinnik stritch untapped hominids hooley junoon trabajadores ninomiya macrophage religio blueberries prequels busking dopaminergic rbd circumvention gambrinus allay seagram boscombe krim encapsulate mirada cinemagic pratibha chevelle palanca tidings iwaki housman momoko pharmacologist angiography basha dissociate corbeil crayola lambie nishida pontificia namie mowry hellraiser chuy jourdain doki persecute fiz clausewitz tableland salalah marcuse siskel chandrika aggregations pathfinders exclaiming unreachable endangerment bouvet argh jetstar mobo polyp omri cassatt transitway uinta bums personifications pent golconda kimmy hinman byd unifil clung stander surin scheepers biko poliomyelitis nua krafft wingless littlehampton flagellum irène janta midsection minimising righting fairyland medill troppo glows pili gillman vevey pyo shootouts liddle praça exd datasheet pergola sleepwalking precipitates ices undefended seibel selsey gardel willemstad adjudicator longish panchen dumpty relatable accreditations unfunny caitlyn disciplinarian rajaram passé aeruginosa casebook agc cytosine xlv decal smelled nurul livable ossining pantanal salesforce nbb daws qos galliano afrikaners chronometer viljoen wiper bimal lamberto kenobi zork gourds glycoproteins enescu drawdown traktor nyborg shweta tobe proterozoic verrill ynys autogyro gangetic jailbreak complacency rejections revisionists redeployment discours castille ridged tisha streaking pylori viento crevice jedediah evermore genteel warpath eutelsat artesia konstantinovich pedrosa cbeebies rial tutto revolutionize adios tajiks mcduff toshack ryong attwood dbl ragnhild decorators hadera sabi flav mutinies implacable antithetical bowerman sikasso terranova airmobile ondrej watashi dahan predilection kob placido blockades mccullum mobilise meerkat doux boosey ungainly plagiarised tantrums charney andris necker orpington fourfold proofreader obscenities smee albina narváez ceauşescu irrationality telescoping duvivier entrée dogra wiegand honegger saru mockup unfocused pian alida wittig refueled lortel automating attainted airpark diggle minchin herzen josquin sobral tetralogy balochi dop anesthesiology roosts caney texcoco bethell screed jujutsu braidwood rov leadbeater mero tivo mystère mousetrap otwock ramzi furore piazzolla lozada valiantly griese ucsc mirth americus oeuvres gavel guerreros lampe pasts letterpress dogger rumah thoroughness sinkholes siltstone tni redefinition chapple rava embalming hunk thacher peshmerga commentating koç skelly kitagawa poore khoda hoole gamera fassbender kerb batalla spano freyja ube lunde inflows cair unnikrishnan dodgeball lesueur guarantor sentience mcanally labouring kapiti eurohockey karaganda départements okey dingoes impreza lela alibaba pua rollie malfeasance kovalev alpe cilento fundação abyssal sehwag tibial chambliss bojana bonhoeffer levying rafale hsiang tunable featherston twee markstein foulis intermountain chunma loony disjunction casillas convexity tubal ditching dialling filippi tauern webisodes kocher jagir lfc groote bluefish feingold sumi scoundrels trc forearms nauseum cognizant opportunist mannarino habitations dth bloodshot lloydminster mitsuo diffuser rochambeau nock knave jory estevez urmila saucy heartedly tsx bycatch yoghurt warthog lakhimpur swv ligure buryat analogously panch dewayne cathedra surrealists gravina epigraphy poonam mcnutt yeoh gilboa unattainable codd iberoamericana toews quantifiable girardi cosima mostra wristwatch merr marwari baw decibels giovani akbari burbage pandava fourie dissenter ormiston annotate phoney busses northallerton glans yatai konitz methodically shinagawa golovin gair xlvii abet bélanger varennes hathor supercross imprimatur nsp parsecs bluefin callback twisters seaquest lugger tawa rmx serengeti tonnerre quem crippen taa lethality avr traylor bossi hdp tannin scheckter arjen hic ursinus pinsk tunguska herculaneum walvis cnd ribas inaugurate stormfront eudora cultic biomedicine bermudez gossamer namm nass falange meaty cil granit genocides gstaad ssu feuer contraindicated gombe lupine madani humanitarians kunar mutualism chafee hafner clamping méditerranée owusu babette pinehurst diener clattenburg dissociated seixas hallé insead bellew intrepidity imagen runa tripadvisor thinkpad rookery clubbed memon souness hoag sayyed ormerod shuler dcp timah ocelot walkman walgreens ovum muhsin thevar squids samu musashino viaggio folketing hosking vigna triplane jasenovac babangida postgresql gracile endocytosis roadkill murine celica ljubomir arcangelo mcgurk bungle rapped kwiatkowski totems raum kiruna isin uke incisions cantt biya cambodians reintroducing tins denney picador spe avs patois criminologist dace uol suriya perplexing jamiat laces piya emea herlihy hmrc flatness friz studd alopecia narrations kph miso tullius cocking marins afterschool capper sev piranhas auteurs niet freitag coosa toffee beauclerk neverwinter footloose cookman morons pook unpunished stimpson gutters faggot foregone hyannis tynecastle modulates gabel andalucia osipov puli eguchi farnworth justina crips arche buenavista microbiological inventiveness francophones solly machias caliper searcher urvashi brandão invulnerability quash eurostat caravelle roping dinka unplayable equivalency fwa connotes aveling instilling lso schwann lumet begotten grouchy tamu infective sepik xxxvii izmit kevan giddy contigo upr rfp sabir mayport wss kinki longitudes bbfc derisive leva tello psychedelics havasu sergeyev makar workbook participles coit arago lynchings keo bice bano quantifier stooge nueces landrieu miraj wallowa leonis nobuyuki striver insinuate preclinical magmatic caillat caccia deflecting empted dunston kabaka polyphemus drakensberg bando pema riverbanks interlock kesselring belper interdenominational eldredge maku trias kriek keef crosscountry moloch agriculturist vandalistic puyi sheared fisch scrubbed morbius ahmadu achiever patanjali uninstall merrily davor telegraaf orono frizzell shimoda naturalisation chingford cristea petrobras radiologist bowlby godwit tatishvili yorkist ratt nls sawed lebens torricelli iqaluit abacha recasting schouten situationist efficacious percolation carpathia trotters mannequins particulates lankans colorized srna salat bercy tarver immunologist boogaloo leman meistersinger duncker porth subaltern carballo spotswood easa prostaglandin tew hcc birbhum haplotypes neubauer millidge tice movimento kiro christodoulou antonello otomo flexed baits fortuitous brotherhoods dissecting joc arum pcd dwan buscemi tanned electronegativity simha camshafts cottontail weo dabrowski granule debauchery chiller fantasma lisette timbales gulu sce frelimo irby rsk meadowbrook dioscorea baldi gaithersburg fennelly burdette funes fledermaus mahdist pikeville beiderbecke quetzal mccarron sutil tevez goossens cva paszek jamshid uprated unhinged oms cartoonish mahfouz deok mckernan retaliates glynne casado horwich fresher glasnost campeones junco unsworth finality ejector dimly boyes unearned subjugate kirkwall tramcars clea ekaterinburg chari engelbrecht chom saplings quickie mog scab megaton allosaurus jyothi blodgett goliad albatrosses lerman cheshunt councilwoman melancholia sisley appraiser guajira aisin williamite iplayer buxtehude tics faut scheffer puritanism pearsall whittlesey intermingled noguera rhona oberstdorf clin anan bulgar stegall petrosyan alexandrovna plosive ghc fryderyk areva cephalic mishandling weidman kein immunotherapy burckhardt bujumbura anvers frcp annul pless plies yount pastureland stolz sceptics telex tyme lattimore synch niclas cress brannan epson fortean appraisals stepsister antonie lactobacillus routh weightless breaths crunk bns sojourner rookwood circumnavigate recaps ipanema blis masterclass bluewater cento epicentre languishing grilles senkaku cationic headhunters bhargava chim joslyn bourdais monopolistic divinities bayh jansz chlorides heyden rostropovich shandy wisner arema nrt anam aldabra winnetka buffered changeup mcconaughey bozorg jonker approbation andresen kareena ravenhill tolna kickin vaporization kaif reconstitution zarya bandara keillor beim flavian mitsuko dror anthemic lugs honk linc ume mccown holgate murtagh newsstands rustin schöne adamo matabele vardon overhangs innisfail levites regalis jamaal skirted adorning meilleur mirchi kristoff hsiung ayan eatery plon physiologically newsarama thracians tauro urartu wuchang env ayla protists decimation ziv subtilis saddest spyridon ohlone musick emerick partridges joof jarno callander tomomi logitech ruhollah daddies prunes abri abbesses najran ghaznavid doodles samrat kathrin nightmarish badfinger ostentatious hoople deduct hsm grotius gaijin gusev uj baila thiru borman spaceflights bermingham patina unrecognizable tittle conversant contaminating aching resell froth erythema biometrics urso madejski artifice whittemore serafin loraine quills treads tootsie juntas paracelsus lacunae revisits nutting stealthy invocations hoshiarpur transgressive slings gfc liftoff comelec tci unitarianism sawant waltzing erato espoir tonge fanfic chessboard sneezing oiled jughead magan galant bodkin kdp teatr ormskirk obp bhabha setar pillay glushko tanith arriaga cyclase kirksville itanium mandell soling septentrionalis octopuses dryas liew subglacial wallkill corunna truong couldnt nothings normalised orgs mcmillen superoxide shifters slg complementarity leukocyte rainmaker loomed antisense goel calkins kermode ginepri eci bilirubin edgard gonda nexis khalistan mowers tena mazembe benguet reestablishment slayton derick mycena nwr folksongs zarand gimli dobra kori stringfellow writeup aisa ldr mossi wheelbarrow anp astrobiology windus elista finales legionnaire swinger weatherby dreamgirls deuxième hijra pippen rifting northup chasseur macalister cranach cavitation jambi brathwaite janikowski microbiologists impedes averell bhola dox adena karuna mujib virginal soundstage henne hammurabi waterston pyrophosphate rathod bethanie facings mousse almonte greenwell kraj applebaum vinogradov pager bowring overexposed americanization glycosylation katniss adcc hitfix dowding crespi marshlands benatar hoad gauged gholam ostrander ashi thieme wallflower rso whitehurst anaïs ardagh wail construe bobbin muff gavaskar safra brachial burghausen loggerhead unsteady illa unfiltered soapy pineapples hadiths saiga digitisation farrukh horna facsimiles wrekin whitcombe tremble meridionalis slopestyle ddd schock hashes imprinting norrland midhurst hammad koop shamsher amerindians btv empathic ussuri kolberg mctavish merkin beehives adak syne hypnotize geel multitudes railroading neg mapp trilingual supergrass westmont vietcong exacerbating petrozavodsk psb tari menin pompeu umayyads hartog countertenor carrom rambert rossignol gigabyte rabia hsr iveta overestimate troutman uka overalls jvp frontiersman iquitos manabu daphnis almeria collegians angelfish pramod pallium iha transhumanism americanized yeux samizdat dunton kneale tante jaane toledano pasto thermometers cstv fmr capell corsi sayuri hansi volcanology navarrete alge kuen turnkey skra psychoanalysts wallonne flirted matriarchal equalising cerretani excruciating eugenius collegian terraforming bludgeon loko diyala paternoster cattell kuzmin phonics godsmack riverwalk waterbirds ahimsa deliberated ahlen assuage siddons allon peut tomasi uba bruijn clackmannanshire nanticoke rosner cingulate barbatus keshav inquires oreo sissel poeta merwin ffg wyllie stegosaurus moulay keystrokes fowley plaça rahn smirnoff fester hadden thrifty siddur naish rivaling aozora mccaw tumuli wilful outrageously musculus slezak fellatio destabilizing lenton dake butane libertador eschew stradivari tiepolo pinches nyanza helplessly loro competently fume constitutionalist brainard anastasiya interjection shoegazing palio malloch resets borgnine lignin knockin weasley neodymium dierks berankis eadie fishguard dismas glances templeman merc ilana phillippe gaal basle parasympathetic stereotypically titov verily bonnaroo polytechnics koning macri dogon wigwam tamir walküre soltan khattab persevered castellana maunsell reticulated huac podge ummah guanabara cissy perp arnoux repopulated photocopy mogwai brita glenbrook balder margulies udaya verdad guðmundsson swac nq guadeloupean fantastique pushrod calleri folksong almshouse titicaca infuriating blowin mineralization interbreeding agitate yoni tortuous rivermen preconditions cwc exton downpatrick spruance dancevic policewoman stenson marceline divino eschewing hola tosi plauen carpeting realy chinaman droits riefenstahl mifune panavision haben woodham wilfredo mcleish repeals heady romulo creatine headsets taitung bursaries unicellular sudeten canola ringtones safford pecuniary prabhakaran ledyard cpb menno verneuil barbeau supremacists edom verus walkout ninjutsu krško arcseconds eren fenech cockrell australopithecus ondine henn dwindle collinsville heiko subcontractors dejohnette holster greeneville houthis railed paisa shepp perusing bluebirds kaunda beltline dmg chauvinism aiga thibodaux gloire milgram orations hmmmm planina columb nucky basa bence breese ushers infeasible lambe journeying sugawara egress rajendran engrossed folklife mops rctv shur medien searing jgr presumptuous mingling sneed codeshare khodabandeh gillani chiasso villes amana broach chifley aileron altamirano tuk sunscreen windom bri teg habu clarín desiccation warri raymundo burgdorf abovementioned flogged aurochs substantiation okamura taaffe layoff paarl wilke bazaars zapp kabyle bloodaxe allingham macalester khost koubek creech milkshake rayong karas melvill adrianna crp macrophylla csk heifetz banting autochthonous brooms blackburne papin cadillacs nihal chea teardrops sidmouth sobbing biwa downpour sdc kobi melia olajuwon peal elizaveta kanchi bapu mesenchymal marella bish opportune feder desirous tatler shalini ordine hav trifle neumarkt cct helston trani prk thelema duleep marland matchless balustrades formers garnish owari dadu sundberg ldap krylov avianca emmerson granaries sahar quacks criminalized neary youn elyria ardenn goldilocks pdm metabolize minty waxed stendhal faroes epg longwell bánh kamke likenesses girly neoplasia akkad occupier cochlea velcro aggressors editorializing aske howser neos preservationists vials abajo galvanised hendersonville penghu thanatos wfl sarno melun quakerism ménard torben hopkinton afterburner eke wetzlar polarisation relocates enola athene bolyai striatum dagens gunsmith showering pias hahnemann arazi lowes wbbm vahid syncing cada anthea kaitlin tranter radnorshire antagonize dinger dropouts shunsuke martingale santhosh berated balti komarov tuke ratepayers estela cleverness bantry combinator vianney bodden cloaking ∧ wrenching blumenfeld cmr melding quis kunsthistorisches sigler chloé chlamydia farmsteads acrobats lethargy dileep dpj maf irian easts allergens fraudster intergenerational elated scudamore rumpole sampath tarquin tuvan letty monod gwang ineffectiveness hypoplasia loni nekrasov vilified pott assimilating harar bgsu flaring outfitting yazidis hedlund haydar boethius aacsb triglycerides hyères rosi ulna deterring millikan viswanath downham aprons leaden palle samhain reappearing epitomized butuan escobedo congdon illit quli gcm silicates venango saps cadavers wideband snitch svr atzmon nayyar anjouan efta vishnuvardhan salmson fringing codemasters teheran oled flp microseconds fini bratt opiates comorian limiter auteuil bemused valderrama hasler kaneohe inebriated ampex bonzo mashable radioisotopes meda peixoto airbender jhang unkind scholasticism wealden guanzhong fili unpalatable stewed odissi ous poppin convalescence cfi wintour rhinebeck omnipotence goosebumps silverlight bourgoin criollos brc mangano pollute emam kukushkin cheviot contravenes tendai tarom belford lampang pushpa benefitting nahda wilkens bewildering swingers zynga pareja veselin arrhythmias takano physic commandants erythematosus busey deuces centennials linnet newlywed benassi resemblances addo idling mahwah catton styrofoam kelson schlosser muskie maundy overbrook borobudur jot apothecaries matthau shaki clogs hausen rizzuto ckd willington hopped alleghany dmv shifty fdd dioguardi spennymoor celestino coursing northfleet gidget condenses patrizia vagabonds brunetti grinning parviz unscripted frcs benvenuto skymaster umesh discipleship maclay strasse khushi vijayakumar carrol bamberger gardermoen kilby gladwell naeem reena scrutinize dolmens salonen ngoc ilium proyecto rurik kudla peretti samoans duplicity beatnik biennially landowning schisms arild threepenny kristianstad upson gillet leeuwen hinojosa woodhull schematics keeneland geezer overheat ziya hullaballoo quintessence cletus insignificance cuervo nerva aberavon jove fraga rittenhouse monotony steerable nervousness nasp horsey philistine prescient daron vfw straddled parisien petrovna clarksdale ascari chamfered nobs stickney bleue lifan battuta ambi bulges javi barfield mesothelioma repositioning cannonballs recurs compuserve sokal brigg fàbregas hudgens ahr forages prés breakups unimaginable gethsemane abeokuta slieve aggravate cannibalistic rockfish blissful spirally leftmost rebreather galo declaratory pokey cei shoemakers kopf lapin pietra cecilie derails bep palindromic elocution dorada propulsive shida solari kranti highrise arron wiggly avigdor pleated baldini gmos rois bullfighter loosing efren drongo seram soundboard penta dowie lazier waimea lief oocyte perris sunroof finalizing obc throng sakis gdc curating heya boole msh sune kerem melodramas wavering hynde garnished cuiabá lanthanum badan liaquat mukden hypercube kenyans temminck imitative rell accosted todi panis pandering kourou panacea gleefully pathe cahokia aino unum kotte prokop cherubini caridad erk frolic mul idx sugary tze browsed sakhnin tambor jitter michaux malignancies tharp selleck jaisalmer edl slighted phares gws alpen wx impairing métiers confide razzaq akihiro mirzapur sympathetically possums moriyama charlatan bocas bandyopadhyay apoplexy fretted waqf luckett molchanov pandu bassin dri meisel cordially metastable turboprops skanda authentically superfly salutation harrassing prophylactic galla pagani topi etsi pira kelman easements delicatessen dodgson sinaiticus rimfire ketamine kcrw snatchers rahat mcquaid eggshell naseem galiano poached anionic ancestries formalization spier boozer silvestro heroically ruthin boal seahorses ramada sakshi conolly karnal micrometer tritt yaar dunst krasnaya steffens harpur nio rione capsicum moos ovo mackworth chakvetadze lucretius rusticana batcave stix baldry henty hassler warsi homesteaders raze reval waitresses patra arnoldo onetime realschule filibustering gifs yuchi pusha raji grylls benicia kalidas parasitoid chanakya transshipment hartung wishaw trillo prettiest galvani synonymously wfm dischord chern appl decentralised unfunded hairdressing naac norsemen spero lue spyker naugatuck longshore freescale cheam airpower drivetime goldmark montel orthodontic microcomputers fastening hellish uninitiated oldbury crothers fearnley chalfont santonja oster shami charette jcp facultative supergiants milland cashbox somnath torments nove maddock distiller donmar goldoni campi tress tswana chacon penitential clearcut prudhoe siloam kugel aitkin psychopaths drewry chisnall adorns fook garneau shas kasay ironton unsavory ruminants locsin reuther neutrals goodridge cuellar hanwell underwhelming jurgens wrightson erith masai tynes orizaba diageo figo cupcakes coexisted stumpf kendricks veiga gondar sidcup recalcitrant laudatory dazzler canty cotoneaster muenchen hardeman coolie wfan resonates fernald kunze naypyidaw quoins schalk tecnico sav remover lasik eliseo disinterest fok mendeleev pintail garbin clichéd liliane gianluigi kebir yogesh codebase fecundity teignmouth pliers patchogue adjournment masaharu radin ffu fawlty boule intrude sí posttraumatic bokaro rhinelander montaner marshallese newstalk krabi anticline sympatric plotter gagné amica ture potty palmers conkling volo bickley timespan fios fluttering glidden bestiary unpredictability sisson katsu brannigan besiegers metathesis brassica giacinto cinematheque tux synthesisers cantina khirbet newcomen mysims boma reverberation swerve varda mcclean glasnevin merk coste mauch peripatetic corrs sijsling manado tsimshian bandmaster jorn constitución stepanov brighouse puran menashe moraga dispensers southwold lavoie rudan chosun oddie tarsal hatem masterly phospholipids vincente subregions grinders zena exasperation jawed malakand itineraries unsanitary yogananda umc machel nakazawa altmann rhoades plesiosaur flatland sibi wku kral frequenting caffè maxis curtiz abated farleigh vtv plater wintertime braulio pilcher élite teniente alyn sori goldmann elbridge xylem realizations lifespans corin emerton tramps jayawardene keough schoolgirls retailed totò yiannis shakuhachi whines kcr moises gestion wonsan unisys venison seite bdg simard unissued passersby brudenell siew scoliosis araucaria meadowbank aéreas gisele agon ezrin kult soley taschen covenanter regrouping sahab carburettors ravensburg internals brendel secreting besting bartolomeu lederman folic groundnut montecito sorokin justicialist auctioneers jinhua acquiescence peruana antic darcis flotillas boobies subcontractor orthodontics brushy miler selçuk bracciali folke berrigan fastnet rosella umma nitpicks skyblue willmott pillared oilfields fulgencio famiglia fagen binyamin motherly nyberg ehr bigamy shuki neb subside bodyline minibuses runabout hollinger shapira vtec burglaries airlie sayin nian damped hurston sph pollster camões eucalypt lfa dockery traumas adh pws maribel littlest merganser twos gatton subpoenaed millwood ibuki ahad postlethwaite jawbone andal teena rolston dyadic evatt biotin coking kalina buehler counterpunch howley sutlej roud tendering omnidirectional wilmette coho philp karyn davenant bharatanatyam malraux barbette cavalcanti carbery pavan loïc asplenium fervour tskhinvali okehampton lgv bioko grasso resellers mustela khajuraho inoki holles radula agt pastorius orangery shatabdi banstead keneally stringing patrika lisbeth knutsford coshocton sydow vitalis roxette stargazer whitehill patenting pourquoi sleds physiologic unknowable powerball ione massena nunc asis nosy flexi recessions euphoric numerology csir anscombe wolk suffocated lota kase nro sbu comparator boleslaw rieng cmm kyw zooms neurophysiology ramalho cky misr adn bathsheba yohannes unanticipated sacher kirshner beas pupal hulse orkut jiangnan vande kkr médecins malahide idiosyncrasies berland lineker legato lotz vidmar nif reinterred piqued fulvio tewksbury toolset ploughshares nrp shevardnadze egidio torrence ramblings cantacuzino midleton deegan madea rajahmundry chesley lympne diggings outmoded reyne madox izzard virtuosic manville marder gents alamosa suprised kilsyth gorey goltz olimpo klemens cryptids demers mainstage electroshock herm quantrill oligarchs longhouse tumult ischia blackmun icicle tammi ravaging philharmonie backpackers neisse succubus saruman liwa sanguine functionalist demobilised toulmin greenhalgh subcompact rokeby nauseam anticancer pondered rst batna micaela retooled levadia paediatrics vaporized italicize swanage inexcusable carretera coxe binti saudade tingle sproul velarde lodewijk sonically dryers gladwin pelly judaea randomization overwinters afr boatmen leonov novato outsource spank perrine shakuntala jamiroquai crema endeared scantily superset mto azrael varden frist bui lundqvist misjudged cust weds throes zabel heino viscera pricewaterhousecoopers acetaldehyde racemic alertness drescher anomie meted basheer euphemisms nikolaidis botulinum grossi beetham henrich maryann heaped cicadas polysaccharide siggraph vacuous netware dumfriesshire kawabata raia valur palatka palimpsest bitters lalande invalided waxes gsx projet asiatica orsi shaming exhale gilford spiderman tamm nyb otte kanaka khalili métropole powwow kugler velociraptor corbels copyeditors wassily yakubu fridtjof hhc rocque wimp eir depredations griqualand gneisenau avonmouth recompense unsealed darlinghurst mccafferty matica steamroller preterm annick playtime finials sexology wealdstone polysaccharides kubert astorga coz coherently casuals jsf oeiras idler chedi thurso ptarmigan herodian brahim gerund pipistrellus tiran dzungar halpin pancha waltons abcs maimed voskoboeva nul hopkin kina shemesh edenton parvez contrôlée mnf abdelkader panjab avebury imperfection billets hideyuki wimsey sarojini viale metaxas vieille wrede stonemasons allegan nihilistic polecat calcification biologic diluting borge rhind apsley shivering hailsham basham immortalised pooch haberdashers ahmadis casuarina dualistic fuk sastre stronach guillen prolactin ikarus dorji milpitas haran colombier brodmann lamu occidente roblin cusick braine mycroft âme volcanics haver chehalis byram susitna peppercorn wace unité vse goleta guaynabo warde reiki mistrial haiphong herrero chichen entanglements jessi overwriting maiko nex immingham nacelles liviu rinne rpp matsunaga suisham storico mesurier tirso marling mousa bosporus sylt libertines footman delage duxford theodosia puk kesari kasabian mineworkers gnrh bednarek bogra ockham colloquialism ambassadorial prosaic stowage urie dalston nrm roly ciconia guenther sorocaba homeroom tiernan fiscally ancre owerri diacritical toshiro anup reorganise homie barasat sandbach foraker hydrodynamics natter simm pwn comitatus crescents oau unappealing mouthful hadj rthk forename depreciated confiscating orthopaedics mariage murayama wenner hitz takeovers dtc galas northbridge dphil cavett tmnt weatherfield akim warty whisker revelstoke voyagers unbanned borislav summerall costin ousmane marois pollok komorowski shanthi bokassa jeppe dienst highwaymen rabobank apf sievers musgrove signposted syntactically wicomico servile megabyte brawley vrindavan villano avogadro kellaway graeco mogollon mallards juvenil everitt nne piha brighten mâcon mahidol bruguière overgrazing polygyny royalton brindle sukkur detours bradwell khader basquiat sorkh doorbell crus karno seni maqam mik bylaw zhivago cecelia polanyi setagaya permeate sot cron milledgeville domiciled chortle foxton donnybrook syncope oded iiis unsophisticated iy eliane gimelstob sittin meno subverted trotskyism calera spillover zouk brzezinski albicans lessor sethu busse semitones blogged lawnmower yusof durocher zircon snafu boudin nommed koné silverchair bagging kuchma knolls kander stanko maupassant mcbeal basutoland benzema tippu sibylle recollect kajang mlm tecla kaia christopherson untagged cumnock attleboro idb nowra haywire trakai helvetia coden geeky barça amarilla filton bartenders ceri sgc tla rheingau headington hypothesize maryse deserting placard boigny yasmine outbuilding topher victorville jayant surly sayeed ksk janowicz mpl bhim menke gand antislavery fraggle puffing cephalonia ramchandra incredibles ossett fmt whelen cryptographers unjustifiable asner kazim awdry perche testicle dismount goer biondi twista optimality delco typographer aldine keine fumio darters irigoyen cetus ratchaburi bhuj pama quant grantland pert wierd veb devdas puka sadako mogg assemblée webcams sylvestre sandie sweeteners bellinger shanley belz timescales feodorovna latha magisterial anaemia backbeat figment demain globalized megaman proudest racemes hamtramck navia irakli jammers verena obsolescent juho daulat flapper jhu ghs incubators karmic eez collum amorim monophonic colusa perpetuates salesians renfro whoop aglaia trellis nephilim isaacson kjetil zvornik shamed malvina rovaniemi demidov goenka cibc canavan smyczek arv ahmose malign medcom kickbacks photochemical grebes wcs pressley goldsborough rouges harmoniously sabana wasabi katt schnitzler seesaw koons etzion saqqara haywards mitchells chokes anoxic tempah yohannan gyi berndt bissett sjöberg pedagogic collectivism archdioceses jaakko klimov liggett socal tomko britannic elahi quebrada mansel nuba sobolev quip siles balked correggio lifehouse rahmat huancavelica lordi kogi poti tamagotchi diniz juhi acculturation cinnabar arbutus kerns weblogs michelsen vidas campanula granites vulgare explosively aldosterone cervus splashes narco repressor tonio overwintering cbsnews aphrodisiac warfarin kanter crier diamantina twit laga lackland whitty accomodate catriona sayf mesons abides genaro smpte accel clank ragsdale gadfly aramis benelli tantalus formless cova necronomicon millisecond tomoyuki sickening catalano adamski chaman fishburne chander birrell madder areca voi susilo mazza iya unfccc minnetonka constantinescu khat phenols mayans ruppert ninoy modernising synovial skeletor whiston fretboard luma kri chiudinelli loath smitty holdout programa fairley samanid chaminade internazionali dhule llandovery conran skied kawase lawmen netherton chocolat arna carriageways jumpsuit infomation hidayat malcom getúlio curries crackle potholes bagge shipwright nitida pares militarized homages noé bowness squabbles cheboygan ejects casares hirasawa keno majoli plumer naji taryn minimis karg spithead stranglehold hca iht airstrips bilson thora wijk scariest eaff lysosomal cosy montford vandalizes kyrenia laffont gosei vasilyev drapeau tenancies inniskilling postoperative manisha entrenchments aal buran interchanged analgesia intrauterine redraw vixens heros shabbir gompers kye bsl ddb anjana amari seljuks nardi ferrando couched medrano geht belgravia yantai zebrafish mysterons shahu wiretapping piccolomini auber seaworthy entryway sandbags trevally yesterdays risi galaxie stricker biro ligonier keat caven effusion fittingly suba shelford dosages esh linsley pirro decimals jamnagar walcheren künstler montpensier aylward consistant mobi musing arsenault sceptic fieldturf draconis vingt firmus malherbe tano reallocated mateen wiggin hatim odum chaffey stuffy tenby subsoil zabriskie nobby gräfin fuelling jelinek azzurri yoakam shins cukor sandilands milind ragga sanomat supplication kinloch lolly penfold rond eriko marionettes stumping nakashima pedley yardstick darian rubbers minelaying hoes carcinomas rotted millikin gravitated prag giggle valentini deplete sidgwick cantona barré treves icici blakemore yesteryear stabling hoverfly acerbic phc rosaceae devarajan mcv antonioni dahn seraph godlike synchronizing clarin clumsily seahawk tant potlatch mork vacating whatley hertzog gruden donohoe tannenbaum couperin actualy defra glides ander duplications quatuor chettiar quatrains mozarteum comebacks ngan sluiter muruga essequibo theodicy jurij proulx borat dartington acknowledgements gowers bloodletting ppe ctf zhemchuzhina chincoteague toltec dct poche croghan galaţi passerines ostriches biofeedback tans carlebach mildmay mwanza infuse scheuer warton prekmurje bellegarde conjures karamazov burak fetzer outsold kabc aventures bextor kike tattered disque eberhardt madtv liddy floribunda nhat urszula defaming suruga milosevic wheelhouse honeybee pbb grottoes ankit flail meola wallets kritik sudhakar goldblum warmia frequents waqar joven aven gea clamshell crossbones frostburg oji bicyclus enel wagnerian middlemen houtman halladay intemperate ganapati llorente sepoy capuchins tann cottesloe telarc beqaa chenab vegf shamelessly morandi zaw burnings okuda gundagai kaká kothari enea carnell bech autoharp youghal synchronicity exhorted carrack cavalera foams diarists magnify bhaktivedanta podestà munhwa baffle ssris armadillos jamshed paglia rusting tissa seay intuit thuy cottingham aelius eloi paquito engen levey conall preconceptions fst cozumel natacha mendon clack pedestals hampel nakahara quickness formalist rdx linens apolo currey raziel brahm monta rrr laboring helensburgh raunchy hambro interlibrary ampersand operant rmt kolo obafemi milani josette riquelme krems behaviorism pul disbarred huma jebb farringdon antar heave telemedicine kreutzer presets equerry wirelessly dabbling mcgriff mullally stil rutten hamon goldmine physicality wlan gelb saúl konak spektor icefield checklists nangarhar patiño habibi reunified ascribing muslin endor eide jeez ribosomes barenboim karamanlis sert kolchak malak heterosexuals legoland neuroendocrine allaire plateaux rediffusion jongno zermatt batmobile enteric coakley smut coaling anania yoh biomolecules moralistic fuerteventura awl necktie rila catlett boccia whammy shuri petites colquitt macgowan diehard thema madeley householders kirkintilloch thoughtless tessellation sandbank charli tumi karaiskakis tvc jaar adjuvant outlay eratosthenes outlander emad hypochlorite gendarme conejo averill sidenote tbh oems astara plotkin catagory sedgemoor alawi kamla spiers cuyo sacd carded skimpy neoconservative calderas stalactites kgl ruthlessness molester boobs masbate puke sourcewatch hydrolyzed hts farage stimpy juggle lio topp betelgeuse aransas sondra menorca broil engender gaudy tco invincibles bioactive atherstone kairos dorados mphil copán charlotta feliciana satyricon jonge bannatyne alternativa adduced ivs stephenville lockjaw msds saluted bellanca camagüey korte provocateur liceu système denigrating wana monocle creepers arethusa roebling abf grassed spawns chatterley perk flournoy yagnik bunnell ahluwalia ciutat sternwheeler kurland buzzword simile kittel moise tempering osmium riggins cabals licorice scree sear storeroom daub rulership cerebus cmo reticular quails huppert lambasted cassar theophrastus loverboy coursed florey skrulls engr flers nayar pärt nikolayev mariette isation hardanger klemm ekberg baader kako comrie mvps aarp tiziano shindo malka sarabhai deadweight indict unicaja eyepiece clairvoyance florid teleological koopman ipomoea borger ston colombiana fettes overrated plotlines insulate exemplifying provenzano sabr thomaston caltrain amram whitgift clymer commandeur joslin smoothie cattlemen begawan teitelbaum africanist communally subpoenas contemporaneously proserpine followings mctell ugarte sephardim arcuate nobili luci taichi integrin geonet leftwich gss menderes eisai vivacious billabong sleet passmore shanties jamaicans sika fungicides rdc kier scrapyard unabashed shortcake reimagined stefanos munsters schroder hopetoun pbm gpc grosz pinson veganism godaddy sog mince agb shuichi tars sagging gwern indeterminacy ptuj ishi paignton sidearm thammasat duy dbms seguros solvency zinta mengele aponte impersonations yhwh monongalia glassworks raisa betsey petrology balaam sympathise vaio upmc thrips coppin dustbin embezzling quetzalcoatl circulations stf corfe battler eniac surfin picaresque implore família vanquish assaf shackled nouakchott cullinan castrol evens castries hijri suboptimal russification mpls juhani vesting brieuc lefroy gatekeepers decoders arbil misapplied pnr collectables gentlemanly wechsler kuhl envisage olbia demme microeconomics cedarville lazo baume dva turunen waterson warley khurshid biosynthetic cryo thickest fel divorcee vreeland charnwood margolin icahn cowry gast guri mnemonics otani glassman mcdevitt byrom totenkopf cundiff allergen ramdas vss magnetically pippi catan beeches retief elvey macaws furst biohazard gridlock turney golly fugazi parveen klum warrantless wanstead glenmore babysit soh scullin derbent toxteth absolve sacre seguso lites capdeville demigod dvor oldcastle dti tokat llanfair roderic yeol sagrado kwami truncate eddies paquin kiely blimps dundrum mopti ascertaining nontraditional timex monstrosity mustering spurlock adductor zeki wavefront brashear amaru xliv natatorium muhammadu chateaubriand aiaw cabarets jacque momenta ptah quilon consett mycological allahu mackinaw thos loveday profumo macaca vereniging deeside harriett fibiger casi battisti fiorina dishonorable erases tcdd karras kooper frederikshavn leukocytes colombe snatching dirigible koroma eartha kke luchador difranco boyzone turi nilotic septet lond eerily brawls encyclopaedias rotorcraft primitivism enamelled umbc kbps fersen bordj selwood levert kpc apostates chaloner affable bais popovich kujawski bookie wittmann saluzzo befall arraigned claustrophobic jetties minyan asoka adha deceptions kediri articulations popstar offal dotson sarnoff goodale raimund roache republication calibrate clemence tegel baggins miskito sacchi namen iizuka marci chimaera chirag jablonski livid inductors prioritizing waterborne templer driffield sugimoto suazo sota heydon appellations democracia performative wijaya greystone interleaved isiah lemming sladen vigan outflank selly amstelveen voort killswitch fss dupin stelle snot exceptionalism reserva juanito hava rittner fugger drumcondra padraig mosfet crim ster fpl dutiful hawkers punchy dystonia demetri mineo crediton phylloxera sumba tidwell jordon gesualdo bridesmaid issaquah mérimée ratp kazarian hedland baffles deryck kamar fescue sintering cfe matriculating ouattara lindh workin melita jessel vlasov meester oundle chihuahuan dorney inouye gurdon coretta mcneely denisov dongfeng mastectomy sienese thibaut piña tardy mirabel impeccably evaluator sitges alfieri shoop yuta conjured ariz ebbe manzil xiangyang mmhg kunitsyn naturale nanometer halder wnd cardiopulmonary suffocating ostrowski ybor asser strassburg sainty korner repopulate nizar incurs eyesore hegelian scabbard tamsin ultravox temperley scaggs framers communique congregated bloodstock matanuska relegate jayan syringes samia coldwell ospina drupal empathetic jubilant fungicide gantz coterie sylvatica morlocks reconfirmed dimming maroney incinerated danni seeman schapiro sivakumar tarde uninhibited hmg seminario kingfish anteaters fengxiang chitwan neagle sanofi guitarra reevaluate lapping thambi francie danaus recoleta dignitary pillory fusiform fala frieden cartago spewing contraption easington hyams restlessness sotelo laboured yupik continua stonewalling karman macias jadakiss greenpoint jwp monasterio sealdah kalla fsh queenborough mfs bjd molteni cardiganshire antequera assailed dhruva scherzer ritu katsumi eleftherios marmont mothra seperately zd fasten tullamore behring coubertin baloo odinga seascapes kengo partington urbanised oscillatory washboard crypts khani todmorden zolder provably tackler myopathy turrentine superstock homologation wrecker frança scorching shastra marwar codons akmal chancellorship instabilities foragers brawlers chilensis dumoulin sahl oversize explication osment contrition homophones gimbel kanno altdorf completly leonhardt pantone esv spal oum appellants moyles sancha ouray smasher rizwan forti diabetics acanthus avion sullen penshurst magnetron bookman fastback nulla moonee sieber faz gibbes pattie mahakali lanyon masatoshi familiarise hooft ugliest numeracy formic slaton plaguing anquetil arra storyville escapade goosen madoc wetton mckelvey tourisme grandad lameness schum amoroso esrb glasshouse giffords chianti rustenburg toshiaki technocracy feuerstein bartlesville selborne ahuja cameramen cmf bioluminescent resurfaces lateralis transients triffids luhrmann culkin bhagavan calvino lapa kensuke pantai rost gim plummet millan zaccaria intercommunal tobi brage lorrie samira stereophonics pco lenard neyland ethology homestar albie potsdamer nue mironov bangka conlan weekender moxy cecchini argumentum yaroslava okumura pinsky salmo walkinshaw kermadec jayden amelie neots jedlicka baltica ordinaries pae urethral lampert grigsby nahua sieger haphazardly celeb iredell tanu whiteboard khachaturian kaspersky myskina trumbo waterton legalisation lamellar deangelo kirton undiagnosed jcr atk quarrelled kotzebue saddleworth shockers shantou neoplasm lento itemid barringer rist teletype cambuslang bricked externalities dilma deira eyelashes smokestack unleavened botero cabezas acv gaster masta stunningly saenz viator bondholders fleisher mande nasiruddin campbeltown connectedness caballo bramhall reubens seabees mannes goodspeed greymouth forego verdant ballance lundquist reinvigorated maoism sawada colloid btec plainmoor siad equivocal ssk finucane wannsee abdullayev pilates samling idriss kashif hundredths ethane ashwini gershom bodhisattvas vespucci transcriptase sabotages iow crosstalk mbeya zapped itamar mephistopheles woolmer runnels haroun minding characterises anat ynetnews reprogrammed holyoake implored emeryville dziennik canny osada kawanishi nourse judit lareau strathspey spo skerry moj konstanty nica dispelled reiche memri hazelnut kliment bunty halachic calabasas preservationist shiho ohs awoken grosset akureyri multiplies aew statesboro enfranchised shemp scolding newley savagery tillage expander nbd shriek shimura voll kawa disorganised illya naftali homi millfield turkistan petroglyph londo andranik hamline coders pangolin extendable bundelkhand kanga equatoria dereliction denholm leurs crosswalk electrochemistry alenia interpolations gioacchino valen asexually dialup restating chappuis budi dimanche grantees tightness matheus anesthetics jupiterimages stupor ipn pieters magia kuku hiked infoworld stange asce hypoxic faring theon novodevichy racetracks leventhal encroachments mourad santamaría baskervilles easyjet cantu braemar visualizations pithy delias butkus stumpings forgetful nikkor hiccups horatius derisively wurm suzaku piro institutionalization sugarland grandison carstens autumnal steger valenciano jinzhou cornelio regen meller costigan sanctionable conv idleness bombo annaba fastpitch ifrs reauthorization gallaher iñigo lachapelle magen perdition hauschka bleep garr silene madlib burstyn apses kania neige sueños pulsing torrijos ady paradoxa tapeworm acehnese chipper solomonic nasik avelino holl huffingtonpost boucle halicarnassus migraines emas esu archbold driller mayorga unh spiking phillipps poesie motherfucker wattled emmitt sood maillard roto wrested berke mycenae eichstätt sherds paraglider bulfinch ods ptp shiel rahway glória thievery gratz jajce ouzou taxidermy machinists lumina cerna pombal youmans hyndman glickman cleon vls hyrum jer arend gravels adab andersonville owyhee wingtips powerplants kuei strychnine ascanio daguerreotype rennell lymphocytic basara yura vandalisms hota bayhawks casimiro irritability tamang benalla shredding thorson marra samper jansch mikhailov savoir killam mitton ickx impairs rejuvenate aira spitalfields singletary treks goga unwed pilla segrave knollys coauthors cdh mez biomolecular crustacea odm allmendinger stratagem paraiso fernandina burro disintegrates sapa fdny anges linseed slv pluvialis allu jaki futurists epm safeguarded preez corson kiara unsportsmanlike harmondsworth moravec sarat veeck hentschel bernays vincentian gamed syllogism ivanovna naw goli lautner astrocytes makarios saragossa lightyear rationalized gls blameless harpenden willingdon drews xmpp demote apricots barrhead slinky kamov shikari bounties northcott tul burchell bootable santis konjic unsanctioned sevan droitwich corrector sull sivasspor drovers pierces spotters perpetuation nistelrooy leyenda lenawee biofilm zadeh hamedan takis ligo ahmar orthographies wardha ddc vigils niklaus kresge attics ivins customizing abhorrent unf polynesians lancs lubricated shik kgo zlatan hollingworth riffa calmness kronor ziggurat sandino siskin dysphoria orloff upwind madcap huainan repudiate mongkut hokusai trastevere neptun siebel upu moorer lieutenancy grech tilts prestel karta asakusa mimosas swarovski roadies conspiratorial mrm floridians complacent honka nazarov gluing slavers longwave neurath phau streit rma storie refilled disappointingly papel asiago ranelagh codeword kab wagering mitsuki vaccaro payoffs obrero mangold orrell camra sauvé inoperative redmen scavenge sexualized downplaying grazie backbencher outcropping pathans okra niang stablemate freightliner bienal bartosz naiad stroh voci bruning pretzels wiebe lonestar amplifies harb audiophile walkabout wari intramuscular dagupan riad droopy dzongkha delightfully matosevic perpetua cemal appreciably warranties thoughtfully omnivores steinfeld eufaula ukiah nichole nucleosynthesis clench ballgame determinate backpacks osbert iagainst magnificently nagato inadequacies herded byelection oppressors hartwick braveheart queueing snowed marcha wretch galán armpit dastardly ibni offloading apatow cowdery schoeman irrefutable farhat compleat sidra brasiliense fhwa vian butthole heilman pardes rockbridge mallarmé unrepresented consummation penalize corduroy prot zgorzelec ferox nyer oreille anciens nappy acbl mrf raed reichel jacq cordia tikhon jérémie bonnard kaddish smollett wrongdoings blankenship hedonism pangea neccessary galata glenavon objectification sisto nickell becasue murree brinker electromagnet rhodope histoires pacifico dabo jalen favreau mackerras silting geest kirkbride hypocrites guttural mdf kooks euridice coulsdon liebknecht iwai downgrading fib isopropyl saluda cecafa legibility ogura henshall mnc cott antin backlot rebeca scotian haircolor carignan swissair yumiko coalville milch voluptuous normalizing maranatha prijedor devvarman sweetie drapers oac binky choa whs vinifera synchrony bachir bayerischer arish jammin leilani archaeologically aeolus scottsboro carters akerman konta nevermore legaspi pud sach laypeople aggregators victoriano rambles ury gerland molesting virchow flamborough hoodlum binay squabbling gushing pelted crematoria eru unbalance sustainably perky prettier aguadilla farces izard almora thiam khaleej beekeepers ovechkin kurupt savard hilmar martelli pend wonky benita deller balaton websphere leitao tumbled meira milroy unu jamali westernized namesakes harrod obelix shafiq aeration iryna chattel gametrailers allister zaleski luscious waa digression eiger negroponte ongar ifn overdone asclepius saadat sociolinguistics solidity marcellin wooly rheumatology fainter pejoratively qual â asam ayyub doggie longue opelousas taxicabs siracusa naseeruddin stowaway lancasters listowel berns investigaciones cornus dodged rhetorically disbursed montepaschi arthouse gudgeon gantt suspenseful beppe mahe submerge alkanes mercalli nogent livers botetourt moulder marand lakehurst deflated rectifying handlebar roadbed obliges maroni autocar dressmaker hitless guayana henríquez wooley ayat pnas trashy nantou beaufighter truely soothe tejon schaff ilfracombe tda headcount tapp spikelets technics sledding pontoise sewa newall rhd saltash mauldin weyland schall almoravid mico jeopardized gerrymandering charité overpasses iveagh loathed wehen tamla seppuku tayo blackthorn coronations lha emms selke sweatshop sudarshan balakirev auriga vex knell chippendale bifurcated seiichi falsifiable overdub falck alamgir lior pandanus tuque iter presbyteries adjacency lamine mothersbaugh büchner coghill arita araucanía singlehandedly vallee salacious colonialist viviane kojak fibak alevi delink ekta homely diosdado wistful mirv uff buckmaster horvitz disinfectant burana yoshioka hely herre ege schwank lawal cantorum zao cruden wenbo memorialize kian womanizing notionally hagel baro helpdesk inclusiveness laffitte pinang chaya wfp receivable achtung ayaz virago atriplex accelerations featurettes quadrupole godspeed veliky sahil kottke prolapse tarantulas strath petermann webspace legates masturbating tomoya isb aquarian midgets autocorrelation augur intraocular somatosensory leachman heras overman arliss akhbar analgesics evaporating kirtan discordant hassanal zellweger ramah shara shanna bushel jedburgh parken murti unsatisfying poème overhauling tsars khiva underfunded herbalist ojibway sdt andropov cellmate scotstoun investigatory cols ‬ barajas ismaily addressee trailways lachin sugi slippage normed neeraj mandrell fanon riven disastrously lingers recant plasterwork putri teeming casus halina gairdner acasuso interconnections quoc badging phrenology exclusives koan macfadyen moulana batis hollands timeouts yorba hardiman tvi sheikhs sammo andreotti pickerel irreversibly dieting insinuated clares equinoxes bukharin ottaviano tarnishing tamalpais jordy holford butterflyfish gsu avocets arouca kampot pdpa roseburg rafik peavey pomegranates laurin ulica simeone platts mics druitt thousandth twitchell spessart scoble smad kwik computerworld tigh doorkeeper goldfarb stryder retells castellaneta autopsies cavemen xen furioso calcasieu argentia cannabinoids grumbach domenica svff chiropractor tennessean rosse telco alleyway thyself hawarden tzara belleau everhart kamm brittan sleepover jimena inverters mois tithing shiraishi kesey makara hašek evaporator uchiyama uttoxeter luminescent resistances mordor quayside bettie gaekwad junie souci farias miniscule waffles ipods benzyl disenfranchisement wardlaw tiong noroeste tubulin piter mumbles cpgb conmigo ahd barc repayments inspects deion prehensile invalides hijackings insaf himes grrrl animatronics berthelot podemos abruzzi peu quist antão pudsey refractor esteves depositional hanzhong taung ghose penns pedantry gizmodo rós streetwise elrod shite dewolf keflavík tolan moosa natty latinoamérica blackfeet unripe notley seance shaoxing milgrom ryback dalzell pacemakers annabeth ximena sru acapella lachman shigeo bopp hathi muggs parsa mechatronics losey muffins cattaneo quatrain regretting collectivist vajda ruel yanina gribble roco numismatist medullary seasonings aarons lucienne sécurité bullocks ostinato mcr villani wernicke sarbanes hamra soter berto lompoc ituri littering worthies mfi inte extensional domestica comber lubumbashi telematics housemaid unicycle antihero agriculturally valiants tns pentameter gruppen dramatisation turok jala oceanus sele aleem koerner bannu budo granz bristly ambani compressible moxie dii gouveia beaming curacies jwh molo colds kiarostami enthused cayce flatt maistre arundhati seis nunciature miniaturized depress rushen bilderberg deitch frio gamepad deu kassa inexhaustible adsorbed lanyard alkylation mpd camanachd gülen sycamores nampa spats tintoretto sfax tisa labat fishkill pittston opitz balbo jemma noakes pinnock adaptions tidbit felicitas hippocratic caterer dedicatory apologia stared accursed vagrancy tracheal rhb sde mutoh sweatshirt onitsha kraemer travnik ramis christoffel parenti mahanoy confirmations arbeiter uea sturgess lyles glutamic overhearing sisaket caning asir broder drumheller politika downfield defile krakatoa biffy freakin manakin internecine rmp aquitania tablature borrell inoffensive lins amartya mtg khong irritates cuerpo fisker capilano dods papillary gambetta propped pasqual dacey pwr cordier ruíz walhalla menthol decorah seers dodges quintuple samuelsson exocet fania medaille ges axum jayaprakash wriothesley pensée reveille knossos vca oga leatherface huangpu shinohara zawinul idolized natarajan malarial suet zama daimon neater komo kittredge digne pongal niners paratroops overprotective whitstable eutrophication brushstrokes atterbury rodrigue abounded yulin paratransit makedonija mending matadors nicolle gazelles sharmila flocking ricker commercialism tractable heider hedberg circumferential myc christel feathering topos searles hydrocephalus cvr ermisch mcalester skagway enunciated barbizon nits millward sjögren gutmann arauco terrill comport thalidomide sauerkraut socceroos siaa furler atmos symphonia sukumar stift yuuki subverting philmont scoutmaster blurs usatf amaze sparkman fallowfield cason seit tilling lowenthal mapk videoconferencing tormenting haggadah undine fatigued bergkamp rosenkavalier foner amnon confédération arabism fichman egged uighur hoagland whitwell handsomely altenberg baldness scotto incumbency drona chadbourne straighter tortillas tish geisler ismailia rattigan nanni bairro overshadow debord enameled compagnia beato baptistry emeric halsbury undoes jornada sva ratchathani ifor tillich saris palfrey permeates blatchford lescaut charioteer meany cheuk ceyhan flodden vink supercouple brittanica perennially takács valorous shortens trini voronin wnew buggies zillion combatting elicits shure kosygin hilltops copp squabble seaborg sbd hammerfest sahiwal plesiosaurs screamer paquette modeler pmp harmonisation cally optician afterall beaty redden iphones overwhelms apologising hechingen daydreams aphorism yenisei italiane durazzo brawling kony tomohiro edgecombe bondy breather conroe undoubted sigman opportunism ulric sharecroppers poulter gema irr anelka pilote skookum archi greensand collinsworth backlit kazuko bodrum dreamlike vicissitudes seghers sunn tynwald conclaves pneumothorax nsd ozzfest gwadar chartering tolga userbase choong telfair hegde embezzled dzerzhinsky reticent statistique semyonov generalities falsity kaukonen exempts piñera wormholes usbwa passacaglia mongoloid masterwork strangles repainting involvements interglacial rigel kishimoto dimitrije manoeuvring monroeville keach suffocate weaning bocca schimmel upshaw kalispell cero maseru mcaleese dorff oversimplification boru zurab quotients états ebner althing rony rattles nessa prather hellenes enumerates crf scholastica corroded conwell bretons delacorte afn sammi unmoved aish siddiqi bolstering oleander kodály rebroadcasts scrolled recitations carrigan estrange stroheim proliferating tepid webm evarts pairc courtrooms sobotka tourneur verbena emanate bersaglieri homocysteine bolognesi onerepublic laundromat lochner enfer zink subpopulations orthoptera fünf tetrapod fouché metastases pitfall joffe dispensaries zhai guerrera smallish wasilla sweepers expels pron coady ryn vado cezar viscountcy swastikas bernanke vibrator gullah shaban bilston winkel qw fach yenisey cees passim guidry kirchheim hotaru gjakova critica ipsos screamin indicia akasaka alona melodie leiva movietone baas blazes schilder kaho canids bodmer greenbush mélanie crawfordsville tabi dermis evidential gaj línea savonarola lusty masry stationing melmac brockhaus zilch inla roars neuwirth esri tawhid manali kacey honeycutt zrinjski reema agaric spottiswoode spacewalks chislehurst winfried bisley malnourished spiros bastide kranz inaugurating touchpad photoreceptor turnpikes cde levitate carolinian dinkins lacson mallets skara reordering vasectomy preprint rogge hickox secours oppositely zilog glycemic thoms mykonos lathes levent vasopressin griot harlot extolled muara tulku campobasso haine brydon casbah alii weissman goreng rumbling efraim arbus rollbacks leptin arx wreaking ostracism levite overwatch riviere voith nishioka lilydale kristie millán bohai stipends busier frond tenderloin fermín oleksiy starkly fibroblast ethnographers kazemi nlf cafeterias fazl schnitzer vergine cobbles isothermal bink refreshingly joko camcorders sólo shareef wilmore shivpuri meli laverton steinar makings blankenburg thunderous androgens catchments aftonbladet inegi precentor merrion dreher wep estoppel rootstock balasore epidural adelson skiffle adhikari leverhulme nurmi lindon maughan barquisimeto jinks repatriate bandaged siebold serenata airco rishikesh heckling columbarium cauvery yass timeslots amet talc treader archambault theorised phang soulless ostrom mcminn watchtowers komal idps hothouse populists quel downtempo electroweak sarala atx mullioned normalcy premarital novena hagerman schillinger whc cryptozoology clinicaltrials aussies rafsanjani brix sangeeta rivkin gemmill heppner alliteration bakhtiar danceable charlemont byfield accidently conglomeration dni curates abierto akademik capetian secord peden ultrasonography gude denpasar dumitrescu giller faryab seyed baranov antico sequeira leverages slinger stolp burtt persico hooch occidentale monaural hockenheim altaf oer fantastically hashi petrosian funke yoshitsune capron bassam newsboys slinging uncultivated robina brack albinoleffe vaporware moravians differentially hassel filo airi itchen uneconomic charron burchill habibullah rustavi shee hollie balestier hrithik bordello vicepresident ballymore saddler wgs koryo giamatti lessard bardem naushad edinger middlebrook girvan therefrom skeffington fingleton schreck phy contrarian overhand erina fuld chabrol piercy fynbos cookers auge bracewell muerto voivodship sef farrakhan wickliffe mmu détente goodluck sacrilege counterexamples fana buttery retiree eigen yordan peterman bookworm smithii supertramp saree saburo treatable kearsarge iat adressed pål barreiro adsense batum exegetical omari multiethnic classifieds meatball sinned gebhardt corran ungrateful hallen shriners edgefield blaue lpo epicurus scaife reasserted viviani siler muntz intercooler skydome mulliner deceiver microtonal waupaca subramanian joos haffner abaco mbi persevere dhx dragster aboveground guillotined mcneal winterbourne crystallize groupie toler pär deregulated blazoned honecker dampen diminution enema storehouses acetylation telephoto faired medevac westerberg kallio stonehaven jamar huard mti baldomero brixen emanated pironkova waff domesticus sulfates anika ailes procol savarkar kasur seaham smita pizzo artificer perdu belhaven manti radicalized mycoplasma armories favela hamersley sociopathic fels eckart sufficed blockquote moorman burstein vorster cropland andong misdiagnosed snubbed devastator farnum pickpocket beekeeper hohhot roared phytophthora unmistakably estienne genista rohingya nieder rayna nemechek polignac cheeseburger padmanabhan tripling kasa goucher weirdly sujata kalmykia marcial heralding cwgc spacesuit alveoli wakamatsu schulenburg annamalai sogdian rieger oettingen arkell alkalinity abelardo loge harbourfront spedding fcb jacinta filbert milkins filius carom kyd aop taskbar schoolroom dunya birdcage ynet hymnals iwm yegor battlegrounds hilfiger hif reordered mgh pervades subpar tablas navid outrages tiv suzan magnolias romanovs reisen manami bonobo oriana ansonia tunics spongy tonite tinder sharpest milsap adoring majewski medved koreatown hagood honeybees exaggerations kookaburra fiendish subroutines corporals devgan loddon antifreeze hci maritsa cathleen sderot licensor boner determiner obamacare kreider karditsa pivots arpeggios hercog maurits boxwood calving anouk prognostic futterman liebermann electrotechnical metuchen urbina poiana resende avifauna ethnocentrism amendola leathers flawlessly vandana saurabh choco yamashiro calne kalahandi racewalking thung sturrock winchelsea magnetized lazard carolinensis suzanna macnamara brittney untidy aiims noblesse spie flinn pranab babbling onomatopoeia taiyo dadi trashing backlund vini elles badenoch bartok deepens scintillation upm goolagong askar yandex nsk waynesville footballs vorontsov elses molt brasserie bushmen macarena travelogues thet barea cuthbertson waddle irbid dividers flotsam jame arby vanguardia civico phanom doylestown grue gaea loit rajdhani pvs kiama espino exertions barrick anjos robichaud stationmaster kalimpong meech verger clatsop rosado hopkinsville adminstrator zippy blore nvc alliterative superheroine cabra hydrides sylla murrah hagedorn jupitermedia volante silvanus galan kandel undeserved schaffner manco beeswax sro soliman agressive danae ohana gratuitously buntings alcaraz katona walzer shirdi garish boonton errico cachet toute toboggan yigal exterminator monophosphate corsets whatcha exotica stapp nettleton maxfield rustling keeble brics lerwick sardi kroeber intaglio gravitationally crossman coby flay rrc amanullah sinker litigant regni panj kühne galvez bacsinszky repentant majdanek amini craggy recurred decoratively casserole appaloosa anxiously nfp aldean overgrowth huertas mongrel crosland reductio wintergreen bavarians baire pdi silversmiths amanmuradova mcglashan giovan guedes ferengi reenactments dowson licencing cessnock regrowth schoolhouses abductor anxiolytic maritza crosswords aikawa neurogenesis daimlerchrysler bayle gso arina bfe buttressed peintre flipkens vilanova bosko rochefoucauld limbu haptic ivens fredriksson doused alfonsín piller cmh contrabassoon yurt pageantry therion gorno korchnoi obote odonata baeza ajanta brickman crl welford yorks palouse sufyan gyeong kennelly misdirection latrine kingdome carer mohicans gethin dentin swayne garou porthmadog brandis estrogens erichson redeemable morillo curnow fanta hadleigh länder onega defibrillator giger riotous tinie unleaded magnifica embrasures tsm auriol fortifying gavrilova cheater monopole sisyphus tamerlane biberach nagin laboratoire arcola hibbs mabini jcb bacardi northamerica gili unimpeded lingen invalidates tallmadge foa tussle qarase ventre agglomerations sirhan pilsner dairying bigby maclellan massawa uncritically tommi exempting dilly salesperson purdie guillem sheathing rhododendrons elit trollish torvalds ngawang impulsively barberton pipestone torp catalonian jaggery trekkers silverdale footers lamartine manto farul opossums qassim dinghies druga adami infest diced fermilab workhorse quelling dissect channa plaisance hybridized acropora middlewich breathy traugott dioramas cobi daemons fairings cordeiro udit stacie bitty inglot kirkcudbright duguay milovan wykeham wallerstein buffa turbans gladden reprogramming aconcagua shunning talaba arsenide danna jesters rnr kwinana heinie busk kickback mansehra zemeckis utu leavers spanglish woodgate petunia cockayne zaha snowflakes extinguishers quickening szymanowski methuselah bso sats turcotte cassegrain saura newsagent shiels rodd solidago shackelford triangulum burnsville virologist fgc gravitas prefontaine shaab simbad growls betamax accompaniments aiguille evelina cuarto gromit ingles vem aswad emaciated trackways kirishima berlinale qn imperiale baldassare espacio dobbin moronic imprisons steeles unmasking bregman kurata reichardt hatt clubman espanola detentions ymmv doheny wid fava stubblefield berlet appiah tropospheric rathmines helmeted birchall whoopee anticonvulsant ceramists fagus nadim lauria officeholder obfuscate kcbs teresita sarang scruffy decoupling wiggum oculi menor puddles nasu mckagan ranunculus bunning wasa moonbase handloom lipinski spoilage eivind zer freethought celebs cheon fekete bullough dorval buchwald aches leclair tensas queensbury bronxville mnr epfl outpaced swayamsevak mudie drouin peaceable fermoy jarmila czerny privatize drowsiness minamata terrestris flints zakariya emphases ruptures piezo garamond barbiturates recede ghazipur cadherin manioc mkii hanke anupama niacin bagger estoy metastasio sissoko leapfrog voorhis keaggy froude guanyin inheritors beath punctatus qabala zhongguo avco birks cassis marth apprehending hornish artista factiva poncho dtl silted apennine bywater aucoin sailfish vitagraph eugeniusz jol crewmember tappara wombats frankl esser videoclip gilley payam ladbroke powerhouses gaveston reacher khurram expats huish unintelligent uracil lanegan sardou okun reaffirms tsh sedgley schlieffen sssis saath mckie vainly ineptitude iop pubis pythias podolski nodule heaths guez ipads nello intellivision sette porcupines biomes doody barceló petrich damselflies gangtok hatoyama ramzan grog schleck création galop micha avantgarde partha torfaen kazoo matric crock greely panola avedon bargained roomed zandvoort valvetrain meddle argenteuil unafraid tkachenko elman sealer maxentius omori relegations sadri christabel swanwick lamarcus ceu ‐ umeda taisha bellville hotdogs parodi engin rosewater kinkaid kaczynski blakeley dildo calvet tumba pingree vögele despairing supine tarts relaxes kosuke basc natt ruspoli karvan inde suppressive macey videographer teufel slimane cradley rop awf cruse boxcars krom coch endothelium silke saraswat scheel foremast anar signy birgitte manji emanation morteza flamethrowers bourn ghatak fossilised matsushima windisch osan drummed hasnt patties bacher gladiatorial determinative revulsion cagle kolyma educationally erma frag amirabad diao mislabeled basler multibillion hayton kanwar undressed cml calero amaterasu monopolize clings interregional barrelled mottola widmer abaza powershot yahia eichhorn tpi hinchcliffe nibali doby sayan collen tweedie tonka lias balfe weeden adenoma pushkar greenslade bramlett elyse pieds surratt megaliths bugge concho exonerate amoy berggren overdoses annalise burridge ecl vroom hsn renu bostic biggin telemarketing hammarskjöld cavani bedser whorf piana redesignation leonardi upsala hout devoe niterói underarm lafourche triune loeffler komuro phospholipid lilliput rabb hausmann boros overshoot mumm pitzer lothrop haruko dougall bertens saltonstall goffman vamps trackway topalov powerplay miloslav repented incubating kardinal macías snep pva msd hypervisor bahauddin hedin hashomer bevilacqua radicalization braschi disqualifying madisonville gwan imbruglia hatley biafran johore katra ballpoint gelora abbès wani syl khurana highpoint anchovies publicists rioted taino semmes falciparum larcher tautological mccarran uscis brooksville pcw jizya lupi taza schottky lcr ankeny lingnan milagro renews micrograms swaminathan hammersley clarisse fabri paderewski cappuccino holter cch sagi grisman lupu batiste grodd clute uac vibrio impetuous gallantly segway coolgardie hued karsh ghb shanklin scriven crusty drumline hydrogenated oid metacarpal zeitz arruabarrena goldbach tannen prosecutorial hayle futhark steinem wotan wattles covarrubias anticoagulant morice chavannes tomei bhairavi crn upshot riina droste lgus juozas rigors grossmith carats defrauding páramo ventnor cursus sprinklers ershad extolling orgel traceability optimise amel adores wunder wishbones protégés macdiarmid dramedy treeless dolla scrutinised sanson skt verapaz théry pollo dawned untoward poornima engelberg mannan initialize bachinger incumbencies clavicle leelanau groan gerardus conniving chiquita manso penumbra quiberon natsir hugged hvar jaromír vagnozzi dahlberg lodgepole escapee tetrachloride tsuru remitted arsonists admonish rhizomatous guus neurone perfunctory freehand morrie gristle leipziger fahy esg drogo nuer kbc replanted lariat saporta raiffeisen alper fha kerensky tassels flashlights ashfaq pilings alessandri monachus psychobilly capito hallyday shuriken stürmer admissibility lhp unpretentious sanjana bandana crooner sidestep nogai llull levenson issyk emoji chaffin velikovsky boyish conca evangelize ital broeck hematoma razan lii eldritch secreto rushworth barbecues kalin viburnum niranjan schizoid zuko rangan angewandte obtrusive rapti martinson longhair catheters strumica morass schmit alawites schlafly plaudits niobrara adelman westman andijan tayeb quinault kilgour troi langridge bovina fanlight ieds conformations heneage bueller hurlburt nevado ald lamond bruneau penwith irshad mesmerizing seiler yoel homered zookeeper milken mortified margined bosanquet sombor labrum saadiq rausch cornaro carcetti dyess condescension quetzaltenango châteauneuf synchronisation zubayr annacone glucagon rosendale kaisa scrip fisheye richa torrid yamin kazmi gannets pomorski pandian betweens linoleum aedt loopy punchbowl plantains rotana linnaean aversa shiites undervalued choses mabuse shrubbery mns grimston hayami mqm hyphenation omak trimmings tyger spica executables natividad maksimir dinsmore niemi insuring lahaina ofelia bloating vikki geta firecrackers dinero hoopoe nimr sourav bindi dibley defamed mcbee oggi banbridge wheatfield derechos luitpold vegans semiregular diavolo epicurean chakras corozal shimomura stromal commercialisation ducted mesas cheery namba gosnell invalidating csar fawzi talos fbk koike srv hydrangea hopson intervarsity subramanya proudfoot lippert wachtel ketcham villainess matsubara shannan turturro swearingen trainwreck sitta netizens sabers fouquet hillyer topol overhauls glucocorticoid ivb damietta hekmatyar offensiveness arnot severs lipschitz uys finke partook psk jiabao sherbet lohner saroyan mahabharat stereogum wmu rearmed damavand pinching jesup zahab endorser refunded reynoso selah pinna coppice alternations peels maddow finnegans protrusions jeou ukhl bildt fpu buckie razer trisomy ettinger airdrop rytter nannini banke espouses lach aiff inducement traub beecroft risque headlamp gute frickley asthmatic peeps lud camerlengo delroy cosmetology zorrilla rashed ziyad harbouring lucina phillimore misquoting laboral viner airedale lmg kalas petoskey millburn sayle horsfield freelancing uji intellectualism senor fineness bagby demonology difficile dcl puerile lubitsch merrell vittore mip zarina heilmann tourniquet gyaltsen taormina telmo colum callosum misdirected niehaus expounding bilaterally histones mcferrin milkmen belfield blotched croatians auditorio fermor butadiene derailing deval ,i laudrup luscombe propels colorcode madureira messick polyakov pacey sremska emigre gocomics jogi symposiums frac collegiality erzincan fusions kailua beed fracas heysel girija myeong soloveitchik ginette rottentomatoes postmodernist razing zamenhof mukul pediatricians tunkhannock perreault eschenbach mianwali accross bolshoy izzo handcrafts alem sleeveless jerboa extrapolating kek prefered percutaneous iliev morven snowbird ferrera meaney intubation banishing vanities scheherazade alami abele kalika expressiveness distill orihuela oflag fujiko ethnomusicologist dougan thrombin caravanserai fabia dundurn roxane pochard kcc biddeford layup arboreta macqueen ainsley goi dossiers tga bandeira joli oppress snoring uncompetitive claymores terada overstock devilish silviu bator portola postural misogynist zealously yaron strikebreakers asadabad simenon xinhai lawndale tillotson kühn baksh clun srinath ailey hermansen tween deneuve bitlis rup malmaison laxative disheartening infringes nuwara atlantics flasks fredericia tumbleweed applauding microclimate yoji usi pesce heusden ferland bhoomi nicolaes alou jabir brearley blunted venosa mugger zouch pfeifer agl santuario comprehending blanchet lepchenko ideation icw coningsby whither shafter fabiana cloture sandžak eardley appetizer loafing daoism clu helicon mauger fontaines impersonates jaguares speedboat tsw fabienne menino renounces barahona commerzbank wheelie heineman arleigh shawls throttling deconstructed throwdown aguri lmfao palabra defa sld amerigo suna trautmann monocytes ilchester tongarewa munchausen lindow jehu quorn ambrosian vjs compositor munros servicios dotty accomack najibullah basanti torsional tayloe billeted rpr diuretics spoonbill misericordia detlev reaped connellsville berating westenra prudente corporatism inaccessibility hoppy scw homepages hybridisation konan comecon hedgerows ints hmt assemblywoman taba cesaro ringleaders speranza cavers capitulate kober reivers linford odsal cashing speedwagon hikmet disqualifies georgiou jase disposes equalizing stoppages iskander postnatal estrus bayelsa blanding ete tibi jugglers exner tympanic aldermaston syros gigli sohr electrophysiology hulked menteith vegetal striations antiochian gnc usain riessen dyffryn astbury miah monocacy papery mitzvot navegantes ichthyosaurs gratifying fairlane anaphylaxis lebeau deewana bex viel avarice snowfalls potgieter pesa monosyllabic moonlit valmet hamsun yow bnr vacationers wyche mismo desu deren insp yanagi bookmarking meatpacking mengistu nitschke itami parrett vpro faia levittown subarachnoid trelleborg nilson diastolic primula gonadotropin lcds chigi dbr kryten pander consigliere eston figc anschutz wiretap pichon perdita harts takaoka aldebaran ironed cordite baps montville premi valeriu landor eam merkle olivieri daro leonore hocevar uppercut usui palas birendra bothersome lestat kickstart silja doolan slu aled nativist klia hertzberg andreyev hete dht gassing fengtian sonning itil olaru angulated ahu giudice groupthink dard tsurumi gregori strana cogeneration unruh kingwood qam numbing manan iapetus paré buc leery abhinav tokaido burberry skewing wadden rorty wagers jil munteanu ogilby dorp wiig seedings resta gidley underlain casta shuttled jealously canting lincolns retrofitting daeng misao sorrentino sall colwell thawing californicus hornbeam banditry yacoub katori lymphomas amaryllis panicking chillies bobbio schomburg alds queers dork worshipper icky schismatic ppb archean goverment walliams scb nederlandsche saami unordered chinned litigated shushtar bansal arredondo teleporting frankenberg junji mignola zadok atif wadowice geico pichincha coto furneaux unknowing ilsa confusa tagbilaran pion krush agouti towered simultaneity andor jordaan kishida egf ctp khoisan yaman rugosa unpleasantness pinsent joensuu aztex antifascist watteau millipede personalize stonehill republish prongs grünberg bnc fertilize rietveld joust schenley navarino olefin leka akutagawa metohija sequencers scoping florissant bisexuals stupidly crawfish zeppelins scabies abadi bdc mpt ephrem tichborne willian eichler commonest parthenogenesis turvey chronologies squalor bvi arawak bitstream mabbett somesuch kole wayfarer tortola laan leeb tamiya happenstance horford enchanter pilon coir braz snip waycross sadan digests zab garib ideologue irreligion snaefell yoram amphion embree annoyingly chuquisaca ipt maser clallam homebush nadler mustn hayride quadri guid prow gpi vyvyan signficant fidler hirayama sains staubach lammers duffie jjb wollheim forecasted karnes erlbaum insurgencies milazzo stockley viktorovich cappadocian weisberg fallopian lautoka belies rosenkrantz iwakuni krk kataoka roye aspinwall detaching kultura fahim obliterate bischof reseda hasdrubal iko javelins tyco innervated manhasset rancagua musées maeterlinck adware tonelli belmondo latinised ordinations kattegat martz marotta arcing liquorice calzada matted mizoguchi sandoz kalutara annihilator icke elizalde interminable zte chemokine takoradi margera tipper paedophilia malar shearwaters hulks hammel flagellation nasheed pardus iolanthe pyrimidine sweaty greenbaum teodora shotaro starsky yasuko sumpter baptize dolorosa kakavand whistled bett kheri fonseka vetoes cloaca subedar biber sergent arianespace seager watsonville biomaterials cutscene handbrake shola coffeyville kashmiris maillot cronk letizia paralyze twigg aggravation nisqually mimetic piccard lebowski freyberg ahura quinte joyride gennadi neta dungarvan schram adua chesnokov halve staking toivonen parkers eckhardt tdm equalise readout gaudet dta homeported speu baeck semenov reve lillo toribio contarini palani loman brilliancy lnp pisan hajar reasoner agonizing katara intransigence hymen paralyzing afra dorrit toombs breastplate bulldozed dreier timekeeper hirta abr glucosamine arrhenius backlight americium meikle saifuddin landholder terese pyrgos evangelos formulates pipefish verdier téa klinsmann pushy boyko hammon szd brossard xenomania blanda decentralisation interchanging pappa contextually undisciplined disadvantageous bodom mutagenic pistachio wilsonville debora indentations dhahran hodgepodge davit rulemaking sambal tozzi willet fiordland souto mdi kenmare bailando uncollected pinos iligan yantra shunted heifer triennale nadel darlings biostatistics pretenses onizuka bcn sepúlveda nini konno distaff steelheads asli skyrocket barta awolowo morphogenesis flett wpsl shochiku mujica subhadra mané dohrn trotta kerkrade jeremias camerin imperialis leoben ffi masochism musta lauterbach saxifrage fus beefcake pestalozzi bion frid floristic theophilos conceited singe macdonalds khanates turpan bartered coldness unrealized yadda synergies roanne rebar butchering lohmann piñata delaval contributers aventure materiality veniamin llwyd palance birk süddeutsche ebc cliburn sartorius cahuilla imperfectly ornstein imputed panne haymes siriusxm shawano upminster ecf adalberto borowski upperclassmen autocad sampo pulis solas louisburg bouncers radboud bogen incapacitate porras timberline repurchase danja briefer lydda qawwali interscience affirmations shiksha whopper tallman criminalization nitish epica remsen fisted longhurst lavie sheree malthouse retrievers swashbuckler shill pronghorn beaudoin aydin zzz glib guilder ginetta chito lolland montjuïc underweight delimit andersons bwa diedrich divines downsides paas stor generales kune dorota aalesund medi modding hanmer windowing forestville foment superhumans dkw villach hdb glendon lovesick zukunft nanoparticle whitacre herzfeld dosen stabilising ayam rickards wladyslaw zal eval historias almaz longmont midrange bilby unmolested pownall honeymooners ampleforth tatham registrants helgoland manzoor sumit eubank kington stubhub provisioned tweeting gambled scrambler preloaded landsman shawcross pavlovna malady kade sakae otakar deni wracked abbado tulle glenfield pottstown orbán lustrous bridesmaids saviors guttman langtry tamas pensive lumberton immersing tmi maharajas watercolourists akihiko mitten giuliana céspedes valter sonet plows rerelease arar vezina jerald utagawa detonators suleman gentleness keothavong holberg zephaniah triplett kazak sectioning sachdev fetishes charnock dtp hcg partito albertini wel guinn jarmusch bleszynski omm nemeth underpin ascorbic hunnicutt bloodstone svea ging santhanam bellis strum salvagable meagan coheed televangelist gundy watermelons emeka vaught creatinine lukin capelli jefe cabarrus marano limping ladybug qais solow amalgamating hebridean pratensis moin surinder newsradio epigenetics heymann yari lobato hereward kulik appelmans mandelson ministership odometer cookstown eakin relent nipper cacophony liye infers amante nourish katrine compacts headbutt bandeirantes johnsson backman immanent bumblebees enamoured beane newtons emptively aimlessly faf putih arachidonic chennault poss ndi castleman fintan schiele lehane standardbred bisping uvalde mohamud inimitable corned brainwash gha masan darr guidi khanh nct toolkits purist styrian masvingo solanaceae lurk generics colenso boonen eber acceptably wellsville densmore mamta chipperfield irreligious yoshikazu bettering swabi oliveri stradbroke uprights armatrading golgotha woodie cascais shotton himesh luff ift karachay rito bruyn cocoons rattler tints museet tenison vampirism raikes larocque halsall salability waterproofing fingerstyle porcine voces matale cleghorn ickes cibola fentanyl topsail hodgkins typescript brisson rast gusti morio pestle tsarevich melange stepwise piccoli stipendiary europaeus feo strobl purveyor phaser xzibit fazil arians protozoan physio albacore decipherment cahir environnement dorling combi ticketmaster baga hermaphroditic arnau unlearned iniesta manservant ruber mourne primero siciliano obed supriya bathinda bruder vitas goodfellas lynd teva concourses scobee colmcille willowbrook blasio harriott rabbitt tussauds makeba bizzare dumber giv zorba arenal leeming kageyama skog reutemann intuitions porvoo bonavista peppino cardus coiling scurry rajamangala sunna alwin macrocarpa pittsford sevendust boldt sonntag frenchtown tokamak canaanites pyrrhic hexafluoride militarization intros brenna phantasm oconto nhp abaddon massinger malaita zoologica hdac magisterium blemish baryshnikov loya gorm requisition ipm merano martinho fudd upholstered lumberman illegitimacy cowherd castellated workloads takaya garside wove gaddi hesperus herriot pesach asci dawud overrunning urich carotenoids surrogates spittal resourcefulness themistocles manilla steichen decrying swarmed interlochen dacha seascape maypole funder coarsely jackrabbit colonialists margaretta iduna braff brassy wrx cois eckstine morán nandu jelen anthropic pinilla bolles riske nabucco goold waynesburg rhus showboat oilman kamini vinland leggings osmani kajol hadrons politecnico fairytales toomer returnees peptidase pièce slanting vining swadesh gos się pinkett sextant lamonica sifton lofgren sambora oktay vagaries liancourt lausd allington rajagopal twombly ivi speedster juche saroj palmolive soloing urticaria bellies apalachee clg extraordinaire unearthing lowrey legolas succinate kojo taib alphabetize retinopathy juggalo copilot ribe ratifications fledge sedate erdogan dugong shammi estanislao bedworth salal cornu heisler rossiya ménage ammi holguín bamboos videocassette condoning kanzaki wafa macromolecular elas yasukuni aimless harrisville fountainhead sakuma ifp evaluators monteverde temnospondyls librettos zeffirelli baqir chieko toko segre premadasa zinedine mockingly katzman nairne kriss bouguereau parmenides kobus cellophane arabians kumano nocera mnt gelug invisibles tulio manolis chuk costuming abattoir yelverton ranson jihadi crossbows insa marron okara tadao scorchers naoya koval anciently hazaribagh beaujolais shon colina meydan charlize prerecorded trindade alternators mexica acquaviva bobbing estancia magoffin hami vayu broadwood fitna ambo winches rodan mke irredentist rania serio worsted erland padmasambhava straitjacket jati nevi taverner nyheter cerulean inanna disproving grenland russias tellus takashima tare discouragement sarda bunks azkaban vite resuscitate turbochargers reemerged floridian aquatica veronicas operación jodorowsky corsini ichthyologist auglaize araz rebounder poynton otterbein cyclamen tiu phospholipase disquiet irmgard naruse reconquer brownwood chantelle anglicisation bouldering spall aashto satake kaan elizabethton foraminifera bernheim hadoop piasecki nolasco mendiola kevorkian plumstead jayavarman rigger bjork leafed vilmos headteachers panamericana havas tajima cugat decriminalization liotta abhisit pyu digitalis catford pfo stoked unbundling mullets terrazzo osm rainham pyunik albornoz zygomatic smartly frommer riccarton dobell kollar greggs basinger hodgman observables cny tertius cully téllez boggling higuaín jcs graetz cholinergic jkr sundae enrol knutson popova robi winstead sodas paddies janna nigrescens chandrapur seidler demigods farmyard qualia underwritten sheard unyielding penistone jetsons gaeilge wojciechowski maclaurin bhonsle brengle barthelemy ducking kronberg portlaoise taizhou remparts gaudin trachtenberg jboss zimbalist lunette tilia bankruptcies fabricio boreas osp daffodils funaki endocarditis victoriaville iczn quaestor alvord hypertensive curvy tigrinya insurrections chastise zeeman deconstructing wxyz porterville méthode aumont satirizing radian neer medievalists thimble pranksters effusive aiaa chewy proposers backtrack athelstan whereof lemuria orienting debased brynner cheyney shapley backlinks photobook topham finck davitt skinning allam guzzi glenrothes jalapa buz theoreticians easygoing neame jacksonians akihabara lamothe yasutaka turgeon hisd sulphide yog nsync willan gaas ponferradina jaycees findable embellish portela rhos hobie retry spectrometers lightbody fraenkel paperclip vicuña brehm principes intercostal chaldeans sharpless candidly strabismus portales emmer kakadu cockatoos waley reber saxby danone yakub turnips burhanuddin cuda nantz johnsonville catflap canonised bousquet revlon conrado oooh althusser unmentioned authorial unobserved adiós brundtland swanston mamaroneck entranced lessens vce corker pluripotent absurdum miklos kohistan imhoff donnellan gouge allsopp durance acor sifakis humanly pgi iñaki chambord ascher mihara masquerades crystallizes wildfowl elision nasim bfs ccw unbelievers setiawan eons minimised robitaille estar wambach examen liguori jell dsk yakut harrach kadi bompard moorestown feign sundarbans mahavishnu chudleigh winamp velayat celeron opulence bori schönbrunn pstn marmoset slavoj miglia catamounts étoiles okapi homophone contravened fossae abha geshe bernina mesenteric mccaskill polonnaruwa outlasted vectoring herriman oona lazenby cyclocross orl camerino multitrack defrauded flam brownback yate incrimination duesseldorf lmc usedom apte peppy norseman campbellton watney lujan meic noni daigle nervously guichard lpa etherington hader grouch bracketing hosei teamster marmon vef siue radiographic grindhouse marck satirizes jemison kommer jeweled roundels linney shh wilby contemporáneo morand verticals garrow shorthanded hammonds regno pirlo warnes warrier miaoli afanasieff nazran dunin uecker orangemen lohr uzun hohenstein harmonizing stoa wharfedale kunsthaus junks mutu ssbn krog dalila wariner peo contorted maldini micrograph hermeneutic icar halacha apolonia raghava washingtonian shrikes narcolepsy workingmen gouging cari jtf quips fazenda photometry pugliese spined shelduck pamuk baccarat bekker picot dispatchers nariman amla elc laursen odetta economia nicolo también expectant jadeja aqui uch beggs trickier aref pleyel bobble sculling yawning séances pennock khadi tehama tts alissa delfin strumming purveyors ruma pdo anchovy tahar dreamcoat aoife timbres cudahy augsburger tepui sdb irgc crotty delillo wilfully lct aspirants walsham mopeds mountford gastroenteritis cutouts emarcy alcides gaffe colonise natalee malevich giovane abyan fibrin launay implicates swoon iet knebworth ippon mongolians virat idly ashour expending devika cist malpas radclyffe togoland diazepam foc gadolinium transceivers bigorre scrubber circo makkal davout meinhof toutes shilo miniaturist bateau gerlache tangshan vorticity ullevaal lacklustre worldcon meds popularise lesko jayanthi barnstorming holmen wajda jimma hafnium adenovirus horie mbarara animist scannell brenham evinced specht figuration enumerating joists opposer parquet pontianak baf zubair encuentro mckeen comodoro fischbach sakon autologous sternly shaper awacs azo kozlowski parishioner implores buddah enviable incredulous balli mugu degen bernardin allegiant coolio visualise decameron ballston lsts watership rublev shepton pfi cazenovia wayman mirjam soiled chaozhou jorgenson amala kawachi beswick apolis corpsman benko detainment palabras perpendicularly rund harran grieved beetlejuice thaliana mahasabha digitizing credentialing tuscumbia isadore jou cybill qub kenworthy porres chlorination aronofsky driggs polen micallef hierarchically yarkand renn kasha shul ventana calabash hokey ieuan dryad brax miskin nastiness eisley collate daye evol stallworth fairman daylights heraclitus sperber blockhouses anish googly sarangi acw thrombocytopenia drips replicator yoma watersports wivenhoe underwriter albian antigenic foliot tog shala huybrechts manizales mccusker accruing multiprocessor dotcom rundschau waterboys tyrwhitt cph hasselblad augments sixto loke mapes aun temeraire cremer awn rnb superba strecker rabban wsh shuji semicolons constraining nonsuch paolini benner gpm acolyte usurpers asg controversialist beatbox centralism blaisdell daad hyslop frill dufek deaneries arlanda squidward thant acg klavier upchurch fulling vidyarthi poyntz frogmen giolla moomin ongc prefab maybank desperados cymbeline nasution imperious jevons encumbered avan gadi meritocracy perugino alphen didot pizzarelli martes tela cabañas potvin mescalero krewe pech tatsunoko vasey craigavon microcredit rajko beis reignited lieven laski veljko bitterroot basketry sprat karlsen panza azide xinyi sapling meffert mediaset cesium shiu spandrel contractile kertész offloaded airtran wickersham wythenshawe ukraina girlie molinaro fsi polycarp paddlers barka devens whipp silvan nationales multilayer fetter yakutia mélisande warman dunford balam omelette saliba biddy vba bophuthatswana deasy roko sedbergh kahin allemand nisar gape lamour amiel srichaphan kreuger ogres altercations maji durden showpiece sutro rashidi yiwu welds blithe paulist carmella preoccupations overwork comerica kamat valla overburden cian woah tayside aynaoui fräulein acquaint iab maximin positing englishwoman moye thessalonians nlds stroop fonzie ncb cartes laf cityscapes rhum costar carissa joelle birdwood decrypted worldviews salic kempten truant softpedia sleaze swindler cygnet activex prester indrani tamblyn cvd paysage brodhead nnw mapquest northway ramy imagineering bureaucracies blackstock grube razzie backtracking fadden rohrer essences edw oliveros balkenende interconnects grindstone butz recklinghausen agriculturalists consonance reimann lindahl beelzebub ror gleneagles wholeness dray elec rehabilitative sfaxien windstorm cañete kleybanova jind basant soll bidvest getters tombeau roselli blesses imogene pestering alceste staat nepa royton ibrahima memnon odb wilmslow bidens metalurgs britto bsnl omnes limousines ledbury slessor magie cathars cephalon yunlin sociopath egyptair crom aneurysms trilogies kriya anga greenbank dhan bonhomme fluorite harboured catacomb thomaz yeading belanger dirck norreys ricin usar –– maesteg bullring delo mahmut guyton tikka emanates taussig sison denunciations appia levasseur goaded leavin mavor moai disunity pathum definitional persuasions mva omnivore franny izzie praetorius gargantuan hunky glazier videla holz lentil dattatreya baza patroclus penman toray paladino monnier catalase urbane oron canelones coffeehouses braverman boso lobotomy dantas elfriede cantaloupe pikmin coronavirus atomics keydets alondra sadao bocconi heiss menger horncastle kinsley inactivate ganging herdman suvarnabhumi figural kommando sese hairline driverless wyke fuenlabrada anticipatory mielke citrix skowhegan mizan squats stoicism piste foiling prostatic bairstow paca invariable agnihotri zawiya greenham djembe dirksen graver shamisen romilly kingmaker dania nahl zbyszko wfa stapleford tst dnd morgans rómulo eckerd skippered phinney marinella snobbery pacini quiescent rajneesh snowmobiles lwin callis divulged bédard eveline kravis anouilh drivin akka berntsen crossbench cryptographer inis spreckels chavis ontonagon jubilees powerboat redon arw messager manjula devours zelenay bootham gnk syndicat motian hassall cigs evin ranariddh cais devane gfp brummer feni transubstantiation tekle societas centenarian stoppers koalas philco enns maries cluniac broadens lipase jeannine sedated kcl groovin concessionaire kauffmann salvadori hyperthermia glantz savai imitator shavers thiepval hre zhili bima addu buka mitty bechet nachi kaepernick trion mcentee recs venezolana marsala presumptions dbu bowdon ishigaki panmure zinaida sanabria falcao magnússon ulcinj irreparably shadrach planer hopkirk yayo wendi skal buskers ihc presences eastport krakowski currituck usg ntra bluejays tenderly osawa installers clemmons microscopically cassock nyy hallman bga plucky lenta elbrus altadena barabanki lamontagne quadriceps villareal rousey grampians telemachus bjerre leen pevensey amati busca squint regicides kistler farallon gilbey gauntlets couleur extragalactic recollects bummer hartshorne porvenir schöneberg launder qemal topically contralateral chora bayda roughing speedo schlossberg merriweather jiva taya benes wildrose crit tabrizi nonconformity speckle sunbathing rvs mulhall hspa couches schoolwork gunilla ageless preuss qila kercher deputed ertegun masih utsav ethelbert naing désirée ajs faeries forsake zamir intentionality keefer myrmica fanboys dimples siachen hoste skilling khairul perma chocó keïta daza beevor déby debenhams palmares heartbeats claydon helo feigns bhavana junpei whimsy sorvino acis envelop floatplanes maggs gravedigger specular maines monopolized flug rationed mugging manduca paria parter elvish lecherous carbajal nanomaterials shippen subsequence harbord hilariously binks jacaranda incarnated francke reoccurring charis mudstones solons dilshan revolutionised simion winnemucca mumbo raje tamba demoiselle kirstie silberman dendy kildonan rosencrantz volcán neutrophil bareilles secretes mortlock droves propagandistic scorch tessin habra flicking tullahoma sibirica philpot escala rall gehenna kaushal housings dicussion hox salvini hartz norn shorn sym unspoiled ebm pawson clouseau haripur belmiro crécy uncooked kordofan psion longleaf abella nevus mailto mees schnee gagra troubridge welty paternalistic hyposmocoma wilburn lyttle empt loamy higdon ricco anticlockwise tev newsagents slandering cgm vires hsh baux noite afflictions synced transboundary taylorsville olten mettle logansport pottawatomie johannessen arsinoe techtv clematis costilla kurdi teke zon golani quagga hyperlinked beatifications sedley bulat afula swampland gopalan rhetorician abdali wardak horticulturists boulter hypothalamic greengrass wigtown cojuangco buffington teru erdos opcw leidy tallapoosa mafra dunia devers grotte hristov vignola impostors caryn telepath abscesses critchley mckeithen tribuna piven dregs multinationals olmo implantable schmeling jitendra plummeting ogee polito dissolute orginal tololo stingy rlm fêtes armbands unforgivable harajuku mastic holsworthy hypertrophic españoles dicey stennis confound clow ntp doral coeliac moree classen waterhole superstation pennywise hypotenuse radionuclides fishbone hypatia batlle dominatrix ail rerouting clerkship purer postcodes loras javid flatlands berm correll overrode supergroups webmasters ronk fasts friedländer tempestuous trillions kroc capelle shema certificated playfulness ystrad michela kalka zeid zook tindal transposing aisling peacemakers mahanadi pitchford floresta londoner rashtra mure challoner crimp northants lohse walley tuscola duddy interzonal basseterre sks neurotoxin lapwings shorted longshoremen heffron lustful adonai edgcumbe mcconville saadia asie commissaire sharpsburg marlboros turreted panagia segall krøyer cregan bathhouses mendieta fantasie palast hypothesised finzi lele magistracy décoratifs tickell buckled hussaini seba dimitrovgrad blk deeley bellwether precis celsus midlife huachuca sund weightlessness ghassan houck attard dainik emelianenko millau selectman nameplates graecia arbitrariness pathologic drizzle hps mete mpb bigots wolin lungfish pegula kirn escudé biomechanical metromedia overdubbing mathisen remainders maitre latvala stutz residenz burkhart inequities laconic machineguns tiana tapio kosice gasser dragonball schild palaestina relocations kamath hullabaloo hulman seedorf godart aerodynamically penitence ipecac mcclanahan tailing garam frito crevasse venerate jakobsson molton blackley expendables relient papermaking ensor payal sempill zana akal simulcasted kirkenes chapbooks hotshots moscone vlora clk mikio dalibor insolation drakes savino breedlove raksha beier hocus aurillac statics miele wakulla habbo mexborough carnoustie shamus stamm tbf tragédie stojan ruffled allot gula diddle cumulatively babaji burbridge alvalade raymonde petered bookbinder janowski campbellsville birdwatchers dialogs solovyov hexagons tourcoing mno footlights briers americorps shakeel kármán neophyte curbed chessington committal stunner ingleby duper mozzarella brainy petkov applets ziaur ushakov visite poma intransigent phair inexact bursar eusocial shamim jetix ocp shraddha bardsley olofsson ijaz garzón silty mangere crenellated hma kulwicki beckinsale efrem bikinis boulay sbn anecdotally tatras qubits suffern tweede strongbow godden supplanting thorstein murrieta ariosto kates stepper bozoljac dadar wess trayvon transpennine alloway misidentification thunderdome varanus brockport rhinitis ceb monagas evonne ier stanwix groats ltda tracie uckfield tatchell larkana yola ossa disinclined raynham coasting tinctures brigand kassim mclemore parvin boda fgs rosenbloom greenup sadist lrp kahuna fuc bogarde craik bighorns gund vaclav greenlit starhub uah raper botkin archrival goda dors copperplate archiver stealer skipjack easterners fathi splint hoists salience trinkets gbu remick telegraphed nde marussia fyodorovich beaudesert walston edta dicker vecchia cornmeal chery astroturfing esto vanuatuan tsao spilt depew eventuality busto chatwin dancesport graduations coaxed enso enhancers guarneri realign pharmacopoeia clovelly aspasia jocasta kravchenko kort lillee strangford esmeraldas whos tacking trevithick natl chitin escanaba impermissible tarija highlife harnack fenella rodale mesic winks igo improvable thumper wahlgren sirleaf sandinistas anticommunist matas licata pineal sulphuric trinh roleplay jiangling culberson ephron brabourne javits overlaying kotler sidelights daydreaming bonnets snappers pales delmer leytonstone punctures ghislain tulkarm cardew heres feldstein gada infineon colonizer spikers trintignant eol shootdown varius weiwei meni oatlands hayford baiano ewtn pampered bouverie khadija gyre cherubim bwh butlin mended inheritor knez ripened tenable waw injectable mccoys acoma sensitization bardwell cami rogowska chakrabarty precognition talismans scolari alh wolde ects otr presentable mainstreaming silverback beeman tylor rattled pekanbaru incognita mert bricker outgunned conradi taskmaster cathartic imhotep dobre bohannon dorgan bemba praiseworthy yasunori karlo bunched beaumarchais lanigan steles craps moscoso ibi toller holcroft hysterectomy kashrut shamal uct debre unfavourably cassian nyi blackall kyocera niccolo mup straker zinfandel europeana fsl sullivans barcodes svc conquistadores datetime tartans isra proscription skelmersdale factuality actuation reinvigorate kerrville oocytes divergences drunkenly ipu mochis cordata conjunctivitis brasseur twisty grangemouth carolan medlock phalke biracial kuroki grell ulceration lely chalus edens ruisdael preschoolers brotherton lauritzen huxtable baccalauréat whacked northcliffe parapan luas kokand blandings doddridge pornographers unadulterated taddeo allport cellier tinsel rhenium levellers monoliths irf agdam refracted conceives coppell grito ¡ medicated fernsehen chie logbook zhonghua vergne cantors salavat caribs deporting bipod bootlegged binaural hallucinatory jcpa beirne jaffer mcinnis btu serous sanguinea dirks andreae causey wheal felidae delineating asiana arlette airstream byun infiltrators jaynes frohman intestate mahia jellies naphtha torry vey wagener glamorganshire lampson insincere tikkun hunley cowra tremayne venoms pareles quoi pazz centavos lumix randomised livings pilger psychoanalytical cohesiveness latched afrobeat angelini apurímac edvin tmt purushottam incites impracticable candelabra acro roubles rózsa cmas basten telomerase zuber reposition lgb tracers epaulettes shays lykke towa carruth pustaka mallick conserves shulgin sloat blaikie kneller haren characterising paquet ramo herwig gusmão mulia wissen vindicate smirnova sanitized peafowl roose hyrule brdc ecclesial jemez mineralized rothes killin chams rothenburg deafening rcw arielle waterlooville elling decima périgueux sione khairpur burry ludger cinders homological reprimands mementos glitz tav lanna tsuda janke galung disassociate shinnosuke skyward potting guth maxey evra semiarid brinkmann ptr toothache machar retracts causeways hemet ephrata liefeld angioplasty kernan martinet beatus jazztimes unionization meriting efraín tobol racquets dominantly umbrian boondocks villahermosa volver clocktower keon rudel nightingales beshear sepang broxbourne dervishes senlis mesquita autodromo amant faridpur sippy mugged farrier mone cib fulmer jürgens kanin pournelle meditated minimax goethals pahs cudworth sheepshead backline decennial polokwane hagiographic shyamalan frum thionville marrs boltz kazem corts rashes strobel chickpeas hemsley tooley woodcraft booing fark malmedy haddonfield johnsbury linville flatley pittsylvania tehrik skydive chariton jeh unna averroes endometriosis easterbrook scoville alcântara baan gavriil hoekstra nájera tubas houlton machetes hendy kegs directness senthil krav kuniyoshi machiavellian garçon gummer manzanares mand deidre courrier paniculata fulvia oti bluey joyfully heidelberger foaming pressburger rhimes whdh tahlequah wia phillis twirling naturism landauer antaeus regula froese koot desimone filmer epifanio morges uncasville muss staffan saida biblia miceli bloxham ridicules uridine underhand maryport streicher khatam galvão chivalrous schein guadalquivir teething lutte feverish blewett disequilibrium bonnyrigg mahala ipf agnetha slavin nickleby opensource zafer sool olympiques bagels avidly marias baike nostrum villainy buckhead interlocutor gyroscopic shipmates ttu vapid orridge tricyclic isolationism aborigine courteously vashon nayef halving antagonizing netley avalos mikuni kutztown avner tibbets mamo cesari pittodrie tsarina rcr savas mensheviks allstar claris virtanen peranakan yamal cracknell judiciously tanners rothe nso koz vaa railgun bootlegger canio tarbell frankenheimer riddance labeouf smudge tessie panetta muscled msrp blackberries pleasants acn nearness vint microbiota hardcopy foreboding negombo dme yavne tvm bornstein kleve heckel nowicki meio garg emulsions mondeo mifsud juncker matheny headbangers fashioning methil anthropologie ncap fui unidentifiable stansbury riband somethings spéciale bli copps interlocked oakwell sdo arraignment arseny clyro climaxes moyet shaq inkster gondolas cubit robillard cryptologic montacute libertà piscium hagler kaeo lochte someren hoteliers naipaul petry butchery slrs rooming springwood loewen montalto mhl zoster tubridy procyon swail fricker sagamihara bluebonnet abut tono hattersley pursat wende ferryman vav nms consol thw galling kaguya fomenting gaan yanbian shoppe yunis bentsen dosa otp transamerica funfair puppis misión quahog availed shanker diatribes meudon serravalle cotentin mourner platen berbatov dagblad kmc centralizing hinault upliftment paratroop bledisloe haruo californication senter hopp camelopardalis jts rashmi digesting acceptors waren euskal unencrypted spiracles espana tangipahoa yaa triannual gaviria hatcheries tercentenary projectionist mcbean thereabouts stylistics mozarabic schooler ecclesiology coverages metalurg boreholes osterman catheterization medievalist persica revie curtailing vtr limped gruner shantaram councilmen paddocks howstuffworks gudang borlase expressionists basford aric greensleeves giggles montagnards mckenney lilongwe cataldo hülkenberg yichang parkville soars gerakan midbrain repetitively rufina tellico brokering subdues dement influencers westernization dagan deist bogard kernow ishibashi webmail brane monkhouse bouquets resonating plainclothes rustaveli widowhood ruts profiteering hil kittanning ranald gurdwaras manuka purna spigot wath incunabula jilani neidhart centipedes fibromyalgia subwoofer playland mervin dahlem nitroglycerin hozumi outflanked gioconda hirota rhi egor afer switchback kth albéniz coupee hydrofoil pslv landen bakelite boivin caprices orrery tocco annas matting scatters iorio silang util terroir gastro rubel occurence uncivilized whately floorboards cheaters frailty phonons yorick ramil nescopeck alsina apollinaris dreadlocks harbach corneliu bazan risers coextensive gailey compactly calogero saye flavonoids invalidity goalies canaletto pasqua lá junto anibal actus earshot marshalltown unrivalled natascha chretien duelling profuse partiality overproduction joppa scarbrough tago suso annihilating upf robinhood westerville wlw barbel obits dugmore stroked songshan unwind sweetman wege billikens uks chickadees towanda spanos telefunken utzon fairplay bucolic giorgia fozzie calarts palauan milenko sudeep chemo unconquered fyodorov misuses miserables quandary klink inbev tamra badshah cultivates diário sundin gardes costantino benzoate serban mcmanaman clitoral instigators eragon bilan sherriff ivc ditka crumbles bermejo pivoted mahlon essie bestia bedminster dowden irib abang rybak hillclimb toye zovko beh erring diebold scher todman milica peau banishes eti lyla hostetler lepton rowboat whittled whitening ghanem internationalists sedaris palácio budde staind semele ludolf llanfihangel vexed mannie unprocessed foreknowledge cresta bosc avocet nbi dismember aleks robustly turm diaconate respirator fulci crepuscular diz intertwining munsee wahdat pacaembu thorney traitorous etisalat wiccans polychaete parthasarathy reidar eightfold pinatubo inoculated pasty swales hlinka namib atiyah soapstone arap taipa delmark orbach bureaux meninga hermès derosa sequester bhaktapur chalkboard suhr khama munsell robinho arirang cortázar khang backwardness chirk perryman hachinohe incisor queensferry cremorne barragán falke kempf barys gruelling atu belting westen tallgrass joginder trece apparantly peacemaking employability velox anolis sammartino fera mase rosier wilkison dawei caprivi backbreaker lyster fawkner havn littlebigplanet mamata prostheses cafepress gtk metafictional grundman wiktor pedagogues lagerfeld slonim naxalite beckenbauer juliano mécanique constanza confessors zippo rainn klaw abedin inkerman lindeman boye paternalism nenagh winglets hillyard chelating catanduanes cagefighting unflinching doberman miguelito downstate adjara mechanicsville tatton ruka miserere willits bayfront respighi baniyas ampa helgeland irvan shipowners headscarf ceramist inducible textural sucha poonch aylesford conned speirs hernych dharamsala offscreen taner peris vung fabrica sleepiness poplars sheffer nish caminos samithi tunas meccan denn substantia negrete gulfs davidovich sandpaper quintessentially bjarni yongin flyway elmina roces corsicana bangoura susskind embarrassingly royaume pokrovsky chivers maubeuge maddin scriptwriters vindicator circularly entrails monteux candlesticks marechal humongous gav kiryu yod lamson tamales mcgavin sabot legalise burscough nyro inauspicious bootstrapping kadett yongzheng afshin homefront dailymotion boxoffice ludmilla capuano manzarek cabuyao treacle kirchen retransmission dutifully nikopol panegyric torquato cosma narrowness chandpur refinancing thorogood repossessed capsize boliviano montaño geier geriatrics lavington digitize exley wingham academicals désert oxegen finnair moldovans anneke crestview ironmaster fortuyn longbridge keshet manni söderberg utz mahaffey maudsley hawkshaw avgas tatort upadhyay fiesole blixen encapsulating knelt kae chatman munchkin callbacks camm puro supt balta cycleway buyouts bertini venustiano jubilation dwivedi cookin badrinath fulcher dingy sansone serangoon bryozoans nidhi frankincense llanelly bariloche uintah davydov mcburney odon charlot mfp ferriby eases prise daphnia smacked amusingly madley andreea ingleside barnhill devastate katan bira josey demetriou marinko marler gildersleeve gondoliers periwinkle tannadice lamotte mergea wister storr tripper menaced omniscience neutralise truax juego lahey haliday seeb frenchy dawley barri margam natali hynek surman maho schauer joysticks mijares anim callen bunka meadowlark gaf cau irradiance zildjian jovovich peromyscus eskdale ddu hartsfield dineen usfs yearns andreassen sagal streaky talpa roskill melnyk investigational spandrels dunker bakerloo elazar entrusts granja polson studiocanal freaked muswell fns beynon tushar bena gane reisinger belding wattage margaux meshuggah resiliency noblest tichenor ucp newsman chisinau overpopulated flieger blushing kassai disaffection léa kariba heth buloh hrd zarah nupedia humbucker longhi ecn scrubland wizz prestatyn gynaecologists darned catron paperless felines harter outcroppings slacks tange hathorn pem usoc ziarat amortization madalena soldado jascha tanenbaum sabena shedd porphyria caray kamenev rehn arguello silvestris axton hilmi haqq lexa nmb boric dampened popo hada artillerymen mundell gobies stina elwell wikepedia muc schley ventilating ciano isms lunge losada secondaries dint rummy cadel mattoon sproule candyman dorin safes dbm uluru reassemble bugler holdover cleef keilor gaits tarantella serov willenhall blackmailer godefroy shadi dollard fenchurch yeahs optare alr poema qays eschews tavener fermin namsos girault simmer orontes valette ergenekon sleepwalker unacknowledged deeb faucet sabatier yoshihiko polyandry gehringer omura landrace razia hille franquin canards mua johannis hutchence alek blayney krish oom effy tapu talmage kudzu pelléas deluise krumlov simplifications lamba sangiovese hasbrouck segues wimmera bioenergy pinkham hungaria mattsson ceci hayyim preben bonobos korngold marinho consanguinity prana eliana ancora tira chhetri mccorkle conder kiang lewisohn courageously milliyet neumark nitpicky highline triplex gluttony shamanistic hendriks ribéry shg gertler entrenchment nonproliferation ahearn domenech leroi carvel fenians cabinetmaker siltation acetaminophen organelle karyotype agnese vierge thomasson oujda tradename haddam tvo andalusians underlings vollenhoven bernarda cardiologists cherenkov bankes gasparyan neocortex buzzy miel emotionless maat cftc muscarinic zammit ptfe excites busia defusing bujold lacerda srbije hmnzs pauwels furth hoodie irredentism mishmash mdl ochi unmaintained masterminds laddie tankian imlay sws barend broek retz curiel satmar colonnades extremis montalban urinal hagerty germanica overusing merkur danda pandits yashin besse desantis edgeley moviegoers crivelli gotch gunnell tonsure akito daiki ioana snowmobiling carondelet gcn spacek southard abrogation röntgen birtwistle coutances abnett heon otolaryngology enlivened anthropomorphism myerson salai hastie spong lanner atanas almaden jakov deadlift southwood bebek rendel minicomputer waterfield pliable travelcard pawlenty breakneck mclendon microorganism hellcats waites detest gagliano bakhtin rosendo tolworth skilfully brooches logar nowshera jakobson mouser nivelles cuddesdon accordions abaya mishandled apsara lachey vedette purkinje brière ebbets slatkin viib biotite nivedita cadences pwm dowse vociferously karthikeyan revueltas guria tachometer trocadero ashwell beautify ruadh marcio pfp paupers sickened perlmutter dunc ccb skidded swash ralphs weidner snooping botti anghel shuo tetracycline sulley ucce smithtown biopsies tramcar postdoc diab mindlessly flushes criswell baileys scrums mdina indents citybus sahir slammers evensong guideway puffins oosthuizen calexico haridas thune yw freemium ssf cagliostro unfashionable sarika balham chimed mclay rosing keystroke homesteading dulgheru wigtownshire hartebeest goggle webbe chthonic mattos gsd sybase sankaran beeton tika constitutionalism homesickness yusupov entrusting remigius imperiled reassessments infantino hardens hartt amn amita martorell ecker skywest xiaolin homey mazari romanum lfo trilling voiceovers scholtz bridgette hurls laureano rakim masanobu frolov mgp tongva ysidro drapes russert sulfite petrarca maestri wheatstone griffen jonesy minimums diene dalal castrato replenishing grauman televise segel yok satterfield navigations latches youngman plagiarizing headmistresses baath impounds autopista kingsmen goffredo anr ilagan artistas crooke scf suncoast blowfish ramazan stockpiled odorata yanagisawa miroir cerrone michelet solem actualization palindrome corder prodigies fetters tarquinius mmt vitti jabari laguerre brawler tír stencils stephani nonce shumway alvan gein glamis uneconomical giani donostia tey lurch ank crud muybridge begining squalid handers bula interbreed dameron lindelof braddon katzenberg innovating basco danio forefather flaunt skittles pictograms sfi washroom fovea orions herrin spheroidal encyclicals boombox moorthy townend peddle adas garbutt incorrigible whitmire lider semmelweis dimmed hoogstraten canet facedown pyatigorsk aventine unselfish medians jayalalithaa hunte bromo synthesizes smite imu quapaw chik gouldman hellenism spurge yuli dere wdf mimes engulf jamahiriya mistranslation tamimi guidon henneberg cucuteni credentialed neuropathic koni mounir nage cromford asiaticus unico pff encrypting bradlee pch smugmug kamera gossage annexations melua embalmed igp aakash reasonings endothermic sassi paned retitle oolong noggin exudes démocratie altieri radiometer beri entrainment tuckahoe segni purépecha holo staggers companys zillions oñate nst shotts jardines histon pugwash rudnik mastership democritus precipitously leese bsh osteopathy ducasse fabry okaloosa hiei bazhenov danuta cleaving frisky bha giusto byam nerf criminalize antonym macewan mosconi telomere grizzled ottley axn abstractly immunosuppressive dramaturgy gnats ayame davin ballack evergreens palus ivaylo dramma northwoods reestablishing heavies sate mansouri hajjaj matsuzaka taster stalowa deas sanjaya slop bolshaya rachis sidebars zehlendorf thay scoresby jetfire pontecorvo heliports veggie watercress tce resourced hershiser gussie ellerbe bertel shahpur bestial electroplating reshuffled meas pulliam kalani comus ripcord cliffside aipac jangle zaynab sangue ratatouille christianson kewell godine wassen prioritization bonnett playas epitaphs lossiemouth djiboutian sikora archosaurs gramont bovary jacco citadels substantiating aton lutenist sufjan schweizerische cricklewood citys orch fairford grenache anova macgill robledo fasano cubano whizzer telenet axelsson grignard clearview varsha sewed fractious janya pokerstars sexo hilarity teres wallen giornale chakrabarti hominin cubesat udomchoke nsukka integrations indiscretions mizutani forelegs husein fabi aae olm moulting scalped gummi oaklands conciseness lifters naturopathic perdomo crayford soltau conceptualize porbandar truckin asiad watchmaking bhave consumables transonic benois digambar sissi psychotherapists posteriori gomera wole lhd farra geppetto hyderabadi plaistow goading groh limavady mohapatra lmu pullback elswick packwood taishi humacao infliction mccarver unconformity fuzes patmos doda tripoint paull rrs rachmaninov guarini jaswant resnais mosse bookbinding disallows jayhawk ephedra fante chums eastmain distributional zippers beechey kampmann maravich abir repurpose bgp lanthanide freesat arbogast majd casella pester bugg profs salamat cupido pentti cela ferrar infusing showgirls lecomte albertsons naved appropiate leaderless rahilly waccamaw koki baier telomeres arnon moratuwa trew gries hoceima absurdities jasna dibdin salzgitter dardenne cyprien jund matai laemmle cloris luczak travails amoco ruthenians fancier flv dibbs pris beefed obliging lindenberg borgir kuzbass motala esopus loggerheads shahriar alita pompei tanda outdo serpa haycock agitations uwc plamen hvidovre melodia kuzma nullarbor hmo vinokourov asnières flowerpecker liisa patt cravings jalaluddin uncorrected signification pastora sunway eph purl copier santillana steere overstepped voila akhmatova colan westheimer jarama linzer phalarope thunderer pugsley asil postgraduates seducer scienza manav powerade institutionalised kors broadsword villavicencio ludgate moel champollion lorsch postgame qasimi faraz diplomatique mutinous stearman yager groveland shinsuke deviants evs moje delibes mangle interspace leoncavallo trainspotting cannell dentures drunks fiera hoaxer ows sneddon bagatelle mondes goodie andree referrer garba indiscretion jorgen scad hallucinating outstandingly printz olea canticles wolfie gallico cobweb winther monarchism kory crna planta ler kuttner lantis ponderous tsuburaya breslov splendidly impersonators círculo tenedos kantner pupate dunvegan brar luchino gravitate piu krasinski mrinal kirschner serotype beinecke offerman bloopers bernardine sciurus lorie aoun weitzman kwaito koos moiré missteps nnn iniquity shaba boingo oakridge navale anasazi glynis symphonique esterhazy hpd sourdough bodhidharma melfort allgood banham elden mcmullan gatiss exchangeable heiresses highmark maestra wingback commment scudetto scobie bako iovine warriner faldo bewdley nuclide monier anansi hosanna tippy boccanegra muk overwrought laaksonen domenic bruckheimer chucho heuer feroze lonicera spermatozoa solidifies ansley sangram posta sather bailing sanin eliott ibuprofen fearon reimagining shariff winegrowing misalignment masterfully hsuan trimaran dmb marot condemnations bernalillo cellulosic antone cashes reconstitute udt gravano serbians oskaloosa watchable gangway ayia rivlin vitter stroking orci impelled etchers abdal scold cochet kalo innocenti comore birdsall angostura carse tassie defacing bozen carnevale crepe ludi aragonite crowsnest blocky hacettepe shau norths soyinka seismological kristal gonorrhea mcclaren itoh fcr subcontracted dreiser muffler vlada molle neka bouche coogee berardinelli summarization kimmo softens excitedly neuralgia namangan inhibitions lucidity amasa fmln spoked andheri splitters kanchanaburi macdermot noori softwood amphorae puchong kentaro ayahuasca horsfall soundcheck ifv anglin olé unfilled abas scruton erasmo gatewood gagauz pertussis herut maazel iarnród xingu gulp kempen saunderson pressurization froissart hazarika pharmacie glassmaking baena albo cuddly mizrachi spokespeople incriminate mcclinton desilu producciones adhesions eyring galeries hbv macc socioambiental kitching doucette kreuzer dumbbell chetty moroka dichter amperes prenton frogman grieves nawal cytology coweta calakmul lingfield gouvernement pursuer dcr yag asai beka spca bourbonnais waas klickitat soit intimated aquamarine cyd fortes futher raincoat floro midges horak unearthly sonoda phloem givers bhansali inzaghi mullaney ttv lotos cheriton wiry lokesh chaconne nacionales okubo estás bothnia zaitsev regularized megrahi greul dinant escalera biannually kusturica leko jolliffe rivette vermeil linderman cattleman appledore morrigan harari caicedo wardour haemoglobin potus cuffe toying millia gueugnon anthon katyusha paki islamophobic inmarsat ushuaia humps poissy disjunctive macphie mapo orval leopardstown zoophilia tricycles statecraft cide rostrata waver mcnary impeller layard seeta latorre ayton bryon cnes pollinate balusters selous underestimating hatha foxboro megawati macnab calleja weare idt corinthia ribbing casagrande bridgeview burp gastone disenchantment khiladi holbrooke surg streamliner grijalva peregrina saltmarsh busker edler instrumented fani jazzland atlant carcinogenesis jacuzzi nanoscience berenger prl gastronomic tragicomedy herakles junot dowsing ccaa mhd infilled circumspect fältskog coeli chamillionaire tanguy pickin bacha duralumin pieria cpbl pesci markie ginkel tagesspiegel renier snowing carnitine sakari droll mene essaouira preet mooresville pli ech shanta sipe sebelius frazee kras reformulation halogens outscoring fomin naphtali gyr sitaram dumbest nnamdi honiton wajid rotaru vester realplayer zomg josaphat brockley kele uncoordinated peroxidase maniacal sedna lenk pongo meese humvee amylase marah polow drowsy saz mange speller molenbeek prouty bundesrat landesmuseum canaris waitin sogno linas ocu labyrinthine maurienne vuh zaibatsu lpr rnai hyeong schweppes santoso orantes pinney nibley lotti kronk huyton redemptive naturalness unmade freudenberg pascha spaniels irked manston kostenko tencent indigenously saira sobibor mtp kneels gud mogo generoso pef sushma zaphod rizzi latam augen bulle wfaa kingstonian novelli habré andreozzi civitavecchia speciale ingemar adour biot pvi ocat lynnwood attenuate bardia regazzoni bourbaki carville integrators macula csun parus watervliet dupre queued leninsky bartali josée softworks hampi freytag montalbán hasa bragged smaug pippo geophysicists delahaye kazmaier verl ehrhardt ruffian feathery concentrator eponymously autor macario optometrist grider schiaparelli encroach trescothick chairlifts rassemblement fedayeen riversdale ameche lesseps dpm tipsy naissance sunningdale tcb enberg enomoto telefilms gendron arnauld golitsyn willman egghead grassley guay sista atoka laki preschools pollyanna peiris faial forgetfulness mazzola fida hellmut surmises panoz victoires cannae overprint averting naco garand todays svet smoldering buckthorn samcro mentone audiology fliegende muldaur pygmaea bête flambeau esma perpetrating gigging monbiot kellermann burrough nysdot haircuts spacelab wwl elisabet marschall higa suffren nakhichevan pickaway collating kempinski sizwe butternut ferc balwant channon minha hrv truer radner suren rumpelstiltskin egfr dantes renegotiate masumi bubka atos carre interamerican jq boomerangs stoyan occultists timberland agnello samarium receptacles marsch offenburg delegating crimen versioning cachar rebukes paraplegia rur makhdoom forfeiting futon ariza hydrant skarsgård freeboard ksar carousels goyal rickmansworth persistant sids absorbance bungo uaf hwanghae ketil aylwin tanuki dyne boddy biasing jessamine hirsi sheepskin remagen rejoinder milles ginevra framlingham rnvr rodion tapi hamlisch tangos ghirlandaio ananya reddi javert bourdain weedy virender usama jpy serpens ravenous katina hoberman kenley curbs predominating robed montijo verbeek forgoing mce kefauver idl promulgating semiautomatic terahertz placerville spandex saray noster rossii cubed chaykin mudcats reorganizations giroud shofar bostrom grumble spiralling umaru levitan vrbas jarod nahariya risborough olver humourous plaxton thieving woodsman lahm unida ook rochon ambushing multilingualism egham photosensitive sirocco catalysed mirnas liliuokalani spohr democrática tarasov ballrooms naima shanghainese mandarins valets moquegua surfboards haditha godrej kaja hali acth zosterops seif heuvel benders herrschaft kohala passi giusti svenskt petrovic yellin fielden patentable claflin comerford rubrics akebono mulloy hewes dearne lehrman croesus lavon peeve lec neiva nrf regnery maximillian coverup petron bixler ambridge devoto lifeson tchad milkvetch eulogized oxalis albigensian cosmologist alliant petrelli junio marka ith artspace bandhan biola cybercrime ttm vaart bladensburg kazuyuki carolin overbeck bookcase ironsides marquinhos barger cill newsmax microcode boosh kame nodular datura loadings taxonomist astronautical wehrli midsize bont sloman watchlisting harron seventieth nishiyama origo esfahan romanoff uhh tep microphylla quirke oratorical fulwood reisner hauntings aardman treloar transplanting formalizing jewelled balke duguid scottsville dufay ambidextrous woodhaven locklear marbach carmilla iid sesotho mammography chanticleer wherefore bespectacled astringent salminen saker proprietorship rostand meego alfano pollitt mithras ebe teeter thes dozer urumqi renville masakazu cuming thalassemia lyndsey icts mdb friedan maintainability taisho elint schack dimmu rowohlt kirillov sandu shkodra jdc khodorkovsky glendora splat familysearch remorseful choreograph knysna benedek snouts sarre kleinman spinifex swb francoise quarreling millersburg redeye oef hida blaxland huysmans ums trogir cussler rackers precluding cartwheel astarte corporación berenstain compte etfs hecla drame belisario sprouse má dines disfigurement tinchy larned sanrio zucchini draga asdic addai calpurnius dyce alavi caffrey croy encoders lechmere willam setu milbank steamy parana takemitsu sumbawa eastwick slott scheider entitling sirena nebulosa lassalle synthetically seatbelt verbosity skp cmyk straightedge baghlan okie ifugao margarida schumpeter trinian pillman bryanston kynaston wfmu falcão piotrowski rabun macmanus invitees sendak seething rodas biles colchis utensil barwell neena wakefulness pantoja flemings perrone coquette targetted haïtien ocasek kurita henery yrigoyen taita familiarly directorships breadwinner diapason kindling maryna policía quenched kestrels cooma nephritis manar minstrelsy pnm mcphillips decemberists reddington weider conners vardy bengkulu angleton saltbush montefeltro bhambri ctl magnanimous eckersberg vlan reily mcguirk rotavirus millsap kuwata campden valk turlock itagaki kohanim dished apb satta sicilians copil ibbotson aragona steeleye miscarried memling mier santer restful supersymmetric sre jilted kiis policymaking mutagen seashell capsaicin systemically assa cabanatuan slog dpd thymine kingdon dagar enslaving ephedrine unanswerable killebrew taxus foodstuff nyk fended brecknock penetrator glp lemme odesnik praecox sows oxus haytham piqué caltex blindingly autocrat dujardin prélude moriscos karakorum mucking rovira leclercq vidi celan situate topix holism escolar triforce leesville munthe jafari droite innova nonthaburi gerulaitis herter arbuthnott vimal tows milius alertnet ishiguro phosphine satyrs canteens ful dashi appetites loyally ustream pehr superbus combes pdg yosuke aflame malinga delph canavese gsr centrifuges bethea lebor hochberg minerve elek jci catonsville bazán rambouillet torelli piatt hellions mercantilism brattle mankad empresas erdal repels uche gallerie netbooks winterberg lukes kinga folksy macanese aros acquiesce matura opperman wmaq olla megahertz lov furnas switchbacks wattana degarmo gessner subbed lairds vivas seafarer sacc governorships bayani vallo essington fertilisers mandelbaum muzio aït coolly cruella professeur folles reanalysis tors tripods hyperthyroidism creatives regurgitated kunstler tiergarten pedant briana gerstein podiatric phono subdomain wordperfect sturridge kikinda vettori calla wasc grimsley coped writen podcasters publico solicits nephropathy lamoille janissary pollinating cosell atropine diawara mescaline republishing scoped overthrows andras barts outlooks ksi charu bodensee invincibility dedalus bailar dallaire hemodialysis wigeon rezaï violante esthetic emet fictive spearing scriptorium roeper ableton carcanet dealey paywalls liverworts unani incapacitating mccue felino schachter tangles riya unfpa peeler rdr elopement columned garhi ralphie wcf asaka rocor garton melis languish shab chatroom englehart hatting vulkan bootleggers ramshackle vec bpc rozier ryton juts sisal seol prizefighter ooops unkempt ouimet disfavor neubauten inhaler transaxle putti pinta jayme redrawing dissanayake molin treasuries regenerates holywood tope lorises széchenyi grandees headhunter tartuffe sechs hygroscopic adivasi livno qpm ksh benavente modeste herodias najd pwp elt jordison sunde hydrates ryosuke frogger madmen senders gratings hecke inaugurates physiognomy prf nsfw scruples christenson enfranchisement brigadiers ome sweeny nitrocellulose espirito antipas caress avrohom corrupts steeler socialistic pous weirdo krol naboo stockbrokers lhota panache caire underutilized sabbat pipits paros clenched mahar tarkan tolly cedeño millis escapism pentre dimbleby brightened soulmate relapsed handrails wynonna screamo kolev ritorno phl speedball gfa presaged ofdm torquemada loews rudos polyunsaturated ulverston jaka haslemere mutable xanthine pascua yousafzai cometary siesta anantnag dueñas pyroxene booke teletubbies kulm socom batchelder scurrilous nuevas conjugates undertone wirz unicredit inah ghaffar vbs apel grates gamestop cienciano plamondon copes dupnitsa transmigration solidification heylin degraw musselman snowmelt jask mimar mudéjar hubbs ortigas werburgh hulda grolier rampling dovetail lummis markle palmeri photoplay reprieved betances efflux kasia bucci voto umg garn pleasence peptic nelsen arroyos eardrum royall unimaginative thud carder garh whiteface dewdney yester uttaradit alarmingly horticulturalist lacquered coleen mkt skaneateles volutes popol maley blut aghast kittybrewster palazzolo envisages machynlleth orpen needful dashti grisons butlins rnk outfall asics lowdown schering lupone cinna hksar siviglia olen horovitz shriner argyllshire gillibrand railfan infogrames forfeits kst fiddles forestier eccentrics trude keb redshank consumable jcw wib calandra saguaro wader sumerians keihin lpd gutsy psni helland greenstein staid scandium pex promenades brocket jamila gaucher exes unassailable roppongi millersville balotelli gimenez muttiah kamilla catscan avital pangilinan zabala larkhall refusals tanvir truesdale romualdo ote abramowitz drees urziceni bares verre melnikov datos bartolome hojo europea haifeng noncommissioned lyng corniche imperforate iits englanders formate dse samarinda rolph kesler gimbal shepherdess eurobeat drape schinkel localism ceaseless cobbold snuka sparkes rosati legalism inequity virendra umbro volant briones ampthill astralwerks salil preorder apatosaurus whitt tako sarrazin kristo ethnocentric wallander isabeau mirages autoclave restorers hohokam wak merrifield nicotinamide hauck ayling scalps adur dolny chronograph vectra chastises baraga hawtrey anglophones senn syntactical reproducibility tuli dü stopwatch manzanar eparch murshid mâché microstates automates gibbins outpatients diffuses pietism harnoncourt nabu carrere ondaatje gynecological lutterworth renny trailheads antropología kulin neston frias nym bettendorf shepley nira spheroid ciriaco blindside pegasi prions stubble spriggs credulity rtbf syncopation preiss caved secor digges warrent fug institutet buzzwords brauner vereinigung ovis abhijit plait zeigler balak turvy behari stamos clavering rheinische kohei entr enyimba woodberry jeronimo sevigny uncorrelated biosafety maccallum karunakaran ayler atena prafulla meron infusions berberis malu eyton foscari raag manumission yoweri pacis chichi pieper dovre siqueiros rosemount olszewski pawley hughenden roams swi jbeil lcp anny gauhati banged rapallo particularity underused tomer roya tacks beren malabon saldana trevisan refundable gibby thorsen permalink profanities synching askey prakan nihilist cholo hambantota canker beholden tactful celli chuncheon vibraphonist gnostics ampas odu karnali wpp limber astraea hürriyet wms morella peddlers jettison stormtroopers tohono tongatapu summery parisiens usbl outnumbering celis karatantcheva shibe bittner craziness swathe chessman centrism kreisler manahan fizzled actionscript boxy abdollah pedestrianised ock erythrocyte samudra hoseyn uncreated piczo torc landrum sprinted mcandrew mux survivalist infantilism badman pilton postel bluelink sandifer tempi nishinomiya disconnecting absolut churned entombment tresor skyways kalimba nasmyth picketed kamin mesut shanshan fogh unceremoniously wapiti paeonia jamalpur ramani mailboxes bashed ossification imani brags deveraux raam anesthesiologist langar novick wroth linnell cathar izod zeo leis stenberg bahai hofman mgc takase paddled brunelleschi breadfruit tyagi disorganization yukawa balloonist coughs redeveloping scrambles orphée commited pluralist woodhall pimple tychy greenford livni trinité harasses bogusław cathrine friedemann krabs goudie mente denialist speared slac wishers fotolog crumpled shirtless sideswipe benjy cañon ues dermott protruded millbank babri puddings faragher sirisena overstating dih pocus bewilderment keeton butanol kilkis pharmacokinetics masterman ecosoc townscape lsb dunleavy watchung quadrate cattleya credulous friulian auric hutter rasp florinda kayan applewhite footings gerold watley shukri kapu boog armors schoenfeld superintending poling halles zatoichi madog uncapped vavuniya gujjar maulvi iliescu pawned clumsiness journeymen elided blackshirts wyverns hillenburg dyn osf caudillo hermans hatillo kmb registrant aizawa misinterpretations rivalled agriculturalist fuhrman frp omd perfectionism interjections imagin excrete shogo nonconformists ayano cockerill odp vayalar actioned olvera delicias transphobic letourneau umkhonto creon silvera raindance nursultan unmet talky sheth norrington satyendra funcinpec deflate moorefield madhubani gullit sensitized juxtaposing afu kulick koka vasek marwick indochine outagamie chibnall awc arroz rangeland guibert inh raylan guadiana alegria veendam bluetongue morobe aao prn roseus disseminates mainmast ageism iscsi sanderling wych quadrille walshe snatcher palafox stretchers transmedia gentium tiniest hankou gion repaying trf trewavas mastaba paraskevi cina listserv shermer fallible alphaville micrornas sherrie streator jui kempner watkinson kaolin greased anning ecj marsham euskadi tebbutt muffled grossmann firmament unas avx nowruz dele radioshack lorentzen safir honeyeaters atia itb shamsuddin refracting orwellian toomas spherically zamani fpf jespersen inactivating helmstedt erbe nought fragonard hisako panopticon reenter pastiches titillating egs hetero megha okmulgee puritanical bussey biog shc kathi glendalough southington dowland fatman zarqa garston phar nlc manasa velutina tadcaster nanshan drumsticks hydrographer pastoralism rabah wolfmother rosolska lunchbox sindhis lakehead dettori nsg oig mckendrick marwat garrone broadus rogersville iguanodon prudhomme ticketed yoyogi munsey trunked adrar paks hanneman mikasa horoscopes barbirolli ird jeane gwu hirosaki expresso drennan anjan yeomans zepp apx ascania salus zhenjiang nop glimpsed precolonial surfside einsiedeln rivendell armouries circumstellar plaintive carmi charoen shavuot mazer tlatelolco mulock jugal bhaskara wesker mileposts feldkirch aurélio tweeter deluna tolo hauteville triceps higson mcclurg arcand teknologi gisèle bungay artsy corigliano phonogram rask kunis swatantra moonrise rollei almira repertoires aldiss quilter chuckles megalodon morra sluices warhols knapsack antje armm lincei doni grom quadrupedal olivaceus bridie neuropsychiatric preserver ishiyama unrepresentative quijano equips lovano teale figgis sinh widmark witmer hnoms lengthier rehder macos therewith yago macadamia rtos spew nguema flagrantly etter bendel elisir visita outwit mainsail kowal indigestion eurocentric ornately tunde giorgione abbotts frauenfeld uninspiring dusan yokai paradigmatic sartori creagh slandered determiners ringtail trestles kristofer coteau steffan apophis boldon westend drewett ordaz mcquade blix hcv khari millay dreamscape okc kuchipudi mmmm funnily overreacting rajoy nightjars pacífico coachbuilders zouave tonsils shailendra littler waveney rivka itp iguodala ranchera etten ftv upturn slingers folia nantong virtualized quneitra gwydir puranic mohican resistencia americo puya cdl oristano ziyang digestible lochalsh tekke winship gummy shipper chabert waronker peeples sclater anomala khorramshahr maladaptive horseradish stagecraft kishinev khera rindge upswing redwing husks unecessary sulzberger godfathers roldan scritti plainer filigree saloum bby koskinen subodh flavouring bellsouth gwaii chided cajal wastelands lochhead northey cabbages swithun duplexes hypodermic bilton unnerved tokarev recanati earthlink killala sondrio barbiere rochette mithra seabiscuit wilmar vasoconstriction warbeck listenership zumwalt tagg lijiang billah akm espanol biophysicist siwan trippin ilc devoir hardys kisangani guangxu terminations retrace rossall starkweather betar vanga fadi dominici tarifa comal pulverized frontend clunes judaeo baryons energetics blooper akemi ilkka kue stomata torana erhu trikes hanko buh lamington bdb capitalistic vasiliy burin branimir collazo hawken sophist beare pilani wnw workarounds weatherspoon gelli egorov trower auroral federici mightily biche ministre suchitra ugt amm fujimi dobrev causally bladders loewy talksport wilhelms neiges emoticons cengage grimms kayode kine audioslave botta unfurled confed convoked amanat mutating chaytor picatinny micheli sickert parkwood simplon technocrats eley góngora jeffry uncaring templo baler verifier hagelin endangers hepatocellular durning graig beheshti mammon infinitives hawksbill alpaca redemptorist submerging bioluminescence ayhan millwright portmore panday drinkable symmes undergarments dgc sonnen masefield gessle eisenman jianye awang jocko ganapathi wavelets gef californias mudslide breanna poinsettia cgmp haraldsson unlabeled tritons frightens peachy wanker cun rehan sarod gwa envigado bacup doniphan rocs sdm ubangi batts serco phalanges toivo popoff ites zay hurlbut rezoning schipper arbre grappelli mathai growler portsea hengelo cancri ulisse tusker broiler frakes edmundsbury toccoa hellblazer multipath sunlit secularists balasubramaniam yuck crysis rashard unrivaled cradles sbv gentofte tofiq maye stigmas cued wreaked michibata fmv berkut sloot aldermanic wiggs pavo clavell ugarit farmiga ponto miyagawa kayakers buckaroo warkworth sasson smillie jsm trilby howls gelber saml buon taegu eisa lobel noticable tangara gaffer plaquemines blasius trackside knecht atsugi homolka rto satoko zedillo faringdon gedeon shravan goellner gazes bronner blunden magnani overturns bryk azra consolidates erhardt adweek pepo claver disobedient tonino ews bludgeoned adit supermen glucocorticoids cowbridge mongooses premised annelids assholes lucey hobey sido avala drinkin steamtown atlee pannier boh khobar kurokawa lurks jabiru resurrects anjelica helbig benga mitu fetches locum ambrosiana ndash moustafa lovering brodrick wpi extramural dawns sirna vastness brackenridge ingelheim lepcha asphyxia brixham squeaks lath qahtani balsamo kinen charl chimeric compressibility heikkinen branham burqa borgen niva subrata symbology tpo gomis subducted hixon kasteel stds demar sharrock brillante kristopher teilhard nitrile melara heerden skinners leeuwin polemicist ipi vecchi groundskeeper trifecta jarreau brevirostris radiologists lakin beygelzimer bernkastel klement attwell lorette picabia buckhorn digicel pesetas pteropus sheahan gökhan flukes pressler sneijder jiji witching fizzy maariv damo iturralde rowhouses skov hermaphrodites sulaymaniyah baseband samui interventionism fatherly warps dowdy kumbh jacko bajwa punahou gabrieli hba kannapolis submergence sheikha uggla vipassana virion asinine submariners inta powershell ghor cressy bjørnson sinologists malm spasticity manin hayfield kalia outbid naphthalene confectioner chowan tabata bericht brimmed stafa beroun bektashi savoyard townsmen avtar lehner cov sebald jds kreutz cavalryman kerslake farlow hispanica edgars smothering lisowski fales latinoamericana tihomir biryani infuriates lydney starosta exterminating interceded musi ampara bandgap wangen scavenged niggas unabashedly plasencia pinks amnesic blouses seko anji wjz jabberwocky trud centralist dhol adoptee boracay fingernail eft gertz politicization dragão petras kinema debriefing shamshad eom ciccone coláiste galván colonic tsutsumi quezada opelika squeezes tortoiseshell positrons sarhadi ayoub reconfigure atienza corsten acht rosenstock cincinnatus boshoff renin polsat blomqvist methven jalali chinon bovis absconded glaciologist cacho granma mantovani orst watrous comhairle mafalda imbalanced stanier suen iwerks metta sumgayit patwardhan chough wyld intesa cycad danièle badalamenti meitner okabe vesalius mimesis chalets holodeck trills chartrand bertolini enrile uprooting queenscliff tytler judicially accc falsifies emer incoherence cbb kraepelin zodiacal mactavish thesiger martigny bennetts disturbingly refutations boneless danang debar lascivious naba woodfull rhyne imm conjugations bbm wherry kewaunee pontotoc prosieben plastering professionalization signore arnesen saturnino morera numéro athan reburial dte kurier musab monetization tohru kaczmarek izaak mullens attributive lvi truda mihir byard lawrenceburg hubbert seppo cce waded fmf januar taio correspondance yellowjackets swordplay mangini coalescence giustiniani mandolins mondragón piecing rcti curwen oligonucleotide edsall thijs mcmillin seether adventuring spinney strood mapplethorpe proprio takeoffs stalagmites rovereto guinan dural blakeslee kütahya cranleigh mccully haglund lindros durfee grasps chintamani obeid sharyo millinery belmopan rungs bravura lethargic migdal wilden ophiuchus dida guillory macomber gak barmer hicksville molesey tink nichts rottnest dayna ladle ojha mclaglen fulmar linx mesmerized escarpments schooldays weirdest ushant kindled wagtails elp zeon blondell mccleary litvinov icom arthroscopic godspell kika incana queso nlb fujioka paged mch battelle heba trow broadmeadows demento frsc faultless ganley dimwitted tiscali bulan janzen hookah susy geol cupolas pitot dus coolies dut castigated cytometry prodrug stringers tomoki lotbinière herringbone usna monkstown chardon paigc basterds telles belconnen morné wnet autophagy sonderkommando hallow lambayeque bucanero beachside monastics spunky ladner pantano birthmark mizzou ardwick tranches inerrancy salespeople snakebite charlier polymerases économie pegram hannelore lucilla unconstrained newberg leber chungju misspelt theis conocophillips putman bajpai supportable mackin logon nereus contravening fedexforum jazmine resch dsiware dieterich hijaz refoundation bakeri binkley fleischman upstanding seversky jusuf masochistic deliberating niland begat assertiveness imma mudhoney dallin ocaña wol malte morad condell badham toshihiko rickover ponomarev songo idg stiffening baloney conlin berate serail doering franch fistfight cah enrolments communalism burnished signifier montejo burdekin paramour hüsker granuloma mansdorf repressing neutra emmitsburg ulc saleen zanardi esko fritillaria hallamshire kimo solus parkins tintagel cawdor roseate chimie beccles murilo rupprecht olesen dillwyn tuya keelboat solidus tamagawa mathison alang ndola altavista palu scheffler workhouses atrash witchblade earthsea roberti gresik shaar bintulu rivulet ardently dispenses aherne landseer hacksaw beacham sofala gyroscopes göteborgs nauman horkheimer timoshenko decathletes lindblad tamarisk baye evas mdg lilburn statesville poro maina astigmatism shuck banjos kindest wsw nuevos donnington oddfellows backdated westmore cbrn botanica straightforwardly foreheads treasonous imploded bentivoglio friendless railton veuve bfg broomstick konstantinov microsatellite perales tfr tizard legless satirised gratton dachshund burren barinas crafton peritoneum lakatos cobblers belzer kvm duccio henninger gaitán snowshoeing cecilio neurosurgical herdsman fcf theil cheeseman pennsylvanians cush strzelecki booz cloncurry torri eire slitting scleroderma deeping celadon breathable mornin tota mixon fillion peiper tukaram revolucionario duwamish cayuse mils virions gona hoth caddie lasallian patriarca manno tracklisting andreasen lepore josei freamon crowdsourced chiffon mealy evanescent stephenie neugebauer emm puch atchafalaya passamaquoddy rtg hillhouse oscillates emporio propellor writhing belyaev garriott awadhi danijel karai kontinen dhi accuweather habanera nealon nominum balachandran killick kaboom brenta penton hirvonen heritages wkrp gsl roxburghshire mmi kantha ashtray olwen vosper yohan vies priyadarshan reck naumov matteson pincher aramean loxley adjuster sergej maharashtrian guptill peremptory detonations tufnell streamflow datacenter hubley redan mridangam empathize honma safaris eppes persei byeong rambam heaviness jermain recce königstein medora kawaii pandion schmeichel conneaut pripyat halli pizjuán nhm woodcarving midnite obiang nelli ryden platteville kabel qaumi raffaella achaeans götterdämmerung honus lobelia epaminondas funneled huger baixa domesticity houk yonezawa evros harishchandra abdol excitability nystagmus nanostructures siddiq aulnay xining mirra masseria theda hindle livewire rentería machismo indulges nobu hotshot eurocity oklahoman gynaecologist fulke plessy osterloh strictness secco riko ishwar elvas solan nystrom huynh emrys giove gloriously regie pendulous cwb nve saiful claves frontiersmen dmitriev ofcourse sanssouci misapplication homerton gaullist hipsters stethoscope bergère varietals hospices eld bricklayers maracay dmd rebooting umber nyo ogmore zakk pinpointed chickadee fave cics escrivá soro dasher dudi elsner viacheslav howey longboat yuh mahomet speckles chromodynamics norilsk carothers lythgoe modulations malena unconstitutionally sabines rothrock hydroplane reding dunoon wlr karamchand higashiyama chaise angelov balalaika verbum loubet destefano fpas resplendent oses guerrier pally geza viviana tramping ormoc elementals rainha bini blench crookston emeli corticosteroid dominos kiani kufstein burda oversimplified wdc norio dicta orso tanegashima simsbury eichelberger parakeets inkling leflore voyaging maximising millennials cristi dmp yazdi heffer saada macleish perps rigo cloete laue astrophysicists hln immonen towneley crosswind gumby katsuhiko idina thinnest kosrae lauds carpeted layfield hushed btecs emissivity randel redbook roberge altena rozen rwe unk allspark carted walthall cenci ood harling bracey balzan swimsuits saucedo célèbre newnes graber joji svm anadromous rideout scamp mpm borelli dsg univariate holkham expounds arvada chamoun ccny staden quijote sailings defcon clifden xianyang mcwhorter ostwald betws gunz dressers edisto arklow ranganathan nitrogenous ronge pavlovsky gemäldegalerie durr tyrconnell pendula masham slf waiving hibino assented mellifera cita bodice cmv hampers yerushalayim pkc sagesse samael clb amdo truitt hoel theodorakis hjort luray koivisto militarist politicised planum cousy brunn katsuya ipsa psychodynamic banta aragonés energize contro crea prestressed huseynov halberd borate sandisk improviser hofburg salzman bensonhurst sourcebooks levity histadrut corsham smother astle hplc canossa dernière firebombing kovel litvak taxonomies retaliating pecci darvish pimenta childbearing miter jellybean academe twiggs salween eline canted barbarella waitrose chali calorimeter izquierda consenus goldberger gilli ayeyarwady kawagoe loitering cognitively sowed warlocks ashkenazic pervading ataman fria mahou dearden mox acadiana boastful mishawaka heeding cino gondry chambon enought meco continuities caymanian westhuizen bluffing rawle elucidating aristolochia kommersant mutharika ridgecrest tolson nadya kuta onur interactively petulant farsley bheri izzat desertions trebek ferreiro decoupled unwound zooey tion problematical netzer delphinus beastly wordsmith monotremes hecuba tapan murton kahler secant dietetics inhumanity edification undertakers iancu zenda rooker antonios gharib marchesi tamers kerim cco electrify lumbee hardboiled bellson grundtvig dawlish premières mcdade snowmaking josse kissin guillot obando moondog dionysian frigatebird lebensraum luminary iñárritu poggi floater portus bcf orban mussoorie broached imbert enki forewarned romeu gammel dlitt dents idan idn bowsprit montesinos pushmataha saturate hamam cardo misterio korey overheads efc spreader nery ndsu bellwood mullick tourbillon rapidan thoresen pursuance chantiers goodhart plantinga meiko unnaturally lank germar branta technologie kameda langella grahn gorp affiliating grinds bayat stanly brij tripolis seismicity sidebottom devoutly seyfried relievers anachronisms raho schreier chimborazo chauvin leas mahalia pouvoir melded nabis skyhook limmat zamboni malted corvair moorhen transmissible snotty bosnians lengeh humiliates imjin dlm unserer harmonically ramped sexologist guardhouse usfws oquendo ruhlmann bloodlust réal serp amwell rikishi nehalem será osip exhaled bitton caskets untangle chipotle stip narwhal salafist piombino rademacher lulz turbodiesel mizell icmp esteve christman mencius spymaster gorgias uran eliasson opensolaris heartlands tanneries ceanothus merril lavoro decedent umbrage dinoflagellates repsol tintern moine archipelagos tezpur mikka stebbing ajo darién whydah feydeau photocopies jmt busways scarfo awal wigglesworth pamlico adroit eit crocodylomorphs loti epiphytes samberg perfumed fukuhara fuori stassen lettice vla sebi gye shoegaze palahniuk fairburn authoritatively marya bochner enlarges pavlovic fullarton chabon swab antiguo porque hosur adamantium plosives retouching travelin ibb eludes gerbils cockerel helicobacter texters superbowl larga hoche afterwords aldus arul multirole sessional alencar frustratingly nrs caissons sheetal belin peenemünde spotnitz sakhi natrona colluding herdsmen shivani caradoc insipid trager santayana skincare brda heinrichs portis eroica cupboards geena mylo dain calouste kazuyoshi ayyappa dignities ades shino coble rimington iir destabilization shinzo gok refinance gaap copperbelt ayyappan invalids bicentenario qal midwood kippax giampaolo etawah matchdays mathurin fourcade cloches duals hersham parterre rehm asmussen mamá brassard bialystok kassam ravalomanana sito superfluid tremonti verlander shoves truest nourishing meccano teutoburg ariake unenthusiastic spearheads gmac homberg gamini colebrook misattributed spes gost pelo hoarse buttoned maritzburg marad demographically momsen wavered gillig aldea yefim ejaculate prophetess mauritz nuyorican cintra taufik mckendree buchman violencia kersten handiwork necromancy odio emba immeasurable turbid bambaataa calman rosalinda ily blarney candlemas fishel mbb picus yearlong gorin stents dja eventuate apparatuses lipophilic canariensis variegatus sufficent tartars venner choctaws iannis moorgate aminu brutish arve conté hellion pylos lofthouse salters spaceplane sidewinders fuoco bookmaking maunder kubiak hypermarkets adepts hertel fifo sepak milad babin nycb locatelli daba ahold holofernes cillian goodly faqir ffsa shatin lepa breuil mys abay regolith garai neurotoxic sabian winnfield hypnotherapy brb feluda aeronca baber mcgoohan zenaida furia wirtschaft locomotor chambal ettrick afforestation faqih gilbertson leopardi fourthly cabled zhdanov kurth etz remembrances haemophilia montages thew cynan gladio penélope oregano laurentia transfered txdot longden atilla yevgeniy sonorous mcgarrett hayles broadstairs publicising zend fishbowl jewitt whiley kunsten irae ziauddin qibla jetpack hansell katoomba feste enn zsigmond lúcia binion gestural dented chink gado freberg albizu herschell lapp carolines ponyo gabrielli rupe alpin frogner lacing togetherness outwith ragnall unappreciated rakhi hendel astrazeneca duell legwork zapopan hiramatsu flevoland suria norteño martialled cheapside rigida oglesby kidron revolutionizing gillick arachne pictographs daresay radetzky duna overrepresented textron rockall gutting wara overstatement uselessness balraj cou marts keynsham agfa edgehill nonsectarian weei myrdal pwllheli maestre arcaded outremont shildon ihp amyloidosis pygmaeus relished expletives tendinitis junger hillforts zun buisson chromate canoga mtf ilir arachnid bolen dreger sebaceous ramazzotti ambar imbedded helles zameen appeased hicham satanist corrente heigl workaholic vch heschel dubna reichs hofstede bared aspley preselected tarentum castling lowy chigwell superwoman aversive plaisir mysterium primum werfel abridgement rhema steinberger naivety medvedeva oligonucleotides ritchey fak expunge camelback caren suco hazlehurst bobtail prozac dorsetshire lowden unifies directionality watchdogs horsa upregulated forfarshire haz quadriplegic iridescence niaz erdington hermanus subsides vladan absalon linette buddytv sungei liberalized baumeister kuip bothell edy defranco emmrich hardliners nicu junichiro suspenders urey topiary karya corella mizushima outlive grantley conscientiousness genest baixo sovetsky obfuscated laffite jackdaw torrente monseigneur venugopal dornoch laidback piccola singen adk shedden skulduggery shrugs dasha anhinga chequers bieler filene vigilantism caroli schwinn shefford raver underpaid hickenlooper harpists boeck cavaco altera reincarnations anselme chumphon aflaq mechanisation supermodels invigorating temescal wolfenden vrba clemenza aeronautic davidoff wizarding tympani lurker shepherding ngau wun deltoid camo baya watton kumba rossana zelenka fillol xochimilco yojimbo elif lisandro thruxton oligarch interrelationships marsters sants jongh atvs disciplining firhill gambhir mccloy voyeur grigorovich tatsu datable oberstein inky westerman wensleydale keewatin kucha weblink kenelm appendectomy prioritise mendler bawden uncommitted endemics sandlin nawaf eightieth thermoregulation herbivory goudy jodrell hornell untiring ordzhonikidze gatchaman miscavige legros lowcountry marivan oyez polenta kilmainham caduceus marcotte forres cameronians duilio outrighted knudson markaz groulx savary fontenoy landsborough stepanovich borek stegner fanu evel nashi krickstein carro oughta bimini crozet cheongju iijima treno calabro smalltown macv layover canonry niggers tuffy bulkley jerri concrète furio ork fossey daming dockside recuperated jeni organum gmr heartaches largent knowland bost sherk voltaic harmlessly agathon vlf trailblazers nullity arz chadha hemmed sbp sogo serval stereolab starnberg kitto shadab kamata pasko shackleford vanishingly atocha chaillot zoetrope nergal brockie dawgs ichthyology ellyn regressed vysotsky kinescope azorean frontlines desiderio wildenstein emine artha isca sennen sakic veld tfm guimard naidoo yoshizawa snouted palazzi catchings hydrogens mothering rpl heslop ballymoney kongens administrating fortuny biman pickler christiansborg eulogies pett armon wipro borba hinterlands hopalong rameez prinsloo portcullis radiofrequency tila malad aretz immunocompromised anguished dieng dimensionally titel bonnell friendster nonresident loria kamensky konda tarkenton tabb marginalize pickling goyer denigrated egocentric unmitigated kudarat caley regurgitate chapeau ogi calderone menen extenuating ludlam myopic misanthrope kogarah mészáros laysan ascona sprig eosinophilic slavish carnac logins violists whodunit sestak schade acma icn gobbi mattar kose mobbing pigmentosa mutawakkil radm dinaric unready wnit extorted wiking quién requisites understorey desensitization impaler menninger lilacs casablancas baiju naropa chala brimming cordoned maheshwari mokpo emap jordanians sawyers iseult legitimizing guildenstern vdm peddie thawed improprieties jadran ideologues woeful arouses pervaded lannes nyala entrapped flavorings mudgee shunga fullscreen fofana mazumdar lapidus spacemen venise olavi retinitis dalry ppf kilter cbu belay mcpartland serj leth grimstad euwe doman corded zagato barbee batanes keyless menopausal vpp yore ocha hennie uat truckload assemblee lpi gargantua villupuram gest serkan mpas riboflavin petrenko rambla nikolsky gex stupendous neuroticism icecaps caenorhabditis cassio chillin artcile mondal ardal hte putumayo nehwal planète davinci dadasaheb dalley baselines satraps panty durack roxburghe filippini rossville lippman sedum knocker ducklings abdo nextgen underdevelopment unaccustomed juda swashbuckling knatchbull okla bridwell cloverfield poetically fsln dorrance anto georgiy fluidized microarrays bridewell connah bobadilla crandon wiaa hsd njit adjuncts mazzoni fraserburgh kapila stockpiling trifling theologies fikret antonini ols oversea duhok carvin drogue statins senapati belcourt selfridges batwoman yodel lary panigrahi latvijas guyer bickle niekerk chemotherapeutic musto tufton moneys vardhan raimo wanli oishi olindo prodrive ndf ignis brightside haverstraw ludlum astrup noches coexisting hexagram unspecific confiscations anthroposophy vnc hie beuningen kebbi kinu fathering dhyana amable asamoah berit leaner biggers teaspoon consoled epperson revivalism brahmachari alvise mtd synchromesh trespassers burgtheater moolah kaidan kamma susanto pteranodon tortugas pek gcs roarke rater ryall berasategui wendelin palestra chengguan poof viollet xbmc tolley ryders terminators vinko houle reabsorption templi topgun vintages abrantes rougeau valleyfield symoné lemire peewee mutua harmonised moskovsky equites balrog mercies defecate tackett behrendt risings icg lúcio cautioning rowles deel yuto vandy quixotic feige chasma sniffer aerojet agam deducing jabotinsky recheck panter hofmeister eddystone wcm ringway disch samanta coffea rumbold ironmen chlorate rwa gores confections bry gazan razzak ragging startrek worldnetdaily conon gamarra vahl warrensburg alys buddhi soreness selecta otay dcd vesuvio rivaldo kuka disassociated dalberg balaklava flanged gashi millenia parfait assunção arakanese korat gloriana jacquard takahata sturmabteilung donets sumitra azikiwe binoche inculcate rossen neagh oberkommando corporates whitsunday lait vaw shalhoub ranh tenniel naturales zoé ushio pellegrin colloquy retable philanthropies flammability fosdick kohan birches definetly denouement rickets portrush artus suzana swindell murch dabble wbs gardin patos ananias riverbend embarassing echternach braiding vso savimbi xray lvf eugenic conemaugh indescribable perspiration kurta westborough conflates hns kanchana hetch durai moten salento homophonic disruptor oberheim disparagement deferral thatta chori castilians resorption disbursement ftr vansittart inquisitors venusian tejo chauvinist antiguan syon baudrillard misquote athletically finis majhi borna discothèque uzb ellensburg quadrangles romulans mladić rsaf kovalchuk thakkar menil sarcoidosis meknes fiala coffered leavened tutin aldenham mayra merl plumed scintillating addenda glomerular couleurs rougemont essien reconfigurable diplodocus pakatan benegal chakma rotundifolia kartal moko backpacker corlett économique spatula sparkassen necrophilia tilney ramji nagurski moieties derose autobahns cranborne borstal aventis mcmahan yatsenyuk steyning soothsayer terris artfully raphoe torv aguada inaugurations crowfoot servir pickersgill fayre angleterre hornsea mansiysk minicomputers destitution xan lobb scv vidalia koror steenbergen caracara prebble sempervirens revivalists sommelier kunieda poudre waga finian sacagawea eagleson tweezers mayville kist proby helfer knuckleball poser thinkin dailykos perrett kismayo pcu meran mottling madhur ehl mof libéré westford ichiban sarto tolyatti generalisations reconnecting navajos newson pings immobilization nnc finnie lamberton bridgeheads catatonia tinplate komitas birchwood gibbet ayalon ventilators assiduously ryota ahwaz stolt pavonia polsce melds snakehead nandita pobre luann olympias blackstar eusébio amoruso gnss polygynous malthusian giftedness disclaimed najjar primorye oatley tcd tricksters toure mohyla wissahickon sargassum cria antonietta vineet chinois britz mcgoldrick khara jaffee articel dalen heartwood bobbitt saldívar offroad goggin lavochkin fenestra grammofon maestros kingscote hayling kindi trounced schermerhorn dimapur supercluster baldrick conspires drukpa quadro oulun olim sharky crunching bluebook shrigley terhune imaginaire billerica kalou natta hueneme krakauer lambourn shuffles muntjac clamor bedoya slaughterhouses wiretaps salzwedel swara renminbi capstan popularising milas zuniga voiding fiumicino brande halpert dioxins blinn latinoamericano antivenom alben fnb debartolo parasitoids okavango documentarian silkworms anirudh horas ilmari eptesicus newnan wigley mcivor moneta aliquot holub handoff granholm literaria gode anjo stratemeyer fundo orgasms subcamp domani qst cumhuriyet mcgrew shamsul revelry vonn tria gingival gamsakhurdia constrains nma corbelled dsr mego ghai slumping divestiture circumnavigated gordons taq thickens guisborough anagrams bridgford magdala overconfident queda jackfruit raspy velimir harriot geils amex tinkler cig muthuraman banffshire deflects repurchased diomede topographically tinley cahors isler alea educative blish sejanus pronounciation salviati agde ciprian redbird sanae guayama tangiers resto dhekelia nelspruit edmundson stainer ceann farmhand rotter counterfeiters panamerican hmi prashanth bordentown holte railtrack behaviorally hartington mediterranea fastidious highveld neudorf willes ramillies hersch carrousel openid thulin egede engelbart alesha newtownards colline boddington brahimi mayers europarl sifaka ashbery typographers toasting duesenberg adélie databank marroquin ghedin overeem matto upfield rindt burkholderia fumiko ccdc eielson rfe akil kbr kapa snarl lautaro pastored sensitively cuyler tharsis rtve amphipods cassels safire matchstick divi larchmont cesi cepheid transaminase bareback ritenour zellers tabak wahhab meakin skywalk parivar epos slaughters safehouse chandy brander anata timaeus outplayed reymond berceuse dobkin kavli bhandara botelho uncouth handkerchiefs glavine hakam schwitters paring avocados ageha radziwill morganton gleam lycian pussycats crd yogis gérôme arbitrations norwell carnaby matruh attired hamsa gunns dearing enge centimes wever andriessen etsuko abiola historisches tatami fito decorates skydiver ablution nuala etre ssds unloved ashkenazy imr magnes wcvb donen moorpark scriptwriting posco encyclopedist diw skuas labo enquired hirokazu barrière tremblant chestertown sleuthing marienburg hackberry rapeseed defiled outerbridge melania gavrilov groupement dishonored magnon rega nomos patta soggy lvt nabeel lorimar eep hindquarters turkmens cheboksary katsuragi mbo timeliness unkle asymmetrically halonen curlews hagiwara cahoon hace olkusz kabi anglosphere peniche andreessen devlet francesa rastafarian oci cathie lemass peniel parenteral margarito wingo randhawa salley jaque galsworthy gouin subsidizing argenta washy handrail camerawork nolen sprinkles politkovskaya shawty maol connote zebu después khabar piyush tenzing kuyper grammaticus corbel pacem libourne recitatives partch goncharov yaba simek knutsen selflessness driveshaft renegotiated anais edwyn yellowcard hkfa philipsburg enamels generalship ramm kbp veneziano scheele xichuan hybridize washout escalade hedman hongo prespa phuoc proliferative merce moldy aerated vpc baller heyford ntpc gampaha intan bergonzi cheesman lindblom basswood jenssen pcia yellen gifting shes jocular tiwi newyork gbm adin spaatz schnittke scrutinizing lavaca quizzing funakoshi sandbanks gatlinburg connex guaymas elearning tummy unneccessary cherubs hamburgo intercellular vario yifan murtha nsx chattering novecento stoics virtuosi badar unsere womersley kuroshio lmi cuteness sinusitis reductionism magali myong aspirational titchmarsh wairoa extorting odot cryptology dulcie lorn markel delhomme mashpee fossum rosselli grudging bezel democratisation quarrymen unreservedly massapequa baltacha conman pescarolo malvasia broadwell upolu mcmath pocomoke cultists muzong pinheiros anapa deconstruct tecnologia rigoberto gronkowski dionysos cambyses prb guanosine ermanno ulsrud teledyne hashimi rend kinderhook williamsville gorna pallett demeo columbanus wiel einsatzgruppe rolo discriminates laren broadcasted sorter dagen fuera fcp problema wellspring bitar galashiels endosperm skagerrak grapevines plowright maloof pmd hitam foreclosures charam sugiura koraput flashdance desegregated rollings revaluation trenchant laurance unfathomable jussieu sfm cud claustrophobia hsia kerby harriette pontes promethean leck frayed truecrypt bartending stong permanganate shaye kaza peste turbos bahmani knol sorabji beatdown internets khokhar rickles bluster nordahl eparchies mukhrani ironbridge unrefined ryedale merchiston heartwarming laxton vanderburgh hipp litho haemophilus tich amarante jocelyne dmi laoag ivanka wehr balle phoenixville struan longreach boneyard disko volz tichy internalization gmd nador hoda stopgap rajaratnam romane uruguayans pedder tiso fef fearlessly jone abg kimberlite kilner forebrain huzur compaoré toyed haza bilkent buffoon faze ellsberg creasy serenades cochineal mct chickpea gonadal pointlessly signalized granduncle folgore denatured wyle potpourri sanath choix agneta abax twink raynaud majesté esdras atri huizhou moule michon tsetse tta popovic isac rubbery cwmbran jubei overruling sauerland dubiously dogtown radke felsic sizzle karger sprockets madhopur smyslov tripwire vion furey showband panamax engulfing wallflowers tyszkiewicz sofas guba kcs couper yellowtail toasters mathiesen meighan reorientation veggies ranvir macdonough pamphleteer yaroslavsky vihear qiqihar imperceptible pothole earthlings beenie enim jms dampening ciampi moveon personne mizzen provincials anthologist nunzio czink cubby bykov velebit taxonomically withrow pagnell hoxie waghorn cawnpore cgc heckman hetchy artcle morpurgo glanford etonians shined navistar elbaradei donegall servicemembers jackrabbits kocsis ramayan whereafter greb bhagwat wheatcroft mody openvms lall prajapati ndiaye agoura berridge secombe parang fulco scoresheet strunk torturer vache kss omx dunkley dannie maija pleurisy dinsdale pericardial photoreceptors staël mylar cjr misleads devry giblin ricerche ssri kanchan garrigue lasius casse zaandam zauberflöte harlock carrel scaffolds tigra hrant emanations whitsun abebooks inaudible kupang pomme fasciculus seve boudoir gervasio takeshita errázuriz paraguayans ison happel lll keigo nosewheel ellingham storace nozick arpad unavoidably quiroz aln aubyn faunas riffing strayer fais raggedy arbitral brianti bühler mulling titre duodenal rugg pernod chano wags tuohy stuckists wcau pechora iraj sanding jwala prodder nishizawa unibody ethnologists bojangles amants sehr everts destructor redmayne nanako overhear vaillancourt screensaver khiri bangsamoro whisk cadore tmb rectifiers warrack loughran memoires irresponsibility pii mlt seldes kilbirnie hotta mcchesney chicha ponton strikethrough schneier marinha grafen artnews lih tickling pict apac lenora bhutia gaim severstal oberg spaak ferraz strangeways deaton letterhead guattari otsu scarfe druggist kumaris sputtering syrinx nagasawa fromberg wom catamarans stian kuybyshev robarts crocodilian hyattsville arnor infotech amia tuneful instate hansom babatunde steins whitecross campesinos beos sangro strydom culm parsippany pdx insolent yoshihisa segarra tomahawks encinitas victorio mehndi bioterrorism cronyism segar cravens deven dishpan burd ackles windowless orlovsky montagna crossbreeding qeshm merrimac kaveh oon slurred gmg blacksmithing fyffe liebling lanky sagen beel invitee herve unodc koeman mechanicals ranulph emedicine chazz antisemite novotny gassman sparkles sidorova bik mcaleer cătălin kádár abolishes adventureland lorton goering colloids giuseppina masturbate accelerometers bhattacharjee nuove arabist cpv polycystic degenerating heightening gravatt tartus harvin bleek steepness peopled anticholinergic hamble bartz stopes efs mbm voie goyder pamunkey counterargument simony aksum lentini pixelated eoka jolyon tabora taxidermist inscrutable hessel foxhound freycinet ballplayers symbionts rotman cordy panellists twitching nisga buescher hanako kommissar pinnipeds mable fount ick mcnish fibrinogen alghero dfm avonlea essar snmp botulism taga phosphodiesterase frustrates dmso elliston alconbury mittag dethklok marijan baley lucious pelargonium spiteri globulin bangers shelagh monclova flexibly pernambucano capela mangalam obstetrical deprives optimizer millsaps angello pintor dpt monuc mehldau blancas flaminia kanai bollard nobuhiro trong benioff aee arruda plads ings sundials boeuf afan gres nottoway médico calcified rfs grabar armonk grappler manfredini swd barcroft djebel landranger windowed jdm absenteeism compacta poutchek chugach jokinen zeeshan plowden stazione innately ncdc hauptschule ,or agius myocardium gars bonomi danco dowler geometers unconscionable bourton gulistan tweeddale ackley heckscher nunnally sohar pinwheel coldfusion grubs xiaoming laface acrylics asolo contraindications summerlin starchy ledo wjc duz cervi portneuf yukos presa fom blest balderdash thicken scènes ving mikveh izzet birkdale rano starches peptidoglycan lavi aamer digressions phantasmagoria attainments havelange damone fant jawan bedlington kakamega hefti jilly rinchen gluons filiberto fábrica pum stross kohut lulin timepieces kothi calamagrostis pdk poer accumbens rahe eastwest kurniawan frodsham ottery bangura kilkelly vash jaurès monteagle moroz sumiyoshi methotrexate bustin hopscotch pfs preeminence medicinally dithering smartcard tarka biopharmaceutical hbr vau hellespont sixths lowbrow hartsville timepiece cbsa housemaster dispossession sarnath smf järvi oriskany heloise foibles spyglass laterza expellees linoleic balcon wbf myhre daudet oncologists tenenbaum kheda speidel vltava mpo kraut gelato conceptualizing barrell phosphorylase frankness recordist désir beachwood cosi temur sedgefield nannies pulps helianthus delbrück whr akr presstv seaweeds koolhaas sahibzada rozas cryptically gracey singlish genii babul shoguns thirlwell farmworkers kirtley nique amazes welkom pontine purina ossetians shanachie melek damask unwatched tolhurst parsis kokkola sophistry bunkhouse yyy rusalka teesdale mista sabiha pgr retinol akl airwolf usmani hyphenate parodic trigon kaisar toxicological krist putintseva ilene trickett gca bohumil mbk arnheim chappaqua crary colorblind cmx revoir oakeshott ahafo honeyman realignments nestling cerebrovascular expressionistic blotting dizzying agilis sanja curium hazaras kilinochchi delineates benzoic duelist bechstein onishi appling flinging luiza talang pasteurized blatent artest quiapo lonny kyon evangelizing troyer sarr zeenat edgemont soirée fratello rion odorless doshi citylink braincase impactful pichler oarsmen stanstead stryper jawbreaker mcquarrie beeps rusi kik conchords otaki glenister dusters hysterically unsinkable enthralling telepresence dannenberg barral dombrowski geochronology chaturthi husqvarna homespun conaway demian mimas netz fabra thorin lahar vikernes kilotons poisonings cowden winnsboro dopey overpressure stagings boned fogle nieuwsblad talaat guianan downland starck postmark vasculitis dsd odhiambo dinan hijazi anemic anzeiger bbn jiaotong forelimb webmd seperated cred wpf greenhills drumm exil cardholders overstate securitization maroubra larkins wcco bch suppressors hollings petteri hansford flg menshevik arafura scena bgs aww waldrop baskakov armi parkhead nickolas thibaud motored magers dalhart daguerre morlaix teknik peur crossland incontrovertible pachinko pmt khenpo cholula pinzón timi southborough proceso gourock topside geto mummification melik adham asunder listeria halbert tensioned biermann félicien wooed kalyanpur vergeer mirabella imperialistic belmar analects knotweed lobsang sexsmith sargis cadwalader zad snowmen pombo granulated sprees defray multipoint airdates regrow chanced pcpro huemer spacers tane shakya teme dobbie chargeable yozgat pantin awaji lgbti panjabi austereo nullifying craziest môn mouche jetliners bdf profesor churubusco vasto iae brzezicki disparagingly joomla spendthrift serafini pif exhortations chunnam jeopardizing poy interferences bentonite coletti arkadelphia deodorant montoneros classico lamon kurtosis trodden vischer flexner zapruder rato baiji criminalizing lapid scrapbooks whorehouse baguette depute rymer borglum procrastination platonov sellar cassowary pcworld cantonments aggro slamdance pasley willimantic palmira swiftsure giveaways golkar minette compulsorily uninfected harryhausen microeconomic ush mccomas waypoint cni enabler monsoonal mexicanos polycrystalline spaceshipone stk clutha imboden cogito motherless yivo dorrien mwh koman catoctin mulford oldřich wachusett ninfa bary neurotrophic bostonian macquarrie grasmere triathlons payot verdens jesmond youre gordan galera awesomeness gabin burgon banjar rahmani vladimirovna bedard boogeyman preity inverurie exoneration dinklage labyrinths dls finnerty zari basson dfid flapjack dacron hawksmoor oyu vasodilation dvs piva fumbling schrank zimbabweans eira hause celebrants pab ghazan crossbill hydrotherapy wergeland taurasi gwb maybury randazzo diabaté paean quadraphonic masekela siwa adrianus potentiation médicis morikawa lestrade lycosa isk espagnole lexmark shortline ovations romi sohu reek vahan runnings hijikata zetland wymondham millican puncher seasiders systematized meb bunyip kfi vins mostert gfs agis bowral syllabi woolton ranko svetoslav gigg washbrook deforested ergot zirconia ashen fantine damodaran wca phonebook toreros tdrs weyer humanely hemophilia rechts chrisman leighlin stompin larwood rediculous tando idealists gagan urbis jcc bahir keepsake neoprene sahyadri tiesto gamekeeper hormel bhatta heins zapf rir revelers plena murkowski roasters coherency urological lancastrians noontime shekel muma chalked agawam ovale rupturing wickramasinghe coggeshall sku tsipras buttered axs flaminio lieve volcker unvoiced nyeri bandi oirats keepin linksys baramulla papillomavirus gleiberman selvam prophesy paez indoctrinated amitabha agw matrons wickens confocal veux oprandi sule cesspool paka zaïre stakeout slt amitié topsham alfonzo prosopis serapis boötes hangal escovedo tilbrook midtempo terres dongle inbetweeners lacerta dromedary gissing corum sukanya langerhans jatropha rubrum duf madrassa gallas cauty sukma canvasses ssx bloggs rothery backyards bonington tawfiq spader foreshadows filet autostrada reciprocates refurbishments yorck lafont faxes uemura goodwyn foolproof pallbearers jellinek smale tshwane calista historien queene hibernating pulo kappel coruscant chimay steerage unprovable marinetti stompers bening cannan zeev hallucinogens malting provis gregorius blackhall trezeguet shermans mccallister nanyue crusts mre wireline castella sandip leptons inexorably mcfall azizi kamau bamburgh tembo dankworth miscalculation headmen cermak stanger sandlot bengtson desegregate malá desborough pyrus feedbacks akeem cynically cdb corroborates schwabe sabarimala hls nailers phils calbi remapped kompas boadicea dermatologists dlf insignias guillemot spens lordly bammer spss clémentine ibd skive parmesan mirkin bacteriophages shakespearian mookie ragland plexiglas mutare wonju wpm colden tunisians torun mahela vasculature abridgment monette wishy waggon looker pks badin driveways trepidation mahabalipuram grads fatu tipi crossfit manoeuvrability righthanded satanists ansa stanwell imputation mercyme shearers giese hosono lohman ctw hankinson doba byington retroviral briançon wojtek pru lukens parachutists hymne lenr patinkin xylene dryland chelo nardo taiba morahan sett curbside edita anonyme tody wolpe lvmh mosfilm hibbing lema berni nakba auctioning wauwatosa tempesta ballinasloe unseating baraboo fossett andromache prm kwesi nadiya brasses jiaji reconnects newpage stoat hyoid rousset chemotaxis threshhold savalas unnerving hohmann tschudi harpsichords thermals simonetti beagles steffy hunterian priaulx reto ravin clee abetted graha disley prothero adulation gilling cleanups kring haining madhusudan ehs gatica florea stith unraced botrytis demoed levar horrorcore jarlsberg granton marginatus asclepias jibe bastar ineffable kahana kenzie granges mazama klint fredo jaron qand ees spect evelyne mahratta delmore hyrax magmas coggins wraiths lowman impériale ankur cahoots brl lussier lorillard kva melly granda notas bisecting stijl wabasha ruabon saunier melksham murai frimley bayerischen mckittrick anadarko justa dorris kashiwagi hoeven omon brum pitino inveterate everclear boito reapplied norbu pietri bracco bemoaned brückner dek teltow tella kocaelispor giuffre hillerman googie marketwatch preferment silencers ucas underperforming campsie mellin fnac messinger susman clopton energysolutions fullerene mobilising shishi tippin borch hornpipe ponsana nanci carlene nieve cercopithecus halverson camargue binning sammamish momin abdalla esmé seatbelts elita choreographing yeshua nassim cosmogony spooked cliched saramago redesigns obviate sexploitation chetham optometrists chlorite meshed borchardt ranke faron adio salida okoye priestfield bonville donley unpolished wegman hyer sasser honan voyeurism narathiwat trib conniff vereeniging victrix maldive gri mukai polonsky beys phaedrus segregating herpetological athy gujar conversed kaikan earphones pallidum stelmach thwarts rumen tyke gambon mattersburg brakeman mayfly legionella borde xcel vallenato parsimony setlists commun sorley energi resuscitated cnidarians profundis ardeshir otitis maghrebi chêne bariatric catbird bris madd khote perley pldt iphigénie ubiquitously haie tenon alicja obliteration nicotiana timeshare uam barboza armaan mylne ptb chakravarty bugti dease sophists böll rvr belyayev mtor iberians homesteaded reister liberdade rhinoceroses testudo papworth romm mrd olfaction chalke mcclane beausejour jibril maicon paperboy stepford gabbard admonishing amargosa hohner emmental fic hextall mummers atmel iginla misinterprets wolpert radioman retroviruses holidaymakers decarlo embu bracebridge octopussy arends townsman vereen bleus inconsiderate jamin bosio politiken plass oboists bipasha solanas gephardt sabbagh monis anther beachcomber ritson maskelyne inimical activators elrond habibie splashdown anneliese mccausland meiotic mtskheta tursiops weal bournville porco hance nighy paradies humanized kaikoura armentières skimmers trucked guilbert bretherton mcsorley nalin balcombe malindi perles sukkah coulton semaine bianchini retrovirus ganapathy gourley tsl ritualized ductility chacko shackle orcutt whiskies nitti pumper plaskett nobbs tarbert esen renda reavis heckled harsin ciba shakespear littell ripoll fushun cockfighting pengfei cinquefoil highbrow scribbled borsa glob siward daei lti csic lfs tard lubbers spital anahuac giridih schlick admonishes recordable humala chauncy ruh entrained ffl shawna simorgh waaf botley paraphilia andolan antagonized spouts tienen cryptid dungy ciu belittled enes nfu swingle tsering piqua hoadley klugman reevaluation cdg pining saldaña graveney provan veneno capitata tharu dumbing voormann starrs hellen majesties honam throop peeter minya requestor mansoura smoltz silverwood berardi pershore crts tukwila revitalised rayford rifkind atopic faustian aussi alessia rocketdyne banai nypost torgau vacuole margulis freelanced hea asanas satterthwaite cornuta whaddon prisma jolin cienega gavotte varamin flashforward heysham fedele roon iiib vía eiland tansy tiredness roxborough raynal cuarón ineligibility cloe osmanabad cannula riehl sylmar poppea jefford mceachern inglourious secessionists sut arunachalam ianto kosa renhe nwfp otoe burlap jatiya equanimity serai polyana sulochana novopolotsk chirping najafi armadas radiologic parroting armillary telecasting stashed neapolis daumier reeled gier atresia grs menier individuation dighton mcbrayer robat bewley skolnick valine abominations doumbia rutte ouroboros nati perineal werle irretrievably tiao fellner mutha subdomains cycads kisco vaunted kalish marchal pragmatist transphobia bif recolored pyeongchang dependants cómo coriacea seeping porritt lff simmered brideshead trumper fishponds vernor rossano crocodylus dimm ohashi tepic waf zeitlin egoist kosaka madrasas mckinlay poel tpf urbanus tycoons stereographic embeds octa delafield inagaki zapatistas whakatane newdigate derren unmik borehamwood chhaya inflicts sadik shuttling metropolia rapaport polyptych kitaro djibril trialed microbrewery miseries namu echocardiography terabytes cochem agudath wanless shymkent rans hyla rol sabella riskier piton armisen rajinder mache zittau hornbills tibialis moluccan hironobu copsey milb logis carthusians transact benjie krasniqi turnstone jeepney gulick alighting upwardly shae eitel ifad highfields piedade schuh kaesong sfar jaak ohta olu lenni hanwha sterol assemblers puede dhara noumea wsf herodes wandel nimba stéfano cooksey naseer razzle immobilize oja casterman lipp habersham snetterton vancomycin quadrophenia dmr hildegarde egr lefkowitz scenography waveguides segesta guignol wouters boricua slat cosmologies microfiche constructionism shrouds caleta corporis flecks unwary tulun koy melamed taciturn pawnbroker beaters thys cosmopolitanism cherkasov speleological lba kahnawake sandwiching baxendale drafters vales accordionists adib renta ferociously kozlova aval kodachrome berrios kelsen dilys reabsorbed murmurs damion cakobau unaids theos suffragans leguizamo savery giolitti latymer tarsier bienne neumünster fezzan brookmeyer lnc fourche kingbird woodcarver teleporter replicators foresees crieff queensboro dde arlon ddm guilfoyle polan peete aua umwa paten thumbelina persis turd counteracting musil patrón selinsgrove loaning bhumi legislations bertuzzi dalma mize kosar pitviper tijuca rashida quinones macnee capybara descriptively cornbread doonesbury oumar watchmakers medhurst nsi neorealism eua ervine affectation xiaoyu fenland betawi ambleside linacre cazorla leibovitz ingot icknield erlend feltrinelli deservedly shoshana inglés updraft seefeld gabaldon demy lavine bellavista bidi kundan flouting naman ferch taht nordmann polyclinic backslash goatee blacklock kyong alaskans sary transcendentalism ayumu bacilli sugiarto cultivable motoko kitna lampedusa suchlike plectrum blanka ipek valmont eladio repos hobsbawm beno singtel misappropriated chungking ankole dahi ducked ormandy spanoulis chale naturopathy footnoting dalmau bachrach lifeform fording santiniketan whacking muggle sexualities biomed rejoiced yukimura varkey basayev najafabad porat shehzad antiquus rehydration monikers photosphere robbe bravos nyong arbitary underpasses fah poulet playin tysons bruhn gauci bpg freewebs yellowing trotman sordo chronometers otaru telecine bayram coevolution lamentable herbaria catalpa zohra kindler reinert baxley sockeye evened nhn helme suerte dân jaycee wilayat andina ceremoniously baio bennion larrea guillou ambrosini schippers carvey neenah divertissement ebr isomerization morwell neuengamme pinecrest abidine dunsmuir mcelderry bhagavathy azarov hollingshead squatted rivard kalmyks yukihiro mcgarrigle marien verilog bhandarkar bramante storck daylesford salmons bailie milwaukie chabrier jaffar esmonde tillinghast discusion deutch sarees czarist marcell crosslinking penedès viorel stromboli mikaela scalise birlinn muenster veres theyworkforyou yepes bembridge kenedy smd anchorages heidemann donatella lippo fatwas pedraza mihaela prokom gago toting strato wasserstein christmastime tomba killzone odawa cartersville parented kharitonov pellucida tomboyish cang chorionic goodchild mld playability culturing iselin abend garforth orn dovey tzedek pained wittering enderlein infinities slaver jetliner goggins indurain metaphoric hooda tranquilizer stumpy nihat azzopardi oxycodone housewares grecia yali krystle pvv restrictor hrvoje sarina coire seimei ozaukee unalaska overbridge aviano flanigan sawamura universum sandton denard sideburns fanciers manavgat aav sedin inuvik kinnick matsue moke cutolo irp warthogs badulla charybdis standouts kusum moonbeam microsite danielsen petridis podolsky cristián blissfully genies jobber cimon fertilised rsr starwars veidt recuerdos treecreeper numismatists birthed ukc jadida tutsis radome hometowns arosa peadar putatively pilotage vacaville tradeoffs survivals lauding opensuse flesch katamari mclagan unescorted nomenclatural cirillo mikulski burntisland havard cantú boilermaker rucka pawnshop neuroanatomy rhian minivans cotillard michaelsen disincorporated brot hih bironas hapgood clairmont colleton shawangunk coloureds verisign barmouth swett burling capsizing poway ricotta sofi wastage scalzi aparecida wast tuckerman kulthum schlatter waltman tetrad haugland trefusis saturnalia weathervane frears quarrelsome hashtags ibe hexane belshazzar temptress kodaira arpeggio dering pinnata armiger zopp butted koussevitzky headlam underscoring fearn skardu vieri liaise craw norvell hwaseong fron ainge marketability rlds rosenstein petco magasin baldev joad neoclassic derr solidaridad earnestness laurette otherness ireton evangel subducting knickerbockers mirsky teymuraz buarque pepito hacer dlg teennick incorruptible respirators interlace phaethon trollhättan raggett moccasins alpharetta barfleur urmston mambazo manichaean feigenbaum einstürzende fluviatilis vibrators bessborough hdfc alaina calpe catherwood counterbalanced vaporize worklist hafen yáñez vizcaíno embarrasses levinas maltings sreedharan kayah socha ystad félicité kaziranga gladius graven piggyback gaffes populates barraclough ocoee ruffini hoffnung anderen chicot monroy financials diyarbakir waterpolo cinemascore pinelands nsn bogeyman distemper ohel signup reorganising cantankerous huskey findhorn outlast unfazed hypothesizes transpiration plodding shull arri limeira maxton zimmern larionov blaenavon apop ungulate rhythmical jango rusch cardigans gota sherrington whiteford ruffians marson mahim reclaims ayad crashers koruna archways zumba prostaglandins demean intercounty authorizations rippling denikin intimates adelante unsaid vap sixfields brushwork rumination makem fumi ctm olefins pbp pinerolo lascaris pawling agosta maite timotheus veikko adjourn tunny instrumentality yamani geomorphological wrightsville comayagua matane panofsky joiners interferometric gazala havers narcissist eich cytotoxicity invierno catabolism giunta lantos balinsky bijan alexanderplatz notational newspeak lethem temne qanat etchmiadzin cozens trnc pineau rendall brambilla reinfeldt essam namo eeoc shada abramov hany phiri matinée nixie bomar tazz casiraghi cozi proteinase decolonisation ritt aloisi jhapa windle coman finalization phosgene geneve fabrik hillard seceding sunoco hahaha vika strega wolfsbane nanao kishor morkel decommission behead kushida unica dissections aspa credito raimon cupar euronews shila reconstructs rottweil minhas nkt teimuraz commodification oberhof alcaide armfield lectin wgp socs tinashe melk stacker adducts guesting unamuno langone sekou valby hamner enriches dbc agger legalese weisse perfumery chumbawamba lukasz seidman horiuchi monoplanes biosecurity cabbie kabataan tranz lipkin wicketless micrometre transparencies britannicus zoller ypf catkins asea engstrom meringue utopias seshadri longton vibrates cadwallader bierstadt extensibility extraversion chaperones mortenson bournonville behe mailings selon schlock guarnieri stouffer kakar pitstop codenames tty reem sheh hoedown furr simpleton armstead ronn deckhouse majeure gossiping yamane autodidact tarry powerlifter reining dhaliwal tattnall maytag whisked flyovers perspex technocratic classica medscape vilification charisse kyzyl battiato towle rexroth abeille falta reutimann rajani azaleas keady loathsome scrubbers sekondi fenster tannehill phoning melora nubians sansthan bernardini rtu meretz nisou sagittarii startle farish mesta denon letterbox apostolo hipaa zhuk maclagan matthiessen cheah marosi xchange liniers penetrative honourably ramsbury shavings dickin sisler sabe bharadwaj estévez necrotizing aigner beag sputum klinghoffer boutin senhor yobe kjeld candidiasis canzoni escalona sigmar dangle amgen amasis oland lipka américaine armendáriz cuca schuck ebd bdnf curtea plumaged laeta daur najm pluribus reorder washrooms wescott mismanaged bridled smithereens mackellar siraha waddesdon noncompliance kitchenware sibel kuskokwim milivoje europop acknowledgments disassemble stigler caecilia harumi fawley somare romanova tyreese parotid rsf drafter vrai nylander dewas craftspeople stephanopoulos blemishes goscinny singspiel daxter atan wwor burson wragg hitt healthful smirke musei vaquero canidae bouteflika farfisa tcas greyson yuasa greenest trunking huambo mft godina keatley tnc chekov multicolor mandara overnights dimond landmasses ernani ccu opcode basslines geocaching thale fundus yinchuan diatom salicylic tefillin insufficent electroconvulsive mischaracterization trouncing stepanakert ndlovu palaeontological grimmett maramureş imploring hamblen rham radwan cleator ramli guardrail lucrecia singstar zamorano kanawa wami blancos castellammare ufology jws bernasconi helichrysum billard fernwood conegliano pileggi tpl nig sopoaga salvos westerlies schoolnet roopa frogmore sloss handmaid equalities caye gourmand cohoes verticillata igs commensal hks herstal garbarek hellyer borchers faintest nabataean groupon dolerite sithole ognjen suribachi unsupportable nakhodka elda steelmaking chittaranjan wildness pfitzner choroid tétouan compensator bernstorff afore dowdeswell overdraft vacances pullin catiline freebie marmots azeem vanu dolman russie takanori relapsing wowow expedia seismologist erl korolyov underpins dually kralj gaita hitching morphologies poppet kidsgrove enrollees certosa goochland shoulda inadvisable randstad althorp corben walder waker saj unl kamila epe salusbury gretton afe maillet reckitt mandola stickleback medalla fadil asse lozenges baul blackthorne yellowfin japonicum korb ahalya hezb glaringly herries sobek nismo pleasance burress catechetical gilardino vcrs sfn maqsood royster kabbalist discernable gandolfini inmaculada obscurely mobbed rusby borlänge oxidizes moïse gruenwald comoro impatiens neonates tambora glassboro gulati signorelli gsh genis lucarelli santin thither usair pharaonic corris toke vcc cloven ballester photocopied ebury tarzana hijacks fearlessness schemata underwrite elza phew biggleswade wallaroo corruptions breakwaters rahimi shenoy konietzko westlaw conveyors jss underfloor sceaux bashful byars prefixing beddington undercroft opteron amada ximenes karori sgx groat teb corbis vpi annus dross touché payphone appin louboutin ferrie shekar shaler spineless impressionable okano otten madagascan remiss mehsud boliviana teco odda handicapping electromagnets icefall enol leme sarfraz shatt löw ircam uia mcgeorge hory slaney legault kilbey toymaker langlade spectres stroessner meon toraja bucyrus clydach dys weh zein toastmasters lynam beauport ordem jinju luckman shohei nitze sirk berenberg flossie gondi endocrinologist woodring choro muhammadiyah chanukah quotable splayed lafrance täby mff paty mpf athletica shoveler birtles doolin leishmaniasis guitare blitzer saraiva qayyum andress denarius pacman dazzled balkar sanpaolo arens ldc darge quatrefoil parigi ejercito moxley retakes finery nhi modibo stockland buechner rigi rba kitazawa whittlesea ancher tedium shorbagy mondragon nambour baldacci karbi flamed jarmo demoralizing iasi milson finca pegging skidoo guliyev accoutrements amoled wends danai shm metallo denki ufm saigal jiminy nup paal vasundhara abelson talkshow caregiving juglans ollivier arcgis unb mickaël gyo coleshill campese lakenheath mercians apopka leatherback dosanjh westervelt brauchitsch harried sweetland ebanks privet rova kiir integrand expositor marzipan louverture engadine pashas strugglers mugi kloten doga benfield styron ajayi srg jeppesen winxp tewodros walmer masti tinbergen cooter leishman microkernel diadema aska mezzogiorno boka erodes mullion walkable draftee aaaaa civilizing entropic impac zombified reproached anastomosis boycie channelling entreprise shiri throughs ruxton lochiel rotonda pugachev rugova pilling montemayor tindale dagestani nobodies barford gassen tafoya ifas susu androgenic llandeilo hazelnuts nazareno penicuik eckhard seletar contraltos nefesh lucci brasher hadžić elefant molinos cartmel jeph crofting industri cannavaro resnik biogenic nkomo stabled ordaining beaudry tarkington stabile pembrey isenberg wurst chaban forlán trebuchet turko backflip aaltonen advisement cuna boheme umarov orquestra zale demuth inputting ethicists beausoleil adrenalin smilodon cira mccurry chiseled dariush examinees keyshia wauchope albu backplane børge zepeda volkmann coppinger tord danks gadsby algar pagers erythropoietin hgh bwr barista shashank hillocks aog dreamcatcher kasher hfs almaviva biran ganged pietsch erector bertoni buzzers mtvu blancs sooke intently becks kumbia yokkaichi interfraternity aurigae deliciously mylene kow kavitha humanitarianism gesner neuner raas clansmen survivable vandersloot djamel ldf blackheart misfire marleen hsdpa retellings mardle justiciary ilyin atle initiations radiophonic gins sgh mêlée elka inductively getxo sociobiology µg rems gustavsson jacketed adornments perun caudex pentonville jailing maks cabe desierto munck serrations klos iwamoto okean farrand wolfhound kundera subtropics latus trd hohenheim gibbard cushioned scada lovitz yakutat ngozi karu centralisation rúa aocs antinous caricaturists warzone carves stortinget sandf kuyt wem zima tamazight unread merlino nagao hospitalizations mandelstam sihanoukville hilux ruz bekele deakins ligon stapf saqr friesian randhir rdi gilmartin banality weissmuller brise dyers unravels wainscoting krzyzewski masini weblinks zwart vendettas basanta maebashi opl tristán middleborough basij roundtree retinoic atli miyata marshalsea koss thylacine racha darkens weiden goy beatlemania alija cakewalk lacerations borghi garvan longmeadow anticlerical whistleblowing spurts yakumo steck suburbanization iznik stabiliser chandrakant lagardère analyser mafeking chorales juul yenching codifying cayey dinis maiduguri djerba feasibly molest sumida deface chalker bolte bramber pensiero revitalise dzerzhinsk maudlin nichi penkridge perverts secretaria blakeman jacquie haack shabbos boldin radcliff jumblatt odbc carbonite condensers fq moneda renouf hiva disrespected mandler mesmer shubha wagging shechem prechter overpriced landholding trottier baseballs magid asrani taoists distros mamun naryn clytemnestra sabata kaski zawraa jabber excusing clowning momoyama monographic mismatches dinefwr thornburgh chachi tiresias altham juiz techs florrie rga pennyworth sanju palla bizzy augustyn flustered liuzzi immelmann sanilac shindig efx impinge malouf panchami mordant gomi zampa vassily rowen fallows influenzae kamenica grandy mossberg karthika banneker gcsi mairead timpanogos annadurai majed torrie teniers diyas mediolanum gonabad spiritualists mumia vattenfall sien henie wedekind alcester disbelieve gloating luxuriant bothe slackers drewe hagfish lefferts okita varenne colorings passione deoband seabee northville yamanote groping metamaterial satirize weyerhaeuser ondes sutch sahi ikaros norco quintilian dolley aubervilliers riise formwork megson armbrust bakula markley dryburgh theologica kus dwa politifact hpi unfortunatly anemonefish huddinge wheeldon radiations bolas citic orgies blitzen immobility snapple barraza vintners hiddink eterno feehan mahir incubates varig silhouetted davar communitarian nub miasma aladin tulin yoho aaja kitimat sculthorpe lilburne prati nims skat kuchar almqvist viewings gamete uncropped resinous pincers camellias koha khumalo sidibé gerontius summerfest acholi trésor thermionic hongqiao klausner isg oreal gavrilo swingarm sealant vardan foulds harrop histocompatibility utp skorzeny faller pauw turnouts imber speeded hailes bsm holdridge fallbrook denominators pittwater trudi hova odf willmar pistone simonides rafique helwan condamine recycles oor electrolux somone galadriel ossified germline halili guruji kitchin snowstorms advantaged tamoxifen weariness hamgyong ashkenazim maritz nega dieback ziyi jingwei hoggard aktiengesellschaft fillets grif ncf olyphant mugello jako seamer mocs papilla leclaire comparability schmuck fomento gza biodynamic vence beso departement kinh imrie vibrancy regalado shizuku blaydon manzo vaya groupware pelion downsize farías terpsichore sören selenide swedbank compo wadley niedermayer threshers listenable squeal oras alberic altas tyrian zet merapi eberhart vallon animistic torchlight kibble opals emmis mcaulay rightness corrib nedd bunche roitman salak oldroyd erkan spotland stokely cohabiting funston whec neukölln ananth dostum ridiculousness spurn untruth organizationally reais chm rexford mandl methylphenidate lunny interconnectedness lemhi ftn ruffalo elwin didymus deezer adra plaquemine lalanne conditionals antihistamines coffees yousif ambato monáe fellers clincher particulary frantisek daschle fenestration nordstrand saunas yulaev mandla gesell perineum yuppie ratzenberger hkt nataliya kluger turnabout bussell subservience gandhiji belaz headshots ezine familiaris nobuhiko internationalisation peccary knokke flatbread eml lankford ocbc sleater maghrib massino ponsford plaaf voltigeurs morose lerche grindelwald hannam eraserhead chromite remount formulary chiarelli gehring rolt rutile musha unclosed collines brych labile hurdling journée polícia enno efate laus gies spotlighted tawau hoylake posible barlas swardt tombigbee aute almon tato virk stamen brede riyadi carrboro oceanfront assed ahtisaari hamdani dichotomous albon untv lipan erisa collado bratty cimmerian blagoveshchensk greenmount colosimo jemaah prospering anticyclone berkshires ncsa dessie leitmotif anholt stradling mclachlin ballan chehel debo skank somma slonimsky binational prest chima amite parise heera bromeliads infirmity rayment antechamber judgemental honeypot sellafield uscf cerati manicured benvenuti falafel colluded granata harveys igwe sabonis morenz toolbars riccio hakkari rashly mandamus luby doldrums lassus trente sahm monfort betrayer valueless loughs thabit continentals itl padrón ebon tdk krab vignes habakkuk byford macphee kiser bellotti salukis perls hillage robbin gwich flocka fale enniscorthy scaup comminges zeolite furstenberg tongji margarets esmail mangin brielle snyman akif embrun mickle revelatory medusae jmp brewhouse ikari henner cerne miharu hersheypark misadventure oakhurst iwao kiddy cfcs kaf holten objets jyrki belang oberman hdnet seidl pigeonhole tooke habyarimana infantil unpledged risd srbija inbuilt penge agronomists wewak ellora zahle fuster squelch chattels santoshi ciego liposuction mavi friedrichs kayaker yello mutuals kyeong noninvasive trane carolinians beary remotes zhangzhou universite climaxed zurita eprint venis lightner moberg mohe ashtar selway cerebrum stoneleigh bracho berryville benedicto chicory mcgarvey limca statist oloron sunstone sabal tsukamoto sargsian damrosch planescape stomps volkan agito serkis hargitay uproot jabloteh crosser larter navratri tsarskoye fennec concoct photocopying turhan lolicon elida lingle kaibab mactan fatter fentress quiller degrasse bld macnaughton armstrongs pavese zarif planitia aul portrayer seguro appelbaum doesburg kuykendall rayed kitwe trembles calea deflector kempthorne gainey debutants laubach stellate podcaster vorenus dargis ertl hirschberg wurmser coan understaffed stilton sidorenko viktoriya tomokazu juxtaposes mediations widerstand sjöström labrie derailleur karishma gerónimo pinedo atto generación quilted sugihara fiorenzo haire valladares gokul oxidised enlistments populaires torrejón clasped agapi kera yzerman mith chanteuse nobre tinfoil kautsky galston coomaraswamy adebayo gnawing cronica ayaan nuking rohl inkwell biblically decalogue stocksbridge jacki iup huddled leftwing iroc drugging veyron cotillion pathogenicity ajami rottweiler endnotes bafana charsadda fireboat rømer impales lehto hideouts theatricality ornella afv leaphorn beppu mediabase arfa woodcutter julienne englander kjartan kusa pintado jonatan muttahida milka awakenings jewison gesang civitella ciccio hulking ernests muzaffarabad pascack timecode physiotherapists yusa shags luyendyk plainsboro scherrer spidey grumbling tagil calverton wintry bebel manteca fathead wyd mony bygones satay canela beckton vechten casady schistosomiasis denigration strathallan joneses bugles capulet shorting nessie benicio nigella bailouts undercutting ells kalenjin geva paju shawkat sifu casted fatigues stealthily gunton jutra costata progresso manders inan panettiere pallavicini schoonmaker correo hardesty lampreys monkland institutionally siegler caldeira somthing urogenital ludus funabashi calley multifunction clothe iad hepatocytes krung eik shorewood scammell hydrofluoric sirs avoir occurance chinle uneasiness buggery zahavi foodborne normals leasable correlative moondance geddy maksym pekar foursomes diamandis valmy behrman muons lfl chadwell gallimore sigmoid estill captivate emanu safeco outdoorsman stirrups savelli ichthyosaur waxahachie sithu mutualistic quique hanky comprehended huntelaar imperii uvb latouche knighthawks kinne robidoux hebburn assheton donoso kake sarris kaeding meck giada fortepiano renter smathers clubland caernarvonshire deters keepmoat krait fabless thain psychogenic umkc exhorting resubmitting ivers galvan traumatised spinel typifies roney kristinn notarized swabians osl macneice rosslare irem météo lavished superhighway zaremba irredeemable gorshkov peruviana newsted definitly spearfish messa repurposing bluecoat malpensa coushatta király nflpa panikkar rabinovich llnl vectored caetani fdot mwai meghann avast quibbling chanderpaul glauber parshall barest hideously unbeknown farge canonsburg mbira enlargements peavy wortham teleprinter obo mendham dalli aqap dsw dupes burston sunkist draganja erben vincenti shahram jaunty suitland lamer raygun tyron kiplinger ecd shod filatov ratnapura douma westerfield clintons purohit takizawa portimão masseur médoc wadebridge sherrard othmar dogfights saurischia ttf tanta peja pharr oria lanvin hotz myasthenia curable wiest barnesville brigadoon frise bullfinch sahay submersion guodong sharat leafless ehrenreich moghul vsb capos stairwells cppcc escott fraley bostick odets replanting balenciaga botts westerlund floundering constantijn swanepoel pof aelia nakedness treetops maqbool downsview poliovirus margravine komar mandar alums scoot ucayali virden copolymers slapp gevork rameshwar telefe aban metrically hagrid chapleau overbroad ochiai effectors cotterell lifford idealization cuadra folksinger vanitas beatson yaquina maybelle rylance mccaffery harikrishna hanworth boucicault itar meynell introducer liquidators marzio cultivator muzak miedo sclera brownstein shaiman sunland independentist mogren kiffin cpf contemptible rockwall disqualifications yeadon favouritism volatiles sensationalized queenslander kästner cengiz niyazov bartman hima throb stace lewington jürg courser harvestmen mashups nenê oakmont reinvested pursuivant intertribal lotfi jungfrau turia furniss mulino pluperfect clairefontaine languid rcahms disheveled greylock cookham anderssen rendus truncatus tcn borlaug crucifixes aaronson furuya braised meanest belaya demoralised kieren ulfa foreseeing valium brookvale mattis reinstalling chelation buona lale nuthin ungraded doughboy voinovich highbridge goodland austerities pemex mkb rescinding mbps sorana schaaf souq colonoscopy pagination kydd kisa banes toothpick elastica onofre jcm cruisin englefield lanai lewitt baedeker gyalpo cornyn ncu ridha svs pavane frou fairtax readmission adulterated makki terminological stateline martynov sprott woodridge triglyceride leuchars barghouti odlum lockley entailing agh corbijn automatics tajuddin pavlovsk doulton neumeier firstgroup bildungsroman gaudeamus perchance traore lagunes stempel cordons eluding flings dyad ariston ashurbanipal südtirol goldner schicchi overy capitalising aspatria pettus revo patani dunant matthes unheeded metalworkers thomastown phages wormald omc kreviazuk villi hanbali lahoud awnings dukagjini dbt misbehave brianne basotho panto interneurons romário bolle januarius iker alvi bondar loblaw pasch mercato photogrammetry prévert golightly jamel dimona azer megachurch alcamo napkins glenside eterna purifiers micheál aneurin birnie pasteurization backspace copulate umpteen nofollow mornay fugees unwisely janam mcgough huangdi transliterate acls subtler liri maxon avl cobre guimaras daco middelkoop cordis lateness angier messianism roscrea gayton doerr garçons grimani séguin wilko fruitland comden llorona rame pelikan darnall mutes familiars sherpas trample loredana tiomkin kott paan panavia splints gogebic terminologies kittitas pedlar deckard samer zfs bottler airbrush personalizing tenterfield powerlessness rickert rossdale compere kaila roncalli thermite manzanera susquehannock pendergrass lwow olot govindan hossam margao transsexualism rosato userid sonority arcus instinctual sidelight cleats swinney destabilized maidservant galifianakis modeller bowne dashboards nerdcore griscom goldhagen mcgrory remonstrance josselin rydal smelted nbp mohn boudinot vargo thwaite cycladic hardwired calibres dongen jinshan indes saraf herniation sarm crosshairs novia matarazzo paperboard talysh excelsis roks anandpur franceschini anadyr skillz iska hadlee ifill brushless saotome lanata kataeb kudrow motegi echostar rcf kester ohka kinloss sanjar postulating displayport rhc yechiel kanada boller triadic barnato boob magnavox utca routt acetylcholinesterase freethinkers extrême geos fuso tektronix stis outstripped fibrils hoechst venturini pulsation stingaree squeamish quirós jupiters dfx slither dhr gwendolen mayflies adulteration minard diaw familiarized clarrie legende doting gtd unr gallinules shied michalak thermae sidemen surma sezer hillhead syms elkington terming hebb hever bonnar punning komaki agoraphobia lithia gottesman icv ramaswami sower marea lansdown prisca gae prahlad hemolysis omn ethicist turncoat drouot abrar incredulity valency bagshaw brambles lais schemer kozan msb turbomeca bucuresti justyna whosoever folle biggles boorish kulturkampf wigston pyramus woodend sedat firoz donghae coolangatta shrieking popularizer alderton schweiger ranjitsinhji sjc logistically saram switchgear yasui botcon smap salmi hibiya tuatara mullane curfews soyer namaste preamp gavras christin simmel nemes tumen vergennes railwaymen kawahara lnr medel noo dubz percieved wint arrogantly ginuwine glendower gradations nanoseconds massacring juels beaudine qadeer wuhu copolymer corbucci killingworth pusa adlington intravascular runoffs bongaigaon moodle bloodstained colbie rootkit elance chillout athénée teles kankan fud tmg trichloride misapprehension michalski spurr aziza balewa micrantha cashmore multiuse knitwear marfan finster macdougal unencumbered gerritsen renz lorenza succulents greenport tush sarastro ebla kissel mgo francisci sikeston adie lopburi churchtown mcf beddoes boulding masaka landreth awg fechner resveratrol insinuates schönefeld dolf rhodesians siphons matériel lemonde clericalism jonesville walkerville wilczek fussed soudan koopa shapers hra bhalla fasted colma mattox lecoq giacomelli encaustic racewalker humours kfa bote mercersburg ionisation gitano eln utne pencilled alick mdgs piru tanisha endeavouring vialli tonsured lightsource tattersalls wyden awk clouding idée castells griffons bienvenido cootamundra scoreboards dishonestly sokoloff partir aventuras schweinsteiger webserver madara witzel wieman ravinia prided violino rcb sugarman phosphors katwijk naftan derain confirmatory mispronounced deshayes corton lebar everingham scaccia odenkirk poms steinhoff ronsard beaverhead anchieta arteta smallholder terrae trono goldy welders thorfinn fpgas westall warrender milito donaueschingen dfe gajendra pring atanasio adebayor georgen belied tendentiousness dunsmore kiloton cabela flan joby underoath doen pangborn opto mrx busquets injunctive ledgers bikeway prophesies extol erj bhagwati dslrs edinboro lamberti alis croll callejón kila iqra earmarks baruah gwinn wais ariki autoroutes kerfuffle tunja hanifa senato cliveden spindler mellish pelley rudis punchestown succop windswept krio dugas sewickley szasz plumpton mariska riedl brushwood lemar meizhou herculean twr isaias headdresses runaround recouped homogenization disinterred visio rushville zagat crista pinscher slm lovebird organo vashti ryker chutzpah yonatan neopets wern wellingtons brembo patentability lagu walbrook brookdale targetting hantavirus mipt tobymac teagarden komnene reisz sobering erlandson biskra nanotech guarino untouchability josué uncontentious zapu koba nunneries scheiner boettcher logica lookups jass rowbotham ehret discontinuance inattention busselton bogut paroxysmal kary fragmenting swinford harnell seid jünger ayin covadonga malts gebel decelerate theist ipsec baftas brickley mochrie denazification yungas klansmen kruk kentwood benzoyl tinh wiman reiff knickers ashgrove kleiber depowered hanscom meerkats ellroy countercultural miming dhaulagiri hasani craves takenaka wbca yueh exhalation lycett atglen milked linnea grana mononucleosis hirschhorn clémence wendland amelioration signac farts sargasso reboots pieta bisbal bundi barbeque beena astonish veronique brás vmc messel nguni joerg mondi xaverian fowls nymphaea thornberry cuero clavichord lightman cackle potentiality grigoryan pnv perthes kasten reselling chupacabra cardle splm egoyan groene rufin coatesville manzini coppersmith hargeisa duta yersinia faceoff vacca topoisomerase flir oribe spafford serin poniewozik polyamory cacapon shotwell tramlink macdonagh underling maal crise dopo shantanu gordimer alkylating shanmugam torpor hasson krenek sacrosanct balmy bardeen questa buta coppery etcheverry ticknor aromatase sedatives sey demba gerrie amagasaki eyvind allawi aquatint lorch noriyuki lostwithiel fortissimo helgason grantor untie kilbourne creo chosin cobby vectis ugs ciskei ohlsson addon bolivians cockcroft camil eddi geithner salò jesolo exh fonteyn sasol gmu heytesbury malabsorption cisc medicago heriberto jumpy barres corvids farnell smarty lorenzetti sadeq lubavitcher bovey shimshon shew romande tremain matthijs lassa izvestia reticulate admonitions safdar aleksa swirls ipsilateral indraprastha zick aromatics rosalba busty chey exorcise harping hinz gaudi peine iyad counterparty steeples maret baughman keli bricusse morogoro wlm havergal taji quantifies nuthatches nandy fiefdoms malvo overexposure ortolani carm ancelotti vix approvingly shaadi declamation terkel crkva mrr diarmaid dancy sead boatyard dgs secularisation iste aiton carnet holmdel radiographs polizia catalyzing angèle mabinogion rainfalls quintal sûreté swish ferroelectric adhemar weeded peveril glioblastoma jesi rastislav shuhei anacortes mashing stoudamire treacher bente dodie laclede palanquin laundries distillate biggins mbl hudak johannsen salehi thb prothom brontosaurus luga anl nuclease eldoret quillen mojica earthling risch sylph laberge muth clarinda tidally deification smo melnick sterner scoter tatooine korczak jux philipse winchesters cachoeira caminho tuchman toadfish bandopadhyay garstang ladytron nabors mollison irrepressible chenier adw gaîté polychlorinated theatricals mannering crematory thyagaraja paniagua weenie murdo fenrir dorner glacialis frothy gentileschi adblock hifi kiddies rasen legislating rybinsk saith furnishes essentialism flds vaas kanti taobao aynsley neuromancer hearns vibert hassam triggerfish allain tourmaline anuj donkin tempio zugdidi barmen pataudi derm dilettante innapropriate varberg underperformed brama hamp hedonic sheikhupura moly fite maîtres ibra tater askia twu economica trippy terras mutombo yefremov toohey murtaugh bargh kobalt glitters gmm pixley bahrami nujoma fetterman kirana zahorchak edythe playgroup holdup enquiring ( maryanne edyta eldad mohmand takhti actaeon reintegrated wcg chambermaid klaatu piniella wetherill wabi gagosian stuns gml trooping broadley batam rationalise waitemata lingala spunk treadway pignatelli edington wigton waveland hazan mcdaid dongdaemun tabulating witkowski betcha uden duthie forté vpa taiwo vianna ghamdi hight recombine brogden achi tdr bratsk caumont groundswell noman dispelling furse airventure faircloth rothschilds moralist dragoljub bely tarnowski tristano fuat timmerman teichmann starchild strickler maimon disperses ecevit rubini civica yandel andar ehsaan hadrosaur letta hatice raconteur pounced noemi mulhern zedek courteney outmatched savants muli chickenpox montagnard jette garver armalite manhua sholem counselled grinham webisode tarin siong quon sigerson dá arona delvin outflows sandre binz bearkats circleville chb weathermen nitzsche publix eysseric dassin transpacific rft newtownabbey codigo musicum erythromycin ortona kanika genentech alfredsson reallocation melnik hibi oomph pavone sff apraxia canley serafino aera subsisted fitts foreplay bethlem fenson reidel gewandhaus rocketeer quaking symbiont decry sitapur jerking consultancies healthiest gallovits prurient multicore wassenaar brecknockshire reshammiya kristol sharona clink aminotransferase azumi ayana teleki lockup balmaceda socialising redbacks ayatollahs cubicles jujitsu labadie buchner wann sensorimotor cricklade volkskammer naloxone saikia rowlett scammers karasev cjc laboriously altea romantica exacerbates recusing bomp elgon brenan exaggerates stooks amblin exude killifish takestan bearskin khouri tecate bettini hansraj beguiling ruanda ajab manang waqt sigurdsson sextets kalashnikova balian elephantine túpac neustrelitz deadbeat fariña tose freudenthal bluestar fining dugard slaf montepulciano dimarco ioane sanibel syst kazimir htt cannington cereus hbk giardini nemec stewarton patios alburquerque cookeville herreshoff darkchild beniamino séverine liminal gnutella lassies polychaetes dharamshala ungodly trucial andrius semolina heribert clase booneville moans kingsnake neoconservatives secondment rezaei boran szekeres hilla blacklists candu yakama abhijeet aliquippa sonnambula bookish gergiev glia mcgonagall lincs newsboy patrizio dunking jordana sinecure rescheduling playroom inappropriateness grantville weert wistar megalith jernigan arnsberg vare cqc shockwaves deaver baoji jhajjar refits dodi lotr salling ranasinghe icq milnor pantages reeducation cheo ibt waif deseo vasantha submersibles mundelein edutainment chakravarti ferragamo bednarik noth deobandi bivins gelnhausen ballooned azucena etl snort bonsall estcourt merson wiens hiroyoshi entrenching gossipy yambol bonetti peláez mckillop hushovd abashidze montell tigard colomb khanda zubov cabos climatologist okura reculver edgier volcanos kearsley speedskating constanze cronquist tasos menhir mazeppa pasado pulpits counterrevolutionary cofer vili lasagna ctd oppressor puny ymer ambergris belinsky frutti nabonidus tybee matriarchy gyanendra kents basia lengyel tethers ominously meriam kpk liberación shifu wenden sabarmati maira goans mrap janša harbottle atac cloquet tomkinson samal hanauer wahpeton komori ratti kays intraspecific verwoerd idolizes symbolics subpopulation settimana palomares scapegoats biloba kpn rarefied combate adamstown nuptials hanazono agana meiner altach gildemeister reductionist taieri marshawn lycanthropy erman entreaties hummus kourtney taipans gimpel lindi monolayer neutered senso rikard tuguegarao aacs gaskets classless motorcoach tardieu dysfunctions academica yeng canyonlands wagered adjudicating barun bitsy trippers apocalyptica iwanami tribalism nickson leskov mcclary provocatively arasu chauvinistic orndorff hollweg obbligato seagrave caravel dhimmi iiic derailments sympathiser lifelines homonyms dadaist sherard qari sry kufuor diggins salif fantasyland gymnosperms stupidest puzo sapient acqui chisato casters everlast sumatera seasonality pettiford keppler gertrudis oxenford cardington pollstar contoured steidl nibs skewer vannevar drumbeat thecla evernden bronko asom meridional kirkyard gowanus dagny eby staro deforming cowbird buckhurst liken inspiron pendulums ravn virile eady transilvania musette annett lhb moench chell charenton zainuddin schütte igc ronkonkoma kiku duquette frm daladier bernabé petticoats kahneman longridge jubilate multiprocessing depositor tribesman cavil bfm bettelheim charleson centeno rears trackless defintion hehir solstices suhail doges roundhead corrode metropol pleats florenz phila frisby aitor tanzimat sini strollers moonwalk shivraj karns gosden bandeirante tulisa regi bida tamid urmas durell rienzi sols turnitin elation montfaucon superpowered contusion wallow lysergic panatta matan wampum ffr gabo abebe eutaw figura sunfire yoshiharu tabuk richler lundell fluoxetine crowninshield eboli holness misbah stas caecilian khal vaudevillian kro pornstar caltanissetta modernistic panta higo shellfire anarkali nunchaku pacifying bouffes sokolniki ombudsmen fatherless leipheimer miscellanies hanga qasem scharff embo gately najdorf korneev giron kerley marquard disengaging mahalo shorrock nikoloz climatological ferlinghetti croaker reverent carpool uvm unami sisulu recreationally backstop kube psychomotor tongariro hawksworth karadeniz parashar neches salette ferrigno bridegrooms fugard foreleg wahhabism solidarité kingly denture dissidence redoing corba bungled paquita ainsi napp senora frivolity ikhwan nadira polyphenols ruscha takraw salander turris cynics walia euromoney chipewyan khushab raka faker bessa integrally kittiwake capercaillie entendres desplat philadelphus lindfors wctu brownson smithies ravenel pikas radikal garageband entrada pogba poitevin heartthrob pentangle marni pora jiujiang blg emusic bradycardia jano snowballs mounsey qaem ferdi maned kansan flasher truancy gizzard guiness neutralised shiller rosell predispose marunouchi microcephaly barkat eclair rozhdestvensky breeden betfair varnished visigoth brouhaha ssangyong padam trowel embossing chisels clune snickers kimba marwood phare patellar spanier dayr inferential hennings himiko quade bahari hardison unsolvable petrushka mapinduzi nationalize bocanegra understrength lovinescu mattioli hibberd barada curci manichaeism alimentary sajan tyrese interweaving xul detrick lancets gaitskell irascible chinua pnn anabasis hanka frugality monsey darton louvered sachar lambrecht mouthpieces bloodgood khirbat abela moishe lumb tetsuro messmer cobwebs rzeczpospolita spagna cathodes moyà athar bookshelves snuffy grauer musikhochschule timp beckerman lexicographical handspring virunga charnley oceanographers tewa visscher paden nxp satria vilain mälaren ranchero wolman abracadabra schill changping riccardi westjet gtl assia blanford esha wirksworth contextualized fostoria banville kangana anaesthetics fairbrother jazze vandellas zorzi riyad muhamad yermak papoose ayew elana egovernment sistani othon vermicelli demerit nodaway reinterpret niggaz medeski begrudgingly evicting pelz outwood liquidating arnell apatite clumping reflexively unrevealed kub hardouin lule histrionic zy dabul governador sarpy acquit bruny horlick croup ossi wordnet lazily imslp performa nonna obliterating condensates vertu switchable crackling autoimmunity phlogiston tuamotu hippel recombined umag cdot tressel gleaning moorfields englert loincloth télécom rws balloch raabe fyrom syngman reines dnevnik wavre culshaw lumezzane epf metformin debonair converses nkosi manion loewenstein untraceable kouros lansford razr omnisports rothchild borchert otar quickfire vsi beutler kaposi banas toshiya timea franchot chaleur impulsivity barbossa decompositions liem antrobus subhuman arley redback rumney dongguk overreacted kammerer novalis golder leadbetter hanratty tetroxide ablest seedless burkitt geiss melin cripples montevallo trudel whew hawksley schliemann disown buenas gompa alava shekhawat sarva relavent tatu mariella lemonheads peltz outdone gjergj melichar aphis pone tokiwa clasping woollahra fullerenes chaliapin retrenchment gimhae dellinger nanping silico faysal phuong ricciardi bamenda mikes albufeira germani resupplied megalosaurus phenomenally cladistics crosthwaite disaffiliated suid torin galang strouse granulation ielts flattens lawrance nicoletta laleh bonis motueka nwp dila tevis merrett maximilians gour scop tidende chamberlayne rockfield kaylee tanizaki tweedsmuir baldo ndt taurine seminarian fioravanti ombra klatt hungate kups ferus youssou eidetic prearranged wachter beneficent youl axemen trempealeau isas jeremih rwc kbl ventus kormoran latinate actium jacked marichal nexon quitte wrenches agustina faecal mannheimer dipietro reelin imbue mcgreevey taksin soekarno softwares rohtas oficina musburger horden vlaicu galicians finepix camembert hanlin césaire kartika guion stigwood belas hatten cronos excommunicate nicklin kristjan ncsu ebsen yesh dati cesc edificio premolar lampooning inexorable borris roaches mallah jorden polygamist posthumus feminization boudreaux mcroberts lulworth ooi snooty enshrine kiera snippy brazing transgendered saget ojsc aermacchi fuming preformed kameyama sturge barnette jauch damocles bhl nonpoint massages yaki jujube whiptail martlets tpr bdu jaci fann schützen stroman mailly eskridge sachi ramping blunkett peachey briefest stelling toque plaut jato jantar wafd capriles grifter bondo peraza ibsa nandigram speedup leros newey saidi familar demarcate petacchi sifted thanasis interpolating cranbury melnikova soubise hirofumi kilts rainsford keifer walney screamers harlin roshni insectoid favelas stubb fairways swaffham behl orlandi aleksandrs parses pterodactyl anjum fdf mastin stevo petőfi blueshirts gds hev pocketing multiplatform kollywood sterilize fenerbahce netta rackard mercadante hochschild testaverde sexing froom penry scapes gipps brines weatherhead lugard hilldale pkm speicher macisaac breakdancing binda headwear cristero duchenne mandali caffe athanase humfrey nipa scrubby taif milman albumen isparta ngong barnardo piron morini khanom kozo overactive omarion barbey leeroy inui triremes libertyville depressant ballplayer rannoch yeboah ferny excreta kosugi fujio entrench deniliquin andamanese masaccio stowers korine cpj meaghan fede edney colonnaded qatif newlin overclocking fundació merri mesilla costed obviousness rumpus advocaat accrual abrahamson kaal klahn luana litigious yuncheng pergolesi bryans faxon lwb weaved bagua pimping mdx heliotrope karnad tego historico bordes vehemence yagan muggleton sparkasse macleans flockhart durnford mossel mcmurry silverio goldstar fatso pantyhose pearling sameness refocusing potito urinals jebediah nomo shachtman requena uniondale rhamnus touts latchford fleurus vasan tricolore dibba extroverted chincha wnyw gtb unraveled camerons colonias rci arimathea samburu hottie hitches darlan prg dunkel sanpete wickenburg macfarland klay eulogio piglets wigginton nason adentro sandgren liselotte rainbands heartburn swerved fogelberg aureliano motorboats neoproterozoic warbling shawshank saviano abrogate parikh pallo vakil huskisson caillou jemaine mushy molk lifton dittrich coonan malbec renamo alesia pentagons guadarrama curacao laptev lbt ekkehard stukeley communicat liberates pirs verón aristarchus mathes giggling gascogne herbig elephas overage owego sreten proteome leat hennigan pradip sere anastas hedgerow doers nicolls millner tyers lusitano alphas gaskill telefon penduline slicks suiting allez oldie noces greenbacks milward doritos physiol pizzi iqs onder bronzy tawang reassign jrc addiscombe amortized kodo lamin ariadna gipson feldberg aseptic raut deepavali witley lostock roxboro bantams vampyre shera steine dnepropetrovsk freshers empson meloy kokusai msas rosebuds vass mumba portugese pettitte vvt douche unaffordable darabont swaroop colonsay buemi hankins coitus kof hillery turrell apprised rahu owatonna reliving guillard gwin cuc tingley arutz oakfield exploitable bove inelegant diba bomberg nagase bergey dookie tented haystacks mahapatra boleros kostova harrassed whitehawk implanting mastoid internationalized holopainen lemont kno athina aminoacyl marcano bluebells ginóbili volturno bentworth divadlo hieratic defenceless hasanov barbero sakes cuddle lamjung rishta baze yimou foxwoods mki borscht cableway homeostatic tantalizing blacklight smm crudup lobdell multisport neveu covell paracel leached rollicking postgate radstock frandsen gordini arkadiusz hudsons alibis hyon peluso hansberry oingo samak stealers précis moçambique superchargers footbridges cose buendia oleo pigalle naniwa pled mainspring niekro rancor tribble electioneering bures pohjola meknès breadalbane hanse virility hootenanny rebutting deferens rubia perpetuum prying uhlmann warmblood tof fpm mollet ceylan berens ginastera musikverein jovanovich banya cuse sigi jagjit oleic apulian issara mullaitivu proserpina monnow tutuila trabant oxalic cuboid rema moba asselin vintner chernov winky bowhead kowsar moneypenny skyteam elysées cisplatin manchin culross sapien planus nasrullah linkletter febrile scornful epik falkenburg krisztina ibar mayawati pahor teide demining baserunner ruge gath bogo deliciousness burkes filomena akhdar coolmore ania dualshock worthen castanets pilibhit zabul igi galeano helmi honfleur exorcisms spazio solicitations scu norland neuroses dredger quadriga swakopmund bta bhawani matrimonio manzikert cosponsored giacchino halse cnh royally buzzed morozova akha quaife schukin tolle extractions blackfish lapels crossan stans phon onu lollipops sikri danilov anglicization anamika jayalalitha middlefield chandrashekhar reticence divinorum eylau parrotfish asche sabang kaskade lefranc huit dmm awi noncontroversial colibri waldstein thongs rhinoplasty vinciguerra woodfield fibronectin rehashed shenouda asal freiburger magness gsg perini jtag alleyways brunell koke euphemistic streetlights tirupathi ontogeny ludicrously chinmoy lathlain emoticon camarena corfield grandniece sacheverell lakoff elul transcranial terwilliger doofenshmirtz nasarawa eskil matisyahu savonlinna cronyn headspace immunologists hootie nonchalant gouged lehtonen roble spoor bobb heskey irak recitalist qinhuangdao janaka schwedt sordi kamba pulkovo gopalpur musters huangshan dodwell adaptors pompton lachmann mushi eventuated gamow letterer interstices eosinophils geoid weale minkus girlz chanelle pitchforks raimond allens eckford godber chongming ezer schecter muhamed burrill ducharme vliegen cheraw bordighera rubino blasi whippet lakeman bonino ingels atque plantin canti microglia misanthropic kazakov milon giteau bhaduri homeboy medius tarring naveed tift slammer emanuelle leeuwenhoek cardholder gmb coppens tsereteli mahood apn livescience goblets plutonic alleghenies nazarian levitating complementation babai circulars potier brandreth burnand kurama rathfarnham vedra samothrace magomed barne orbiters maneater deathless szombathelyi cottam suci sinton casca legitimation pseudoephedrine carboni ungern montalbano beller baysox eucalypts paulownia medfield jitterbug conjunctiva unwinding grosmont iger watsons litigator egotism rjr vsa neigh kvapil rodes bislett linie iterating limewire lostprophets forshaw aridity pappenheim mbti gloriosa lobkowicz roskosmos hallucinogen popjustice laverty pesch letov baccara haugh villeurbanne plasmon conboy gallina cascaded fertig garfinkel candied internist jamz stellan snowboards dooku geon subah tillakaratne hien yallop lambertville lappin porro tibbetts teed negron avante maspero obrist mitani samen fyn tapirs darkling orlova bete terma arnstein northerner johannesen servings pegler ild aaya anodized mortician spck wsp rigney kwaku sircar longstaff pcg colebrooke snobby lidocaine becquerel lexicographic foxtail ekström saugerties dyskinesia oligarchic górecki popery ktvu agadez fuerzas erath weinman putera quero biologics trillian nansemond corne tobermory crockford ossipee sharron flaunting thinktank phlegm miniaturization weserstadion joonas verba molehill antiope tánaiste sdhc ilr kero nyasa prioritised garlick remotest gjon marriageable rumyantsev codewords glaswegian perna sibilants tsavo yaddo dolgellau rabih borj caciques barranco jaishankar servetus lysistrata malfoy expressen endgames samra pigtails longerons kallen malaika cema bakht anik rockett orderings scullion winking siân fishin interlink nitrox spongiform triskelion alresford quinoa mothe religionists croissant warrens tetraploid redcoats crawshay inductions bise clang zevi toasts lianne laserjet nutritionists baddeck legitimise ductus minka varicose fabiani tekin codas hellmann impactor douches freckled soichiro rosenberger sekar boettger heptathlete sfp hallock mosely perilously pincushion recency princip nishino reevaluated miserly bankroll hierarchs atef gewehr schober harmonise emes krispy afterburning schulich hartzell mums festering punctuate burgeon mikal icebreaking pacifier centra mixmag grishin islamiyah froman kamiński bottlers overburdened palomo lvo synthesise chromed saina minera penmanship markland protoplanetary comorbid yastrzemski bawang sheiks rale hardenne glues cammy grethe libation interchangeability lunettes kovac willemse lpfp pendolino rajpal gondal lifeforce menos reroute smethurst heuss votre freakshow viso rutherfurd messaged sarsaparilla catahoula caz meanie ampat machan ripens nakasone zhoushan heu libeskind fengshen picos masint boubacar erat keong superintendence affirmatively harth toiletries kedleston lyford warfighting callanan cida lothringen enterbrain yaris binford redtube tronic stowmarket pucca miyajima rossby pfm tartaglia ruo toulousain sipa poconos salla hydrants counteracted mutch multiflora clutterbuck appetizers endear nonparametric henriksson ibrd stymie bezerra geb overstepping flatwoods ashwood fassi kirat kuli funès shrove crony transcoding vomited tujia standen lente auvers foles freenet sigue appraisers mohiuddin geen reuses veneers shugden bundesrepublik tahoma rukia brillant spambot ndu lifeblood guaraldi sociability accredit gortat flinch luss tbr nakao headon eveleigh waseem carillo sallis recollected batistuta louvin jmc vanunu berga hammons lettermen insolence pervak locational gowran kaleb borst mccaul olpc obwalden praed woolfolk cosentino ranade archdale sketchbooks translocated monoculture hisao uft mused koski mido loukas shuvalov koja rolly westeros griffioen ritwik devolving thompkins anjar parente tinta spk devastates acworth castelló saltcoats svyatoslav wellsboro bejarano yngve unplug luu kaha kunta adlon jannat mavs colinas kinji hewitts crockery bootloader comforter grimme palpita horney shekels transpositions vizzini impermanence médard yoshimasa pangs dunks goonies tuol tagawa abcnews modlin laurenti vnukovo sinti atsumi dilley greenways stradlin vinge talabani amateurism meatwad fazekas hoby borgias shinnecock flatfish tuas astore shorey flowchart khyentse pondweed wimpole dieterle jambo debenham whitlow aymar puneet tenis drumhead bouygues elford harrap uhura unabomber vermes opencl kalayaan hoodwinked signposts boroughbridge raba migliore ironpigs opticians ospedale ashrae corrida makah masc feted vallier mcwilliam paktia copulatory socialise khu zvonko remscheid botox twiztid localizes fuer antineoplastic molting uncool valeriano corporatist mangione grabby anselmi swiping arditi paramedical marcelinho glossing pearland southlake mononoke songstress obrador miccosukee widjaja liebherr sunnybrook dorham seel arrernte inviolable calwell jiaxing faubus shap telic higbee parroquia situs riken recursos monserrate starlin aila stapledon mogador snags blessington andrejs pé burgo bullfight furze evren optik hesitantly warrego bekaa serifs cuccinelli grimly pieterse sholay rubaiyat timbering longlist jassy matia scrawled preproduction blithely papadakis gorki scram taggers firebug adora scavo axially racialist cundy lovro carlaw aleksi tormentor letelier rumania nkb minghella nowotny niantic meusburger kananaskis voelker maltose rencontre neuroblastoma hissar helpfulness mattison barthélémy benard creases disillusion laburnum ebv orenstein natanz topspin sadra croxton eschweiler bassoonist serialism stomped cante mlcs pellerin stockard caret ringwald jowell arete recco sandeman ohg lauridsen airside animalistic ghadar harlington boga weeklong bowstring holtby sinise dobb usada caroll rael bellboy agitprop weifang krater discourteous chillán nokomis yasna ribonuclease mahdia showbusiness katey frenchie milenio fluffernutter pictet frederiksborg iolo psychokinesis hanina vdl rantzau ferghana priština varnum resonable cortisone kember webcasts bellay leonesa vitra unwrapped tafel atala convicting arpaio esculenta zeynep policyholders comings zdenek carducci moncrief mudslinging hadlow cemil esiason gorrie wreaks conceptualised neven calixto reiteration wildhearts tenterden titu greenhorn mostel zamudio turonian tno ahc stephanos nacer flyable castillon kamer amphitheatres deleo sholto eulenspiegel mouthing grüner aquae rosco gaja thins ottilie mdd reflexivity peacocke lamarque frosting ullswater disinfectants ector dendrochronology apollonian monoecious fidalgo lakeport rhetoricians avtomobilist deconsecrated everyones cepa availible dimetrodon poza mercers dixons brenden copperheads zeca airlink chaika spielmann kyriakos mottoes crerar tirupur versfeld herpetologists stormtrooper henriquez bbd alexandrescu raluca kumaratunga unproblematic technet stuckist eggen yemenis strudwick lindfield markgraf weirder nagaraj dethrone hanh moncur fausta bloomers qutub verrier collor greeves elem aryabhata guybrush localizing laufer tickled larder qamishli sartain mortes panky thornburg evgeniy regin sabbatini wds aristotelis poulson inconveniences metu tellingly wank juergen sider marielle ryun nesterov prinzessin berlitz casso kakutani rossy felicitated foel wellens dornan brasov aftra tugela allemande pectin slickers intrudes clif abbate gentili stillbirth caccini nof bleasdale huehuetenango calverley beheld svaneti gillon pulsations mastodons raval jeannot gfk tucano raila mudvayne chisago braehead joensen rahsaan fazer comicon gagik comodo hdt subclavian boursin estabrook charnel backscatter dinucleotide pdvsa poolside llantrisant dpc bondarchuk orest prievidza ibañez incirlik chiclayo buttonquail maun antimalarial lavell untruths soja ronen lith krapp bnet spielman tishomingo phou overshadowing ecoboost irredeemably officialdom gynecologists ozon embargoes dwp nagra lamo musar laughably nags yegorov detuned dessalines dumbfounded eneco tantum fruitvale recut diviner unaccountable arnaiz igorevich panchali campa razon fard sanctis botch jaggard rady harefield joggers laskaris popsicle sokolsky savini finalise barbiturate cobbe palanga frons wysocki morrowind timlin royan madhavrao niceties laub fedotov asymptote debney fredi stijn loureiro brujo morna alcor panadura resoundingly veggietales vlatko lychee rushdi mayur ritesh karri gazpacho vajiravudh tijd kuda ashta servilia kcet fpp hyperglycemia amra suey pancasila adjoin rowlandson hodes iskenderun rewilding sfg blumenau epoque ftd pragmatically wetted kelvins shhh intercalary abin abdomens sarria kach petronilla carrère torg wytheville soccernet stromness sadomasochism remasters missourians stefansson rdm washakie dhanmondi decisis governo krak prynne bexleyheath gymnasia gouraud troost micheaux bridleway benyon inexpensively medicus bwi batz ifan maric winterland recognizably caernarvon nerc thiemo reexamination churchwardens hiran resettling brightening rashleigh jeremie beilstein griesbach landman samaná marmorata wavertree laurinaitis longyan changin alzira bakhramov hogar personable mccowan hackworth pepita husum natureserve rnd donnas buhler fasces hastening pechersk goldwasser marcellino oligomers hammonton foxcroft gesu kosovska wuc valledupar huhne insurrectionary guyane heche eum carotenoid liya tetouan anf uren malamud chiese drome lilting dursley medgar demoting demoiselles baqi jencks maddening longleat tarahumara hartenstein ardsley dedicatee buckaroos nien republicana headedness frédérique sisco bofill heterodyne casamance rabari unravelling fertilizing biharis gobel satirically ashrams zwick sadowski tommasi sprengel menomonee marshmallows sheil osteen truthiness karasu dispassionately psycholinguistics buro staterooms prefatory avakian baddest perpetrate sias fearne gluteal proclivity hecho doone gurr cowart sawhney lunas combermere punctuality moras muttalib zwicky wombwell akamai drood architectonic oesophagus ahmanson midvale retitling zerbe uki philbrick dahlonega gujral sorell detracting machiko brach awu neira impound opd wellstone internalize parched dobbyn eupatorium heintz continente kidston festina balogun ferrata ilaria rednecks flagships philandering monona cardoza wellfleet bjelica baillieu estas valores cosford laviolette klima delp yolks inverts dinny undeserving melodically hasso sesia wessely smitha trelawney davidian xilinx aycock mwa yupanqui iden ganna boggles malala ande alfio fage mundine excretory factfinder kamei aleichem achill hudspeth vikrant laxity delgadillo jacobsson rtgs shergill mccalla paume conker swoboda brs epitope shiawassee accredits urologist foldable horsetail genna pollan wadis bipin audran derricks takami khj winnowing ellin faruqi startlingly moslems unforced psychometrics adq bielski ploughman campobello oikawa gunfighters mowatt auburndale bozizé kilmaurs glaciology dilorenzo coppermine acceding ruiter jacquelyn tunisie inflaming hagiographies livestream caseload xcode aniello keratosis rhomboid linge sadhus marrufo aesa fellas graveside caul contrivance backhoe apodaca zelman rossel villaverde crato kanab holdfast wrightwood chev humain knittel grinspoon pfk baram sprach chk systèmes nihilo mitochondrion interlacing tinned relf egp biofilms trypanosoma forbearance clovers endometrium lyda skan braver upping transporte multilayered apollos howse bnd gede megatons aurélien yanked pullout pnb gemina habitants unconsolidated huebner quebecois caudate parlament fahm schwalm stann kaffir thaman seafield homero dissents talyllyn hackathon ovadia willcocks engrave wahine remounted gaber rossland sluts idrisi sympathizing abdellah quandt chews lacour rainhill bagdasarian maritimus redshirting romolo kurram foxworthy shipka yoshie negligently arctos cgd radice condado burrage knick swooping eyepatch pamphili aspin afif jannie bramham hiraki fatou stipa imperfecta godrich vasilevsky clastic chd centring manara konaté lovey flashbulb energizing onstad prang jandek charvet mantri maskell optimizes landgraf disdained sohan disorienting greenspace leeman wyborcza bluesman jaromir achard schama arbors karni devaraj shahada mittens comv battleaxe npv kalem swatara malou discontents fredrickson stepsons visualised gluteus cedaw coyoacán legarda tramline elbegdorj sappy zerubbabel longfield petrarchan manliness castors jingzhou readjustment harborne woolpack hovhaness vedran collegeville hiya frings gasparini fotheringham freude dmf crossmaglen rainie plessey shirer dummett thika neisseria battens rehearing ingolf zelig fds buttle pyongan burrowes prin churchville aleman sjs trae battlegroup humanos goretti lussac countervailing gaw sweeten wetherell baronesses ader lenski dusit tavis ngf schüler balint breau fifer catchpole bullfrogs disharmony optimistically shonda kaler seck nelligan tenafly ascott amble severa assignee kater leyen tadd sorte hoodlums cibao macken indictable impingement humberstone blondin sushant mcgruder asas reversibly picasa monopoles carme envied worldcom aqeel snooky vlasta lilja afflict lipari cramming berean keay serotonergic psychopharmacology keim morphos ngata yamakawa copping dharan transcultural throttled sloughs eberswalde keightley pandolfo unrestored lnk roeg fragrans massine kluivert hesitating segev dehra musiq gilkey antihistamine hetzel tambour jonsin pennsauken celi solen disrespecting heirlooms kufra laureus camelford feijenoord fredman nikah vlastimil submariner toop crackdowns toman kipke bansko karanth fuhrer vishisht accreted servite womans gehlen woodvale deists edlund khaleda vigne sirota manjrekar latching tomson henniker longhouses retinoblastoma airfoils paychecks pinelli brewpub agroforestry brahmos amelio solin biga brust riled saputo atrioventricular recedes kafue tuma heidenreich kody kennerley volans hds dermatologic schembri trousdale probiotics transnet tsien hamman chieftainship webos fillon tjm irrationally balducci uniate bushings cantref mandrill abida catley elata cahan alcoyano hearers seaways nordby peleg celentano tjader bogert harav gandia meloni sobol flaccid vda lattre enum aivazovsky bto picardo fivethirtyeight oglio olivos upregulation windlass overreaching narc drummoyne disconnects briarwood porthcawl apj kada columban tevita keohane kurchatov intensifier gambles rebelo attis consensuses xiaomi overclocked sepa yarnell wilms phichit asmat pouilly hailstorm kastles dtu yukie pupfish carpentras dickman lynley broadfield hurghada stableford moustached nayan jovanovic macky egnatia wickman entravision willibrord lto subramanyam nipped vinu boughs verga zicheng preceeding caramanica lemelson ebo égalité contrôle kstp sauli rationalised knockdowns rastriya craxi crouched boyds namah connexions hyuga steeds atreyu betwixt santacruz lankester writting louden clutters jacknife coraline magomedov edzard uncheck cusa archetypical marquet ictr mixe chatterbox ruslana nozomu crissy opencast obd pretension mohana dongan coif accumulators saxes enero futur babblers chappel paran monorails chesnut rivett lannan manship narasimhan interleaving blin longstocking número fiord brooking acrostic tawney ashmole herath laffer katsushika uku cureton danis rivalling mazo verum beechworth bekasi gallants kalamarias leeks foden portici renfield boudewijn swellings whithorn eridanus powerlifters playgirl axing dumond tramadol czernowitz lamongan bogdanovic holliston spofforth cfls gpcr bijoy spiraled toral mayon ferreri dail kilmister memorializing braudel gunsight florists lilas tamiami amazigh hirono slpp varicella normalisation routemaster sangat irmatov reminisced suara tomoyo gazza gembloux bettany schutte virtualbox feliu humint mouthwash incorporeal retried ofra kintore rvm dolo repulsing kotov forsey tareq gnassingbé alcon dhow jayewardene gasnier anodes instrumentally aspern prohibitionist wattenberg whelk fairuz indiscreet calusa taipan excimer runyan lasorda buttler hdv volya scow nuristan cribbs drubbing iisc jarnac birkeland ybp resa briant bodh whewell louvres kneecap kalama kodesh tras wereld floundered poorhouse mahr rainford bilodeau peterburg totton puebloan fifield auras vernadsky puncturing retraced yuxiang petrolia roloff ncse rangefinders rêves chondrites weer calvillo juelz thompsons automobili tamia mantles viney detections altagracia kaloyan mediterraneo roding industrielle mhk cleantech enormity arnoldi craigieburn naseby strafe kuper escoffier vlt eeprom huggett vocs inheritances nahuel checkerspot antenatal peppard draughtsmen wakabayashi anglaise terminalis tonle driss baserunners ingeniously hustling ikf rls dukat psychosexual genio graffin ngu ferron parola unchangeable fuhr journ typologies jubilo aspca homoeopathic pupin unspectacular grunewald colangelo kafir attell neurotoxicity simonov balram pneuma simonton baptizing meiklejohn tarpley shakopee damayanti ottone stiefel pieri marryat santaolalla ikegami eul sirikit chadron maces damar eiichi vaticana burrus noisily smolin garretson riverboats stomper rashomon brigitta hessians osteomyelitis floodlight ganon coad oscilloscopes dogme puffball degerfors purr salin emax phukan bayar barreiros biondo speakership gráinne efrain gurgel sharq zha collectif subcortical rockcliffe errington festspiele wasi bpf multiplexer handhelds ulysse inukai laertes vinita kca trajkovski tunceli milady capehart cohasset windscreens abstruse hiscock incapacitation ifield tresham cheerios belitung superclass sadeghi feeley ammanford hames clete gwydion schwantz drooling gigantism zoonotic guillemin jaysh branston ferreyra fuselages takamine mcmartin cerny ruhuna kooky lahaye bechdel duffel diorite shipwrights irk isai biogeographical enbridge glistening twill incompatibilities nclb shireen torturers galleri nyanga manby hebner vaidyanathan interbedded germiston liuzhou reshma koresh knepper blackcap carisbrook criminologists thia diminutives undiano wichmann ottorino oropesa saket danaher waymarked phagwara schleyer vibratory hindrances mxr jarrad hachi bovell medleys clownfish lya ubayd mustin tarento catastrophically ducale bagheera ripoff tykes gance maecenas bionics tedford karmal lactating nhlpa tobar relapses morong maalouf freemantle mhic ryotaro neckline voorst moorcroft sluggers eldora cathodic castine ramgoolam popplewell pyrene birdies aldama massad fabrizi sayadaw wpvi plac malakoff tinkle igy barelvi deplore goodes weygand botsford microbiome roques canham wefaq upnp linné plt nemtsov transocean zulueta doxorubicin metonic fleuve syrups revises yayasan terrazas carlinhos ardor parral effervescent entführung ayah harrass woodchuck daqing beaubien holies baluchi idee ordeals seah havemeyer haast myall zarate vith coleco hhv eircom cenerentola slsa reflagged tarpaulin ouanna vitiligo northshore snarling trappe shelve redick bracks mcmillian markell vuuren ornl mth castiglioni sveinn baston villagra hcn coolpix winograd roofless wachtmeister preppy basrah alar montilla larraín kanae chapa jezero kahar depositary felsted wrenn lindsley fors djk oranienburg sii bequeath panin sarong michaeli ranney rappelling interject parsimonious csonka subtitling mager flaking questlove enon demjanjuk lefkada shaik lightwave fawning unsavoury walkden bulnes chieftaincy norby batas pitjantjatjara repechages villaraigosa rouses drabble sarazen gpb tongass voskresensk kadish tankersley wozzeck lemans techie sipowicz crofters moussaoui nondestructive loveridge franklinton mitigates zonda hominins hallberg tölz kashin fridolin trekked deacetylase licked bisphenol webinars braasch muttering boag gnomish subrahmanyam veep typesetter painesville shandling eurasians engström celal khola kieft kleinschmidt khaliq petrillo perkiomen meulen toti ditzy gullickson leaderboards bayrak abductors satcom kabale refilling manns saeko bahl roundhay turbot demilitarization azeglio ewbank littles ralli meirelles activations sanches llanes cotopaxi sekiguchi caddies ohn poignancy boruch oddest conveyancing kasahara rapido fomenko lacordaire trai pernell trolle norell vavilov swaine scalping haun bookable nungesser strongpoint kempsey dyan erewash friso nadeshiko skee sterns luckey nack mixteca grega sarl gisbergen delvecchio sonnenfeld franti rinconada balart mishna rubescens raphe sakti zonder alltel massi breaded conshohocken unreality blytheville estrin rhuddlan lepsius carbs marples iud lautenberg kundu pirès rege smilin nitya beaman primorac nobuko fenny butterley banquo rosine anfa timmer roter smallbones hoti armijo delerue delphinium bassetlaw harmison bonfim oudtshoorn collocation fretting aase calipari sakakibara aerolíneas absolutes catz essenes yerushalmi winick philharmoniker plouffe humani asphodel toiling garibay duplicitous pravin concessionary sennar molony harmar rax naira panam cpac gratified villepin jugendstil affray blasko ahp encomium charades walch shampoos coachbuilder jospin fabrique fests philadelphians seguridad colyer drancy osco cuadrado oculomotor atenas respawn spacesuits zampieri boras houde penso cepheus neoconservatism tafari cinematograph mwm idomeneo addled cruachan petruchio cognizance kriti unheated succesful eyeing avca arnos harting ionize somdev putrid rummenigge rearwards feild zazen horley laminates neutropenia striding salzburger foord olazábal piscator yodeling glendinning machon miletić gojira ceva bushmeat ppo villaret breakpoint kalk ifaf vasudev sodus bothy ferryboat patrolmen romanovich introversion felts duvernay redeems obaid mollo borrowdale poch oskarshamn serville exocytosis sanatan bewegung fujikawa instantiation lounging berryhill fenty huisman duggar copter bayles reidy throwbacks hairstylist hurrell islamofascism � ioannes humoral lifesaver sungkyunkwan axtell madhyamaka ariete mosport mikhalkov moyse mpu gromyko ottmar uag vasilievich hettie pufferfish smallman heanor cung ozan fridman patry ostapenko virginica rumer crowborough woodworkers staatskapelle haret precariously foxhounds kizuna vaseline godfried ents centum shigatse lahontan rexall hickling billund nibble erkin phalaenopsis blam haumea unmixed myatt desmarais bowdler usra sukh disfranchised manipulators ovenden mulcair retails orry undercurrents burbidge schneiderman aroha survivorship statment sterilisation disarms afrc okolie kitkat prefs crufts lathyrus warmup mégane reconditioned tapscott hpp dearie williamston arnulfo myoglobin norwest longe squib olap derwin jarret bulut nspcc hatchling snagged wristbands janka propagator gualberto expeditious verdigris trapattoni envies askmen festiva bhattarai guerreiro relaunching nesn ljung bukharan retentive jemmy recirculation gavril mcvay observateur sectionals aldington irelands filmic souda vandiver dubuc schwager poisoner hmb elaborations rusconi beatie knotty tdd edifying madelyn hix guar hossa payee nanobots chorizo surcharges panteleimon pouliot suds olympe cahuenga archies mclintock nasha giordani leinart lote matanza kalash informality iiit ewert nonfunctional verulam bowmanville billon pimpin philoctetes berdan donte washburne biba snowshoes handelsblad dillman banki creusot fatalism dermatological glatch cantinflas tibesti calzaghe multimillionaire aauw arneson meinrad mujahedin poodles kima hamo sacrilegious iem salvadorans goldring knbc schillaci sarna sitamarhi unspoilt falken fauvism celestina iannucci mikis shivam proffer linhas rifai syphon ists ivr scions kadhim fundacion saccharine mistranslated boudica campina gnarls trimotor rinsed kef leotard haudenosaunee charlies marrone canneries gallstones magherafelt vln holmwood cabane believin burka gibraltarians gallet sinc micki lefort astrometry presuppose sleat simples parla wildes nonexistence vouched necdet denker oosterbaan aristo luzia alamodome transbaikal zulkifli chausson wsr yelawolf transbay toliver tnr saen shepherdstown hassani glaciations brained frescobaldi unipolar uel gamecock rothermere pithoragarh delcourt mohler dysregulation caza swinnerton bevy colorados bekir siris mits scrofa duceppe tdf gruel deloach vardi kear ledley goldcrest meireles raso barebones sart angharad veerappan reti frazetta bitterfeld kokoschka robbo holdouts conciliar aronoff vandegrift dilutes agno variorum sangar palaeontologists overabundance smeltz bundang corre olpe bigsby kelkar dogo sonatine macalpine maintenon zep campfires lactamase winnipesaukee exelon bryde bajrang racal winsome mnemosyne querido samedi bogalusa bucknor roadsters fireboats sidoarjo trod arbitrated fost nummer tapani dws outselling loots melanocytes socials modica shalott steinhaus tubize cairngorms radionuclide elbit strawbridge paediatrician blanketed myat lamacq litani daintree kampi adminstrators clayey simonetta friberg supermax xna jmu dohuk kolozsvár kiener jasmina fct libo jennison uptime snd tirah volte electronegative valdis bowley conscientiously oberto druck ranaldo stoute telia finmeccanica mckell shively deliverables wapa micheletti shinedown momen laufen bibiana attestations tegra nber maurin bruyne ipe ventoux gorgas ayyad rayalaseema glutton chickasha inria luciani rollason trie apure ofa yoshinaga bedelia elodie gorsky mcparland luqa henrici woodmen garryowen sukhdev greystones supermini lsat numata moralizing apostoli latécoère annelid ventriloquism duffey superstring crace mentha tht shaina briley healdsburg idylls drudgery bogner noro riss tamika galassi tischendorf zuiderzee bacteriological dropsy lyngstad ebrd homebase chevallier dhyan ahe hackneyed bommel barkin slumberland transferrin karami bory conewago kubler bandied lauriston flq vts doroshenko tobit chato humanae breitenbach almeda snowe jerrod mendrisio pokorny claros mosso yauco honore beady schoch cdd bilgi abravanel protestor automatism wexner lafon byker giugiaro cortot signoria dhul huberman thyroiditis atd beeing abelia kautz pbi puedo ulli ramrod morphogenetic orito crawfords schaap iorwerth taheri malaka herniated kabak ichkeria gualtieri beatboxing palaeozoic dunraven jungen eastview merrow ironmonger cruelties massaging tabard monocular labore ellenborough limnology hassi subdeacon theismann gernot littman rowse incongruity haarmann lytic drachma elkie holbeach ull newsmakers roncesvalles unpersuasive samora czarny izawa bhattacharyya angelopoulos savill hinshaw magnifico contextualize slimmed tandja luzzatto sneyd wagr metamorphose bmr chioggia liedtke shean teichert manhattanville retardants wjb orthostatic inconstant rushall corollas komet mard backboard medibank iacocca depalma skalica zyklon braunau sancto oceangoing oana punctual pandemics colom signori mrn belsky sweatshops misprint cator misbehavin caffery laporta adders meyerhold mclarty engberg zhuzhou hydroponics roelof silkscreen rabo mayorship mnlf bludgeoning shriram domachowska evidencing etcs salwa pela trifluoride hoochie digha pipette objet quaresma keerthi sédar sadegh horikawa defalco lluvia konzerthaus reexamine brimley campen jeffords preteen awwal berkoff wtmj lovisa flatworms duren chian fashionistas hilty aphra touchline bawn decanter asaro kenning wynberg torrio kentigern cels lapine fortuño katonah hunsdon reaffirmation kansa elmham npn célestin gandolfo salop floorplan söhne marana sócrates ahhh stossel régine lucho volkova nedlands ebadi thermophilic deneb playout tashan irondequoit ataturk europium lazcano harpoons tirico dirtiest troopships polgar magnin galpin aulis meleagris vibeke tipler comptrollers coalescing dependability santha nagumo greenacre hessler blad claussen fragaria ballycastle crosier artilce janco trigo goldenthal laurinburg nmt jarryd dunshaughlin tdma kittyhawk saltoun girlhood overdosed incidently rhinestone leonida sinsheim laypersons climategate mcwhirter nerses whitelock uckermark terman glaxo drl tebbit aure shihan chobham tines disavow advertizing erba morneau stroller refn beetroot azor belsize threadgill milia hongzhi hegemon nappa faery dimitrova siodmak doogie keychain sanader monicelli auks sasse mundt xpt odilon earless emley perrins crossbred gpx waifs terzo unexceptional mdna londinium prydz repackaging sanne parys grünewald gitmo laraine keni latinas rett goldsmid shoeless kubitschek flavonoid grievously fasciitis bopper unpack impregnate basri arborescens fassa mlrs blinder lundström westmead retracing jfs oozing edb bébé silage lukacs kazu tournoi gyulai sence clinique pulman stagecoaches hunziker sipping scammer cashiers donk ferne barnstone consciences gry cilantro parkhouse erden mede nili alannah zetterberg partai nightcap unviable scallions thornaby proselytize curds greyscale joëlle downgrades stomatitis nextstep gelling heemskerk canid antoon jda megaptera shoring narai morceaux politicking bettman entree intérieur pinkston croak rationalists rajoelina catto guster hostesses kitesurfing reconnection bitchy viotti ewha thum gei hollar torta ubiquitination kafi niarchos deferential mortara cernan scargill bolting longyearbyen flaxman guttmann ucm emanuela radia torlonia frankenthal dsn riu sibanda chondrite wiwa morano gorée porthos ozeki palatino uppal rjukan domon lewontin mirvish bassman mandragora thich izabela fabricator dressel mutlaq biletnikoff rigas gdb ronettes chilwell printout lydekker lgd deformable peronism ilka pewsey ifi lovegrove lingga inquisitorial illuminatus schwetzingen fotis evapotranspiration kejriwal ferhat rijkaard pgd buehrle mehrabad kili bookmarked warspite vaticano sundquist kgf offishall siphoning christianshavn calisto hitched wata laceration bookselling lomakin choudhry olan chilkoot ramiz tujunga caymmi tethering chilies freethinker dreamz groombridge saxatilis streetlight garcetti fancher madrazo pettitt annalisa springbrook elrington kingsmead koura toenails koffler tavira conjoint chiro sturdier schwinger lebreton matignon quickstep lpl afflicting kisii mieux cnmi yoakum nti openshaw redi castilleja russi phifer romper ragdoll braked illumina amas berenices burchfield backbenchers paku christe osd hosein fougères mandu ggg keown mulde findon bhel lionesses tardif vivica septiembre dukas mcclusky belligerence pythia paye unfailing windsurfer aeons brocard lushington unobtainable patens tbk alekseyev peleus phelim gojko psas nutini umami nics craved undersigned mayhill ttr foon sonik epidemiologic filippov gershman mystikal shorne cinémathèque chorea kazushi loughnane cranking teltscher evgeniya honking aspic gillam plataea midstream filey chg barked maass participative hendler kleberg begbie bechara ceux qingyuan unbelief dalip dogmatism leaver gameloft panesar lycee hindlimbs ameliorated altmark cumulonimbus royds sabz modis gimpo lajpat goe aggrandizing yokoi kilborn coops subsisting ginter twat roget oaklawn socked clayoquot cevdet pequeno elapse pauri vocalized nambla börse vif oppressing refile zapper filaret mortaza coralie kua sdh whibley ncw jacmel kellys dunnigan demographer fussell twinkling forgers grigol oxted calamari gaspari tamera fonthill reexamined eius hoffmeister rangarajan cobblestones vanja panji weinrich westcliff draperies barackobama taglioni sandcastle ebeling dexys yilmaz aplomb chatted berghof ensconced electromagnetics atavistic hauler fatales halkett bombe bottomland modifiable sugata ellerman warbird alvey lysosomes mortise kreme ior riffles labelmates panaji tatsuma ophiuchi caernarfonshire julliard tournon bourguignon halvor droning cordage hazem climatologists yawkey forbin frannie róisín eeyore mirny varas effluents occluded chamomile biter boccherini awed caiaphas summited ronit bajan mearsheimer haldia fée baiyun spangenberg exf caden vhp flagstone customise girlguiding lamination anatolyevich bati wgi ngam cisl jinshi sinar matzo transmilenio retrained giganteum ekiden notarial macmaster dihydrate majo interning disfavored yixing cayes bispham broonzy fass blastocyst mirella balbir siirt keshavarz asplund verisimilitude pfeil mitta wombles circ dürrenmatt tradable televising bouffe fiel contini graciosa burchett spiridon manicaland riveter debarked hypochondriac wingert majidi glorifies majorcan pasquier zschech surreptitious yixian blogpost innards rosson transdisciplinary fécamp brahmans steenkamp pavlik slocombe laffey pernilla ibbetson doli barby talcahuano usj ganguli coase kantele fonovisa lymphedema skewered miyabi despots giora kunigunde cockade sedgewick yuzu polkas pinel ueki cooperativa becke reedited oink peligro lavenham mé nørrebro castellane vestas ostracised deputized godric hej proximus cnts misrule lafollette reicher golaghat amazônia waterworld jbs hurn holmfirth devolves equitably driftless rbm islah palps veruca borrelia specialisms morriston pereda oben prokhorov walpurgis restive cadwaladr ewi enke graney slipcase lothians jaja fulkerson atas lyttleton cinchona oystercatchers cubists sbf andantino menara serafina fahmy sahelian ahamed suhl kilauea adamu deuel siberry creswick nall flattop reconfiguring blackledge aplin intertropical globalstar mangling aste narconon tonner guidestar gramps jamba clathrate unwashed sanusi marías disproves residentiary sisk kefalonia kmp curti angustifolium janeane homeschool illusive cadastre grosseteste outerwear extravagantly letdown sailcloth rosenfield capitalizes sampa bolsters skm lensman manya bartercard coulon smoove opatija armillaria sushmita inhofe conagra bearsden stationer braes munford ljudski cata authenticating aircrash ewca wwdc qrs britomart snorting scheff arête invigorated yemi revis ethnographical independente shallots jawless henstridge milngavie mukherji kupfer dexia vien quartermain yasuhiko newswatch murari noblesville farran waterskiing slicker nmu curson usia sizzling noncompliant lifshitz depressants improvisers oller clinches orphic macculloch innuendos zoomorphic fanling egill emme kiner jerash rededication depopulate balbi equiano muramatsu danka taca bobblehead cheveley territorially lygon garan mayi alfonse industriales lusts ptcl hearer ijtihad mashantucket impoverishment charo ledford teigen auricular vampirella bebb hadamar rostelecom honora rosendahl polona jeffro grez hayim armpits korff koshien seato fcd rejewski constantinos kollwitz bisi biodegradation ketu anaesthetist dacosta pruritus typewritten caniff meritless willstrop randleman caledonians elmont ballmer poshteh friedrichshain odai megabus idwal majella tasers folkwang greil bordon trippe starlets recoletos womenswear wickford hispana odiham michoacan macaskill oracular carlitos vincents soukous unlit cazares schnitzel bezos guillemots riposte singhbhum kincheloe veltman photochemistry saez beriev torinese hoarded guarulhos vilhena sharpay boley moston custodes winnington coupés crowhurst psychotherapeutic lof kurian talha kubera jabot mgl loups taverna gentilly sledging headfirst jordanhill padraic fashionably rinsing cortege fav fakhruddin blinders theiss fontanne sevnica vermouth dhubri coola ruffles zichy grappled shenstone dirhams uwm someway kyles vgs erf rebuff definiteness primm nullius kadar enjoin cascavel utan eleanora principi sneek tracings poros spermatogenesis naze recrystallization foxworth nazarenes almir fiamme wraxall ammunitions cistus dissociates sorters kesavan checa adelle liebert neuroprotective pietersburg bampfylde guse whiny wapo minimi bimodal hobe ntm vpt teleprompter nakamoto coello morbidly cashin obrera vinje pujas chanler réserve leeper silves pardew steedman extemporaneous krylya fna hsus freewheeling grigson zaytsev reactionaries stimulator attractively halfa koc exiling postmenopausal flopping smoothies hedged occhi hartfield beighton asya weatherall daltons euroseries almirola zoi caballé empiricist autonoma newydd sequent rashtrapati baucus kaldor hult menaces libbey delicata mif levada copleston switchers ltm fazlul orthodontist jaleel lertcheewakarn humphrys fantom mallenco merlini kapo komotini capodimonte slicer kouyaté subtribes philae tanfield dotting derham cornhusker vrain esparta flohr snh usba chacabuco mcnichols gràcia wasit mladenov flabbergasted oncle kajaani ebrahimi fsp homologated qinetiq jabar enshrines ihi peruzzi aneta unfailingly sprenger phx bulkier faizal detracted tomasson prinze clefts frontwoman graziella creat enrages liebenberg benveniste rubalcaba catalinas aurélie ctvglobemedia altra anstalt meyerson lagash woolrich josy tle blitzstein fpi braunton tainter pergamum stac wallenda hämäläinen hidatsa konar jdk timba jamaluddin kopi pense liesl pipework gopichand takeout naoyuki crybaby karunaratne gde gentlewoman grapeshot flirty longmire heaving fuge koppelman ihara disfiguring attleborough epitomised avma zaka hfcs llantwit dwelled namier majerus overextended nôtre pagnol purnia unbundled voit plaxico serrate saltbox pedroso mcfee foxhole guppies altyn speiser shchedrin seawolf slovenians maund phua ijaw pasturage hib tamarine runny kasimir eyles harnden penniman torbert franchetti chondroitin agard denotation rezoned faslane garrod neuropeptide boke sledges speier yurchenko olc mayas yonah igen jta lrs rekindling elefante vmm dul wco mrl busily appallingly loes crome chainsaws bullinger braunstein callington swindled tubercular wanger ilario dayz gerona gormanston enfilade bour endowing kawamoto acheulean carrero nangang jfl katsuhiro sarangani yik meckel levering hestia widianto nuisances finbar falmer collinwood gni crossbreed killington xizang roundish meucci glascock validator umb koevermans ceramicist sericulture unexplainable shearsmith acappella ghadir patmore spiner dalmas cosmologists arreola bajau golfe xiaowen tokunaga cootie treanor witkin parcs desorption godse hincks bergner dillen warfighter serginho mineralogists crazies duplin pesca whitetail dayang skein kelo kunihiko galusha diam sawgrass carnations basileus scooping rafer nuchal viereck irun binger boinc bosa doubters stokers tastings geographia hmh arikara assai guilhem mycobacteria baidoa tacuba rayen aldersgate jayanta watermarking littlemore gastritis shigella waske blowhole sonal provincias narcosis ceda liebmann cifuentes rampton loudonville fadel mesmerism aisi rentschler emeterio markova kardashians merville samina sherburn solipsism cavill dramatizes groans napping cisterna humors lomborg polkinghorne deejays macbrayne technopark ecsc unheralded inflates selin lonelygirl honorarium overtakes fuku renfroe houllier battin seconding patrouille tenax tashiro gillmore lingard verismo susann medicis sbr foxglove prewitt lavatories ekrem soupy burmah musca kivi pathi volcan kocharyan orava peterloo epb kittson kultury floriano jareth antiparticle kep maag chinooks quinze gom gharial ois rosebank morarji beeb chiptune waxwing flim lenglen lecithin untruthful anneli brockwell tiltrotor twentynine frictions vallely amarte matosinhos panyu dfds mcnicoll vij shahabuddin besi baykal bilinguals prothonotary modine revelator sudeley cronje vivax blumer equivalences perfumer alster syngas buswell motorcar katsuura westfalia whiteboards caeruleus branly tandberg pils alcyone margalit rationalizing normann hiu flytrap havlicek argerich cimarosa orcinus tungabhadra channelized serotypes bushveld hapag pringles kett disodium lorien choreographies calmodulin bismillah ballew yasha tamarins eliel modelers rylan smokie nativism oggy ionel clipperton papaver jonesborough nasher regnault norum kasdan czk bellicose cuno fortas collateralized caufield hercegovina jalapeño messaoud carob shakyamuni miralles mumble shapeshift chicagoans riskin suresnes moustapha culottes sansovino libi pigman riverdance enga floe inauthentic magrath amey sonnenschein lacandon jiuquan aukerman duhalde cointelpro kimbell toefl landfalls pouille zeist teti kumagai parkgate zanni spicules mágica taron adán epitomizes tisbury helion tornadic vidkun ishpeming benét rugeley alienates talgo dubinsky poiret ebersole cynwyd kogyo duende zanskar colombiano kennecott monforte oakie flayed geis boutwell dito undecorated dendrite oris latoya chertoff beveled pseudonymity henric novoa dataflow photorealistic drin transdermal sanstha nikolic froggatt irizarry striven ronni unsporting hcs tawil grabber axelsen undercoat liverwort lincecum wentzel repossession marshman aldborough theming monteleone unsorted impalement mandingo flyin recusant upstage yashwant kuhlmann destabilise truckloads goalkicking twiss tridents chemokines noradrenaline brw presuppositions pene quaye kirkstall bismark deboer plettenberg dilmun sommerville borut halling giancana dingell taine leavell menaka groupies tumbles waseca splendida esat saleswoman emiko broadsheets uis jey jolivet plainville shadowland aubusson leos fpa crikey flannigan rulon kroes nasirabad deepti gryffindor candlewick lathan bersih taytay puis southfork romanelli lagomorphs obv behnam rigsby haidian grimmer jaruzelski housecat gilby jra chives legalities allegri jastrow sashimi sprains regimented serialisation feuchtwanger rappel tideway bucknall intermarry introvert montand litanies microbreweries boardgame circulator matjaž cetra layden vreme kupka ballyshannon apted matawan beatitudes ansons steelman tramroad rocketed jut palpitations cassirer osório demurred fethiye figgins nmp ballymun kinley encasing hermoso cowpens maceration shikha bewitching ercolano reggaetón volare fibber zile sigsworth salvesen dkp harrodsburg denuded sesi banos mutantes landlines perimeters covergirl clouzot bolinas mcnichol mirabal packhorse thorndon sturla olympio joye perico golos domina riverland clutton recuerdo mamou behaviorist bessy sepulchral achmed serpentis cypresses hanser rosewell bulleid kurobe orgone gsb disher bandidos shebib suntrust gorn shamash libels galarraga prien seip voe unarmored blacc khm mountie dieux wrathful tomtom motorhome cretans qaradawi goor jovem cascio ravishing lenzi attractors dragana skydivers sinding hendren mahama thyroxine demerits honig studbook teetering doxycycline widgeon dishonour tatters hairstyling creekside poum sammons wiss ratanakiri firdaus deedee hita kibi gic takedowns malmo hütte lapidary mercat majus dematteis curonian karmas athenæum morar shaves uec boldest pernille endorsers chalabi groundlings antiphons addio mannitol mistreating silvertone yixin malathi greenidge bychkova sciatic broadford arlequin biedermann pahl practicum sleeman streatfeild myler melson yor burlesques ifm levitsky weinan contrarily stellaris pasos doctoring jerwood righetti vermelho junctional piat puls bogside mattawa spamalot ardrey dyspnea arni yanomami narang huahine ibragimov rettig trautman scholastics jameer raftery janek fauvel sugg sharpie rockhill kwekwe organon hizbullah fokine didion inheritable woodmere outputting tempts mizzi bonhams reste congeners allerdale zhivkov managements borderless lel rahbani crocs comiket implode ptosis wristband kirch jaxon martire sulmona dignitas bubi nanosecond readying prorogued alvares grm braved consolidations tkachuk meres timoteo caraga redder sabatino roasts gastein koffi aniruddha hofmeyr knutsson pseudogenes rakoff equestrianism magnetometers obolensky aqil arnica angraecum rolin zacarias kttv tfp fugate lawmaking barthold novelas vlog pindi spu ulitsa khanpur couscous workspaces tartary umphrey kisser kempston tanveer lipsius mbas ipfw maclehose flyleaf kampar gurewitz sifre levitin milby servatius aley likelyhood jammeh stressor bagapsh audun esn nort coercing granodiorite schwarzenbach honoria neurotransmission scilla misaligned granderson goree spiritualized steves kashani okan richmondshire lohia gundlach schirra melvoin anthocyanins copepod masakatsu saqib glycerin koza ncd peled theists gilberts susanville mosasaurs tonton takaaki lanford wikström penitents daran secada arbeitsgemeinschaft chromis tihar diatchenko courtaulds anssi gazillion beales unpatrolled alarmist moriori militaris korhonen tsiolkovsky eyespots navami berliners dehli dilatation gentis mercutio mikhaylov kingz naral inattentive hessle benavidez alewife bagchi reloads expeditiously seac pagel simulans automatons bloodworth menges conseco foreshadow jugnauth eski treo transat suprema gessler bejeweled intermingling kilcullen trawls ldh loins botton rightwing lilienfeld onomatopoeic oaktree dusen yadin coutu ehrmann lydiard ravelo veron bookies chiharu rbl tororo daikon jacka ragwort tuamotus avena macero chav recta merrit shefa dorsolateral mascis finglas willowdale crutcher longfin zahedi mahalla loing associés gangotri scruff tenge garrincha kaleidoscopic sorum curle beartooth rosacea sphenoid standardising ryne aikin espina roskam pedicle puttalam sumathi shepherded workweek wonga chilterns leixlip merriment abhorred parcours dosso xlp preux corrina gavriel gyles tivat cosmetically counterstrike yeley buzzi dumbed interpolate gie perren episcopi conspecifics eirene udd songlines brizzi kalwaria prepped efrat leve fraktur tabo garon grigoriev arumugam quia marcum lepper finneran telavi barabbas noonday mckibben ansah grundig nigro foxfire dbi carterton benigni genbank bushwhackers ashraful zonta shamsi roxx samsonov spake airbnb dreamliner glioma festivus floodwater gbl cecchi kornheiser silverdome fehmarn stepanova ftth treptow disruptors stably lalique hayama farrugia megs kartel cramping drac skf tarragon sondhi hemorrhoids nicos kathe sncb natori mappa hermogenes crisfield lodestar nrdc coleg edibility asantehene aqualung rra sacchetti guanche blackham novoselic highwood saisons spools gijon wpk meijin afrocentrism contentid aimar oshii nucleated citizenships gynaecological yonhap canford jacking lge referents tilled bota kunene prud locher shikibu navarone filmfest kapisa hawkwood stottlemeyer glamor ringlet testi katerini kelvingrove interlocutors spid phosphide handcock stammer baringo seekonk kostov pallister bostonians gamescom callinan wilkinsburg staré birthdates stoopid walwyn cubical groundsman maadi levens edman nanowires rustler consequentially shb gloved parsnip ybarra reali kuji loveable bushrod achelous fone emv dorinda infidelities denner beyg substantiates bulli salicylate ambassadorship vds inarticulate kirkdale xbla bontemps mikkola mccarey dybbuk canonisation hogwash oxidants kilobyte wettstein speers ezo hillebrand balladeer humbling bocce beauford anagni klingenberg foust flecha keye quebecor ⅛ pdu playthings lpp durward hollenbeck gossips cini siba rafflesia jaimie rountree bilk gasperi bhupinder harapan eyez bowtie stabbings familias snu inco potala scheidt benge franceschi osmund toothbrushes lomu himanshu moyano nationalgalerie gainful florentin khare kadu smn lenhart fahmi appeasing brochu maremma coeducation bodoni pib ewers brünnhilde sandell dnv trivalent abbreviating shellharbour airtrain pimpri blaga trou fukunaga gittins carmaker mcateer unlined jags palestina mulkey omics ruhl plop pennebaker harle chapuis salvatierra latitudinal balazs rhapsodies tuncay soddy bankside arcanum morshead scheibe généraux meninas annu yamaoka beaucaire redbone germinated daher exa selex tiwanaku electroencephalography giustino robbinsville ridwan dimitry thilo ubaid ureña trin javelinas antigo farrel bollards cravath vlaardingen asala vinca tadeo lutetium cici sodhi greenstreet exoskeletons asaba nghe dehumanizing subdural komeito shoehorn revitalisation carcinogenicity keta tansey brummell brookland chaperon aways photomultiplier charta clayson equivocation aureum callin annotating baotou janki atanasov bartell pornographer sablon phebe endeavoring signoret nightwatch usbc erh braeden neurotoxins chippy stokke ingi witwicky lettie ladainian charmian truncating wnc nambi terza khandwa efg usted communards palmeira wso constructional rajni dooly koca ashbee diffrent curial familiarization johannesson vanilli paclitaxel boloria hexes mcvicar mnd naren edelmann dilli bickerstaff abie reentering sharer democrático sundazed ebl zucchero kiswahili duratec foer gavi nch antitoxin pigskin furay klis tracheotomy pinjarra kinsler olha macoupin chicanos justen maginnis malevolence torero voynich heping cloyd reig laeken bierzo pegaso lelio grr andrae dubrow optimists mandriva rennet nickle heggie maca ghanaians laci grise multicoloured reawakening wieck slouch internationaux leathernecks melanoleuca deceives tanagers hemme dittmar wanaka rns ciné migros burritos crabapple manch marañón guv absheron dasari hild microfilms tinny middleham fretwork subash ifo appraise rmf oao erosive devol prejean mbna amenia jiong kuribayashi tings silvered towcester rli shukhov laba vastu malco waterless locka vermaak atis mcbroom baynton pey carrum andreja aldobrandini raus lumens mendi dellums howerd biller sevierville shorthorn turun jabara regnier amidah bunnymen sweats accidentals entreprises tomaszewski debentures idioma alarcon sarakhs napanee kornbluth raveendran nannie roadless surcouf intervertebral hunton writable saberhagen buttock bjarnason skewers galloped nordberg hinch rovi wakelin higuera pll stokoe jimmi incalculable menjou quartetto casselman commerical aerofoil celio lippard maroochydore gymraeg pleydell tamira chaudhari balme slovenske pictor progestin tats delaroche cravat lampley soylent winkie mullions agilent serik mengelberg klebsiella verboten beiping edad riverdogs indiscipline redeploy maddocks inequitable fattening nago majdal littorio commendatory flaco hedren kamikazes etim sialic immeasurably semiconducting golems flix raincoats givewell stigmatization etzel centcom lobbed jaintia doti impugned brind constrict jnu merrin moustaches tennille hache aspe bioremediation liaising decapitates rotherfield klosterneuburg nyima sameera infuses corsaire lapaglia botting vogelsang hangouts piranesi takakura arjunan aritcle abounding gamesmanship windspeeds vavasour philipson keeney riazor vilvoorde foolhardy ratites wryly jeered mieke ekin sternal urbanist imprudent literalism feith kufic gulbuddin waldmann gird icedogs dorothée perverting synchronously dwain chattooga moman spektrum cavs wharncliffe mbda bergoglio jedward redistributing pravec bolaño supers awaz sneer barranca shifa gorrell myp lpm irix schickel isostatic interfaced yaqoob quattrocento baad guto harian menuet syosset cenk danses pallone dunce chastising religieuse corbridge wilcock neng sardo ahriman sidek hinrichs gyros louella bashford baretta narrowband percheron skiles ramberg reinecke agrostis potsherds orangerie konopka wertz whitethroat altamaha dco qube redfish droga viterbi ennui afanasyev exim deangelis pleura kazunori sujit impey laar thorbjørn barek muhajirs dreamweaver bindon batwing yearned mcglinchey mcgivern phobic alassane leota ermengarde roope vire bosé thro linkable mszp raved eales hardcourt flit fluor efm unequally carisbrooke roentgen mallya destry idk northerns schwan derangement soumitra stormer vivier bosmans boubou onr hitotsubashi sadek scirocco reinado eib patronize leavy mühldorf porticos dixson agronomic chromatographic redid zurbarán garissa niggling contraptions katter enuff wjw camby sensationalistic huffpo haslett quantile impounding zucco portocarrero fruticosa calender amw asem damer paktika pish cahaba thinness mashal jigar ointments warpaint nunthorpe martinis kabwe kranji flos parlours akhmetov unblemished jonty colledge rimless baumbach brocklehurst calma vadym voinea infiltrator deana ziguinchor malim weatherwax predestined unaligned twyla jetsam bourdelle seasick willison schulberg thelwall spinosaurus coelacanth harbert streetball ians indefinately rapturous beran vab woll greenall taraki brittas suning multistage voortrekker ianni eoc fitri utpal himal yassine dyersburg télévisions gorani dnssec subverts kimbolton barson spillage ltr scandalized mondello hbcu asger krell pollsters allsop choudary laundered eukaryote fucks takeaways awka vokes placoderms polydactyly ravensworth plassey flog woofer incase habilis matings rawtenstall brasco rangeley barresi fags nejmeh karmel hurlingham greystoke shadbolt paese alcobendas marone ltl hackettstown instyle bayamon dvt hengist multics croome dionysios tuppence startin torchy sagawa exigencies gobain awt padgham verandas mazara ozymandias pme schriever lollobrigida emts zuri calamitous oikos sandridge stonestreet sunapee hamels crucify sophy marsel collymore saoirse tenpin aughrim subspecialty negroni rayfield dutilleux mccullagh noye qadisiyah trutnov velour prettyman kgs vib boesch guercino prestonpans bookmobile pomeranz harddrive rickety rajkumari pretexts imposters oncogenes raph edr penrod misstep biogeochemical douze professionnel shockey wantagh walruses savannakhet penndot newsweekly praslin idu lading xplosion sherbourne griz sevastova sigüenza mazen prambanan tarmo soldati shishkin branes dreadfully irreverence sieves siqueira aradhana amell habituation gené kirribilli linna weigand mirrorless souled fakhri mazurkas bolduc espinal teamtennis ruminant nadzab fuyu swanee savan splattered zille adrià rycroft mantas smacking brakhage mccants tenri schilt semipalatinsk girling functionalized sportsline nassa belson rahner btm clercq welborn bakehouse cowdray leçons handelsblatt blacktop sequins venecia saputra orignal motorcars inscribe pedaling tob forsman banhart gupte endnote tunicates reversibility dahlman centrifugation himara collodion circlet heilig beneficence eclampsia pipo janjua fiorella framerate chatrooms rauparaha carine beautician troia chaudhury melati dickins doddington olb prca decimate arcangel treacherously newbern kantar erichsen gramatica evora sanlúcar ngee metroliner asit bergens lewton traunstein neasden sask winkelman apne mady elmslie gaiters hütter mccullers chihuly luong headnote mapfre schillings caedmon enchantments florus schuur panting shomron automagically magners alces partlow notturno eclogues skoff healesville mauvais fleisch opv aine twirl auberjonois hajo knowable haredim bradburn kurla goupil chengchi knapweed imminently , quickened hypersensitive delson widowmaker genta rieu wring nisshin cyclically kurti taiz banqiao wonderwall whoo klim ornithischian prednisone disobeys jnf truett ichthyologists siku gingold muggeridge amphibole concussive catid technic uppers spiess ombres actives besnard basketballer calheta accessions berl flypast barkan zeidler fromme björklund bamiyan entrap aerogel cnf casadesus sergeev pcn cauvin untried superego parlay savate phthalate goodkind benayoun arsed divisiveness morphou nnr pepperell emetic kincardineshire scissorhands souleymane blainey wapello prete thewlis bick rohm mullard thoughtfulness reintegrate splintering adx whitepaper kardar iestyn entremont beza laminae militaria songdo morville pottawattamie lampman yaz montgolfier coso azeez thz mandrel anarchistic peristyle awaaz azari naif cdos istiqlal ddi ddl ioannou toshiki taru fizzle trb turman starlite transposon arnolfini astronomically rosalina bandh bombast cassiano dialogic tyrus demus yv monceau poète osmonds damara sedgman leino boulle latgale heizer cosmin rawkus bobbins enmore manz platten greenman ryles minutia supervalu nurtures bernards blockages atalante clansman shuns tamora kusama figlio hookworm existences ibaf chrystal tsf undesirables sneezes minahasa kelemen marica rotund reh mwr ductal valbuena conjurer barfly grossmont hochuli puta priyadarshini jingu sidewall animates rch heatherton dunstall shillelagh donis playset cheez titlist solferino singalong luckiest kossoff crespin willets sotiris shames akina hup bluray forsyte mcmullin ffv parfum gothamist mallaig zapping frauenkirche braswell copan stier dextrose whiteout troves calatayud sakina tissier pertinax rosada foxley neglectful purposive rifaat lakebed biotechnological broomhill freier gleaves priester chevrier textually rambunctious iccpr walle mvm bluenose sprightly reprogram stull asparuhov cwo ishaan overspill diabelli esco damselfish janae issachar drilon giardia centos nacac corll ritsuko stucky marivaux puncak debakey bernama discotheque caudill gpt meshing krishnam chucks solex dood massari hamadi aricia prestbury hillview audibly anthropoid ftw niqab wean rawling jbc misspell francona jammy subacute protist gkn sirt skoll fot kellar farinelli humdrum nobly clockmakers bupropion alders coproduction hazeltine lymphoblastic boger breslow ehealth itemized cappelletti olentangy guilderland kushboo agn kamu mult tesh zukerman foca methode vahdat brewton hailstone killy moretz burros pelagia lockridge llan szpilman chex mackenzies oth xylose farouq ocracoke nutritionally izabella parisot insan touchwiz donisthorpe freischütz reification poliakoff milanov bth rech kitahara skiba chasen granier rijal stereos ocmulgee rehberg finra limbed grins tomos tamasha greipel unceasing rescission presentational reffed bartering downdraft petersberg ghosting shatrughan weismann nightwatchman tonia dehaven irna fogs qun immunosuppression simkins abrasives peth turnberry polarizer siya fyne zeven desiccated lifehacker rajini juntos oury cytomegalovirus abertillery haves mockingbirds hepatica christiansburg brith grandjean flicked bettenhausen fedak connaughton lithographers fasci prakriti dilate aycliffe auda ragnvald shirelles nerissa isaksson redheads brumfield seemann nantahala sustrans asko leberecht ccds jining nrcs motwani vsm taille horsehair severson ressler mutilating cagan vandermeer bagatelles kennebunk otv langport huffpost rabbids caecilians akasha folland overdosing barclaycard hollowell yoro wouk jovana cassville microsecond makeovers argot berita hildy petronia niguel luda kotecha venturers korba decamp merope whigham particulier acela sudeikis burzum quietest khakassia scorcher chomp nuvo bronislaw alessandrini hiddleston neuendorf genuineness graal holabird manalapan softshell culbert kayin uart triveni uzès prieur bulwell ralls agenesis kaden clackmannan heraldo upstarts snedden akamatsu agreeableness bryony depletes realpolitik magnier commandeer malic oie landcare compensations yaniv nung marsyas maquoketa fruited chandana griddle beihai hofmannsthal gradisca eyelash accardo frayn mcalpin colcord misako courvoisier seismically dehner unrealistically reignite comely omr bolam practicalities jamelia crispo litigate sauve leadon kislovodsk propoganda weatherill unbanning unionize standley chump journaling messin failsafe dunlin aluf takapuna jagran henwood denizen reinsertion jhon urbi triangulated olov execs broadfoot arbat iif assiduous boulet cosh halston tailpiece gazipur xuchang tischler quy harjo storyboarded amann roser hairdo baoshan schmaltz onodera ingold crooning kentuckians ejections rambutan pembleton exemple mento resh pressly lynbrook muddied saji fierstein shortridge ysabel planetoid napoleone ladislas hillenbrand meus tamanrasset trialling gustin westray lienz viens nft jouy kunitz motivic mispronunciation gasping arsi picken copiously ajinomoto lfg serpico shediac pesto taeko taar moff ncube pulque restrains kittatinny khmelnitsky cheekbone globosa géricault dbd melodeon shankland bellemare angstrom dudbridge brownrigg inzamam partials horeb figureheads northlands condes kpm musselwhite coalbrookdale rutkowski veloce tallent arene lauretta proselytism agim davidii reincarnate arcas daal juristic jaana singa goins wannabes briskly hudd funicello dilawar expressible poinciana yaks payet toiled cannonade seiki levodopa platina vhdl beppo iguchi qar newsasia huizenga shak svengali depatie shurtleff pumpkinhead apace natas claremore cwi snecma wollaton teofilo mcmenamin busied bugey sympathised yurok leboeuf catty tambura reidsville kerrie asymmetries mcnairy gelasius winchcombe uccello checkup hesperides majordomo baris arrl ninette relaid umer tamilnet consumerist thekla armide daubert congratulation beaucoup dhafra corto giantess avo transcanada orphaning khanewal penetrations fuseli yearn kuczynski jordanaires kehr tahara gayoom nabf bynes divulging osteology gomti lofted jonquière toch woodrat taddei morath munnar serzh cartouches fomented mbabane saionji ureter raindrop grodin forager zio tobler holier exactness continence bjk murph johari mehsana tyreke dukinfield probiotic marleau tevye sealink kurowski appropriates josefine jitters eldr cornes millbrae desultory gimignano tripitaka zihuatanejo billowing berbera lyotard newburn sharecropper soltani puzzlement jasa herminia gladness cramm troodon chayanne calendrical meckiff muerta salamon riebeeck hunstanton poitras lodha nedney shadowlands corvina jurat maen birbal ravinder bimetallic marney peeking lovemaking ciliate monarchic amick michelotti tomek shesh cowritten vaibhav berling grabow flicka berberian derozan pompeian ricca vaso miffed leaven sangrur btk salchow aosdána taveras lather longstreth zogby daiwa zafra solum muhly horstmann mcmann kochanski kibera akaroa krynica bigeye radials iccf stargell cezary bumi tomica venancio keadilan viscoelastic beanpot onet thyra obregon waxwings vava abrasions fontenot montecatini yorkists seru learmonth szilard salou gunslingers whalebone ronco dansville blay misheard takayoshi sidecars gignac deanne lessees airto massasoit wellhead catalyse rosemead cergy xbrl vfc wasilewski counterarguments showrunners crackpots xanthus napper pendennis pupillary fluvanna madhesi timisoara chel frenchwoman newmont erratics severini mullica macari vágner cirilo neilston hollerith dualist collegehumor togashi ravidas clelia wacko garcinia concordant kft rgm mirwais podiatry bridgit morishima khujand mugham linhares equis ufw barla schrute hdc slights bayonetta mcilwraith noboa weaponized yoshii defecating urc tannic laidley tufail stettler faneuil chetumal marshalled ∆ electrocardiogram berchtold mammalogy kelpie murgatroyd spingarn zuluaga boven oligosaccharides kile moins crv fakenham tener convertibility ballas curr gulbrandsen toren hulett chuka hrazdan dessel tamarix budgerigar gendai transience brecher remigio sostenuto mccotter semih dubreuil ellul clavel selinger pomus overconfidence weka ishita pardoning pinstripe vigneron canetti outwash flatmate geertz fledglings streamlines fintech slavishly dámaso turville clasico peloponnesus centauro aion overexpressed gilwell kharlamov marstrand assemblymember demerged sexta mccaslin pestis husted cloke trumble feta motorhead krc amidar fendi manlio larbi dadt brucellosis undergarment valda shevchuk dexamethasone urethane uncluttered stapled nyah rpms achillea avaliable rinn farting reformulate nueve giti lamda bandwidths toughened musiques gainsboro diagrammatic nosebleed desafio exocrine joselito pilsener koivu nonaka haseltine shigenobu djm ganas resolvable imatinib chil katelyn orbicularis regla chaumière gensler unscom houellebecq balaj frenzel olf gurnard dandi kentville breckland piratical freudenstadt boehmer bellatrix lelia echevarría defrocked builth bayous wallechinsky boies noticiero eleuthera jingo disclaim maritain colourist ouessant sarn gargamel sangallo wyrd fildes lamento luken outstation womad windup maravilla pollio mazzoleni discoverable kitzmiller saintonge epitopes homesteader panzers krabbe uninstalled riffle srd rff bilo hernias swanley flickers motorman paraphilias champasak anthropos ajdabiya proofed neurosurgeons zamzam neurodegeneration crytek paddler yasu wurz avb mcenery brennaman alloyed cardano killjoy shahak bookends campagnolo minns osoyoos duino levay meteoroid sahin kretschmann zeena scarry convalescing baladi breadcrumbs carnwath earthing dingdong tinsukia sessa charro corrin heidler toastmaster mastro okon toreador grackle wetsuit weirdos thala isoroku biratnagar munari partakes itam tarlton antenne neurofibromatosis silenus istar soth cannibalized chalmette actuate swithin melcombe cnp rotenberg irrigable prepubescent hurrying cathepsin honoka pellow simonyi behzad bloomed kolya fujimura ibooks glatz alber puccio relaxant boxall upgradable charissa uele intertwine stormbringer dushku shiners tudeh oranjestad schonberg headways tup argyros oodles selzer lidl shikarpur broadstone daiko behnke horning giampiero abductees tropicbird gosset doggedly mmf medaglia scouse hsg yasmina wolfhounds forklifts willetts srk rikers piscis neurobiological overhill exotics jizan talamanca dappy stardock kyser alpinist undulations wacha kinkade fincantieri arevalo nelsons oncogenic juxtapositions kyodo millcreek stiffen manohla brehon girt ransack fiorenza tapley quora falah itd komsomolskaya dappled tsuruta euroscepticism reciter pneumococcal impugn ocn blackshear campari novin hazlett yahtzee creeley harbi bescot umpteenth ktv montazeri jocelin rothbart tailwind maros nardini galaga kelaniya downingtown arava cheatin mutters culham benzo barpeta ingall keays broncs walcot sundiata cattermole pingat cryopreservation abboud confiding blackdown kushi jobbers mehrauli cevallos auggie misch guestbook scientism berkel aarón morrone androgyny suppositions henslow obsessing spion hatakeyama outhouses bierman unpatriotic compas kraal zahl oogie citrine hoopes climie narodowy fuertes wotherspoon rezende munt tyrannosaur goll lynskey widdrington ritch quirinal eed alexie duse torvill cess zosia deactivating taunggyi klaudia bruisers cck floes possessors kazuaki nauseating soldiering vallone pronger strohm hortensia mbale worksheet kaspiysk despard trabuco mehl azharuddin observatorio lincolnton idh coiner drechsler vizard wollen stuntmen ahlquist juhan kinghorn kodai mabon zahara steadicam juhu ikan akko eberly papists hamdard meineke wiglaf javon plumlee emcees frowns caperton questioners capshaw mutilate sohag voltmeter vomits hasek kraig zel dugongs hubie bansi shanter lacaze phosphorescent gueye millon redlich begich parp zibo superstructures mulu gilg muzyka tetyana rhymer hinter cioffi salaf riaan caister goober baras orbe hilfe kouvola gansevoort cherney rameses doerner antonieta siring hosny kixx beadwork eurocontrol nesher skipwith beccaria samata scatological kie ripeness lowrider economía esler elitexc stringently madhubala pictogram evangelized kettlewell bermudo quoll dunderdale hii fmd grachev schadenfreude reme gluconate gunga islamisation wheatsheaf tronto snf xiaobo fiemme bodnar leisler deodoro finasteride preps wurster primeau hamley weihai frictionless persecutors whas willner miti gombrich dissensions heidt nibbles maccormick denarii canosa sorbitol cramond balki melman akizuki iacob decapitating greenlaw changde dongshan dallenbach nathalia bagshot bodyweight langeland sunitha wizzard personifies aravalli stayner ahidjo thermos saca henningsen kanban fonsi shii multidrug corradi cephas kitajima uhr reinet smersh bumiputera sunbelt polonius submillimeter clevenger mester tuxpan maden calibrating moradi tfi negativland delvaux shiratori dulac sfor saifullah rpe idr scotians oppen sincil garabedian sophomoric vbr heman brightwell fleishman renegotiation hamiltons westway ochilview schaerbeek fairhurst orchestrates copyists karki kipnis rumson seigneurie plainchant vivere nanu hankin nanhai jokey chancey garlin decriminalized arabization lepe krane olatunji poupée psychically nunavik afdb resurveyed andro qsl dvp zzap compson macfadden tuckwell polidori zamfara sju shisha paswan ringgit pavitra neste pierrette barbecued demis shug pericardium thermostats lithophane samay seeped raymer iin floppies manasquan architraves syntagma transfection saltaire farriss mclaurin petronella winckelmann metzner panchkula maikel wkbw balled laudanum tirades blc cascia outgrow aeron mahalaxmi goalpara martis aqi parwan subcamps clic icam racoon repeatability hutten memoire norgay strep cheh romesh spellbinder bedwell budha lambada checkmark tifton tinney smbg kearse environ sealants bettye hasselbaink uncharged cetina olah perishing jeepneys dahir sadducees gournay extenders annet inness iag dalmeny clamoring alizadeh dettingen globin jèrriais tarboro amicale nodosa comscore tica osteoblasts langland philosophes befallen sneering trémoille biti surmounts gellert bonafide ecovillage hmmwv ufs seule moltisanti unperturbed moinuddin artmedia miyahara fronteras poko deriding merola guenter immunized ngor gasherbrum typists avoidant dowel detmer nancarrow plutarco museology lopo premodern ringworm ixa lindauer gesher tsangpo platitudes sawfish sekhmet taye ramification misshapen informix brightcove easters urundi tpt missus ferenczi angrier bsr raich griquas bronwen barot myka alur histologic sangin precident cge barrault ferg ludford caffeinated yaacov pions marieke ketan prudently atlanticus hoaxing licentious jola salameh bohs arboleda skirmished carnelian elwha narmer acct crenata pleurotus illicitly newburg sarti yajima zetec dorrell bakes campione lims darvill freeride anmol udai unquiet faulks serafim righted hoz kotto juggalos flailing masonite abdirahman varkala joongang coldharbour nemi inada dianthus torturous pulford errr shorthair cuddles kumo espenson anthurium lipps taraxacum badghis dratch flintridge reccomend paedophiles beedle minch gue zoro homeopaths yining ovalle tct ballinderry illich superga onkel spearmint eppley immunisation parapsychologist minne bakary groome pantani moorei tevaram rateable matsuzaki boijmans pietist samachar gubbins siecle electroclash iria alwan impious grothe wimberly consecrating bromma eglwys klin antisemites panamint parasitized rijswijk lakefield matangi drom touquet sublingual charice outperforming szell hakon henlopen micelles kisch liev cappy foltz harkleroad kahlil cathryn lubricate threnody bearsville tuthill iou spasmodic nietzschean hookup barthe prathap fantasias inflamatory minhaj barroom epiphyte marksville organophosphate sundby lydd kindertransport incommunicado sportsplex huguette ikuta lachance salvatori peterkin comair mco connersville slavonian trova materializes curiousity rla clf sabathia nahant glowworm tooled ethelred sandvik borwick dears mals tarascon larmor hodel gearhart klong klebold damascene amphitheaters malarkey mimed smouldering umstead bearman bulrush wigg denniston foos weibel govardhan vacuums kokanee rnp kensico pratima curassow corpuscles portables hedayat jasminum giraldi somer waistband corcovado chansonnier ersan jaymes uvs insufferable bevington bages klar aliyu bestival bloodhounds mercifully liftback striked stolypin takasago effectual louvred shandon megalomaniac hutus teleology hammy tehelka brazenly suess bennelong masten steiermark nasiriyah ligia quantock matildas embellishing uffington clicker chira neetu quartett issam dinozzo ecurie clued nederlander makonnen osterley wieczorek sisa eggshells fritters takla kumeyaay millhouse vasconcellos selvin plasterer mcfeely molesters wickenheiser bandmembers hirado laurents bodysuit cbssports dumaresq aronian volitional kinch clar jica groundnuts gasparilla balaram habanero galati colloquialisms fandorin gelugpa lyse methicillin wier siphoned genderqueer voilà madchester birdseye obstinacy sprawled weightman gouges commonweal apsa tereshchenko wolski nuvolari garis ikot sundew lalli broomsticks grahams awp wrangle whitelisting godhra justino effacing hoyts massimino oedema antheil sulejman tangi thermidor topal orbcomm volkskrant berthoud velden koryttseva eggar bartholdy qana topley sandbars pratama ostmark mprp madaba chahal demonize guiteau kathrine blissett subordinating sammon adamou leblond sisowath dorey groby vigneault novarro alsager coba grandville azarbaijan scrapper hartshorn krämer chéreau sugarfoot integrins keech inverkeithing titchfield rossman martos headhunting frontbench deregistered ectoplasm reassessing harems harten profaci arvydas fabris dollywood banega debutantes yildirim quilty tzur breastworks giverny arrestor decapitate rossouw strategical viertel psychodrama troitsky uladzimir kamphaeng ortmann ainley transmittance hyperdrive mckibbin ponchielli czolgosz stablemates sindi adresses lancey bagno lutetia solemnis wbal schiffman reflexed bundesbank jassi sni churchwarden rievaulx podesta eade kers boatwright massifs pushback colet portofino summerhouse existe triggs figg facepalm traven roelofs heiau gooey enger illmatic zhangjiakou adwords terephthalate partes sinclar dragonforce gravitating antihypertensive amerie shivalik kotter repents vassiliev beaufighters nastassja molins liberality aizen sondre nymphenburg fauves mcmorris koharu youkilis putz ashmead westergaard abulafia afrocentric hokum crematogaster haluk handcuff rambaldi mediumwave morlock pdd guen güell bohan peaceably seika payed tages kogen larkham grayskull gezi arkan lamé coldcut mindscape gorden settembre lillis bergamot marchisio mcilvaine colberg kushtia aquilino peco spirulina buffalos fosco viols ferrarese ragweed loudermilk jik bwp grazers jogo deri kachina corbières dicke mcginness flybe syke halogenated lawgiver breit cuss vesti defilement tinkers hornbeck airshows knowlege segreto gittings penha kelham dicom bathymetric llanrwst kutner hughley almo theodolite flavell diwa serpukhov zante ilmenite reprocessed hughson aptn christies cappello bienvenue stourport nipsey nein scalding orania reenactors akkerman malleson amchitka withlacoochee danke scottsbluff bellisario janda strongpoints demure ebru sagem nonconforming sml offertory biaggi climes avett crawdaddy externality witbank suraiya naturwissenschaften kohs critchlow grammatica kishin catoosa polluters éclair bronzino unequalled claymation poppers reavers marroquín demopolis mianyang plumtree orlan castelfranco oros prindle hassles florescu knoblauch bastl aengus weah kelmscott oozes anqing baling expectancies tercer shuttlesworth manzi ulva wench europ hilder basicly bolitho rinat doublets deride drakkar yreka scid ishizuka mapmaker tagliabue tognazzi mli palamu nadkarni remarque anyplace hyborian tribbles trundle louvers silloth technocrat bernardus brüder beautified stram petrine sips armond whiteway cinematics steuer carper maso hia paraclete reshot boselli realnetworks stim noce faceplate hatchbacks chamberlains bartowski shahzada madelyne evolutionist pondok kogo laskin schlechter saravia deers angerer mazzone homans gfx etu liveliness modbury mcgahee mowed chaput nonexistant contractility villarrica outsmart aquin congregating shakeup sippel lagonda paramahansa engulfs bilberry gripper sammie skansen leatherneck yawl sason poso disembarkation marchena parkhill sacredness tejeda redemptorists premji vecht leonia lourens bashkirs gratuity styler stagflation autoantibodies abderrahmane gadwall detests wegmann ksr tabernacles thubten indolent esmaili zalm generalise lyfe colclough idées megalomania gremelmayr hitlers pistachios loblolly mclarens petterson cheruiyot haematology saima wordiness aikens iee hoobastank disdainful arteriosclerosis panahi igboland healings estée hygienist homma avendaño bnt khs basma bazalgette bulba comper kadai unimpeachable dbp kiichi glf granulocyte sheeted sleeker doetinchem scrutinise reissuing salutations burian grigoriy exalt reciprocally publicis retooling colchagua miroslava minidoka wiske kingussie caskey blenheims narsingh shutdowns sidonie thio yoshihide deputised jillette handke tenaciously deary giambi porthole bouin seiken befuddle acrid judds sowa nogi armd maricel maidana irm saperstein gevorg microchips servlet wason conteh docx janatha uco titanate sanuki agrégation porton lovelorn pickaxe faridkot caplin decease ronchi indrajit flixton bergdorf kautilya selección sympathizes linesmen plainsong batey aver peñas streptococcal ilb dacko retractions mayerling singhania tarana troels garang shaef ardis laflamme welz pilsudski yakin serological benni needling ophüls quieted bort przewalski hailwood pépin glasson mohair cockles winckler lettow louisiane arng shaku lucidum dalea nanook boardwalks artform forgan gameiro dunnett lpf steg diamantidis salutary pipped mowag blueline pkb edmontosaurus girouard tournier cunnilingus krugersdorp crowing skarbek mumbling geoengineering samuele liquified kaling troponin henao yiyang visors triumf viersen gerusalemme netter piute palk nougat scaramanga sesil mcchrystal ramel smokestacks psps ackworth logit pujari invergordon elmet terracina failover chena yolngu prêt photographically bandshell popularisation whipper yoshiro blackball ajar ufp zarb stadthalle taneyev grünbaum grigoris rots bruk phin reattached datalink parseghian villarroel ramadhan lèse sequoias ewer morinaga rsx whitewood annoyances soloviev wootten governesses multifocal chachapoyas eave delavan sloppiness legitimised maisel svenson nais nicked boyett cyclorama gabino undressing sobs avey luzhou santería peeves anzani rhoden acra duffus yossarian southbury saam cuoco eilers reenacted superintendency guma charango treasonable latu mitsu achenbach tahan ryoji verkerk reproductively appliqué musicus obihiro meen shalva rtt doiron probyn indignity pricey chernenko audios bransfield isna tetbury mischaracterized thyssenkrupp mcindoe mimmo sarov hemorrhaging gerrold kestner fightback roff neurodevelopmental psac arques girdled guis addax defers bridgton breitner reoriented bybee extirpation inl potti tudur resumé gunda ladbrokes smarties classicists fantastica bostan rondon alexisonfire danity riyaz baggs cresting brandywell chrysanthemums suspender weirton sibugay ancoats vertov corno wbai saltworks kratz nccc sparred spencers winfred mykiss bilko absecon pinero heybridge evolutionists jse wakely timken speeder blacking thorley patco söderström molay crafter migrans avchd weddle mcn balas aliant bellenden lasser canario wmds canoas gva whymper ascites touma lovins skullcap bodyslamming wao platanus ilderton procumbens wuling shoppes verifications paulos rantoul jaitley gloversville radim fossati tupperware cecco florez squirts unicast brainwave nazari kilbourn kebabs tanguay homemakers leifer microalgae kubla xiaoli energizer spaeth wiedersehen owosso somatostatin rtn holidaying rucksack katherina adg boysen thoresby iihs tomorrows bangash najma kroos disbands voegelin higueras maestoso secularised maresca yoshitsugu polesitter duart lambo simbel finborough epidaurus balen zir vaganova fordson hotdog brezovica cooperations mirsad abashiri perishes runnerup sékou lichte fixx genn biedermeier fretilin ituc cordifolia magos coahoma fluoro vexing knitters bowater stata huichol absaroka preachy discredits mustonen mathematic peripheries holsinger kemba edexcel cpas pinon casque uefi edes aurita gogarty obst purism cfn hellerup wardley ridolfi soldiery inflect hkd mckiernan hoan dzhokhar witi eal chilo suat cesarean breconshire transpire loughner makerfield labelmate sylvio philadelphian bangaru purgatorio chamblee ganado lusophone procreate calor cantera naki segued vess spen maltz ernestina heckle riveros inchicore arteriovenous francisville pearcy misstatement fsr mintzer quercy avison wbt krikor larco yougov sokolac schmalz foulke derogation trampolines orangey mutis sanchai ekranas hospitalisation kalinowski aldeia baynard slavica nfr placers optoelectronics greaser farewells daina quigg navalny rumiko kouba lapponica octo clumped dynamited murrelet tehrani crosscut biala retell comradeship icicles archy bronc marcoux alstyne regale canelo bili shaddai phenytoin ethnobotany emon sexed kozma kasbah westerhof danyang muckraking divinyls hermitages betterton salafis byles sgm pten kamaishi consoling bondevik lavant jorginho zahoor kayleigh windass pileated cobar irritants demagogue ragazzi diaphragmatic leyburn cww raymund firebase dayananda whittall pierres speakman transgressed haggar aluko puttin sportsmanlike lymon katchi lemaitre iosco valedictory keflavik skiddaw magloire ulb prineville reynell mainzer withholds diplomate greasemonkey miscreants lieux jayaraman youssouf scoffed shoebox ngn hazar feiffer unhrc webdav nygren algren haymaker arwen agencia gledhill jigawa jaiswal durkee southdown yavin vrij dietitian dalyell tylenol casterton rhps anb sexto stereotactic elastomers liles mzm samity kotz ragamuffin revelle tini lunges reconsiders princedom tooker gimmicky monreale oatman refiners cooperage emigres huk jägermeister rehabilitations lehmer kubra iller manera reframing clorinda mouret phetchabun lastman gering hobbled bicyclist ghaznavi sigal fuhrmann nace padiham lavan frattini hoskyns hungária fahr computerization coffer appearence stribling chaikin martynas hesham sard leibnitz abhaya ferriday interdepartmental bettor campesino bishara socialisme contortionist limoux teesta vola okereke owais beltrame mocenigo interspecies wayzata setsuko cycleways modifieds rinko blackening buskirk impe repudiating labine chillers atreus hagop blued jaffray canadas dumbartonshire nsaid aken visp cholla stallard tuominen nonet bellick persuasively macdermott pantelis kela mahbub bilstein korkmaz keesler menas yorn skerritt tuckey ganden loansharking nandrolone sahra choquette bibel ahve dayana evliya bussed interferometers rosenkranz namaqualand atropos moncayo carbonation harkat kippenberger stranahan jfa venturebeat stob transoms qaa voortrekkers vernaculars nicaraguans monticelli xjr canas deucalion rappin berzerk hinchinbrook peterhof gaven ezln lamprecht kidjo hedi cumbre kavir drawl severny leavening boho zoologische sidibe soldaten intertitle storerooms jdbc bbe mwanawasa bondurant peramuna tuite dallam cft vapours cordner dli ethnics fugu reactivating phragmites rifat bowmen kulikov inara custodio azione akshaya hatchett shead clonakilty matlin tapeworms peyer thring hamma caylee benyamin queercore allamakee branscombe teifi wobblies bather siciliani kuwabara koester dimeo kudus klugh magor kaltenbrunner vandel venkatraman orderlies fuka abided hoak refills ceramide gregan smither obliquity levator krister leukotriene heightens drost antitumor sonography aktuell moreh zoho tadayoshi dola kakapo taymor xanga crêpe noss bronckhorst jiffy cretin tsongas kaline neneh valtteri moneyball balancer kosinski malki lyneham buzzsaw corazones hollaback newsround chatterji kirkaldy thorgerson squandering subha ipos clarinettist monetize zalewski kullervo altschul pudendal opportunistically charbonnier symbolical biss barbro shirky meteoritics farrant brosse seliger wiene hukou brading dieckmann stoking onlooker torques bue furuta reponse fnla tolmin leamy matapan ornithischia emploi cras decodes monaca kenesaw buckton vijayaraghavan zhanjiang zaharia fissured ginde gillow piera lateen cfda hameln bungling dismasted tpd rufford snowpack friedhelm dogfighting sierre tebaldi kennicott mahendran eram eban hdpe tuvia teviot neilsen lumbermen temenggong majesco mattachine zaharias modugno kulaks majumder diskin rebecchi linyi extrasensory diskette kasserine ysaÿe reproaches cosponsors vanderjagt hlaing ashcan pharmacokinetic aiyar unsuitability kemsley cardy shibli curcio lince ismaël mulvaney ascender opn tulipa culhane anterograde martlesham fontvieille mayank relinquishes eithne saharsa laterals flavorful questing keizai vmro tiangong domestics turbonegro inveraray campeau compagno panoply yol loneliest prostration rothbury shrieks scabs sgd douthit llanidloes stonor concomitantly mirco loth testamentary gladiolus strauch bribie sorna kaempfert pawlak quadrupeds isleta masser unction volar macleods follis esalen viognier interruptus ruk evolutionism bulova frit irregardless perham senckenberg imbecile kost spir tuffs hobos chimeras fassett kast toshihiro kolarov validus ellas quale gordana ijmuiden puru danseuse ottaviani arioso landslip khoa iolani globules blackmar atascadero mccurtain fuxing gams uruzgan lubna downshire cracroft margaritaville pecora yos shailesh realigning iio umbels highmore deliverable bunraku rodrick immanence borrelli wtop carbureted privé miscalculated recriminations campylobacter dripped zhob garia lakhan meningococcal weingartner byword westhoughton stevensville barong petzold educ svb apfa acushnet shinbun singularis shaeffer nce bopara odder eugenicist opryland bni azuki jajpur nationalbibliothek aleck rmm rubí pawlowski lazarenko catv shoichi coalinga endara mychal maskhadov keris zera elastomer broxburn bradt takemoto reschedule kolesnikov outmaneuvered falchuk fariz brenes newcome malkmus pavey kuncoro raintree litten hustings charleton behinds statesmanship pervaiz prioritizes biocontrol patek mattapan betas sajjan feelers apk morbidelli suppport seines arinc mulsanne triumphing arijit westcountry haslingden goanna ezell makhmalbaf cassiar hersholt programmatically fiche beckoning siple riou kulp demint pjak ditchling huat brumley lutfi mcdill jörn katzenbach laminin anuar suni beaujeu trespasser pisi sherinian hierapolis hermiston ichthyosis ryoo phai kwee busoga boksburg cordiale oceano greinke facetiously henriot rovio filibusters nurseryman udhampur rens metascore scarecrows sayyad grimley preseli seweryn receivables bonga magnesite toninho sadia bws wla ethnical atalaya garp rohani prickles sadko tnk mossadegh geissler nja trussell twaddle pagadian koreas corm skousen swipes materialization stanway kamakshi econometrica keratinocytes abhorrence preinstalled loper cernat lave martenot elderberry delfina lovie diabase quirin dobby lexicons cousens kleber bartos editorialized shinjiro jaret isaías hilburn frilled jochem patekar bodyshell cosco okami esca chalfant eystein slaveholder kayal kriens lotions hungaroring mattes microblogging zinnemann holwell slk kuusamo dorothee dahlan veiling colocation kruskal pnu matriculate kimsey hartberg findlater biophilia tedd hoopla khotang periodontitis feedstocks induct ferre controllability fayez multiscale cruce hamadryas mattea composited saudia demonization broadness lesa meca mirfield maderna jum humorless mapei technorati amil parian gdl enmeshed mitja slutsky fastlane totoro llanberis blaustein bwc tinos piñeiro railroaders charitably ehren araba spotlighting nasm barrino bojonegoro homerun jambalaya poyet otterburn quintuplets wino bolliger wheezing mythe tallahatchie mariela groener cangas drp scheler nithsdale dasan busboy rejuvenating sspx natsuko delectable carnforth cockrum jowitt captiva ytterbium abhor naroda klepp eastchester henslowe stalinists cele hoffs alexy bich robocup hypnos tokaj soderberg vivar twardowski estep laplante chambering nursemaid aticle fleiss garces porlock reinstates brucella chinnaswamy gasset hornos bagheri gip ontiveros piltdown caner tharoor sanneh beiteinu vividness skilton pilas bdt syah matveyev cressey curtailment sollentuna hominis surefire sulcata astrium bitmaps rejoices monoceros dysphagia irresistibly bruggen heini domingues zhenhai rodi enablers whishaw hrp meandered harrer hackenberg treng mbd ramamoorthy hoku civets biosystems bushing interpolates pricks underfoot aldergrove desalvo platers eraserheads lafourcade slob terzi shutt evandro oarsman lactase serendipitous rwb ridgeland turka ingleton terschelling prancing apayao asco elna saddlers thaïs summitt anping impertinent kolok fanged nto dominators defacement pagán mattheus idolatrous zlin coronas overeating messageboard panjshir tase reassembling mbit aldhelm crevasses racecourses qalandar stiers ingoldsby canonic hutto nyland preda sicko depersonalization byo willunga dynamometer thayil abuna poyser hattrick ribald zinder bandoneon zipf frf asheboro testy orifices brandan decriminalisation cityhood osma itai fuyang heliocentrism decrement rissa yeovilton moneymaker locksley luddite capac beeline unobservable offshoring foretell capi coenraad mouskouri noack korban lynes rockslide bretschneider basilique forbs bracha afscme fishmongers azi grane connivance abiodun autocomplete watermen cerca brug heere liaoyang gwenn kume pantelleria netbeans bessette bishnu fdb squillari btt trueblood stanislavsky hänsel pichichi gaylor ellman orinda freewheel midriff reinterpreting clery lionfish woolford marjoribanks larrañaga enunciation harakat strugatsky wiltz bebington dahab factfile karisma klien millom clearness ottesen biren cushioning wakeboarding tryggvason perrotta frahm mols mystifying katich zall vigevano benzie gaura mabinogi scythes lasha lamborn kingi haymakers zindabad griner luisi gilliard stivers stingless blowup celbridge maik scaredy miga exhorts traquair mcmenemy perfecta gugu nefertari aicc musker indolence saturnian florentines adelphia fronton fhl creem newschannel isen radishes mandvi jisc lithographed mynach gurun rondane regino ultrafast davidge tlb igra eyüp shalem torne caldo dietetic kazantzakis albertosaurus ladyland goyette abish blacktip sangita masterminding zhirinovsky agriculturists bagni relix devoy hoverflies medwin areola glint annualized semiramide keynotes forestalled nbk wroe bezuidenhout dula mcgibbon bonneau squalls kornberg lowder nissa koenigsegg bicton bonamassa tardiness marquesa sagitta dahlin mizu freebird foye blowtorch vsl sixt dunums charmingly ridsdale kangchenjunga carbamate carragher redacting abh nautica tinting timebomb tomoka reif exclamations kinghorne rhames sumlin accentuation gritstone khalifah bendy yagura zakho haruhiko stanshall mcclory understudied banerji marianus shinshu sukhumvit marmaris kabbah akhilesh apod pihl marineland underclassmen stt caldicot nasar strandberg baumholder tgif manobo dvm bleddyn capsular kintail rdna tinamous confidants petrovski plaited offord counterweights ugandans ibook sils chava ngoma shorncliffe siletz floaters stupak posthuman laryngitis carryover agila scheck rahl waypoints sandpoint amv basher whitefriars recirculating feis untrusted lausitz kumho instigates bensalem kimbra oilseeds harrower dodder tomioka duden zula lechuck leder souffle deshannon vulcano valcour grotesquely pinglu phillipa springburn perversity rilling shebang autechre debarred newsham oai holbeck anglerfish gandolfi sacristan geraci lollywood nazeer godwits loog givin gabbar cheick minu fabulously labov gweru disruptiveness belzec vannin tucumcari daim sieglinde shinkai kaley vukovich shawmut fica hydroponic influencer pindus cascina kübler tuckett ercan nonacademic mastitis manfredo mayte adani boumediene wakasa mml marabout stefánsson elsey betrayals plinths hurstpierpoint starnes torrentfreak blokes pnt synthetics eyjafjallajökull gamay screeds fatuous benmont jackhammer wellbore galvanizing kahal batterie ocellated redesdale forssell ayotte peeks pierrepoint diggin azz montoro exculpatory lallemand milkshakes phagocytes buggles caruthers daito strack malva acsi blo cvm niculae sabyasachi marlette sterry hügel simonon upgradeable träume bicolored traxx mcelwee moosehead peaty bellhop lukaku coccyx praag horbury snipped studdard roussos jantzen ctn pedernales yuzo changzhi rutles rbcs rootless americanos smil ursprung disha telmex magennis yunupingu farhi collisional barneys lami cudgel troggs syzygy gamblin frithjof girdles chup disincentive veach adventitious hoven buddle churchward horr reichsmarks laas sportscars scarabs nontoxic cll ramamurthy jale pierrefonds grupera skeat recurrences waterholes uveitis falzon dronfield subsidiarity tonny digitata unpredictably cimbalom anglophile kabbalists bukovsky anesthesiologists ghostwritten harby mapuches instants supergravity macnicol interline ngl beauséjour dedman aias bomis marciana prelim savak commercialised abbotsbury tunga flybys venkataraman hows microlensing perrysburg blech throught amuses winkleman ubb gamedaily totowa loto hanni ordon eubie redington norns kinesin wroughton nostromo strad posad divac nazmi matus orio sarit aviemore volkmar burpee edx quiere pku santas chelios zuloaga manja verfassungsschutz foremen fux valvular gidon muskrats ktp alleviates truelove interconference khalq lakhdar mures gynecologic yedioth unpacking gorbals erhan umana wulfhere bonum bearable zinger woad berna hosa foamy jobbing ensnared sheremetyevo reconnoiter whitehorn wht salmiya binny currants arten yasuyuki suka eavesdrop jina tomcats mvt circumnavigating barcarolle murli mccutchen butera vodkas mangotsfield prenuptial wurtz leftfield scramjet hiroto crumple photomontage smail cheektowaga officious kishen muckle ringsted riess dimasa shikhar sidelining remapping karima kozluk macewen stygian boons belch coas batres yuca pahrump inver luganda taplin concocts oles budimir plasminogen damu epigraphs pismo duncannon kristel cherif wira htp balmont thiocyanate shunji terrorizes ficken pjs borotra axworthy tortona crosswinds raiatea tunnell coatzacoalcos cristofori clarington kuranda anemometer minersville agyeman countback lacanian braley metonymy gedling uygur neurogenic karcher duduk drt thisbe anspach windshear sould hovell valliant saeid bredon extremly obstructs bayport fowke meglio immagine loveline murrays erv servos makhlouf mahato streptomycin mcmeel fairmile isengard kapor hellstrom periscopes clavijo mindstorms catherina chappie tsoi maunganui travailleurs oakeley sveinsson hyperspectral evince spofford shortt gebre fonterra dosimetry wiliam rathenau kalkaska barg nccaa aelita wqxr liveried colbeck borsellino liping steig grandstanding kaare jaren hnd ifni delacour stagnating maff boyan terenure komachi constanta coralline cantante nwda takata benedicta swt corde saiki blockheads hamadani compulsively debilitated pnd bichir ramnath espnews royden pasties ekg wasif atcham lij murtala sarton faustin kasturba krawczyk catecholamines giovinco betters beltsville exasperating hamidullah cipa cait lario bullish bele emeril reimers outlays amoebae wholesaling hyndburn navis oversights takuro stonechat wynd dhirubhai katinka asprilla schneller omap vandermark elektron plusieurs remembrancer papadimitriou swabs hta cintrón shvetsov kamelot seraglio vima kibo congenita holker shunts gilbreth powel stok occlusal loango reinier superdelegates bigler dollfuss allais constructionist binyon yawata bonnington goslin welshmen karis winterset kosei armbruster cyberman uighurs shoos manoeuvred drozdov eszter madia romola hammar gourami rosaline indicum yng pista retouch synched hailu sarracenia brautigan fmedsci kozelek bayo margriet currumbin pne hege darknet brahmananda avowedly papunya byatt stovepipe roba cedros fmg wmca ejiofor cellulitis narvaez markovic englisch kokrajhar bumgarner detestable lamotta sciencedaily kalari enchant youm kuriyama tootie vassall decoud readymade asadi damaris nadav danas embouchure doune teus masseuse horseless chiswell kaija bicornis metropolises gutteridge niihau chromaticism uth chouf agenor avium schreber outgassing carpetbaggers postmarks heimlich insectivore plebiscites roundheads hallinan administrate jedwabne oken vilsack battye ³ oppinion greenside achan hannett nawalparasi tingting dourif dlx legislatively nagraj gildo lagasse gedney nordmark gause selmon reoccupation dalgety gna foreclose favorita borohydride circuited karekin portly barrancas couse keyarena thalys cupping minbar safia floria fortner perennis halm doze crummy kaurna recieving oughton coattails venuti kanshi kuhlman pharmacologists raley prachi cachorro kete fishmonger juster relaible hsueh shimo olms weinbaum shirebrook lascaux busybody mutilations wonkery jorg momus ludhianvi brokerages bodhran granat sasktel mascherano mellie guesthouses prosocial spokesmodel pulcinella warlow perused garowe salgueiro richert kipner hilali fumo szekely maquette katmai wmap barataria proffitt starker divs birthplaces hannemann ashville dhoti broking nordhoff temnospondyl hengyang warbirds skolars diverticulum thielemans chenault kps kalos nettuno personals polyvalent ehrenfeld rhumba ranuccio victimised hoppin hackle yellowlegs cuscus arnprior patrimonial wharfs sistemas kazaa copal danila wuorinen bellbird riek jeroboam carfax debenture beckoned povey ygnacio pucks ioannidis multivariable habbaniya krofft aircel igual palam pinup duka siegal seme chanticleers emancipate ferriss minin ajoy simak shantz ditlev brémond xxxxxx gothia lampoons takhar measly ratsiraka paha wets famu aquaticus gilfillan enliven slbm bessin outhwaite proprioception kinner rasim toshima tsuruga surmount hogenkamp ducting saddens toilers caped redactor karski aph pht moone cyanogen tucuman martaban xenos ldv preprocessing cartilages lathom withdrawl tabori tamburini héloïse eknath saward beara amanuensis quinctius berney kernaghan avait aylesworth urologic hardtack curvatures doctrinaire kumkum iulian likert greatschools tishman perrott tambay essec varg mmb beeby bitte inpatients modafinil keila ultimus majnun kubik maras yoshitoshi infinitesimally fatt pakula scarpelli cartland disbursements pouched niblo basshunter bosra triplicate lammer pêche juab sonchat biru olio gogoi shareholdings herberger mmda cesario trx pensioned standaard kpix anastasi uglier southcott kelani reformism papist chernoff glottis manaf tveit rainger gerb flatworm batasuna photodiode deberry fmi supertall wainuiomata ceccarelli hydrofoils downloader nerio revolutionist buner spacings subchapter andoni respublika antimafia dollis planalto demas arnel carlucci kiyosaki eynsham procopio kidal ferrall dogpatch yancheng hawthornden nydia osmo outgrowths dowlais heward snowplow jenney breizh papules middlebrooks debus magnetics szigeti matchett mutineer thighed arizmendi patchett pery forsee mullett huot pathetically patchen brize gouri cinzia ponferrada perforce vivarium sunniest sonepur lawa hauppauge correctable cani shal zags kumaraswamy wttw alevis disowns glissando bokhari cuf sziget emperador attanasio krige tulse knx brienz purifier alacrity genotyping blab welsch woodcote zubaydah topmodel sulman lucchesi hisamitsu bbg stanwood ardian jacquin tarif lecuona butoh mlr jordans heydt immunologic ahistorical eremenko mccone genetta overreach mahalingam dsu dutoit silbermann cornets gorontalo shiroi statutorily tybalt solarium ragtag georgiadis gulam harmonicas rushcliffe michaëlle particularily siavash exploratorium sartorial vicariously trended nbt eldin elevens tormentors kuga stoma threadbare venturer myke maturana claverack offtopic ibp mmg wadding chingy bradlaugh schrodinger eleanore plaice pensées houari abegg gallerist circolo witchhunt adeel conseiller evaders ahumada villeroy cornetist filipacchi mittler kiyo mansbridge fagundes leonhart altro neccesary graving okello cuckfield addario sharath parrotbill calibur madding speedie tindouf santilli convocations snowfield whare eliason witless mcdavid eyeless ramdev bergamasco aql barberi baiser avenel lona faik bamboozle kebede nutritive wingnut spilsby gunfights mellowed suor hauk vegard rightward kapilvastu nogami penalizing juif histopathology kameez hider untainted sartor nereid gddr masina pannell xis figueras decoction kolstad emek kukri pedrito tangy corundum myddelton furrowed hartig newsline samuelsen unrecoverable aboyne bashevis laneway blak jinjiang jonna babyshambles balustraded doens winge surgut haneke gordonstoun icemen caracciola rmn grilli blakley cyclingnews abodes kangri mufulira cutesy henrys oistrakh jodo lightbox bullfights sealab shannen bonsu gnb gallucci hookups lango unocal pomar montañez surjit niente burrs praya sharyn leishmania riverstone thunderhead hostos savinykh lothario swettenham lsv micronutrients bdi oxana kapi earthman sarfaraz ranil mitrokhin apco akgul sweety bwlch dongfang rosman wenman leeton ruapehu bonucci petronio jagjaguwar área zipra luxton invisibly boxford groep suspensory sunnybank inquisitions turkington bosaso citadelle dingli halevy langi kinzie menasha myfanwy coronets zenker uncirculated wachowskis arcelormittal bozell hri bukka dehumanization delanoë aof shimabukuro ironies martirosyan prophethood rösler destructively bartholdi sabar merkava wynnewood haid brayden taufel marchesa raki amstetten tiptoe lunaire cabanas machakos eyadéma palmeiro pevensie marciniak manteo fondue perversions cornford childhoods turnoff labute eastville tarasova drozd dramatis durum shadyside gormenghast auroras kalil ailesbury phidias kalinić redundantly gambela cauda povo crèche shrivastava osservatore fortuitously gatesville inishowen veltins couped szymanski mmd yamoussoukro westerham laupheim evernham delpy warrenpoint kamina aldana sbe shallot herbalism indesign affan krishnas riesch dowries piatti rawley muret adwa daar invesco othe mallinson pulldown moesha chaurasia petree coleford pantheons jonnie trochanter aromatherapy sproles glatt quadrillion chiellini rosaura guti hualapai guston obsess bowerbird santry norquist xilai codey improbably francoeur urf prefigured polizzi kilmartin mitropoulos steinmann torstein axeman csh loveliest kahle malegaon tyc hypogonadism bide beker couzens maemo leofric onan sholes kindhearted shango collingswood ferrone alcina neuroplasticity mahru hyperventilation yushan plagarism akela hirschman willen fanatically automator yeom biljana kini aggrandizement sacasa efd hougaard halite nerina ballades queluz opi hawtin chandrababu freida briz borromini badness barolo milverton mckennitt rearmost shain penciling huli unashamedly binchy allusive gaffigan luteum muso dramatizing chihara schoellkopf saphir zanussi fringilla lutsenko warding remedying arteriosus golay sloppily scandia mccreight kensei urfa peseta plummets cydonia fugal riese kubu devis yoshihito schip mahanta lyonnaise kavkaz soloed toka roesler salvor zoffany djoser inducements chukar fitzgeralds bkv poot vsd mildest mullahs snobs anticoagulants persevering hbos unglazed rachana underestimates mexia tevin marinov encke annealed bedel cranham hyam principio jolanta dipak mayak meilen keqiang gastineau milligram magico medtronic ebonyi sorcha graffito verhofstadt frasca kaido lanfranco dolma archerfield dennys borie brettingham wkrg csepel constricting equitation jafri nozawa samana strutter beardless unbilled fullmer mainstreet shigeki blackcurrant wuwei brailsford tonneau dogan angelle duncanville keytar montanari génération shibboleth scaramouche eggleton cht wiscasset barbarity bhati yal superintended daishi teruo ilog naivete leaper lilyana nooks horii urlacher parkash detaches cavaradossi spex bandundu caris danieli infringers midpoints jauhar shlomi umwelt arsenije etan maguey ghauri tietjens hirobumi misano ndongo pancholi crg gurls corsaro jinsha ohata storytime christinna melchett bankim antiarrhythmic nips azriel cyclophosphamide shelah eleftheria jagr nelsonville combustor shuya arbore catchweight balor zdenko slanders agan cambium toshimitsu vonda godowsky xianfeng enac géant falangist lanl soochow inverell physeter spang argentinians propyl comps photocopier sancerre rockfall wtae mallam dahlen deansgate paves fxx gusher tiaan charism alula rossmore squarish sallies kfw fowle demonized aegypti cortijo shallowness romualdez arachnoid pocketful roadworks europeo ammu hetton coleby stricture sublet viraj maino morgellons befuddled twilley qods brutes counterfeits neun gasa lysaght chinh woffinden spx noires comencini heyworth oreja kyl dami barve threlfall kanyon benedetta adiabene scarps extremo rainstorms kanade siping foresta condesa inas thau unapologetically isopods rennert curios gauck cunnington corleonesi colonising verbier hutterite razorlight pangolins unsprung beget kallon kroenke bustards brynn repetitiveness shuker dunga caselli europol tirado chicxulub bridgeville acar debauched cheerfulness seigneurial peeress occurences eltz resubmission teno delors eprom chartists deliverer ilf wendl malinche wildcatters watercolourist kahlan aftertaste denialists goalpost shahryar coochie rathi disappoints megathrust similes daisley kaew galia formalise kaat sial heists ffp kinser ryoichi caretta shtick willebrand megillah dielectrics olvido tombaugh impregnating wilting sokolow mehrdad veii freon rhyd vimukthi hyraxes statto recompression unconditioned birders killdeer jpegs manthan lcf ovando translatable birgisson compuware backe southold norvegicus protagoras kidwelly wristwatches boehringer marinescu doughboys lef matchlock counsell brahmanbaria cik kyte asprey mancunian johncock andry athen helman edam yongkang miandad mcgillis juge dikshit widodo sugarhill rebello monohull prokuplje qnx kronborg portages harrasment peplum interposed salone pask ethier kingstone dring prizewinner wrottesley reconnoitre revving tolbooth sene waisted valdivieso ignacia mentawai mansilla chapeltown nikolskoye brainless floras senegambia blockhead lyndsay menn borkum malmstrom envenomation wheatgrass encrusting senda anitha rakers helly mahua hajer schistosoma metabolically cojocaru prospected pinpointing histrionics beseech subjugating fadiman gaud gorsedd torsos saade coppers evic chainmail mirta tiree lant schlemmer aaw blinks alik dalin baggett wui wenceslao beavan macrinus carreg pavers immunoglobulins grrr auchincloss dorma wringing outranked lege gillmor tharpe nassif betancur warhurst fingertip hilli pragati digweed trego picacho tablecloth juliá syncline laksa regionalised longtail kopa girardin mcdormand hysterics showmen diviners naldo homogenized inflammable felonious russells glencore dalman incinerators silverlake spondylitis levien mineshaft cuyamaca woolman denville warlpiri rautavaara jupiler overmars misdiagnosis yarlung garagiola culverhouse enet sildenafil kvarner gneisses unburned maguires lebo cremations scintilla tidworth uaz dupleix giorgini albom kassar troncoso kunde jobbik retributive toumani yeosu aitutaki gunplay hoskin fontanelle adebisi froehlich selectin carsharing borzage sununu gorleston natyam philipe samet halprin leuchter mangora xscape henbury algy iwona aana amilcar callistemon watari varnishes fayyad traun atmore macallister srr guercio funai parrilla kiyoko oof naivasha intricacy interlagos doel tarbosaurus minix basedow lowlife derbez communicants unashamed banjoist paret hilberg clapboards notícias pfd furnivall ayush mekas whorled lubyanka aldis footfall bancshares wychwood wehda egge dodos willmore kotani aquilina ramc dutcher dilwale chubu jöns rapamycin arieh cholmeley strainer wingard liberti regularities jiddu aribert ligi uvarov vleck yevtushenko pauvre markoff soucie monograms jael vilify muzzy muqtada lancie huizinga rubberized colney döblin reemergence holländer shenley voom novotel dago bacchae schoolmasters stasiak dramatize bronchi barkers pinetop capitulations fritzi gery pulping jankel domi radiodiffusion costantini ethridge berkelium marwa vishwakarma lambros esposizione foglio ruess perversely skikda pinho malinda grétry arlott chaminda vetri vea algis edah roaster maneri pantheistic gorb leatherman flandres biggio warre catechesis eyres trendsetter skidding nazia madariaga yared counteracts technicals particularities opencourseware reshuffling remeber multiuser debasement brousse tiedemann skews auroville pelee dominum itzik kirloskar urizen kanton analytes perutz bundesamt bustan entranceway fatio schlierenzauer henrie charmers asakawa subsidise hoult berlanga damita ramzy seiyu godsend starkiller furber thuja ronnies olley kostya moseby eddery asyut armrest marshak giraudoux harve handstand telo shenfield airbrushed treu pandy dessin talman unconstitutionality miscalculations gervasi flouted munakata dubuffet powerscourt mpac hawass makiko booger translocations rechecked artland deggendorf eanes gaku contax ashida reliquaries represses tiler leks bluestem furcula friant meridiana portinari sciencedirect qantara ratifies iao madinat spuyten hanoch vente campillo chapo spadefoot eleuterio despaired hernani naos negrito damião tattler quinten topsfield arae dewatering metromover vidocq elham driehaus ticha maternally ziba scuffles ornelas telehealth parlayed andel sankei pase fdj despres muz asgar wangan bulma imbibed thalassa arlovski telcel raub doubletree digitalized haleem mantz karlos ruppin hermie crackerjack zaim chichele amsa wisecracking balzer telefutura oley astyanax signatura trogons rautahat disentangle deputising norther hindon btp exorcists bachar illuminators stagnate maciver treeline shoeshine rng esty tsuchida guterres prweb battlemented powervr brucker radicalisation aeronaut signaller arla rodionov indecipherable nupe boks linchpin broughty replicant ogs banias oscoda malko gallivan oddi tamron milkmaid shalwar chartwell blips purus hypotonia kittin priceline bailes clarus dichromate gozzi ctg kennebunkport rexburg ellard tsukiji roscosmos gesso roadmaster bunching rickson ffh clouston verrazano lhermitte grubby snorkelling kunshan vna pistil goulden anau antidotes tapachula decimalisation explainable exciter paintwork codicil parag manana afo vra doina tunnicliffe talpur pomare fullers clipse ceratopsians mckuen tempranillo jaypee jow ofr sillier baldassarre libbed shiozaki outskirt judie martiniere brachiosaurus phonographs yoann brandenburger skadar piti montesano bever deve meinhardt reynier jukeboxes scart whenua baise mabuchi profligate ruba earwigs slitheen exum angelino rawnsley washi thiokol hsk agnolo ubl crutchley lopresti sedov crivoi swasey hitchhikers crimmins schnauzer millpond masafumi heiligenstadt ultrasparc metallurgists discolored kiwifruit infiniband bissonette vibo enma topkapi cambrensis bramah fittler hogmanay boling cedes hemenway viele thye spn transacted pianissimo giap afia teso nbdl casaubon kingswinford morgentaler perlin druidic goldthwait signees foh dimms ruddigore portilla oligopoly honi stargazing warin lightbulbs ntd duyvil lettera segeberg kail tonna tardive kaylie bordesley fitkin perdix oilseed dilke rivulets updf cored povenmire criminalizes silvius thile archipenko ranjha bertran moshannon cainta itaú rosana kachi phenobarbital verdot middlesboro handcart hafod muthanna gastroesophageal chesky cordeliers moeen guillain phetchaburi sindbad hannington dabbs munter rockhopper silkwood hadda christoff masdar mfl langauge paneer unremitting templemore fixe blockhaus dandruff gleanings reawakened arq otw ufford wieniawski ejaz basi lugansk jasen bfl brinks oyl corks bakiyev websters windshields wankhede pido gripes gamage arness garrisoning boreanaz bhasin trackball harken badali fbr virulently girlish ismailov happ kulka chohan bankston galerija gorgonio segun jeyaretnam windrush husn baía theriot davido caloplaca wallack salicifolia libreria gazer skylines scampton beverwijk presti nkurunziza dangereuses breakdance wilderspool spode impinging dukkha sods eset harpies jubba ahlberg thirtysomething onyango torridge mairi elysée amped burdock peltonen neshoba sirianni daves arzu tywyn uriarte harijan diffracted zhurnal neukirchen kinnoull adhi wailuku gazed popkin sawchuk tsuzuki beene risotto kropp outriggers chiquito nuo drever mudug radiograph belittles jhs scarpia ironical souring letzigrund timetabled ajk reportable caterers ercc alliot simonis meristem perrie obeisance tightest ceramica perforating hauptman euphony counterclaim regurgitating outsized menefee headwind spillman lahood rosneft sandbag deniece sajak hettinger tulum gesturing teela edwige stainforth gijs kdm windsurfers subadult mckinstry wbrc karuizawa lovel ichneumon nacre sechelt metalocalypse preciosa sanad lustron philbrook videotaping jaegers copas brownstown oulad vpl blankings choirboys coode yurika rivier wilburys acacias ascorbate roze excusable vestibules gumshoe timoleon petek moure ilsley biosensors marlee roquefort brookins pharisee ouma improbability ftt familles splay fono dogen igla syrie stenciled luxeuil bodenham luella eurosong kaweah tatanka morny alyce notrump flout esra ingratiate ebt kushal polestar blouin charolais micra gaceta blairsville mlf wth koreana ahle oes tiaret crpf satoh raspe cowgill perjorative blaring cowrote antennal lini misto sirah huckaby moala hongkou bisque swansong pilley chides macchio huaca schlessinger epoca iliffe gravest mycotoxins saja zahar hansbrough transcendentalist cervo haba centricity sawtelle userland quiche alberich kehna siffert courbevoie diagoras nordea cashback wolford pelzer chernin audre laman strongmen regals farndon arabesques jervois fanconi childline hadas belling dokic khem specificities vose hellwig petrovaradin chastel kutta overtimes ossip moiseenko anamorph dematha grapples marske emeline endino quarrelling kutai cawood kanwal devenish mcgeoch diabolic aguado inupiat arline tuath mccartan hyponatremia cordwainer engram liberians slotting walke barceloneta denyse arduin tormes maghera atik bors carreira subsiding smiler primatologist wahi remediate assesment troubleshoot langage familiarizing hallström usnews privatizing kigoma saham undercuts srisailam brocklebank prudencio surayud moorea rosal kohinoor schor sturdee saltburn overdoing stepsisters ruder szeto mdn wooler ducat thunderclap directo pwl puce huong bunhill tesch narodnaya extrapyramidal voyaged goar bangle mccoist lhuillier eurotunnel fryar brackenbury onsager immelman shavit keddie neckties mangos grimy canepa persuaders imprecision inviolability refines mitsuhiro quarrington panhandles foundering rickettsia comming friggin ormeau cherubino gayo armytage tartrate clarkia divining constitutionalists bloemendaal accentuates supercenter hoosick fairhope romanek communistic petters westmacott acrylonitrile contradistinction pnac patriarchates conciousness bandido capernaum arsdale unquote blauvelt mencia mattocks mystically cunningly fga jeanneret auberon touchback yohanan colaiuta ecos waterfronts walloons pbwiki compulsions rpk malfi anvils afk mauling selamat gunawardena bov complexo cluedo jare euv homeownership junya dicicco chantrey kardec lampkin kwasi trat groundhogs dagwood rvc bjelland darpan posto byfleet fruitcake downturns grotta kurile cratered bingöl mcfadyen uncrowned xichang raynes saviours beauly sebadoh calva omt biochem strawson loza imaginatively hellinger suge apfel befits muskeg halimi romanes waterstones bahria centromere unjustifiably borkowski chicas ridpath saugatuck auriemma koinonia csw tym walberg plantagenets chrystie comino rodel jfc neatness summum rosalynn tullow northview terfel vampyr becton pyromania muzaffer chilcott woos skjold kleenex beres preti chloramphenicol tremlett adenomas hopps lackeys albertans brokenhearted cecchetti ambystoma duby buckcherry lss scholem cottagers aborting phillinganes noviny oedipal brosius berchmans whimbrel ritalin polloi orlik arclight latona sallow ohp oaken hdds mindsets mitchison mushers chinas dinnerware cbca uncompahgre hartwall polysilicon tapiola reisterstown prothrombin tambourines nyayo borre aesculus mccarroll sequiturs morongo holistically messias bridgehampton geode waistline salomón essentialist fresnes keuka handcraft thieu ultranationalist zytek mcdonell carnie lacquerware duas polwarth disturbia brackman caballus gondii winooski yichun oia humiliations kerkorian ugk oakton tsutsui domenichino galion eventide bayoneted pois carjacking azania valkyria naysayers greentree thamesmead licia halvorson enshrining romanticised illu tdsb interlocutory hybridity primi krew leiper oriens fafner burstall backbones timidity puritani nary malouda crider bartolini simin alleppey sayeret cnsa kwaśniewski freakish lightyears shinrikyo embeddable fasi ibérica shinui hytner lorazepam applauds navaratri panik gauld baldrige busacca orgasmic nyr surest anthropocene missenden phillipines lavergne nmsu paraboloid melancon homies jaf halabi teenie nordschleife patroon seaborn hockenheimring wehner forevermore colomba akwesasne duis pittsboro yaguchi merula beyers turgid aculeatus kinzua dextromethorphan merode montella armato isom saggers tolmie kombi kusunoki slimming wenchang liukin purisima lgr kecil pmdb sokolova signboard saft katu catlins uwi broadmeadow pasarell sheepherders lamma andri transposons kimmeridge statism lyudmyla privations skowron darlaston furan soomro sadomasochistic khasan llŷn kelsall rawi spatter intersectional vautrin filson glazebrook mochtar nolin blurts jarrar diemer unità deveaux emmert hainanese hitech dacs cdrs utile akure baade elliman redfearn brimmer toofan insite lipoproteins shophouses hardi motoki sarafian altidore beleza roseanna frankton woodburne hausman sanded préludes declassification shaoqi stroup qubo vco duin bonaduce kida melpomene ridout palatina baldridge malti donaldsonville singur comfy leiston papago swaim yasunari wimpey ammount gehman catia otan livestrong tourers usami invalidation dearg jetson stavro lanfranchi mantar hend thuan botto atago thibodeau brotzman shange mauler storages comi underbrush canasta sty nichola dramaturg ignominious chordates lanta dispensations pkg falvey perebiynis ptolemies downpours herms wwj urinated residuary hatherley gree exoticism fyr dinesen hajnal evangelisation alwis blackbuck larin mclauchlan portail viadana hso heong beheadings nedim stalactite feg alyx hirabayashi eurobarometer venkateshwara kokugikan dotrice epicyclic aurat doremus birge weiskopf dawnay boasson kielty juncos vervain fluorescein pocketbook semon shafik hemodynamic alexeev exel birsa minora doy burwash antikythera illusionary misstatements grima wilshere menhirs vergel galbreath ual teats courte hiroo tribu banjara southcote bazargan frontages abyei lavern eildon miscible boosterism américain hippeastrum calmette sfsu slaving landstuhl drane terpenes drumstick surplice kricfalusi injun gbh fnl experian hauff sundaland reworkings hjorth kith axion oakman callistus quatermain amidala berowra sarabia bragi shrimpton teat kanner krw gress sowers slyly buckhannon stifler ilion phèdre dolin brid papio hebraica rpj traceroute nudism nebe vocab petherton mrcs vicent chhatra buggers ahmadou flabby sleipner rifa pivovarova milloy congressionally wachs phr milkha freshest juxtapose mixmaster libitum archos convolvulus abbruzzese osby mieko inconvenienced nayland holsters pontiffs palooka cloudburst maelor shenango repellents miru jcl matsuno clairton yogo jeepers roughshod rearrested saltpetre sqs skira rohn epmd fifita hyperreal norian swartzwelder abdy boattini circularity sheed hypotheticals stater terauchi cackling chalco dónal quashing tengiz flyte jaunt sociality novarum cades amul deniability prnewswire dolci dorland centerpoint rango dongola devar steenburgen mukta turtur clore achham ogoni follo frangipani radebe npi quarts rendón sooo swalec amund seabreeze overexploitation kunisada ofs rubina chickahominy levins balladry soave wayfaring busa madwoman swilly satiety woodinville koyukuk weigl caulker stylization imes heyes oei yücel quickies rouch casc barto milonga egeria nonreligious merreikh chopard recommender duffin surendran obfuscating xiangtan dvin riperton implosive distended cubesats breastfeed pinstripes nns gerstner etal merryman bedales tlatoani phylogenies itty tlaloc mwp ngurah bercow adewale sgn bethania vendome gulen ˚ statin pharmacologically rivest searchin criolla degaussing najwa deflationary cortices ptl zoff donnan wissam xishan shahjahan winnifred plasticine dufy beke thaxter aot resentments archaeoastronomy cibo boerne pinole corns prekindergarten myoclonus goral belbin khorasani alishan pervade penstock raku sinkings roseman talmudical simic tdcj ceni kma pittosporum rmg canopied deforms zootopia igt zeleny etp cecum seraing flecked lampi helgeson jayawardena jostein forefinger debbi pawtuxet beb billows idas lattes fumarate juscelino harvill oag balts plastron serviceability prowlers higley stiegler kreisky mixup brok occassions haggart frutos iohannis muffy saqi axford mindbender quins otieno antiproton hambly ermelo reimbursements gimlet insurances aporia mazi armley bottcher adès yeu legat rudnick bustillo jharsuguda tattle grooving apnic palak krishnaswamy brecel artificiality leatherwood kyoung leverton doubler roue trebišov shoprite araby windu lekha kedzie arber shariati rybka schou midem fre cnoc gentrified beauforts moutray alguersuari strongarm tros forewords mukilteo ritsumeikan djenné keshia scoggins draping katheryn naini meggie thomassen jodl koponen wamena vars shariat yakimova jub amneris britches ejb truism wince privation alterman osteogenesis freestyles ameobi davidians dzmm futian romanenko sargento salima guatemalans vandyke diakité colaba navales cornforth savor bellu huidobro esthetics ecclesfield dalliance categorises sinema sverdlov duritz stradella graebner webgl piazzale taitz conferment mcdermid zorin chatelaine keratitis odt exonerating trebbiano bartha snb greektown mirkwood rics naxalites bochco kewanee partula youghiogheny tetrafluoride hallucinates unwholesome exploder deferment kyan barf madrasahs opportunists mispelling ackerley turbofans ingenue wmp weatherboards scowcroft nasreddin hizbul lucier campton handmaiden foghat ferrum creede goads fluorophore kotwal golam slyke clouser hylas yamagishi bradtke westhampton multispectral monsta zeba miyashita subversives kazooie latics envelops premonitions tamanna voronov berardo raffael glycerine toyoko firestarter worsham okotoks noordwijk kotak epidemiologists argan cuckold cluck longstone bronzed santucci clm sesay cotati montblanc explicate oversubscribed wfla groce denyer fends mentzer trapezium longines rocketship immediatly katahdin odours guten gangopadhyay langat tandil breezeway amadis koby frutescens woodlice dite tilo silkk bottega storz bastin tongo junia alekos vasi timperley melkonian upstaged lopatin stangl ribcage alcona ilgwu haggling panas maconie undigested nafi useage arkestra cran nerval haslem beevers brûlé bhar mellows nihonbashi scorpionfish nasreen cordwell lilleshall zielinski minzu guarantors stuckism headpiece wakehurst tollgate sensenbrenner gelbart dewald altho afterimage roybal almondvale blenders monomakh theakston driessen kripal marvelettes pellicer congaree voo restates espers sealey claptrap iechyd gärtner foetal cinderford silvertown felician sovremennik goldsman comorbidity jarratt sofitel yoong cattrall yashpal purdah laury dietitians umist heliodorus smosh compactflash akeley solvang orley skiffs larios potluck amazonica rachman videocon rando céleste helsby sonars fenlon malmgren nayagarh codice creaky spewed disfranchisement stammering kalt mccovey peutingeriana anoeta lungi gherardo evanier larner tumorigenesis tityus plats havisham naiman rowett juggled seaworthiness telegrapher voyeuristic gadgil vittal anchorwoman garris dromaeosaurid biennials gravelle nacion piebald screener baco denni blethyn underappreciated benzi pacy binod celebre sanitize ladyhawke kulu enriqueta cambon negre laka mosier earnt staleys autumns glx shepstone kinard ttg dantan caos cley vyner harrowby goldbergs somerhalder essanay capsizes klick colorists confidentially claudiu maiming xvid marke knobby transhumance tiroler pulu epaulette dbx leff conveyances nibelungenlied jubail demoralize vtm norcia canadarm scherman cahier outflanking ankaraspor tailgating wiseguy punxsutawney caldicott vashi ivonne miéville deyrolle aplenty nebbiolo munden kuldeep palmate worton dissimilarity fanciulla chahine oconomowoc chérie huskie stoats desirée khadijah totter scrappers critz akel ranatunga barón onge vanek kany shimazaki tanti rofl ,but sanctify marmosets zq klett nuon niner irrfan beauté mcteague séverin satpura newscasters eaf monkman gatun soucy seismograph alpa aubrac propping bertelsen fabricators rini morial sarwan fowlkes autonome monn zehra squalus carhart kamali jamerson gcap insures horsford wieser bluefields padovani chesler carstensen overstretched severinsen berjaya ottaway albery ulmanis stagehand gelo hollen osian nexen unhesitatingly dignam batalha situates poulain skomina sportier refrigerants rodway grieco thermocouple esterase polders carbamazepine abadie klaxons gidding tsuneo brassiere hopedale conoco slimani pkd nuc fickett scio aloni marva fashionista gavilan antonucci tabern atmospherics fusil vcds contactmusic wach potentialities abhiyan multifamily typify brassens caquetá mayoress simpkin gachet sriracha heaphy tarquinia mallinckrodt battlers froebel swirled pessimist nampula territoriality universale drottningholm neutralizes champneys stecher therapeutically mugshots umu thura tardelli digoxin zaia mongers prorogation spieth nedbank readies putrefaction parmelee chinen bergdahl maske kiddush joux gorce genoveva rosengarten fanfares imx seiter watlington unus akranes pyatt myburgh ches sunstar klehr blechnum pozzuoli linha callison tikaram lsr efsa ncar disputations hallowe udom appell kinkel vocalisations grandsire costes driberg amyot tailspin wareing varices brister corcyra bushbaby backhanded vorkuta foliated paroxetine cors orientalia dandong probationer unbeliever daylighting bridson calque husson frigo ngoni yih deric ergs ohh harrie comtes symbolists bayon goschen pfennig reconvene woodsworth jerningham twinkie dahle pchs marre codfish sinew trifles labasa hightstown strutting millo temerity tensioning cévennes wearside stapes ophthalmological sodo annua hemorrhages homoeopathy compositionally chrism transiently denso santon plexiglass flink cavallari njn bentall frankenthaler fluence casar niwas quinteros nisbett ramanand ogni winant disquieting strick infocomm offing fagg griots enppi tfeu nafs houben melander neher eaker albinoni vernonia farne enamul koscheck decisionmaking realaudio viduka lasher cftr gerken gaulois lesch zedekiah capobianco birobidzhan prickle gxp bivariate asynchronously poleward pangbourne happend walkerton maars tanuja anglesea maney teige sensical pharcyde okeke jca encrypts ebcdic onoda metabolised voznesensky neshaminy nemuro plumages lassi antm aconitum chaska anglos voris verrocchio garni usborne nmm sequelae diciembre bhubaneshwar balthus moneylenders mayol naudé hoog deckhand waheeda xfc diederik lebed ximénez jeers lagunas nonmetal ksp nooksack blimey bukavu euphemistically sluis krem clews buyback maathai corpuz nizza suncor commitee wisps fictionally radiosurgery tamina loken nanton noes huf buis cigna bendtner shantha anchorite funneling grassroot ryohei ungureanu lippitt veber supercooled rethought barings usenix janel unintuitive actinium ownby yae waterberg tsuruoka sphinxes chenery cetshwayo gowon microns nozaki hindawi atherosclerotic stenger renamings strawn illy warders sindelfingen morganza gandini laxatives abyad linzi jex nasrin ruane synecdoche galette shamanov steinke slapshot redwater uummannaq cerise foxman horman halesworth résultats ductwork thomasina prostatectomy rendon synchronise qtv radburn bushwalking suppressant konig alzheimers glbtq giguère shutterbug todaro haneef archeologico anoint cousineau mandalas btg phev womaniser animo dichloromethane koslov duhon pikesville anri adon galax pacifique vogts anticonvulsants panzerfaust aeroméxico billingsgate wallinger okefenokee burghardt luanne osos mutabilis airbrakes planetesimals reassembly gnaw imaginings aplastic gitanjali pung ronalds buxom northwind sturbridge laem samian cairney tailend kochanowski swiftness attucks kinta haitink safwan preening lhotse winkles scheller leonetti duathlon trickling hatherton roundness choughs turntablist cers caucasoid cinecittà kimpton glis allatoona interrelationship zacapa tyro sova watchet unece extrovert kowalska bulstrode superposed ardingly chetwode militiaman tejan mizz yuletide shettleston incapacitates parenthetically screwy replaying bazillion zoila shmidt mccoury mcshea haralson ciborium toxoplasmosis peepal boor compania sayreville sanogo begoña catechin burgan papis dniprodzerzhynsk fragoso dickies nullah mengal overmyer immaculately nakatani utopianism jsi ghislaine olimpic postma assignation cardell jogger underestimation loosed russkaya utpa pantheist sobha bruxism vti outfitter justman inhabitation lipski absense sillitoe cito vjekoslav jingdezhen frégate encephalomyelitis ibérico mawashi bernabe infrasound ebor watercolorist grof rapson kautokeino cbx fagans uncategorised revillagigedo outgained bentos hajibeyov winterhalter backcourt azzurra alcmene omnipresence sherwani kratochvil repairable reenters erections zoominfo centinela headrests bollman buonarroti tillery solera kortchmar ruffing piddington idfa culley mellal tacony oommen jochum gasper samit nolde erbium deducting reedus fauzi tempura handwaving felids ingrassia poultney sashi encanto horrendously glasscock kassandra maassen rhabdomyolysis mobilizes hottentot arika chauvet waiheke valadares brenz iowan nilssen duch penson rodda akashic padmavati headey haf minehunters coanda brucei ikram mccarville rosenman sadleir virga binhai superhit snowballed granado brunettes kalay kittiwakes apprehensions cawthorne sterols embleton normie jennette caillebotte stronge knower butyrate airfare hartlaub shwegu lensky sparkly florentina ctt dyre bomer noxon kazanlak carrell manoeuvrable fayum jsut annen tangmere kostner pharmacologic kayenta sarkisian bresnan nevelson suji kuya cephalosporin souces lotuses thatcham burkert airwave paddlefish hattin holzmann rondebosch scafell wildlands moneylender eaglet chilvers upington romanovsky iguala autoexpress shinichiro républicain graefe fka usccb faizan bindra banska ziller blanchette herby morishita naff kiselyov tric couloir esol sibuyan vanhanen kentuckian mcelhinney dignify mbaye aircraftman barbi jedec ioffe kneading raben wallwork eadweard crédito blatty tearoom stoichkov gennadiy clattering tcw istc maryvale labiche masu atsdr puddling tightens farthings antawn vindex chocolatier burbot wilentz brows rebroadcasting kolff tompkinsville magnanimity bokeh abx stancu endoscope belek irrigating studentship nrb trygg kubuntu bioreactor microfluidic kensit coms navigability fiver bemoans chata superfight fishbein qader altissima bussard phoenicopterus mawlid succès hargis promotor ucv kelsang ubaidah dishwashers rincewind arrabal mimms croon surmountable ngok malipiero gyfun wincanton helfrich wayfarers shepperd hellmouth mistrusted leoncio blurt mado cyworld toilette menifee dewine knu ziemia palaeography oresteia wadhwa shorebird khayat mires sparsity employments danzón frayne marash voges iriarte mughalsarai beckons mulled hueco tumbleweeds longshoreman minifigures peonies ravena fiorentini cdx councilmembers petróleos manninen ravishankar gerbert commie fum garigliano mcginnity tormenta blackton kratie cumbres isco unmotivated oosterbeek lazzarini relishes celler multiphase lammas gedi informa msci miño dispirited rackspace kpl cephalosporins coni flagstad shivnarine mitts accion lunate malfatti garciaparra longyear payless bumiputra kasab solie gjilan pardoe fieri mytton tablighi hallet riggers karang presupposition blier mccree epictetus copt oec cattedrale chateaux generalists hila horsehead norham leachate undercooked tiberio facey ascendency kaist gemological plaything pickings nvo creepshow jackknife dooms hums echt ngorongoro bauza rusholme proclaimers brewis leghari jailbait buchalter titmus extravehicular karlskoga arifin ehe neeta kenneally clézio justgiving minshull narodni polack briceño longdon leavis supercells beddoe gulags rauner kobol bandhu guipúzcoa stookey anthill wensley wrotham lissitzky omnis davern chalcopyrite hoots oviduct dejesus chamal trekker farro cryptographically cranshaw polonica vestment igbt rira pinchbeck bmps lebow interparliamentary chanters dirham guardado metronidazole hpl brownlie amphitryon unclos hollandsche astronautica tauride pinkard catolica bayarena nfhs corddry disfavour vasilisa songbooks inadvertantly opposable repopulation gintaras eckman calahorra wedemeyer kozue carpinteria radka stringy meiktila azuay unwelcoming khural donaire cirkus barzun brazza hulot dizengoff bidston nahid wdcs alvensleben fumaroles sirindhorn belgacom feverishly stockmen eai powles businesswomen probity selectric mousasi coerces blairgowrie shambhu adjudicators asrar mckesson elektronik tyrie scullery aminah peabo chugging incongruent woz hmnb dislodging incorrupt dichotomies fireships pyi blag meanies refreshes rapacious aio unbearably lakki hejduk rioux tsvetkov clasts orisha imaginarium westra repackage diaby missolonghi etting repaved newish oge isolator divesting rudenko rosarito dehavilland stimulatory foliar skyrocketing mosson crystallographer catagories burnden stevedore vassili rantanen gothard diagne cliffhangers rooyen domenici interloper achleitner inslee murwillumbah freebase overloads fazlur ifsa maor indigestible obiter metrowest remarrying yashiro reconsecrated southerland affixing buel preamplifier meatloaf purchasable borth lastra kepi emended nafis pamina tactfully subframe hippopotamuses catchall bvg thirlmere whizz margi caprock pacifics septicemia transoceanic outvoted kashmere satis ifans exhume riobamba fedde bwana noureddine brachytherapy sahoo spoo marle phmc rheinhessen engelhard tamada lera ibaka pestered krahn melanesians lcn mcat hottinger chemise newsrooms blondeau cutoffs yasmeen levita servi rupel usca nagarajan reanimation vcf jairus ladue squashes bickers gabbana quillan ands brisebois izhar winkelmann dhanraj sultanov smithee lutter fov mazel petya reconverted ceilidh ufj misuzu bused catwalks shaaban kawano dressmaking ridgemont hammerton horniman bireh subpart amran tracksuit murty decapod reichmann carlock sapping thrombus icb garmon takamura camryn adumim weaverville pajaro udayan chown santerre polarize misanthropy eisele muh chisum narberth cytogenetics isopod inclosure anacleto ludden sapped metron itsy irksome stratfor straggled lugg holleman semiramis pleat boosie weakley apolipoprotein puerperal barga forni rusu ivies sardinians toco vora knotting dauer achaia excruciatingly sinfonica capponi nicoya nwf stankovic hintze commercializing sulfoxide dichroic pâté noctule skoglund bibliophiles uig welte christiani pajero demetria trumpington placidia steeves ruzicka mdk wilner kompakt lemba biagi timpson cintas nunchuk bottrop bbk durang postern maloy ahonen rafal larrousse regehr kaelin menchov dlt chrysis fli tinderbox allinson aboud orie showstopper graciano inspecteur whodunnit farol baule cgf froch inundating bicalutamide abit bizkaia dinitrogen genitourinary weizman incendiaries dugway rangamati cosigned waziri lolli drool millenarian naderi chembur josiane stannary setauket poltimore alberdi neuschwanstein dodsworth callard kubek astrée kardia vientos dubourg kalbi zukor subsume manzana hoque unsuspected pelorus slideshows holle uncoupled tuta halwa orcadian parktown flauto maipo ardour skovgaard refuelled millerton evensen berend safwat unweighted khurda rockit jaqueline mcclymont sinterklaas epitaxy trousseau lagers labbé huzhou refiner coss probs oblonga sayville desmoulins sauvages sarani azules boje golomb pneumatics clonard albrighton kaira petard arnall kilduff ista levison bielsa poncet hiley brigit glenroy sones vissarion fisica tangata asos urbaniak gillie laliberté trashcan beringia krissy dunked declensions morriss philomel dubno hovland quainton muga galeb grigorescu depailler kyros furyk boussac blazin gilstrap dredges missi ulp croes atum moscovici kark hartson insurgentes woosnam artemisinin europos takaki rotondo musicae kheng collierville devere gradin abdicates volmer powter pearman ecuadorians bluto polat sexting turmel eleison azd hartono melitopol panspermia ewf lesean anticoagulation monreal zigbee dampness zakrzewski firstenergy oxm owa nwe burkholder sauda alioto mouna behringer wilbanks snowmass jakobstad heder judkins gostivar gulley ghomeshi loathes donghai bedfellows haggai colloquia braybrooke herrnstein microenvironment kingsdown broda laugharne izabal ennerdale tragedia dengler prenzlauer chafing khandelwal weardale tombalbaye prissy kmph askham dhea maram mangla azeroth kexp devgn variazioni pomace kimmie sakhon semey balbus mimeographed ghanshyam brickhill montale cobs bresnahan kiriakou alls desrosiers redhat faucets maurie crapo rowson croucher kayne hopatcong granbury unprincipled azmat rinus dionysia boyles ideo stackable kegel pennyroyal kenitra pide menza retards lajas macrovision kononenko kloof masataka woodyatt oved spreewald guimet possiblity nominalism heming nearshore policia genevan pingu stylianos gurdas mutts abdulkadir hugill webex frangieh inet cruzado hunched enigmas hallvard laza schieffer levitz gorget bestwick stathis pedipalps schwalbe monosodium brockovich widmann soso tillers meninges actualidad habash receptivity gnocchi neuropsychiatry bantustans gildea fubar croswell chamunda royo novik raban blomkvist picon lcu terezín cowman parastatal cappa missoni rafn miniver letoya floodway yanofsky pontiacs zaher purples mcguffey balut vdi venne meurig buridan autorun cripplegate capstar baudry lacaille lugoj susteren ksm heilbron cavelier peyroux seebohm pecans redfoo lumo mangel lakmé copei bispo prototyped prasada bph arani mediacom pacoima haitham speedboats straka sciacca horology slevin ldn dullness deadlier paestum puti leta chaebol akino freedomworks saidu arseniy makedonski danu ovide barleycorn slashers agne björling zazie sunanda jocky fatehabad nwl vdot tsawwassen chug hebel reparative pauker zinner glostrup slimer nixed wetness pectoris fdm fumigation fti sabado shagari aquiles bahrani kasprzak spurling caesarian larroquette heathers feaf vsp reperfusion barnham rossella venere opinon tmx glenalmond osmar lifesize pagny leaderships sloper muldowney stelvio brumbaugh vano appends houlding ivanchuk townhomes belleview ashrafiyeh ceaselessly byrum bleiler wtic briony esrc gingerich agta misfired kerli infographic customisation snuggle gnarled rues zemlinsky pliska starkie puer viewtiful ivp frederickson evigan picciotto malcontents doshisha owlman heidelbergensis darwini terabyte stroker lakshya cartoony cardiomyocytes spallation warka curvaceous minda soos attachés mccardell heartstrings bbdo wadala oktyabr morata bohlin fikri kempes acquittals ipd tpv nesa bng hafs laparoscopy coonoor netgear aksoy sverker linolenic prenzlau grandly waqas bellin interlake deorbit niaid iwas modjeska etrog csaky unie drownings nuss sabourin folch astrometric foton hamidi estrogenic maure dismissively koln khajimba prepayment rotella coaxing redolent parel sabbah woodshed underlay boothbay musetta homebuilding zwinger neuberg bartonella foretelling revson sieveking dynasts dragna hidetoshi bikel lazaros rallis swarthy ornithopod bethan debden herington thorton tillett ficci fleadh distrusts leffingwell pleso darnton alph mecham engenders tonally concetta hanabusa helvellyn orients volpone pentacle chassagne tapajós pcv sentra otmar inquests lunds unexposed collies savva steinmeier victimisation wapakoneta earmark boscobel papiamento gaidar wjbk fruitlessly saddledome amraam painlevé tzipi kirkeby dombey nordica goofing lvl arkel crimped jonasson boase gits lockable trampolining giner piek gfw texturing loiter swoops pcmcia hausner kemmerer outwitted avenges kayoko rationalistic ilunga characterisations bonnot morayshire reinterpretations fomalhaut motier crip commendatore jaeckel sighing neet terrarium caac larocca vojtech wakemed swartwout fantasizes trumbauer fearnet mandira telegraphist spillways ropa sarsgaard mcniven occoquan kihara forethought uzma conleth asensio nbg isam rustication lansana nebi numazu toklas grendon janin jnk vasseur jrue steamrollers kikwete viridian inflexibility najeeb ducas bilad middlemarch walraven farrokh zooxanthellae roomy froelich lazzari jarle protuberance tempor steigerwald ntn kome scacchi dryopteris vogl mabee cauldrons guárico assateague pericarditis litigating bhosale lunden bluewave cranko borgmann salcombe basch gored haparanda yaobang treinta bulwarks centrists heldman mukunda brora rockhurst rums musson cowal stammers psmith proctors perfetti stayer oddo successional blackcomb caffey gatley serow neuritis propionate petz hydrocodone coachwork talitha boces fichtner zayat glickstein varick intimidates hempfield mouzon allopathic shehri tmr usr chindwin bittencourt gabbert pamirs nevel wandle dailly baril mattila husker horological violative edgartown alway friede bidwill quintette uria auv natus bdm dakini frontside hummock marzano factum sadak paulhan hominy billionth zuk rosses seared lindeberg tarso borstein pemmican trotted sarasate maurine teepee adamczak neptunus tuman azaz luh pmg cognitions kntv posses difford tsvetana schelde tova beriberi clauss bookcases mcgirr scobey alexanders railcard keyport parnes inducer lish ciampa cubas ironworkers madrona teran phrasings gmat srivastav mahaveer visto springbank hosier kneaded udder pelling vlogs boules sonisphere heinsohn seacroft uhs carolyne tilford zul grolsch unfertilized creutz bennani flamme obviating okawa jaba naarden mahy saturating tastefully bushra zeil avidan abidi pedroia ferritin borage sare lorica metsu itcz oladipo lunceford fredensborg certianly irwindale anm germinating personify akar dava silane capades bullfighters anachronistically estatal marquises erno dutkiewicz asar misbehaved herbalife rayan deflectors equipo bonspiel culturelle mehar adly hyden cantley jamborees mnp heeley mariann guillaumin powerlines franka rampa junipers moris nutcrackers helots throngs billi baranski ity quartile bathtubs marmite nahman inordinately essel aem bathymetry pulmonology forded ravalli henckel orosco splices abarca cpsc kahr rauh potton transgress immunizations thackray anahita moistened meja comicbook mccubbin kies gherman hajjah fleurieu mcconnel politis lirico lorax huling bajaur chait tampon keibler suliman dach gubin autolycus phantasia itoi zdenka eelgrass ganghwa lucjan doniger quietness wfld frieder westword tsvetaeva moire broxton shalev westwick dfi cocina persicaria methylmercury leadenhall zeitoun erding betham blurr stanbic keratoconus milbourne shouse hominems hotspurs transliterating biomimetic neutering tamwar caliphates talen charny dingman preciado ornamentals drea bulman eventhough minturn cabinetry sexier sebago bérard sirajuddin tresco ausa oaxacan bju pollet rosicrucians primitivo nicko riet dayo nishijima parun klopp hawkish kotal skyview qadim westwind antoniou leonine savaged knoppix chartiers echevarria azienda qufu meteo aril zoon santigold dispensationalism glassblowing mooi couillard fotografia zapote dongchuan callendar maltsev pandev colonus awo benbecula anshe pilbeam tugging isabell hmos stromatolites enablement fugit lilah seoane verts glycosylated delphin kilgallen durazno pariser amyl indexer yuzawa crystallised menteng seductress morientes vre froid dynevor stukas cañizares volokh slippy stephon tilsley pohlmann ruidoso eyrie firmino cristiane restaged primatology propitious belmonts oliviero oberndorf staughton repercussion soutar manpads munic ameritrade sitt bryars showplace parshuram neuraminidase claughton myelogenous wrack idd mixx poletti blather nimrud rahma dorotea superiores pawpaw histologically woodworker subluxation wyrley montello turnham corwen tartaric griffis appanoose transcriptome applejack todhunter predisposing gál qilin kustom neeley hallstein immunohistochemistry taraf photolithography mantises calcitonin castalia postmistress yeardley igniter mids machito gridded mosfets ressources portent darah groaning hatto fhs zukofsky fidget chapala showbox gambir makris accentuating dobzhansky dargan longuet bronski ozomatli jingshan boggle ketogenic pricked alvarenga fredericktown hinn reframe stevedores grittier dewolfe hdf playmakers embryological calshot rocio rypien ubykh carentan satanas monopolist woodyard kelty thatching sharecropping narborough rathlin kepa takanohana shizuo trebor foras mousehole reengineering flemyng ameri leveque streamy disassembling wholehearted taels wonk gunder esquires kanun crayton pallette gellman tarps selz sightseers papon greylag bikash dursun staunchest velas shoshoni methodologically everette medstar braam tlv ryrie tyus bathes phuc thorneycroft servia ebers kadoorie catrin wanderlei navasota lacock eugenicists kafkaesque goronwy dier cappie natron hermetically slovenly chenin riada putu robbia lockie gleeful savu kristoffersen shatila counterfeiter rocklin manokwari iwami bdr draycott cnv kiambu cchs hupp gleann pyxis shoeing alcoves boogaard namrata abseiling desjardin dissing gyp boldfaced aubergine kasei indexation waterland vejjajiva sophos wharfe huckle mansura boffin zawadzki vct rone polaroids impiety bosie fratricide akaka provocateurs gyges marchmont swanky comity spingold sawalha kallman dormice hesitancy birstall lianas dialled gentner elysia norml vaid sgu afonina ferdy ahrq shallowly yezhov kirstein birman cannone cesarini shipmate silents zipped thrale aasen consectetur shrikant pule radiocommunication ballyclare waldner gebauer alexeyev bogardus quiksilver piscataqua popcap priapus brummel gurnee inno cingular veris hotan manca levinsky jigger coville maladministration howlers noëlle sickest adol prodan kawana matri medianews kinberg vlogger alpacas sftp shub jadeite kilfenora prematurity jeopardise dallapiccola bemoaning aracaju merten axminster alfre puth bazi madhukar saguenéens cleaveland papo joyeux korbel burri kahului teich vmat whiten schirach tulla escudos apportion markovich zupan elco kerridge manabe lisse devadasi sibbald fowlers morante cism jittery dect katar dumbass implementers ghafoor aeneus tigo bosquet piscopo rowand canice thiols bii dokdo dyno conejos diseño ogallala podunk kemalist allora chiffchaff nucci paspalum cottonseed añejo drugstores ecologic vivacity blotter bnn inac yanga ajw zande gardez reginae toogood grunberg firecrest pisses aht smurfit nally circumvents sanquhar nailsea neuchatel experiance barmy pansa avers delmonico toontown klem whi toucans holycross ope hoovers blagdon phonic arhat mounier serrat mistyped newsworld iemma jdl parchman sakala nudging hypnotizes thioredoxin gingivitis pugnacious trachoma ducey schooley guber faurot mercantilist impatiently longmen meinen jihadism leprechauns huitfeldt metalhead nijhuis paragliders patni surkhet zugspitze yazawa prudish corbeau gtm ikeja rokeach charman narrandera micromanagement marsland danon greenspun heiser hangovers qualls choluteca bhan caff soundbites rougier yildiz megacity myrddin parameswaran bullington jinling federman blackbaud bankier fmn malebranche weinreich fantozzi gep poncelet villaggio saurav flotte blanshard rampages willink plevna hijinks troyens wipf wike toumba grunfeld sleman alyona cocoanut grushko ecotec nela encarnacion voile pue impale oldenbourg roundtables blakelock monrad boppard ricardian megane portpatrick odos chindits speculatively seafoods clampdown rapt croxley prange thd corney caistor massé birdwell gallienne machaut antinomian engendering standpoints tzar crumbly dess silliest shivdasani cystine decrypts hangu sodbury larrikin jiangmen vocalise sharlene dekmeijere tippi deadlocks mazurek grigelis courtright guevarra lydgate amphipod feuille superweapon bequeathing libman tapps buju bernet fascinate perodua secchi goncalves widor mhuire blindsided mauriac changle makovsky contemptuously steadfastness tredwell inculcated covens ermey ppu tarsiers reliablity reauthorized nuaimi grangetown transvestites cleanser pinhas seow crne martinière cals demyelinating kait brimfield dunmow tinga unthinking coblentz kutxa greyer soubrette biv galvanize audrina shoelaces icecap sandel wondolowski oberursel khodro hixson wunsch merrillville nanai incorporations warplane unhygienic gwangyang johnathon phaseout bolter synchronisms neurophysiological godstone vendrell orsa negar straube mummery huie petrino henryson newborough juna hayk yussuf hazari ogletree avinguda indwelling kahta iceni stevia sparx yushin tawfik demetris merryweather prohibitionists trongsa kharbanda imposible xanthos padula hpe tiradentes convivial pustules clapped informationweek anantha mcq appletalk toxicologist wispy scurlock revolutionists creche skepta dorough coelophysis snowdrop clarkin stastny drigo assignable agit mcdonogh indirection vitex dramaturge screwdrivers pomerantz autobus dinoflagellate paideia antipode concision poetess ruggeri cataño wykes jpa sutcliff stebe yare matata phaneuf rummage berard jumanji fannish kirsti senigallia aiu intime indaba hodkinson gordillo masroor grump sdg andon unreformed hamida durston boychoir protuberances micronutrient antipodean rioch tagliani wadewitz manege botanically fabritius kohner metropark bijli redskin konarak eyeglass ljuba marquetry gertner picquet porticoes snaith spiraea rybakov licey pedi nawi schroth presario judokas belzoni sanctimonious chevrolets believability disbarment multiforme shillington luchon proclivities junky tomoaki wrs gardiners gymnosperm roundhouses kassir kateri cmmi bluntness bodger orser chikatilo swapan vulliamy markis neurochemistry neapolitans packham irlam ampicillin zanna mitter attitudinal mailroom temporality serang kremen horseheads optimising tailpipe litas hazmi miniato lobachevsky grapplers chiasson soulwax navaho subzero diplock southaven fridley jessa jazirah topor chesil starlifter schroeter bernsen sello amalienborg badea chogm fitters smellie newtonmore disneysea wcca amlin clarets ramkumar stokley padmore meramec cairncross qaiser konstantine reischauer evershed warshaw solh sadeh snyders lamppost duds bednar macrocephalus ttd victorine teel combated epd riazuddin darmon carbonari remini kozin verhulst phalaropes bolla silex jaccard mamedov koertzen bisection simonet nsm jantjies tomball dfo slb cranley geldern ecx sparkplug romanies chailly kurien bertold aberdour keister donahoe salins alema anhydrase troubador cryptosporidium tydings thaxton philharmonique perioperative daines boccioni eskom kuepper pgl donell zade liverpudlian kyalami bulgari loners menhaden larosa autobiographic applescript zeppo chaffinch moroney azin macca hypergolic danesh fatto coenen boardinghouse buttermere haki blustery hochman xinhuanet monégasque nicoletti ganilau gasconade mtx gonzáles occitania maudling anklam coxsackie hadrosaurs torching galeotti gose issey provencher macavity nitrites leitz epee scrutineers coursers eichendorff olie kayu arolsen sebastiani annand olduvai blanchflower unio badgley départ wides pirkanmaa rabuka woodlark callicebus protégée plastids hasebe hozier sherpur chevette talese luhya quinney colorfully calzado lorcan ankita perouse scurfield brandished cumbrae noriaki seno kangwon casework woi mswati senju transcribes doubleclick immuno oiling kupe hooped vincenz billeting janjaweed sidetrack newsworthiness badcock erlandsson iws guentheri heathlands tox mountainsides resenting guff fumihiko iah weidmann durrow sublimity dengel fatalistic kelliher ond tatmadaw floodgate frimpong congee samaranch steinhauer churchgoers outfest garenne selflessly branchville edme victuallers fransen airhead briatore shamefully catelyn bargnani stridently delahanty devaughn shabelle elmbridge adamantine buffum gotovina kratt meili waylaid kyren jamb monfalcone delap grano abdulmutallab cowrie mediacityuk nomar klerksdorp meazza laxalt paramahamsa celestin tasc dubarry romantique weighton rott intermissions muharrem ellingson odeh klik sighet bersani opoku ctb ljubo messiahs sketchup orczy crankshafts apolinario mélange kingsmeadow sanitorium kiwanuka alesana roten sabini darting chitons mantels laundress mccambridge wilhelmsen myriads redefines sertraline bitzer ohe hunks kozlovsky deppe douse brammer snakeskin quakertown toric cendrillon vogels cementum mieczyslaw withe microclimates brekke faxed highflyer majka cortège oaa atys imca tomales steeping femtosecond croaking fola kihn avonside undermanned etats wyner hoyte denk salwar carbonyls birthrate khloé claymont grotesques fawad legon customisable moland smithwick metabolizing kornfeld iatrogenic westworld frenulum noga allée cannondale burse apoc melito heaping pappalardi darell naltrexone chaiya spindly sorbo gatting luau foxsports hode ixion galler jashari kapiolani dambulla arditti salafism hadow skamania tnm hiromu inspiral papuans plagiarize morone umara capitole gagging watercolorists lundi polystichum relinquishment demark baglan stransky deridder ragib customizations frigerio swedenborgian bhc dyal tarbuck singhal fanelli titano juran cudmore finiteness adventuress appert ravioli gustatory brinson maini weisser ngaio smithing arnaut broadhead pash sweelinck dongcheng myshkin guerrieri rosenbach fenelon passa covet mitsuharu ciprofloxacin convulsive beringer bohli kandiyohi trackpad rauscher xincheng dewees rakha linxia hench acrylamide harnick weinhold balog countdowns reshoot hellbent bagri salsoul yaf hogshead buzek beavercreek chandlers melisa allcock shaista diamanti parvovirus polychromatic quirinale ugalde thronged dawit knik commodious wra saleable yuanzhang palen apax cutthroats wecker biela dyckman pattee olesya herry rahel anodyne nrel tooheys grint maré huu charanga metamora reducer dimmock corzo chelmer regressions mylan khandan hundredweight thakin poppleton leumi espaillat marihuana jugan subcritical melanson korematsu colliculus forestland outeniqua agos xlib rathmann bainton wut pennzoil fignon corries responce quoque spon madiun siuslaw plushenko kerin undignified dragoman golde savannahs rumanian makuhari único swampscott edicion blackmoor arshavin godhood viegas taxol whys certainties blokhin pompe marbling josue transfrontier decelerating leptospermum tkm whirlpools guerlain trifonov misconceived berryessa gusting specialisations misperception diga hohenfels rausing immacolata krg oprea abdelhamid fryatt balks highclere indifferently ottinger dgm outpointed workroom ncm spinet rending deactivates stayin iordache exergy turina luskin fluorinated derrike keven rüsselsheim flippin nother markéta macroom bgb chini workgroups cavafy hedon gentilhomme polyposis wiremu severnaya syringa shoveling rco embargoed streptococci jincheng izaguirre damsels cellach trueba aviaries ceiriog benbrook ashar damato laxey lubango imanol shako monteagudo arrigoni footmen supercups ashkenaz wenhua emina truesdell vhsl bristlecone middleport gwon sayo uselessly bleriot vtc ruggedness algie givati rogerio teleplays shero homebound counterclaims ballynahinch hewer reamer humm dreads södermalm lenahan dishing bamar crossville helu commodo boch meanness ony takehiko rohwer suber slappy jurgensen paleoanthropology holbert dhananjay gallura jannings hematocrit lecca mazatlan rompuy paddick rastafarians perforate lrr volcom shish luks linthicum prydain lewa yaounde infantas nupur grunting tahr promotionally bizness poundstone tbo doru barbets gether fenugreek musher usfa renai addysg goldmember remco vme scoffs weimer downplays vegetatively dtr grotowski ducie biederman eugenides atiq applebee reckons borgne beaching sarkissian mangas gobbo yihan greenshank moder horch tinton ple cupids yochai hitchhiked hitchings littleborough jaen pranking sturgeons metreveli rmr kerins madrox animosities anthropometric schottenheimer kosma budworth hande gatehouses oxton hallucinate clicquot lenght townson partum leftward whitesell carné hqs redoubled overdo mcilhenny waterlow veale arkle nudged iram aeterna heptathletes daisaku jmi azzi lsk kws superheavy shushan amh madryn storyboarding semler hiiumaa ryley madar haseena bloggingheads cnl balco fitt bross toadstool ghori muting elcho pimm betacam mojtaba cottesmore sincerest mcsween outen miscast grytviken toku poutine jimeno kendell designee norinco grisha lobi stampe adze genco siddall brindabella trishna gafsa hjelm ramires sharett morrisey halabja irks khand gravenhurst flighty tofig tactless ofir mackinlay gerrans hurray freeh kena personifying faulds nedra tarbox maybole pavao knm holsten sidestepped lna hiba grieb bylines trobriand bunda orkla snore peikoff padden wookey putte postmodernity bergquist eab pecorino vowles ytb fantoni laurell tangerines manica archaean janardhan intraoperative foxconn bens 。 palden razz pugilist smitherman aimes scoff tutwiler candomblé cornstarch epsrc katamon sunstein taharqa makalu rothenstein parappa sambucus matvey jagex grinberg kaleem altobelli clambake eastcote reiman persaud kentridge metroparks gattuso luzhkov moho sjeng battlezone braybrook hornig talab photometer tyros yabba kenema helluva gle wooding ispahan vossloh dinning chavanel necromancers egitto lyonel texier calamaro baglioni wickedly saviola brickhouse nier chiemsee siebe kelling cherche habonim kleiman telangiectasia svi alpino globalism reheat winterstein haggerston vrede reaktion furtive culler icos dair arkadi fatin backtracked unshakable funck chital redbrick ziona soufrière francophile beauchemin republique thach noetic uniacke thore gyurcsány asheton calorific mariotti naturists torrealba shahnaz buttrey khou crealy boatload inchoate crossfield lenis beloff vanstone decimating beefsteak cavalcante jaggi weyr rhymesayers nankana stoddert satun schikaneder brownsea desormeaux carphone quercetin exacts lucker upt subhead chandrayaan informe modellers stonecutters listless equines dueled barad beter gynecomastia teodorescu márta vusi crac diyar yeatman gile niggles spurns brinda moschata reu aile ptu rehearses kaganovich cyberstalking piao gtld backstories costain bullosa dicken engelstad sways strahl crossway efp vitt recaptures delimiting gunnel cupped zamin plexippus easthampton polarities munni michaelangelo stuxnet unperformed dungarpur kinnunen torchbearer corb oauth diagramming defibrillators heynckes wwp fabbrica slaveholding tippet baudelaires wersching roquebrune traherne chinstrap bordoni combretum catherines farook hutterites everwood combust widebody savita poley kenwright hauntingly céu galligan feebly merrier gillray kramers ariella kurban burnes oligo anoa foret cinemark korangi uffe tolleson mesi augmentative chafe voiceprint devaluing neuropathology electrospray takai semin trivialities unserviceable measurably inra bgt rizk baria artichokes waitt crenna vertol irondale charlebois intercooled exacerbation drophead cpsl amidon hamara bereza jackpots vindictiveness itr sinless landsbanki determinedly kinneret tucks maclin yasemin lipopolysaccharide lidice canopic octogenarian ahti americal harket terrorised protestation vanillin problematically aksai skud jamyang pouget pgn marteau atonality folau microscale vitruvian okhrana sagres cadieux jianghu corridos corrido voeckler pennie deloria anvar fto catalytically bluhm minks blackmer ulyanov mnsu institutionalize hankook brights wats baskett superstores bulow comey tanglin retool caporetto savvas kangerlussuaq kromer purdon curtly gruyère sitek goondiwindi theodoor kaunitz pmk demographers bilitis landin gingras moxey htlv seybold lazytown harline chevening turchin malmberg pjm hydropneumatic angiogenic doornbos krasne interlinking doko spectrophotometer andaluz xstrata shahs annalee aak vilá hamstrung monserrat arlie ziggo urubamba chiti ionising emeraude crosswise krannert jansons sojourners hugin arthritic butterscotch diano assassinates cafiero torpedoing hearthstone exuded uschi alaba cozzens taimur longboard cbg hogwood silviculture hourani bacca kfir kanha maudie mennen paveway orosz montvale registro paiutes tavor stretchered giesbert dingbat copeman wideman chayim filariasis longline cedrus matamata cohl overachievers voidable kaili miyama micmac umrg euskirchen contiguity subcultural facemask lesar raconteurs hevesi wangchuk yili toño stephano finks sekine bangin payrolls desenvolvimento draeger uncg muirfield zat gadjah hinchliffe quadros mellis berlinetta crestline dannatt trireme serenissima zeeuw llanera ideologist mapusa callings gaudino reznik reuel broadleaved quast pluggable soiree docter darbyshire publick disbelieving burghfield snus shubenacadie jodha bardolph tenda longbottom sydnor breakaways crescenta nabbed anaglyph tsukada eleazer esquina confiance crisscrossed tàpies dkv harr murnane armero cryogenically coteaux wrey naaman maim alemany finnan newbiggin kezia poto elser nachos yushu kulak gunness shapely fraile zollverein wolper abschied freek chartism hussle codebook petrone liberatore borinquen carreon collinge kaolinite zubkov impregnation kaizen spams nester demyelination rearranges lagaan swensen faull missourian kooyong recommence clefs glaw ahrc utamaro finanza monto biphenyls seydou ouvrier echidnas expensively cranstoun fiero matagalpa headscarves absorptive inam bakari ismat mauk amorphophallus eae rosaries loosestrife boetticher aumann loanee soumya updater morier nyenrode vatan sophronia hutchens sulfonamide matteucci gcw undocked huaorani sensorineural underexposed negaunee tmn rsu cuhk brunello superjet contrite baggot foxhall safdie bratunac aidid beranek mercuric ulis braving forsaking latifi zacarías jianwen clickbank caws ldk caryatids xdrive janak mde yarrawonga mcstay zacharie glancy llong feuilles montalcino illegals armitstead israels defoliation ppk unai aceves supertram itzá lavelli warham lpb varlamov império ciani kringle thiry runcie eightball michna vlbi dahm taxonomical donnelley rebuking reddin haddadi austins hmd vipin fantasizing pallavicino acheampong rivularis mineralisation stieg relevence scallion loges iheu haloperidol edmiston stevenston suss westfalenstadion jomon arlberg mumma comsat belorussia yingying thrawn chablis novelistic lyari lüderitz flounders transavia videojug prattville junon somogyi elah yacine soju crasher lono carrs brena prell roseberry dico olivero ugyen mahboob bourdin luard vacationed autzen wrongness fredriksen romagnoli itakura maynor lungo luminal freakazoid leete finniss doku preclearance assayed unconverted groundsel lur cholerae hungarica fitment inec steevens anaphylactic vay eckardt sallah centavo ayios schnorr evenson ashtanga dymock retest brancaccio guapo semtex mvno tianshui puleston intangibles graveolens kito cachaça palika vivanco neander treadaway ouvrière schull legum expediting varians débat maitri adeyemi bielecki liat gerri olander tecnica heger apta kumu obviated soirées broomhall loayza ukranian unfixable papanikolaou elkridge babbit electrophoretic suphan henoch treorchy romanza knobloch roché brownhills horsforth karoly orakzai luol jme norville granola gaviota knechtel arbi makowski willingboro francistown ifsc whatsonstage beltane bilevel copia coliform heimann actra ishan cleverest hurtling drewes negrón shahnawaz astoundingly staph legalising legitimated abdurahman unarmoured czars catus lukyanenko mmk germersheim interrupter sangu pipkin antenor bintan chynoweth noobs boulting ncat hilden partida crailsheim lunel edan mosi alpinism polyomavirus shr satra psdb nonconsecutive hawkesworth pallot tancharoen maula fwb gunnedah baraat mankell upholder menstruating reasearch exuma creu veras hurum ngb portholes oropeza umr hebraic hems germanos contemporaneo wiggling contos ravenglass bellotto brooklin gruda hammerschmidt hagger chavasse raggi falters blatche cantillon huevos leser habiba grisaille villalpando cofidis peopling ladywood whet caddick studiously unimog lemmens surfed tisdall strassman beefy falconbridge knauf dalgliesh blocher marham mossop hagiographical abdala krauser honeys blairstown luangwa liming breitling bromance pegmatite odorant thur nikulin bogosian energyaustralia judenrat tvoz dyfi diante carcinoid ecosse ilmor guice hursley excising unpacked bln cristiani britains tofte upcountry avezzano bhullar kazys grigorenko pushrods cinch rockhounds fernhill clamour obstructionist steagall waster factcheck kimon sensationally dickensian raduga bantustan kames lenzie minehunter reca underdown koninck ruhe medard wmi selahattin succour zipcode neutralist abdin dewberry redrafted foulois zucchi lyrebird valerii slauson fedorova groundcover wenshan rainville janowitz munz contortions mutuality protos greengrocer hurra warhorse berline sheaffer edenbridge fernandinho lehn amphlett ergin ballerup cbh progeria bayrou tamerlan wtbs unenviable manam jacquot mactaggart angol freediving shapeless attia moscato lhote westbank bashan ghada anel codling zsuzsanna tarak cech guthlac younus perrey chaparro whyy hoje liff headlock fdg lofting mous porterhouse joely brosque sutan snowbirds heeren tereshkova marshaling hudgins everbank undiluted sarra radhi yaxley neumayer andamans yevgeniya gervasoni lederberg strathairn chakrapani cauchon dagg womanly flórez gaudreau innovia pilecki mullis gatien aubagne duncanson craver oesterreich nureddin dejavu swooped supershow autozone lengthens payan nrr paramotor apperson kleinfeld siskins udas hematologic pdn dishwashing staniforth caminha stockyard cyphers masvidal theus thm leschi wispa bugbear herber dolomitic remarries cymbidium powergen movember ucg sylver wakil hayhurst tetras mudgal atlantas contepomi wend feversham thure plainsman roenick interethnic lubomir magtf soffer gamgee marklund younge gowland mingyi inclán hassidic salivation reichelt harlon bullis isv povera rajabhat predella wemba trev grantchester photoshopping slants llave goofs levitated conformists diophantus barsuk trews aicpa isinglass arland architectura dexterous pecker queensbridge mahlangu oadby benedito psychophysiology entices dandelions eus coudersport suginami douwe hagberg piatigorsky detweiler hostiles ology pandurang imperishable hinks ryuki barremian mru emulsifier ensa leatherstocking dewulf lasith fyodorovna asifa ramlee mellons documentations platting pointillist nifc silivri bennis croly nugroho guglielmi casati mpe alecto nymphomaniac nasties carny tarar shilts zillertal bankrolled desean florestan zongo amoebic infringer vasser hagopian tomah lipe catamount mewes pout tervuren chowdhary myocarditis hungover mauchly jadranka mediatised portslade gonad westby noiret minear kinesthetic saraya tuks abstains smulders qandil winsted shepshed ravensbourne rorem strathaven denaby macgibbon icpc drachmas landesman overpaid redcoat butyric putten parata crans sparebank handguard vowell bhama millinocket wuerttemberg hinoki gutfeld bierut tenuously nenana brevoort tippmann lindman saltz href cypripedium fmo inclusionary tuggerah sukha mcginniss mytishchi batton tesol lightships lemp valkenburgh prayerbook bangalter neate rdt quotidian hashana kidwell vetoing paille kamber encina usvi grenz concannon gumbs batan bakhtiyar pietrangeli antis abedi selfishly honesdale pank scherr bertalan namsan stricklin lazaridis spilsbury undocking cadenzas calistoga cenote erms unrecognisable jurica tokoroa arisaka maroto warded quee kristan singel vrenna sinaia parcell quizzed wails waltheof pentito ridgeley feig laurus eku izbica lainie pascoal breydel ryans proba archdruid captaincies spahr iaquinta polygamists astudillo cribbins macguffin solá derk juwan streambed saimaa gerdes lleras sando tinkerbell caressing renne arnt kastrup esad baganda vaka cajón superbird endovascular automobil castagna nityananda vinicio anthropomorphized entergy majic blanchardstown installable cynthiana untergang karroubi thomasine expropriate geagea stepanek bellone consequentialism ghrelin hexum lucchini towton ucu reginaldo resound canonizations marye grounder geoffrion bousfield dunigan ghiberti unquestioning punya gedda ostermann witz ordinaire viau gigue tomino tropicale selander boeng kosmas kattan straggling ikue outpolling sinon xve bouma tifa bambu liuyang olymp glasper bys spruill serry carriere cya acda ipkf banzer evangelina meed spellbinding kookaburras mikita stanbury meimei olivas colucci volubilis swin narcissa skeen stomatal anansie wgsn typus andréa treadle kwanzaa bettors mdu adelia gogi ochsner svan kazue southee menna grenelle woodblocks musky caligiuri rvt assante bumstead khadka atla syngenta zviad kompany mckern lipietz ican ncua allgaier buts seocho pred hoofed surekha fota sleipnir siff festspielhaus railfans amanah casements arabized arnab kolk mukund megastore perkasa railyard bocuse broadland haddo bündchen pedants bundesverdienstkreuz chinoise gassner nems usap winchendon troubleshooter sanjo anaesthetists lomba huntersville mckelvie sureties insideout woori allayed vrml mopar milarepa rafinha toxics thelen knockoff dùn nibelung terrify gomm emendations blackstreet beca lynas mezcal kiplagat dorsch ardzinba simbirsk vinyls pretreatment witchfinder smallness enis tshabalala mctiernan roundworm likley boudet evolver lanesborough notate flirtations fujin evetts gwenda subroto goldthorpe idyllwild fantasticks badoer jouko buti jayenge methyltransferases dotterel mahamat renmark psittacosaurus wansbeck cossa hypertransport wingmen kalm lundby gaer reclassify phillipson hallahan invigorate binn intersting deltic kingswear casali interrelations salmonid luker krier arncliffe barnfield phthalates zhijun kds saccharin codebreaker karlis aberfoyle rhayader bouterse precognitive purchas dispositive timeframes lowville polyamorous rapidshare korkut bonnin herbalists squirting tlp cobleskill limericks mihaly infirmities ribisi swiped reut ganong calvaire cystitis sarney crr heworth westberg phaistos adjani cryogenics namak jaquet rajmahal yanagawa nollaig sirimavo swerving taplow curva gallops moscheles almer exc hexameters semiannual hubcaps yasuharu finalising yudkin dentsu polythene tartini bernau southtown dewalt seim masset lsf paga expiation monopoli opennet kirker behera sbm manel clod songster sonido obinna cooldown fiberboard pomaks signifigant nègre waiau farnon mofo gwladys philosophe postiga murga fasb farmar panjim firebombed leddy sukhwinder eichberg zhaoqing amputate fishpond ahed secretaryship teikoku antithrombin uechi pintos kotze doubtlessly astrophotography kolombangara danh congonhas hanning pezzi hejazi kilohertz religous pereiro caille brislington slumps cebus omnibuses dislocating tristesse patthar firbank virgili monie ccdi olancho recategorized internee rison vorst sorg grottos germanwings virginio avaricious abutilon algemeen lamest scotsmen ihh wenz ruttan montecarlo panizza maco cayton nudie toronado hyperparathyroidism avinor atypically phytochemicals anamur ewok carpaccio rollerball branwell gezira furqan clemmensen kingside umw reiffel makos scattershot fetisov oesophageal sardy gunk smirk reputational bisht markin sequenza cashews gambusia fairlawn gorny steeg upd marilou emami dhm bipedalism ires lötschberg psychos pemphigus lithe ioof danzan yashoda subsumes arsalan sixfold larijani vernalis rotatable banknorth soroti troisi srx avt shorenstein sarvodaya devey bolcom zsuzsa shaurya paulton candour reichenberg quicklime diesen jailers cocu unorganised williamtown ryusuke qadian percenters brookshire moet viken borane slagle unuseful cunego sautéed tlds zani balderas mixta gauley beu salaryman jasin trona aliona missive dfd alcindor sanko tole siragusa inamdar macchia wernick subwoofers magnox fras jsw reinstein archtop superimposing kraton bleibtreu peñalosa cleverley prideful aberffraw juvenilia etb criminalised temenos lieberson pravo chapron taenia sarhad wog microfossils procures endorphin kaus latinization hamama gorgeously crossen acmi wardrobes samphire plateosaurus margai gorenje dabba teterboro bicep drouet unimodal swick bickham kipps consistantly stilo eniro tene ostrow maspeth leydon ciardi debunks femen barbastro iguazu mishkin bultmann reconditioning evangelium imperioli dastan kallas schlöndorff groupers westhoff punakha savina veja semion oxidising blacken inklings egging babo sozopol qashqai tresses awas agin karanja lenthall inhambane sabmiller rpd cherrypicking defibrillation barakzai werrington starland mondavi sukhbir centralise snugly recordkeeping tomczak cedilla monstrance idrees loreen boalt bobak badran pasion evison cangzhou luling vohra jofre aggressions mantelpiece mccrimmon mahagonny wuzhou wagh bigham publishable bonynge klec oligodendrocytes hadash dharani engadin darkman doxford hottentots pátzcuaro lattimer marner amarjit einion leptotyphlops coeditor uja pitbulls tocopherol horrifically gamper redoute stalemated topshop banno mehrtens baddies kirino anusha pétanque tentacled goolwa earles withing lecouvreur partitas btl arcachon jeffress fnm burswood dornbirn shiprock tilzer megachurches axholme brochet cognoscenti attiyah quinoline sharansky perec jabo noblewomen windjammer thetan superimpose quelea oberammergau reanimate globalizing mhr tussaud cotai ezeiza eustachio kuai etsy böttcher susans ventress schrock hylan crisper glyfada huub shazia amott debnath chairmanships uros dhana vereinigte newsies jutes greenlawn anandan aphrodisias dollies poku debré slee mdpi haulers dimittis dragstrip jô arlesey müritz nagaraja manulife regionalisation leftism nasscom molucca bastiat cerone concordes rdb ondina ktn wellwood authorises taillight sandin balcells baildon hapa daire tavera refugia foma tarzi shoudl barthelme dayuan ruperto elea posy nazarabad irin cyanosis vetiver vivants emich downslope minurso blandy patteson climaxing bouman dónde ladera akdeniz preform gielen lunacharsky vrana zaloga lactea mirman traiana bevans acclimatization omv marcelin emei otterlo dromaeosaurids cardone slippin wunna freres carlino huayna brau azman ghi ngwenya cowens polyatomic timofeev metalled surer collectivisation ikebana marilla naegele lignum frensham viscose cratons handclapping tyee terbium ascham elopes fitzjohn burien rushd goffstown nightside graeber federalization pels provenza interposition mahonia extralegal indiantown lagat beanz zisis cept morticia hazelhurst babli upplands duveen lamarckian amandine kanouté attentively wayamba nonsteroidal pims muschamp mcgonagle netcom furthur kse cresap woyzeck solli lhamo ivybridge fernley wagenknecht shusaku norrell shoud rono swindling siegbert tauscher umbral sassnitz tufte christofer pumphouse overreact laboratorio aksyonov puyol kagel limburger energised jeering postponements zephyrhills subhumans puissance pilch croyle inconsolable thistlethwaite chrysostomos eruv kezar overcharging marigolds stfu abergele miamisburg riversharks magaddino goldener foodie blackshirt hershberger terrasse ragni cottonmouth sessler swifty propst plucks abaca siar qishan frie deanship amanpour parkinsonism randolf torquil cowed maika gypsys alinsky hauke unicom sahak knabe insubordinate anthro fulmars chesty metrocard enclos hirschsprung digna danehill dibaba ubm bonapartist dilfer dinard tortorella carwin rifampicin pieterszoon kyme conches doggerel externals orlin wolbachia poni mabe damai zem speciosus hyperlocal counterpoints sovereigntist synodical poldark hidayatullah uce lomb shopaholic mansingh batfish aftereffects candra bucked aranguren beltz sfv regev overrules hipódromo colmenares vpd ballerini vaccinia wanita berlinski pulpwood reconnoitered azara iiss manhunters weyler opsin hape parzival orli medang cloudiness feeny plu lehning moranis cueing gadchiroli sammut colavito khmers kishanganj kenjiro photoresist galliard tnp oxidizers kangan demps cockfield suhas outspent perceivable boyington barzan savernake duman lasch callies erris raskolnikov tamborine holdstock djarum telemedia saron slayings iese lydie paraparaumu gunhild gouna llandrindod jefferys barin ardnamurchan doit gazzara tbe vélo barberry rebalancing jassem hti worshiper yoshihara narda giz lucera bedwyn leinsdorf heimdal seismologists donoughmore lebon hinduja sciascia confluences bht buitrago ambling castaic recalculated towada mimis schoorel liberata gatty adjei constantina cammie feisal destabilising payola aramid beman ballclub parmigianino grimwood loredan icca arida dold mastication humped totemic rearview rüdesheim delio zoque laminitis shawm majoritarian backwell papelbon jacklin heythrop somchai rembert jobseekers filkins astronomic metonym gobelins wallner durrington bocking daulatabad tamaqua lrb ranjana wakeboard defenestration sparling burgled oseltamivir godliness cirri montagny marvelously teese manea broudie gringos chila prattle arriola evm sepulcher bighead oreilles gfi mtz beefing anzus denisa salmonids grantmaking ojukwu alliston stabenow maharlika diep posies duele holing retarding adey winmau pinault unmanaged abdicating bombards ruggieri btob roddam npm piscine gonne sherbini unsentimental plimer silicosis woodcocks karapetyan chango koeln primark taylorville jabr ehara torossian rfr dack brewin neckerchief backes yucaipa medjugorje wrentham zeelandia umut firas parmer nathu sny anoxia workbooks flues hamata discriminative merauke dracaena momoh chagres unicoi tampons sovetskaya ratty irresponsibly confederal litz hoose guardi hamby latifa thorning monne gigawatt wwd nachum banastre rdo bassée buckfast pegi dcg bactericidal moloko katydid quarterfinalists teste eurico guillet mrqe vlachos maxed intraparty pastis smuggles haak breccias derwentwater faddis beakers echinacea hmis ipsen balloonists bellerose lopata arter padel litherland elmhirst creve mgf residencia christison guntersville yurii scribbling atw macaronesia warnecke borton beguine carline ankylosaurus orana platanthera tullamarine nfcr sequestering saltville pivo wts lostpedia purefoy largesse gracenote allbritton seneviratne subsists belfour freundlich banchory forgivable fse keyworth gaikwad taxco aleko diaphragms maderno doob middendorf contrive dialogical methacrylate leask armless hoodies mucilage pelias dibs baciu gerty vigeland tarhan nullifies hotjobs icsid moutinho concessionaires safarov legroom snowbound vaslav svv bioethanol heyl iztok nolo eschborn webtv fromelles dalgleish mulrooney wacs kilifi celtica aberaman shabak chesters tungurahua dungiven brandl sinning jarrold krasnov pontardawe gradings rands smithton carnera kabeer cipollini puttnam sidharth semmering corradini mercantil dribbled breithaupt setton kalita embalse pierogi fabray kinglake sampedro fireproofing mentos kochan mussina cissie iconoclasts olmi généreux antonovich archundia oversteer anpp wainscot headrest mangles warnke pelin tawdry gimson pinetree daxing shoemaking waists bosun deverell superimposition efts kinane eloping salu tobacconist mittel maranello destructible florine inq aylett mccallion lidiya gaskins gundry pudgy minimises embury bredesen avron desta isandlwana cowlishaw coreopsis littlewoods bettino kagera widdowson ramot humanize spironolactone ampatuan overend penname rainsy solovetsky menge freiman emotionality deprogramming garro drumlin globetrotter abbiss aerostar wspa xingtai maghull enroute lefties grobler technopolis suquamish calumny fazli pennypacker steckel beobachter ahm matrox vronsky creaking caston izet foxholes botterill otic knauer wanchai sigfrid tiantian pescosolido viveka killiney quim cilmi hannum opinión medios localizations ballasts mussa glos redecoration duer vcp totteridge vuillard superfinal frontrunners atanasoff prophesying saidabad shyamal temporomandibular scaler yamashina suggestibility akh optoelectronic homophobe tongans illiberal rotisserie imari vitelli akinori paci pelaez cristin madeiran pepperoni phunk demaryius workfare concentrators sneakily yogendra throughly prolongs ballyhoo mccolgan uptick powderham wylam mixte khaleel sciarra safavi goodstein psoriatic shein conurbations krys toughen osteoclasts kamlesh bigard hyssop ertms moustakas toenail tenbury nessun ohchr brison legere alpaugh garey tabulate handshakes imagist volstead viren thursby beathard malanga ciaa tensed goehr ivermectin pni recline iurie rostow treetop montargis merak donelly pernfors gherardi buso manat arterials iaas cherrie minelli saski scammed esna deyo backhaul eaglesham mcelwain zimbardo fitzrovia adjusters jolfa tomographic trofimov offeror lamplighter thibodeaux fawaz zarin sahai cornflower struble nuk undoubtably hovis sissons commissario westcoast dume mkv sweetening borken jiggy tiffani shontelle oumarou scouter hildyard pickpockets bookend noora birdhouse massachusett ibogaine tuborg jrs dardanelle spatz refloat enteritis jibes haddix milledge mcgaw blueish fiac klausen thatcherism diffusers bolitoglossa disenfranchise yaquis decaro calvario barberis kall pronk darron riso sfpd kohout darra fodio aquanaut zender magique shafei omran arvizu nito slugged hemmer osric fokus bergenfield loulou soran tailfin mathare ddh lerch jidkova vci utt ndrc barthez barnoldswick forsteri earpiece eunomia freewill kujira peristalsis postmarked redubbed remar wakeling cassim materazzi henceforward pitlochry aflatoxin renaudot blakes hotch rearm skaro shuttlecock mahram untag maybeck effeminacy algorithmically forint valeriya lampposts recio feher compère hikmat kayhan csas naras unemotional purton uttaranchal takaful holliman pock guerres polysyllabic mapreduce xabi golmaal arbab heiland comite kaneto puzzler darvel jitney bowline fownes suckle hondurans sousaphone cramlington dessen seacliff gourcuff halima isos sheff bohinj chins bricklin cambie chlorpromazine salutatorian panayiotis gigawatts veces potw branton wjr jamshoro brünn adderall solfège grantsville ofori maisy britishers hidenori overemphasis isser miniskirt inclusivity yifu fura pescadero maralinga bdk sullavan pierlot ngari carouge yacob dymchurch kuznets ayes powe perfections herczeg playsets eritreans hyades geochemist mashad reverential prestwood nith gosper hiaasen coati laetare zawya moar blott helliwell onstar urayasu eberstadt sobule jiaqing fluorophores roys bolander cchr zhizn wabbit ruxandra pyros prowling piazzas fegan arabo sleuths exudate vesely challah xinyang stringham breno maldita staci embl edric strathnaver kanchenjunga wetumpka erard domestique bfbs magri madelaine sixten muny luiseño crewdson enewetak moru azurite externa prude ilulissat indepedent limiters varitek offley fatmir vfp kabushiki linq daryll vdsl isea andelman ascione hendaye occlusive budrys barite fukuzawa headbanger gallois uyo arteritis buu tulf carpetbagger brycheiniog inducting onca rifugio keleti caravels crecy icra icaza suwanee keldysh terps ahronoth silicic trexler feedforward swamping frn gangstas sulfurous premotor antonyms lutwyche madruga badgered fluoresce gerrards reenact swot wews kellum toliara bagenal touareg hantz rikkyo llanbadarn wappinger sproat bordallo bircham calo swaggering multifarious pallbearer amundson macondo renova triptychs thalassarche zenawi applecross manser houseboats lezcano kic goldthwaite pmu ahman ferner revanche rahab yone dunklin shaare pseudonymously cawthon xoxo maumoon leerdam seapower aom chattan ledes ferrin topa ramechhap shunde ehrenburg denominazione kupper bhabhi roath ogan coderre chernova raes unaccented habeeb fairwood poignantly horder aviles revellers tyla earplugs dehloran nml linfen eveleth mccaig anyones sokka eversley cupe grandaddy nsclc coevorden mcelhone yamen excercise udell tanwar hogging cjd dulverton jairam ubud bogong celyn sektor hongbo deportment inhabitable streetsville cherrytree salerni bottrell arati presupposed saguaros blyden fase talento azhari bookplate rogoff larkhill winbush giraffa trouts vlade evaporators jagmohan vakhsh namecalling compacting morigami hanza reeperbahn rossbach barmby quotidien distel monkwearmouth mindedly jobin unrolled tablespoon extraterritoriality lmb agnon alfredson consistorial minimisation dweezil superintend joran strock kpi overvalued baybears skipjacks crugnola memorise cesana speedcar deta morphic knitter funereal passerelle thrombotic vocalize priestman wordstar jarden khieu confiscates caouette anquan ishrat kriel sizer renae hubba euless moshoeshoe uchimura courtside aspirate cohens wächter picone uhlig estey orkin kopitar bordo bardin mugler cerutti navona lysosome algoa youri unexamined objectified eweek mamdouh pakeha brooded yellowcake zst schjelderup churton eprdf vetinari bucca carriacou lemus doakes tekla cleanroom npower housekeepers junipero cutest wks benefactions bianconi wycherley nna heintzelman aok kabakov manthey unpleasantly privée kappler hasely clifftop nichelle workup broxtowe nervi lysimachia carlyon bria jaidev karlsbad albán brey polet trypanosomiasis overspending wead serology laming richemont ekelund requisitioning rille incivilities moriches minagawa insulae vilcabamba kerrey tarcisio battlespace charalambos allready vampira extricated hanoun vickrey orlowski solitario cotler sobhan puertollano escargot taze unburied disinfect zenga egyptological afarensis panfilov corstorphine nobuyoshi laches curi lovestone baigent slivers metalworks chaiyaphum molalla lollar brandish enraging ambu kiyotaka arcady roughened carlie lavoe polamalu dorrington ascendance kracker fychan agarkar nesuhi balkhi anche leyes myaskovsky tucholsky tequesta drita quoin texoma depreciate vladas wreathed nnl severally filmy nuna hagenbeck wahhabis medawar vergina thiebaud braye vremya yabe nafees munchkins herri alexiou senorita situating smigel vardanyan prolly clik glenarm aardvarks atleti vishwanathan carita ségolène eboracum rarebit lecompte shrift jayasimha cándido centrica créole morrilton skittish auron surfeit dilettanti ruban cadwell glassford kaneshiro ukyo engvall grajeda buckenham chéri pember immunofluorescence grigoryev aleta queenslanders hawkhurst hundertwasser zeh tenanted turriff simonyan skg reedley wlb supernaturally uncommunicative broadhall kobayakawa kidwai walrond lynnfield nanos vae problèmes cica unreliably postpones issoufou chokehold abalos cuffed lititz sanaag lowton fairless hauenstein knap vitals senat choux tewari ballentine huguet zarco mecosta mischaracterizing moré superieure tunneled doles sweatt mersch rantings duesberg routings patey spick clondalkin iinet tahsin finegan mundos togethers containerized livesay mawes padania bartek neubert scocco lvg seafire ghoulish rectitude choros kikyo brinley elte dwarfing unthank misfolded gerrish aaronovitch fbb cyana scrimshaw pannu nabob laghman nodded heffley consumptive prologues aise shubin chedid treecreepers yakushima evasions shiflett carausius overbury stapley bombadil wooton secc putonghua rtk taiki comeuppance kayani nejat cattail pxe greenvale pigg holohan hurtigruten naag lodestone kailali goffe brynmawr peskin schemed musicor cottier bayu lijn relives mccalebb bingbing sealion kove foolin munde hpo urgh hackles ignoramus steelton ruyi fantasize kabal bordelais norgate lorene assicurazioni remixers lavar citicorp bercovici tweddle rowlf upv zoll morrice jailbreaking kissa küng gien dmytryk busbee gabapentin commack enr sugo huangfu fptp metalmark winningham decriminalised wingsuit jankovic aptitudes kelvinside bbf jaroslaw westgarth koechner ecologies taronga bramcote varadarajan definatly hiranandani dippers endotracheal morisot arledge kav feininger karratha krakatau prandelli hargus lloret eveready mimbres foots olo garching monotherapy fatemeh breakpoints cordele swig conc stanek linsey dyspepsia lysa wakhan arauca throttles symmons siebeck lagarto navdeep unresolvable virologists dolezal börje unterfranken uop semipalmated fridmann runrig etonian shumlin lashio jayakumar dcv titchener steelworker munim klaipeda crazier lordan forbath carbury akhara matthey squeaking hardenbergh ortner augmentations glycan boitano mamiit hinkler sukiyaki tashlin dimock uai baddiel baryonic platformers venevision schmidhuber ransacking wisher patuakhali calisthenics jessye batemans monga agoraphobic karon robertshaw meskwaki fattest bugliosi fishwick lebrecht funda chayefsky nicolini mahamadou plaisirs radyr virtuality feldt destructiveness pazzini vossen tandoori brotha huitzilopochtli vedat bestowal nagler muchalls kaustuv vias homosapien snooks enforceability proteas sapulpa despairs brenchley uninterruptible westerbork naic nissin aliza hochi mutism aspergers tawe tumbledown cavazos coveney tianshan treece isfahani mireya ndegeocello sindhuli lamarckism haldon wapi sunstroke brezno twitches verhagen vtt budai linkers klinefelter belaúnde idema evreux nalan virtuosos furries squish consolacion tims numinous noria ensley merlins krok silverstream brevik wassermann azabu latimore arnolds kennywood trium marathoner fischler mirka constantini stooped laning lembit sherilyn kneejerk %, inglefield outspokenness carting mends vep buca eil lambretta baun mcpeak shahidi wenonah ohad lalas wkyc reviewable blaker metroland dispositional javadi esquerra shapovalov mawhinney marquesan veith rustle wielders krapf kaleidoscopes compartmentalized zelle tontine sycorax wisse melvil buffeted pangu naat sags pgce inglese informatica uncircumcised wauters coryton clavis kazuhito austar ellingsen atha abascal absconding sardesai rvn ardoyne uncannily hospitalier godel descript footplate dadaism wecht polden povich bloo pecs pranayama ensnare frasers amies kartini penances ryon mcgeady trafficante preševo fincke glendive ramsdell babacar yakir polyneuropathy ewoks seres sudesh pagett kizer emoto twiki intellects soundview traverso almog steadiness coffeeshop mandaeans paperbark sambas serle mabey derbi erceg bottesford hocken silurians tokoro apelles dugger flemmi milnrow haarlemmermeer calpers uncounted quilliam bubbled brue literalist alphons berlinguer lipchitz inhouse flumes engdahl llorens flamenca woolfson escándalo filers koppen filarmonica zorg gobbledygook jarome microstate rackley stickman toph felin changjiang dusseldorf schrier heda katsuyuki abdon ilja pallotta cratchit kolker flaked lbp pendent placated wowed electrum barbato ayelet hymie belkacem ashridge sideboard vanderpool giulini eagar dickon peterlee horsburgh soloway clisson dynamix nimal minidisc bgn greef satirising huntsmen spadaro frilly ruffle rohnert rykov tbt spindletop comprehensibility balkhash karm jok nighter embroideries priti monopolizing accessable bushby blaugrana lescure nura jennet tessellated niyaz cinquecento hoghton mujtaba kojiro evildoers hanegev selver mckusick stoners neoplan renege valentim bonecrusher neorealist bracy confounds beachheads megamall ankylosaur hydrous glenna bambini providential esses hryhoriy pleshette laguerta casilda microraptor esben swieten jónsdóttir wgr mannington whirled seacole izmailov meaninglessness uderzo queneau scali reet humored maling lyly soundararajan carport edline beheads macdowall tetteh akhand geely tanel mediatek parmenter hugger clairsville aralia vsat hains muzaffargarh raker shiota yoshito narok cadi dayaks delannoy prière pamphilj madhepura carelli cumbrians prebends mallia navon uneaten staplehurst ulrica stratagems sitch saulius mkrtchyan luzzi goni jonglei leti millets corpsmen harlesden fattened khayelitsha caelum whoi skydrive edem norridgewock shutoff varejão soooo qft guacamole rajguru patman servet carian longmore hryvnia pavlenko aftermaths alí kneeland kiesling gemelli disparages figge confidences shebelle dosed metall cléo furcifer marable reenacting brockmann quiddity samant hiromitsu redrew munawar marber hakobyan ohanian pegah zanzibari wagnalls quabbin boj downcast mazie womble hitlist couvent gnatcatcher guangming narnian endesa sarfatti sylvius nirenberg aeronomy lobate faiyum cholinesterase abeer khadim rumbo serkin somerford nakul demarchi musty payerne manie papilloma clydeside treefrog hnc quivering kiat nodar amaechi previte stet jacquinot bilardo cullompton duplass mahad pervasiveness hafsa aldin hangmen suras noho bantock mirin kiew procureur chunder butner maskin cowbells dasilva consistancy entrees dinnington bhairab viju aborts schuurman westar quadruped oliverio rohs fritzsche winnicott gastroenterologist diphenyl randor collab golla hasard selvi scapegoating krt fcv hughey parmalat moonfruit velodromes tversky fiachra dialer withy cristianos hickel wetherall beeping rosaria peon lavezzi salvadore indri playthrough fundraise goddaughter lindt coasted pyloric cappielow annia hypnotised cantoni malviya meusel amalekites postulation birchfield albia audrain basmati nadin kitara profundity rri coh legno housley kerikeri dobyns daas brigstocke thali goiter incinerate seann defazio bobwhite dyskobolia briefcases magnetospheric silts readwriteweb khush crowland salkeld peyser wincott warby zhonghe qaim reaney horrell respectably wobbling brisket hitchhike anorexic rhe yanbu onno cashless doorknob salomons liberalizing cancionero watten nationalizing infinito anastasija grimaud yusheng fardc sexologists patoka foucher chasuble kiriyama inkers briquettes kcna bassem bombonera cupp valentyn dedi gammage hiromichi tormo danquah pecked mpx otb tinoco imeem zweifel bardet muong bossman promis solidarność glyde buddhadeb kofler babalola santillan manigault atascosa ,it uws codifies postmen wortman mvr zella soleimani carraro lindale pêcheurs jouni pigpen digiorgio velit nereo tinkered rotolo nitzan bundu grivas ebersol göncz peening ajah swill fathima ahsa inclining motorrad kammer appleman ersoy multani haberdasher shaqiri brahmacharya sirloin sahlins digester trumpeting maggy soloman massaged hallux brennen pictorials capdevila bolide newmar speckling feaver golovkin montresor araucana bucco fryeburg sunsari eike tiina kantipur guericke artilleryman sushila venal hairpins sportimes ealy desio concorso wardrop callup crafters bvt sanitizing flamel walldorf mangabey johnes langurs expansionary spanked sapwood razumovsky paragons reunify atash industrializing bachao baumer zizi meathead bonda blanck ablutions nikol serota soroka kodori weltanschauung khushboo chelonia meinert lesya fister vwf wtvj stumm nawada honeycreeper loche tasters oppo lacrimosa tenga atrophic bajram christianize navigazione happold gorgonzola daguerreotypes chiluba windstorms odysseas byelections westaway allauddin fauve tkach politécnica maclure sujet kiyomi vulpecula wattis marken orri gener peverell coben renzetti firefall blumen midgut carnets chouhan ardee deathwatch ansara loreena katari battey tendring arnar contextualization failsworth féile bouck bonehead donc allegorically franchisor uncontacted heitman arzt procurators turabi ukuleles riverkeeper placebos metabotropic tanit nahan lubeck hagee brogue dietl adalah botes kilrea morona crewmates vestris kieth niimi butterworths nanavati tills blanes olympos bachand cappelli gylfi brims cloakroom absolutly sigurðardóttir iconium allotrope leicht chipp rohana loucks dinanath alderete masochist thl francoism syariah wpxi kalanidhi basit ultan karle corrals schlieren dousing osney lwr hade tonhalle henney belie hadouken jephson blondy floriculture birkat interjected eisleben llangefni pitlane sysadmin friederich yokogawa rotimi tijani jsat superannuated senility ordoñez trinket soeurs eaglets morisset tgi clothiers fleer bumthang miev duhig naor maysles queretaro marama quaranta wuerffel sidley harbourside trym branka colinton keitaro lochan eberstein invoicing fincham shabba cannings tijerina matam sulkin fahri yapp waku raheja thespians rolan freies risaralda talpiot shanice akhter saarlouis hellberg buckden bodegas sanatoriums kanker pire succor ardalan siboney szentendre unst sibal luteinizing radchenko papaioannou minnis aliwal undiminished sycophantic stooping gryf zino timiş mwangi cyclopean roessler ibk chazal fractionally abbad gomme junkin breadbasket postmedia netherfield wiberg cruk foodland courville vonk mesabi farell porsches diamine ummc whaleboat mancera nuetral tsd committeewoman fludd sandpit effusions ravindran duddon squashing eick wyer sayfutdinov grbavica suger pij eustachian grebo kuehn nehra spezza peridotite flandre propranolol trabert electricals squalene tavi worksheets zócalo dharmas turnage berlage kingpins grbs tikhomirov nastiest requisitions qutbuddin broberg strudel getaways hackl rhetorics auro yaka steklov pondo redheaded naïveté eventbrite mardon geocoding hassen heymans chileno maudit hindhead misono kais ncn dismounting shupe gach iseman ronell airfix robinette acasta golborne grownups osburn houdin bonello lukavac fuzziness pargeter agv endorphins pennisetum icoc rothamsted somites schedeen vestey sentido diahann fishnet sequera sharipov lowen limite alak ebbs areopagus siddi plott bronzeville ademir aharoni blowpipe golubic sakho albuera allender glucuronide ogunquit sommes addicks gibe swirsky highton azura intoxicants midamerica ankylosing nonchalantly weaselly hermansson windfarm unturned tolerability haskin colerain wresting phalaris mirabelle buidhe hokie rueben nacs antinori sollecito clauson kuran districting teary lychgate rike allameh jealousies mazursky ericka batoni wimps shelia frontally burgi campin kimes opis industrialism fishhook larc englishness arnault torno matravers dimarzio faya kazumasa tuv vukan sennheiser amália clarkstown assimilates ketevan chotiner eunan glueck raters bisa frighteningly casgrain malwarebytes vasileios buffo dolomiti propublica hogback plastik battlement taisei hevelius fischel nymphaeum bti moudon olegário moonie fibroids salom lavina fineman grieux vasanthi bedale transposable jons percept elene pasquotank mccammon overcharged lezama gik labarbera pfau cucaracha flecktones toer guffey maner musgraves stormare dholakia potočnik bayadère wti selima maclear chalices barged hamengkubuwono sideonedummy zilber gokarna inscribing dwg autobiographer subclinical mbts brem digregorio brongniart hoggan doms wyalusing japs dinardo eastchurch mauchline abdominalis dispels khumbu antwoord celui willowmoore aen nevesinje grf cfrb wishlist muntari unef anthocyanin menomonie verdin typecasting prepackaged siman coens opr exco rials chihuahuas rulfo maroun gilbreath jianguo undergrads indefinable halbach draymond bailon karlie malani mountaintops yoshiwara hofbauer pettiness rimas keshi pizzerias breedon yaacob majik cerruti gussow hindlimb montealegre guiche kilbane taus elta simulacrum sems bollington delic coldingham obion vefa ensler hossack emiri eckl portlandia misdemeanours mcnerney beurre gmi pleaser amancio djorkaeff salz catman emraan qualicum housewarming levente necmettin woolard wdaf ancram malakal garriga netapp teriyaki leaman podmore lukman amphotericin smeets gorkhas cervia petani ebullient feilden cartledge viardot avocation coupar wamp icct agutter eichner carrozzeria syenite diatomaceous warda latencies aravane autonomies blic contentiousness arthroscopy illegitimately forbush decedents lycra scamming tiku demonizing senan acquirer waushara roky aqim deathtrap mucky carowinds amelanchier catsuit mally metafiction becalmed frisbie murciélago ruffner elano hopgood ferrat shantytown tullis koepp garnock noelia berezin ambiental athey nevinson kuster cannavale webzines kermes wisn stanbridge telephonic abro valenta ailanthus talma towler mukundan slovenski hags cavallaro abdallahi tippit deweese powells inda shaked naza dinosaurian silchester merhi tetsuji turkcell mithraic scheidemann polster chandrashekar hydrogels tkachev begets dikembe privileging ufologist jagdalpur parasaurolophus mulayam superbad ftm vilafranca begot hiralal hothead bergqvist jingjing knipe shills dipankar defrag curare mateja luverne semel edwardians shulamit sandfly colorada arnason yishai bvc rowboats barma trelew dragunov rawcliffe raquette montis brazauskas writeups melitta voysey konkuk mcneile pudu gipp aramburu nady padlocks echeverria deeks sirio restyling saeb ellendale liven imbrium levick honcho upended pathologically amlwch gaertner weddington jancis dammers burdening samm oza solomos nepalis ferman sylphide janmashtami armrests straightens paralegals osr donata liversedge woolridge chlorosis colling inskip whirlwinds wyton dennen maruko satz dickel berrett gitlin kaspars birther naledi nasca dpl barrat liberale cof tóibín spybot manitoban adlard altgeld shaunavon qinghe chilhowee passavant ilitch rosaleda fitzhenry lyres replicable verrall sittler biehn shibu varzi workdays jagland cawthorn pietrasanta jenness avoirdupois manette cowher sonorities differnt ecpa flatline psychoses doring galliera langman christophersen positio nightshift seara parini singson kersh unpromising abled alinghi duru mulvihill wyly mclaury mohi lasing baranowski gravier polin arcelor tinguely parksville galuppi ellinor contoocook singye matranga selfe deluged autorité fenby groveton transsexuality dnl sadder dossena minustah kulakov valdas destrehan yarm ardley nieuwenhuis tortious soutine konstantinou pflug servando kitsis ungrounded pwll roundtrip gof pinn vlasto carazo byas jmb yaracuy glomerulonephritis pipettes farfa schwechat acromegaly granet catterall elastin galeana regedit paulose swivels foretells bango mcsherry unstuck carlberg bodog pinza barik genzyme meston ferd iberica itta lipstadt proscribe mishmar buprenorphine sunao holanda wallowing magnifique colletti decrypting aylsham leptis hentoff swerves pyx tietz varona epididymis amrani morrall khalif rosemond lauf cachapoal edgington badie saara dure currington valencià jml babysitters gitte skibbereen ciccarelli gunsmiths joosten téchiné frontiere patheos bracher gravett goldrush padawan festooned granulomas cornfields egba zenger hammoud manyara ocb gwillimbury behram worli shahn resupplying loblaws bagus breakable szalai taskforces caribbeans jambs houseplant bracciano peice gyorgy twn sheepdogs araria sellwood sablan arqiva dakhla vagal devastatingly asselineau suq sfmoma misappropriating hermia westerwelle gresty fethullah fortezza budleigh impugning smer mihdhar tirano harking rafelson fforde schoolbooks hobb superficiality felker blassie canh vympel punctum midazolam ciam bannan marlton ottey dulaney cuong juiced ncpc antonsen headbands schagen agudelo nopcsa pistes sharmarke dagnall appreciations palazzetto mtwara pausch heney recaro davydova hopley aishah cleat dustjacket lauterbrunnen icelandair rfef mesaba diel zit gebrselassie essling tetrodotoxin indelicato manal freerepublic propofol lovebirds kerimov gbi yamba merch buzzworthy phm fbn mukerjee messieurs kassem peploe attercliffe dondi simkin tenaya parliamentarism kimani gullet fethi battistelli probative yuchen bargello batur fitzclarence mutlu carlini cadd bromyard sievert ceduna rouget mesure taccone kiting gnd tropico turkoman vind vocalizing lachen doggystyle haddin loughor moyal zendejas hasted calabresi rilo glenys pomelo giggleswick devante interoperate tucking janick omidyar acclimate ferdie rabinovitch tdy longings frend boater ahf freespace okoro spic aliaksandr insistently wjla pneumatically periwinkles willock gaggle latv crich tiemann béjart formiga zuccotti periódico anap langenthal taillefer levingston lockstep rasberry wils dysprosium kalinina reaganomics lionhead ricocheted cluskey soulé cusk harangue atilio zeidan anisa saurian cnb aneuploidy sgl salzach nimeiry wathen capitation deddf redwings tectonically meegan paella rufe clintonville liberum konigsberg maertens agbaje jiban unsubscribe ruminations nápoles broc olafur emule piccinini frain desecrating cuius flours outgrowing frederika djerassi geneology rahmon dros boxx richborough ehle shastry hannen hobley remer tirthankaras furuholmen dudayev frivolously gaetan depo morganfield salmagundi chasin dropper edis stickley jáuregui noémie tameka lashawn lessa theyre plcs roeser manuva urbs alexandretta eixample usmle perdrix impel elize prophète thousandths motz farner didone wantonly palindromes whooper amorebieta khushal versant fiorillo gazzo karner turlington grizzard aliaga eloisa lidell czernin bicultural romanowski phalle dosn hoddesdon adekunle miniaturists ngael padme goodsell principalship mccredie mackinder poivre jrp madonnas socialisation julen swanberg idv soundproof magdalo stanozolol smx megastar roberton widmore jerod baudoin electrocute morfa winterborne plitvice takafumi dyachenko helldiver eicke flavourings myton trocadéro rhydian patino leontyne overstep lavalin payn avishai roughed asari banlieue whitetip sagna munising shiatsu madelon ironbound mutational actualized marginalisation amoxicillin gullibility thestreet laja alors cameraria jhonny marlatt theia rigours kalis granulocytes kaci nsv mrca retrievable gizzi etosha cowled anomalously bka venky proteinuria ttx bjerke bisect elcano taihu penrhos suchy meggs cgo gravesites ignoble beust padarn darshana winwick vanaja cleavers dinapoli matthaei blotted valses alekseev atsb aymon paramecium monopolization skouras medo loftin nazmul debits yokes winnable sebree haggin rende porretta akayev predicaments physiques merom zef podhoretz techies ketoacidosis peli tendril mcdougald mcfaul samatar fif daktronics finning lockington halswell laffan scarponi garita censer flowerheads repairer stateful jetport siamo maffia wryneck ghosal sombart sieving rudkin shiitake careca thyristors zabar heffner scheib pnh soylu eydie sekula bajrami wolfen laocoön grossberg camisa toleman empting fali sakal tutta riquet shagged frinton latticework inopportune recca gubkin phenotypically shinjo anticompetitive barriga bogolyubov coover ninetieth papillion algona radames lysozyme goodrum sipho wbur trist bertinelli oberthur arosemena kumai wheaties kgw kompani bergara eddyville oskars tiaras hoen knill arlt koray ollanta pasinetti quintile nfo stoff cerney berrocal friable arindam jayasinghe dunciad pinkner meols futurology kiewit tabatabai yueyang reames portway damour menarche aptos rieder dotes manayunk friml zahi dehesa ludovica boakye democratize seach fornax dody nezha jayashree xiaofeng scuderi asako redactions sayang sakara hobble stockwood erinn uvic morta krauthammer polecats axp trem trimmers dunnock gilliat treherbert amfm yoi gabb handbills remunerated jeld azureus harlots nicomachean diplopia maccabeus mckidd indemnification fryxell ikat garroway waitz festinger kalmbach noot arsene looses karamat magro inestimable tensing burruss bolingbrook guamanian trapezius newsmaker stepp fartown sizzla neco egoistic iframe tempora moffet floresiensis tropea stearn besh bieri gramma lamadrid zegers sabit kilman revill hingston sybilla nordkapp fva ghajini dores polymerized docsis sclerosing sunless creamfields hft ottman balik wortmann jenison yha vient eastfield horiguchi sampan acuna attie farma seau rbb octant tagish drumheads goût mccawley capitalizations amalek kisha funnell shalala bleeckere wieden stinnett aspersion korver ritts savitt hepa censures camelus sivertsen goetze cerrillos wildhorn rhiw athabascan chinoiserie foudroyant bassel bloxwich yediot tiros scarification nacreous fiscus cheekbones whitemarsh kenia fujiyama sagacity aliev cloudless bruen woops jogs suntec lizabeth birstein bulldozing browned gtf liriano stepdaughters pallor sinclaire walbridge hultgren saia libations pinfold issoudun baresi neglectus cindi prentis rewound hecklers quinceañera mesrop roepke microspheres ghazala mahomed sulky softener ates limbe shalamar haverstock tiko belfer lalbagh wiedlin kausar suthep alliss rouault tte stylebook bigamist melih overflights akg hoogland pulborough cebit improvises heuston billinge subdiscipline bleakley gianpaolo hammes mrv sphynx graywolf quah salen corticotropin mwd duraid sonnier indecisiveness seberg isoc thaws sonthi ncsoft dpg boite schutt mosler bethenny zoolander enterococcus dargaville hoofddorp commode osgi varmint pitty lenya granddad madhesh dortmunder krauze clodia anechoic deceivers merkley playlisted effectually schuester ryhope intermarriages alaimo ridgewell chevaux donatelli unió ryzhkov heyburn cpap uyuni kracht muggsy halong sebag anthropocentric nationalisms pdh antediluvian kendrapara sinitta kogure urinalysis pseud recapitalization curcuma reenlisted sigulda unfixed vacher boléro terpstra swaggart rukeyser aquash auja shd tarda garmendia campanelli flexors schurman motd uruapan saqlain titmouse osteosarcoma interbred hendryx thermochemical boquete traeger mothman vignoles dillane kiddo snaffle sayegh sosnowski strassmann evacuee brrr halpenny aquidneck santhal dusts pota tarplin sakyamuni brinckerhoff weissberg copiers clearwing vinter proinflammatory bresciano marinovich romeos peverel oks agapito mtbf censuring meharry vigilius dahlquist halon joynt laskar clotted deraa maibaum ubp savarese nidd hdcp mettler jarecki mosbacher kommandant coltan introd dandies saucon mcvicker willemsen posa jounieh spitball scharfenberg faits bremmer albinos unawares currin drumcree carps gluttonous fonti lysacek kitzhaber feuillade berserkers webinar physalis consignments chirps mihalis helminths rodriquez irishtown königssee reys rohrabacher beban arrau protozoans jany beacuse magowan racialized wailes trussed allenwood dunnet theatreworks frühling nawabshah muizz cardon freeth hussam magdi lro hepatology nathuram prufrock deltaic scarier homesite enuf marnell vondel diwakar carloads clayfield wunderkind wizkids continous ebitda extols poiré geldings taylforth professionalized liqin positivists sayoko multihull avella curriculums blucher darlow rajshri sportspersons grindal oussama rippers kushinagar ieg kaiserstuhl urr caamaño yorkshireman speyside cazalet friendliest hydroids daunt perotti muslimeen tabun triborough hornless riggle timelessness pinedale lodgers codebreakers zadran greenhow jalbert pellissier capitolina landini dwr asmar nacha costarred setswana catfight taraji antigravity jacquelin bumpkin stockmann hillin rastogi neurospora gervin luen arau tulou ranong ljubica monessen copywriting horler raspail ajantha contracture uwb caraballo trave heiman nyholm premo bougainvillea adversities floorspace napoletana juanjo uob racketeers marcopolo idalia maithripala zhiyi tates gabbiadini nyorai sightedness moncure minoo yoneda interjecting avio tumblers raniganj fourah nabataeans identidad autant homiletics deploring dml tuns sullied pach prominences lucila gilks peda pembury buttigieg svetlanov roxby disegno amcham willenborg frb lickey ocher bcom kealoha vernay bygraves ionised alfond hilditch shobana newgrange zeroed ghezzi bozkurt korina tingey gowing bucklin kinrara chappy larrionda : ferncliff hollinwood latissimus scardino lunchroom masers firefights unneccesary distension gole sipos silverfish newchurch miyauchi dindo putian borella oleson wsis fandel kwabena lamme sushruta yall notman kalakaua wildey kananga discriminations andøya klavierstück megabits skybus transcaucasus dalvi sembach crossett nutria sorbet jumbotron cgiar sanjiv sweatshirts piketty siegrist magnifier construing cantilupe stmicroelectronics bazilian eggnog rago parimutuel lidge chummy lichtman selimiye wakarusa pcos perturbing plasticizers kateb rathmore bluer maidenhair swaledale schoenherr deadeye jarboe buchi leisz califano streaker caspi gavroche tenderfoot galarza ioa jawara passeig boberg froment mcot mulago apter zilberman inuk mancinelli brocklesby ucar baathist apprehends vids whiskeytown ollerton soha delighting mizer riverbeds crittenton hatchets giske pinguin nagahama cafta fultz duport adivasis pizzey mariotte acy vade dragonette daresbury independantly karpis hirakawa bambridge hadrosaurid saraceni federline bagga tendo shary pratley ljiljana kabat erasers wmm peephole lateralization silvino plimsoll tynedale magner wolfeboro suport archrivals heye koin maddern bbp keatinge kagaku misma wmg rawles elmes cleobury modernizations tranquillo shrewdly crumpler boody plaka falle streller ptk plagiarist dissonances abakan schreuder voicings koichiro manora derniers khairy fratellis maister voluntarism archerfish cesa pottage greases taglines nij jejomar motos paolozzi carless araceli arleen exposés gleiwitz vindicating chitwood blyleven europeanism loral daunted gerome explicity chapada futrell cosplayers escamilla britishness impermanent spinetta marginalizing arj creasey furu clozapine abinger tessitura kellan sunscreens dipiero ruhpolding spectrogram woodbourne angioedema marjoram mosharraf huffer sciatica clarkii zooropa zennor tettenhall danilova boyton mieres staggeringly zhicheng morabito dowdall microfluidics glorieta wigham cluff trinder fbw doudou leafhoppers gramenet diclofenac cjsc chaussées vroman blewitt mumy greenburg suur imovie destra bayazid toxoplasma barrette imagineer gask bevil loosehead protean ccj willkommen goldbug divorcée farfus lauris arbitrating polyamide casteel crichlow mhi hool oscillated euroregion alara scopolamine hobday marangoni neca gartside thuggish oosterhuis pimples kacper healthgrades recantation lochore itea lecavalier nosek caned petain uzelac christophers excoriated leggatt unselected tathagata isoprene tullia trv lubbe rathdown odgers thulium tutbury craveonline screven ingests quebradillas carquinez afrin agraria preventer forcefield pylades cix mrcp traube satow neogothic hogben bonbon coinages draka indignantly madrugada blaha giesen lapsing dimon deoxygenated frederator tassos abattoirs cadenas orchha wimberley amblyopia westlands csos statelessness minored buzznet rundell rajapakse saphenous dessa claritas ferozepur qatada macaluso lembeck ukrainka berlingske driveline différence coquina lch antipolis liliom calama dishonourable pneumocystis chitarra behm newscenter mofaz prova balon pejman thangka redemptoris beastmaster choker grimace extrapolations buzzes intriguingly shizhong varano besos uninstalling fibs faba estrous tandridge sahn morgane pann halva yicheng nosebleeds enjoining boisset bucur puree joanneum winstar katzen chalais kovach gpcrs solares zawra rowhouse ovidio jingoistic lindenhurst siegburg circassia benedick lansingburgh roughest dennard tarkio tesson torcs corser apres kddi gaily balachandra careening gastown deda pols chichewa hisses pilo disarticulated mcbryde nays cimabue verplanck gfr montecristo berre quadrupling leiner longueval favell vesak braziller misdemeanour shabtai mentz baqubah tanzer tarazona mirah soapnet taina penhall ermin unfrozen pepsin arbon memorised seka exfoliation martinek badruddin bensen zuñiga nagore cerullo daryle perisher yngling chloral carolco ferryhill wagenaar khachatryan gasometer tatarinov chloroquine twan reusability pugni physx polyploidy educationists hura slamet rbk chachoengsao prajadhipok plataforma gompertz chassidic valuer memorandums treadmills peterik dirleton muste czestochowa lws superprep ephram hateley mahdavi brewood gnasher unities mustaches bya gitta aveline valproate clarens fromage vasodilator chessmen nikkan ingenio rajasekhara southridge bania gualtiero castellon vte nordling haysbert basehart dorival earworm inzerillo warbucks farag hymer runde gelatine privity navvies nevile hymnody barrionuevo sportster skillman nicholasville farrago capetillo iwamura arrc rothberg railroaded fraternité merab glycans adsit udmurtia laskey poldi teredo hickerson bankable whitest fandi bromhead dessinée prescience fradkin voltigeur roka bandula picc ljunggren sanyasi hoogeveen furillo puu paracas longships kemps hamrick rocard forseeable schatzberg paling snelgrove darwinist toyokuni retrain yawen gigliotti volvulus steilacoom westhead chipettes nasopharyngeal rcl willaumez karev faulconer tarawera biophysicists hayer insecticidal kershner angolans casini mansor cosumnes immunize ramasar wakeham unasur chinandega pothier extractors weatherproof villamil nacio vithal sparke hongi thredbo decicco poddar plebe vertus jau distinctness maslen rinder kilrush thermes pflag whaleback extracorporeal catrina candlebox shinozaki wqed acord tuxford rolen cervenka baldwinsville nyunt leukoplakia rober maiolica sahand toadies rayson derogatis hatsumi franklins woodchucks mangosteen karmann luberon circuiting theriault muwaffaq asri otd amarah recognisably ché nirupama mentalism emidio bupa pasquali kheir gesar marelli cristy aie decembrists randeep kiick markievicz serotina prachanda yatala langara bandler kallur unwinnable bitterne ipg testarossa faina wijeyeratne ibby rochet besoin ssgt acteon stearic nmfs ioi wynkoop kashgari vendler kweller pigmy langholm uncompensated rockpile mantooth hailstones duga kantara holmstrom prebiotic cossette avati oberdorf cuanza valaam ruination faggots bonnies tappin tullus danto hirotaka hna lyndall applewood hrsa forno elchin cozza mwt leukemias caprio olor groesbeck ivories destructions wriggle vueling abels pavesi ophélie stockham hotheaded beo wolfsohn swingman matua olmecs hct mahle cockroft siniora malham adjuntas openwork photic congener vitalia brymbo masiello viveca unwto drafthouse manhandled halfords bhf microelectromechanical eked nylund rewired sittwe feminisms econoline piemontese alternaria bilt captchas alhassan wdiv gurpreet sandby malir squander malema lowey pileup cerros empresarial rinker blox freeling toal apthorp sangyo kbb samangan niyazi barnaba dsf tml yehezkel keffer failla teka guzheng fugs ikin reang loxodonta abcc laxness helmy dalat ntini khashoggi modder kemeny olaya willer meers zajac majer tabackin ridgeville cordilleras yulee motoyama phibes yaffa varnas highball birchard jeunet hydrocortisone anthologised kronen koranic muis muggers slimline platov pancrazio cigale economou pharmaceutics wilcoxon locle revokes umali dekkers hawar dcn genia enrage shayna raschke marienbad sukabumi badillo misner ashtrays idrc traumatology chacun kurseong luddington ciw kusuma skuse trn humourist woodlouse humoured walgreen persano shiney gatecrasher liberalize zeljko supercapacitors orzabal thackery obstinately plastique sherrin kericho maglia brainwaves goyeneche sanfilippo chumley gierek altima lairs bardiya wolfert según anees llanthony coxsone gustavson caulking berube hollick zaa pierfrancesco ordinators atka luncheons lihue trevose retractor bonnes konerko sheindlin wrather misstated lrg arismendi righi eatonville alos kári holmer autoplay cyberwarfare lestrange einsteins scollay digvijay porten hajdu udal bassler joffé chmerkovskiy piacentini encamp jolanda jabuka mfn poble consciousnesses jubb ambra lovelight chimurenga checkable calvisano chicagoan dimartino monody carpel tshombe yudin chapdelaine crannog carbonara definate eor tercero newsam rückert maekawa shahan alsa dismembering teratoma falso reta laveau bgr bermudians miasta lums mory liebowitz ghandi shoto planas contestable grignon warmers haniyeh fuchu abnegation quileute schnapps brummies kizito wacom nilesh jesson northeastwards permeation moch worland chree strub onc herzig geron mezger guajardo waki explicated crois caggiano organismal homann scroller perino kamuzu stefanelli sterilizing junejo maltreated chalcedony farooqi muddying persecutor ilah antiphonal tutorship eer pyrex mercader quinnell dorsi junfeng tutzing efcc oscott wellard badshahi newsy kharian relativist vanguards kenworth burkhalter cratering zelma fozzy donaghadee burki vestergaard buxus kerzner kawamata marylanders nondiscrimination kingsgate musiker patridge stockach fallibility turchynov wynford breukelen poppel pargo shamen sofiane bitz telefonica karakalpakstan slon zinman redbud aspis riper paleface canterville sauro ellenville gönül anglong ueli vered avance wasley akhund kouchner giocondo mineta gagnier redressed cisse soysa halper reinstituted murre donax haemek silberstein saucepan serah auray caviezel mantling eckington recommissioning pilz seney decelerated infonet meegeren lienhard diphenhydramine deindustrialization childishly phylicia dagoberto pondexter chucking jacobitism gerardi anonymized collingham patassé sterrett ghan sandboxing kasyanov sidestepping buse defensiveness morgannwg donaghmore armah mantuan fyre fennica jijel bivens regmi geomancy gii ferneyhough krimsky shive maddon wireframe jayshree choirboy vescovo likelier rappe jns tsuboi deplores dmitrov abq pranav boulware glassell faden courchevel unnoticeable minnesotan zaydi eiler redirections graziers baklava lytvyn usaa democratizing gastrin mendonca certifier bason picketers gilo deslauriers charg haynesville signifiers progestogen wagler chamoli cafferty biphasic tellings kamsky cahen autocross ufrj wynand hakeim grawemeyer annonay npe hesjedal sops bonder odintsovo cosey veta hitec fengtai unmasks gomery tatung pakhtuns bullwhip jabberwock zuffa gravid litmanen jatun dhobi lexy baty uht osim tabling eyemouth birzeit glendening insull gogan funkhouser psap snooki llangattock trull itsukushima fantino cleavages justitia deciders exactions aleksandrovna capen berimbau etang neccessarily fujisaki boekel jackett barthelmess leight akshardham hoenig monetarist gotemba canne pernis neba fancourt humanizing displeasing formentera clearcutting darrington gagner kiraly rickett amiodarone jerked unearths fabinho kapler upg mogao dulcinea damaso jefferis spambots crighton sacajawea cygne greywater endows travelodge panchromatic troiano iztapalapa tareen zivkovic redlining darner adjutants hackley mallock teves reichman tuyen wiesen meditates appraising transfixed kaahumanu mirrlees nust suspiria cathey dodgeville accor tedi deerhunter dimitra sne mabou gurinder wando donadoni meall briel skellig sadu autarky cianci dulag darcey dicamillo loafers jumma demoss boothferry virna bardsey hardrock powertrains mery binswanger tamati finau osram beading schäuble tenaga confessionals phyllida scumbag bernauer skyliner oguz coxswains unicyclist hodgdon brannen kaifu allott fauji elmi ahsoka bowens xylitol obtusely drunkards henreid policyholder flucht cotham demorest outshone passivation westhill thermohaline tharanga reichard redfin lanterne physicochemical leontief abduh waterstone loden roisin waianae luts munchen autauga lujo duvet hontiveros valorem neft yop fixers gisors faliro resister beany skol tplf roedelius hrothgar cathays illes seperation zbornik salahi kolob buet chema dalmuir retinas reassurances hargraves femicide bew kurbanov piazzi muffed hartline mcgahan bergan nakada parfums amrut maurel callier baldessari seductively laurencin supertanker wpbsa nole proprioceptive mcilrath lockman squealing toseland traxler nart belarussian crimewatch dheeraj georgics sculler lafond rockie kostis wholes atapattu hintz elphick lobanov merlion gullane plasters princetown hartl photorealism lcas tadamasa najah neighborly livelier onderdonk ravers internacionales perrow vaginalis sabbaths comparision esti krulak kec kasese symbionese aereo stavridis tasmanians datsyuk aretino spittle monocyte juanma ostler roselyn tagliamento nachtigall controllata safet moister brasileiras sizzler kanemaru modu lianhua arnould adri occasioning gimble hawkweed hiebert esaias dows rightists analysands ditties kabanov carano relicensed ridnour slaine liposomes gstreamer midship flic crue vibrated tuong cleartype abrera bimah holdin bozon ashiq lavers varo castilho easterling valeska brickfields tugwell stampedes pierpoint shorin therfore superchunk towles loz dvir dildos kovtun schl lunokhod bonser trenchcoat kombucha handyside pisanello greenjackets lalibela dyas vigée embodiments cubase gaeltachta mountainbike marcon magnesian beton mimmi povilas tragus gots partygoers bles tdl launcelot vanquishing guare gustloff winkfield glubb flagpoles downhole giralda dumpy grobbelaar ivars eshel yarde clorox glasshouses rtx ravello whall headstart yelchin simoncelli kazuhisa meseta czeslaw lemkin sitton farmersville sculley ictu spartina fawns ballers tinkerer fibulae uhde commentors aart frison malade zawada dwb sair alisher mcgillicuddy defeo spargo traf ziemba dnt gandi ango memorializes doppelgangers schoonover vivat mckane votto valory mapungubwe juanfran hypospadias purfleet khidr phillpotts phosphorescence amuso sexualization hanumantha susurluk ingrams filipp koll walterboro fahl lusa nasiri desouza courtesies koans gretta smoak pão whitebeam bergé maintenant soulmates orbitofrontal kuqi massenburg lmt kundara raoult besotted kahanamoku hame isda trouw sellin thorleif radicalised praemium mariucci jogye diluent vervet welten kapranos osmunda hro kangas kvam backflow sln tomiko molsheim meyler hicklin ossietzky hemanta byelorussia ladrón ballagh nonvenomous bopha bécaud pierzynski lectins bungei rockledge untethered portchester flatow electrocuting rager zech shirl mahuta replications ijf jerold chafed dussault barguna mincing wargrave gloats tío poteau macek wayles petsmart adipocytes locators kamogawa yolen taare stenzel swindlers parm landeck ncep myelodysplastic scrotal cowlings momoi bogoria denes manoukian spackman gigged enemas rapinoe keloid ballyduff piane raghuram rewritable naimark inglorious wedgewood strass ekland attac ples trances flexes templin bredbury microstructures bauwens ribblesdale qingming encyclopedie sahakian nerone kissy subheads infarct interlopers hdms priok zinnia junin embarassment infographics samat supersaturated arabsat jrm elaphus itawamba limonene phellinus hach lookahead gbenga margus tofino kidlington pagsanjan gotlieb tabatabaei katsav marwah sundowner evey navicular llorar eggplants hartsdale alfreda birtley tominaga murk refrigerating miis valu zappos kilworth havin boswells biphenyl jagga ashik ifex accrues ellan sobe conal brayshaw aldworth tallchief uvula colchicine chiwetel eliphalet koide lofa trawniki soulchild vitrification bhatkal jacobabad schau idsa sebesky coningham acústico denslow casanare noncombatants schuschnigg emel testino ommen understeer whimper veh soori aquia moushumi yanez helpfull recyclers balafon carner barani miikka moren shider lapsley lemongrass ithaka rmaf asca garant rusko impudent muench sobeys gordonsville heermann luciferase centrepoint honeoye cyclassics sentances necati updown smallholdings wickremasinghe dalmia countertop dunam hyson binned riego axolotl zhiyuan ardizzone cantorial donlan xieng hobgood rwenzori anshuman ilg amarapura ganji elayne aparecido gumma leod filippino cemex garnets toddington concubinage chambertin yakup personaly fissionable mabus fantastico lucre hfe nagare summative citybeat loreley klepper timbo garters nieuws signee nsfnet wanjiru batta sankofa gebert monetarism phonograms frankowski shambala brocklin eddings ausmus phocoena recomend lovatt marawi bertelli hablar regs northug perlas hellebore determinable anyhoo wragge catnip xiaogang hepatocyte momir shoa munsch contemplations koori majel visages caroling galanter pavlyuchenko corydalis takahara khalfan ashbridge dualities ninemsn leeza myanma sacroiliac degroot ghiyath islamiyya remora giffin socialismo frosh cardross brean fluoroscopy unvarnished severest descant copiah winterburn arrighi firehose clappers mutiara despondency gradus adeeb reorient blundered gorringe cantilena topcliffe lyde iffley lambasting skyfire sasakawa plunderers pams rsg braugher toor guidolin castlefield trant chus chapterhouse filesharing ouarzazate marisela negredo pored badwater omkar sidewalls fres unladen gravimetric nph ogo gohil amsberg mistreat huse celebi pfarrer hioki brandname beachcombers flook fryman aloes ussocom inkatha mkapa wassail autrey topknot babys chelsey wkrc fiorano parsees overshadows charlwood leguminous vally ulman weakling battier stateroom cille rajouri inuvialuit caesura niskanen xdr cloisonné gofa halewood melchiorre kacem playmaking therrien brazelton nausicaa sistemi adjunctive zeckendorf oliwa lymm tikriti simen insistance kreps zubrin monumentality mannish salli longsight harks pagenaud weu consequat troth skr heartened bachi rastelli berrington nonessential napoles steyer studholme deerhoof ouaddai oxenham mazeroski forlani ruelle sanday crx sergo genderless presnell sawfly parslow anonima colmes tilikum gunnersbury duris arcadio gitana molyneaux whiteland stenborg chelles shafted forebear duret societa sholl krinsky barocco holbach noxubee hisaishi taik graphology hadiya pvd slangy legalist ussa rbst attenuating etiological gulabi forkbeard ethologist weese iberoamerican mikula demerger rufo deukmejian haverty skyros unequaled hectoring sagredo chiffre tokay soundz ansaldobreda ghosn meuron jamat mixteco rinderpest penumbral stenography eron tamao oettinger seacat shotley burritt lunalilo tgn tipple midsized crinoline webcasting igls audiard jonassen karren preen árbol trickled emceed tavola daniyal agains hardbound mecano colomiers ellsbury aminata oast ludwick pratham andaz esherick olteanu knab bernthal bibliotheque soeur ladakhi klutz kölsch prepping gekas hennes lachie orica jointing transmuted sarvis adelstein motiv viia chanchal dongxiang lalith mehtab hellhole gadkari cjtf vianello gda coment penwortham conciliate gokwe wingspans astrea peirsol losh borrero gwenllian sisodia panamanians armourer yella lianyungang riggi jobseeker thermoplastics scenographer maysa zeebo adom adshead vadi glossolalia mishka hergest tamino vanian seagrove piquant pyjama piguet boathouses tripfilms montara greenbriar martinville satterlee chasms siento deq semedo schaper howorth litke trahan smashers ghsa veltroni salines giambologna homed jlp hallandale noreaga agnel locura coexists runar apparat querida underwhelmed arounds stuf civilize cgv hardaker schalkwyk scarpetta muscovites yunque nucleases qeii leanna midd hedera waner dippy freddi prostrated ruairí ecclesall lobito derik larache incriminated kfm orna varadero winzip intermixing globalised frier vermonters farel spindrift discretely kranjska shuckburgh hillah inscriptional tweens woc fages miyoko hemagglutinin farzana prolate azizabad pava klif prancer mxn homayoun goldacre rentz yanis nightgown vns balcer epidermolysis apiary rehmat gover sumptuary bonfils hsb timmermans lemgo tighthead avary ginnie tulyaganova maruja northanger vacuuming civiltà grd mcmahons bfb suha gesundheit wurzburg graner goris pictoris bá pinnick chianese numeration brockington localizer lobbing privatise dedes sabratha alexandro vaporizes takasugi thumbing rizo skiathos okuma mcbrain magoon bhuyan rotifers norquay outperforms demaine kmbc holyroodhouse ccis silvey hageman joyously knowledges linac shijie shuar yaffe dogsled hematological hypnotizing indistinctly tarkanian dbb rewinding venatici marji winnick kokan atangana zidan bredow taavi somersaults sleepwalkers bertman loesch owada calpurnia mcguiness bristling azami ehow sequim longgang tadanobu idowu laisse descender thaçi parga cadle alojz freewheelin continously gavyn boykins tritton cowans lesbia aquilani torda sauternes lipsett phosphatidylcholine dunces chicos ilyasova rutenberg braude majolica savidge earthed housefly dorit specters abercynon depresses daddah giovanardi breyfogle equalisation katusha kinnell drzewiecki halfmoon manvers ncte tonsillitis addenbrooke einsatzkommando sontarans tef syndicale tobia alienware cazenove stoutly kerpen lafuente wanga bodiam farson copulating boquerón righthand brohm kookmin peifer natzweiler vanderhoof sinfulness manjhi tabar rafic luces chagnon concretes scioscia rhun dubay teratogenic artzi impoundments selvan elbowed handfuls intimations garran raggio pomerol sommeil doubtfire orvis moabit cleckheaton cyberjaya behesht blasé albena itala soqosoqo malleability ilt suzman somes scalded menchaca enthoven enloe kinzer trenching morrish pareil uncoated immodest overstates hydrogel qbs quinby chinnery cubero storybooks petchey pernice postpaid ballenger effet misreported foraged chaar confectioners hwange sportv cheapness inclusively tzotzil deicing yusaku nummi syphax jarlath aubameyang cotte tré yabu vocale churcher belisha chafer emmi wonton ratoath kontos gauhar revote laith sobell zillah avulsion bonnefoy lukis lcac ayden enrica tranquilizers legitimization shennong gellner donya permeating zigong dowsett simians manouchehr azraq clinger radicans allright isotonic shiism thalmann fowell odc meko corkery kindt rubes hardscrabble thiomersal patas toksvig bootheel monocytogenes dymaxion rotta cournoyer ellos temazepam kempis monozygotic finnis polyesters coppélia aktau grifters furet sassen gnarly mandolinist impacto hellier vicari peppery dalaman tellez aigburth curried indigenes lacon mightier korzeniowski dirnt bracegirdle philbert epona jea ormesby forwarder armco eeden choctawhatchee skolkovo pirillo romijn coxen kilham maceda liudmila lionello boatbuilding meyerowitz wisla courtneidge bagehot ropati crocheted chlorella letang witts aguilas arbenz detractor kephart rense taibbi vizcarra osher michiyo boissier makira neanderthalensis lochgelly revolutionibus saucier eyak bloodsport bogaert wyomissing sudi boesel madha dragonslayer trama cytogenetic pev gairloch wiik bucovina taverne deblois technika crated fyvie mariangela scarth rouben teargas verite haeundae seachange jwt kajsa lunan pappe ined humpy duggal petrella palicki wintle adde deford saluki eusko candoli hissy telma raffy costumer johnnies poultice enforcements activites jora goshi starfield torrini resultantly ksawery blease plumley zithers friern saiz dhimmis descamps goosefoot integrase verbalize simulacra peerce mameluke dessner atlantium musumeci obdurate hefer modenese chidi zaghloul ponza tonypandy animists nikka pullover raymonda gpg gilkes mainboard yahaya degnan roycroft sigiriya ouseley constrictions chingiz ignatyev novation kadlec greenfields brutalized baltes alazraqui royales dode arfon mayim ginko thrupp flywheels waistcoats hartke tuebingen potto kuruppu surender wsfa vpns wst lpt reinke dalloway preoperative subnormal muftis gelded boxley mirela canadiana mitnick landmarked shibukawa sabertooth indisposed nobler molko winterson kinison peyre kindia windpipe deduplication hymes analagous nase brownsburg sharpeville whiteville berms ifes hilts vetlanda harlaxton lissy corneas donnersmarck hadise scattergood feelies nigg lieberstein afyon carbonized pellagra pantalone tulowitzki ciii alledged joist yaseen chewton bvr siddle fairborn monobloc chakravorty multisensory relight djed boncompagni bbi thataway sumiko salaman cabiria benquerença manske woolcock darger indemnities enumclaw footrace umra disclaiming gvwr japp graphed pattana backbenches harthill curro felicitation lunes ucmj sterett girod pétursson crucibles sarner leveller turtleneck beveland patricide timerman praseodymium reeb conard yuichiro tablo vyse tiefenbach warsame mcteer ideograms gerasimos collegially decamped flansburgh viviano sumeet seligmann feargal chaque velikaya quivers cancion hatzis stonecutter wetering rajar terrestrially deadball pinas kaymer crinum bittan rescaled proletarians shirtwaist glyndwr polyphase nanowire hagadol temas transmute aestheticism skream barbar dne martone macke jep derrière gunnlaugsson amiya sobrante hauz sledd goldwin preempting dsps burrowers ilarion arrs wendling paddon bobek asmahan ofthe ifac defrancesco tindersticks chicanery peroni derfel kvist whelchel farjeon quitely pruden hisense wasdale jozsef mailers noorani underaged derbys veloz barbas ilminster recchi duckie relicense garrel teapots lurked iml underemployment gentamicin collioure hygienists kiveton tradewinds replant urueta mscs presenta karthala cedillo deza sandri knottingley nairi mackall zesco gcp klasky nrotc lura misbegotten noris unrra beaven dangdut asmus bhoj vag kuok aist nappi morina archabbey comunal pasanen ramle metastasized waqa komando christodoulos blunderbuss chesterman bagher bloodstains rubato callicarpa copthorne sitra antagonizes exhibitionism canella goodier bvs japhet spean christiansted multiyear charivari vigoda ephialtes fotopoulos idiosyncrasy daewon phagan ribbit hollidaysburg ouseph genevois shyne meritocratic trever henkin annelise preziosi hote erebuni tassigny salme vietnamization grandiflorum tulln blairmore tabley mickael zeldin lakshmanan condrey soundbite vishniac stennett shimamura cioran marabou savour pails bayona nazimuddin psychonauts eubalaena endocrinologists borin acomb feltscher cribbage theobromine pyper natriuretic szalay perdida donlon storrow stollery brierly starsailor calver psychosurgery borgman intermix muhl draghi backlighting hersi baffert haws numerus hosford srilankan tresa mazz paresis hotze chettinad spinrad fumigatus gradi rajnandgaon fyfield laatste nuenen miata shillingford sandokan irion stordahl finaly chio eskenazi parkas mukalla regenstein claddagh microelectronic juppé roadtrip hartvig huerfano spews avonmore rigler kevork cabrio introverts squiggle rootstocks sapsucker bergstein royd inconsiderable pierhead aeb deontological noshiro eisenmann louris bateaux lawfulness loughgall lanzi koppa cks skirball vaad hosaka amando francks ayscough altamonte paradorn wakeford hmr muttley petch shantung saltus epochal mausolea akpan hohl skyland noiseless solidaire ovington portisch mesosphere penser peral kokin wimbish hanly komsomolets apposite staters whitefly cutlet mainardi gahagan pietistic dilek steventon nerys chunqiu dieguito bodas scenarist klok cropredy conure glimcher dreamboat knowns zernike pelting knobbed praz perilla blanches fcu hots pneumonitis orga torpey banwell padalecki halyard motorcity apha nokes aspar royko imhof zegna ajka candie meyerhoff mollify cellino ontarians keltie malmström halis cubi patryk ryunosuke paynes folan reachability biennium joseba stolonifera elsworth dymond puentes charrette umair twm pirating sufian tramore stefanov nuttin knoop gitane hendersons buendía tabaré beaufoy inactivates onalaska gwasg roydon beuerlein incentivize amaker menina choli masia redmon goldson hortensius demoralization ushaw bessey kangal coeval dvla usurps lafave wajahat kasthuri ingatestone brocka anlaby bravia boonsboro pybus rubashkin litigations corneum facta petukhov soled lambertus elian neemo sachse dupas afropop oseberg bagasse schwarzbaum guanghua plf hadza stank townhead thre kempson venditti rosenau sendero sanghvi avtovaz rutt maximums yizhi murcer zangpo barish martie yob readington inefficiently chesbro rowton dì satter yavneh merpati aureole dongo glitzy horsted uppity semiclassical pontormo tti budak alvechurch mukachevo dourdan sisimiut quichua skeptically gonzi autoblog taseer longship dehydrogenation laisenia pupi benenden gunvor melanocytic apgar soonest gargiulo hansteen voghera rogowski kissi cibolo gadson reichsbank goop macheath habil holzapfel thula egas tfas källström northwick phb cuddling wengen asaad elkan aju knowledgebase freja marzuki rocka futureheads zhanna hardisty slobodna poix maybelline casavant lavage sandiganbayan simnel lbi blinker gavino canó trouper groundout vincit metopes hijrah demarcating hoye rlif spangdahlem fratres bychkov boredoms garut changelings comunidades allum meadowhall solenoids osaki gersh richt tolu tomio neophytes kittinger callista dejuan rekindles natuna magath crillon shaoguan toffler dinko elitch menuetto rijk kansei schladming glacially boscastle noory defrost lummi rosalee aurignacian pennefather yulianto gcf witwer spath canali biswajit tarandus mentira mattock shehhi simas cahit conferral anglade prêtre ruwenzori gunge gongadze wilmott canvassers acpo calas duracell nkunda gono patting larbert guram scarcer oradell allotting portside tribology haemorrhagic selenite naxal marimbas hayom kaftan carra syrus livsey kristijan camphill moringa tibco magicjack cowslip stoch kovalyov costars cruck staved wadhurst pkwy telesur chatelain ihe iasb calendula cherryville endres belem leola cheep lesli miryang rabie impute workingman ignat zilli mettmann kuldip curto scicluna yot dehaene shumate paver arbres bongiorno crosslink wbu quartiers chikungunya backhaus greeter cimb matlack peni fcn airless wardi klobuchar bombardiers dacorum tibbett wilga destructs ftu pegmatites tiziana snively intensional cradling werchter rodr stipes bandarban werneth nudists matchroom hollmann heiskell torridon katzenjammer quraishi tosti hleb fiamma tykwer laliberte sater abaris tickner alethea shirokov abrahamsen leuluai cananea biopolymers noscript parterres ardglass klcc nejm massalia renick allom assynt madams vicary eppler winson rahmatullah shapps bobov hardtops proofreaders blots baubles cowtown nuthall laufenburg naseeb hoquiam vuillaume tshering avaz livability spiralled perfidy placoderm shirov espectador trishul emrick decompress springy biryukov mikako daldry cauley nourishes nasta pichel samvel skatalites latourette trouville obel logy ciotat spilotro texted scads cologna priscila proscriptions esx terrero egl tabib rishton dema sojo filion naviglio doriot aggravates zundel twitchy juco vata cafaro kimberlin sopranino overlie naama yanai purée porteus muzzles enterovirus casto jochi ghostwriters needlepoint ortolan harrying milborne wailer idiomatically christou mmos jouan dehart balah appletree darom haldar newville niem dews liberalised agnete hamnett guardrails machale hoxne ornithopter urp tallit getrag stoffel exhort stilling upsetters bastet schwerner overberg agip amoretti russified inconclusively yearlings spelthorne guptas hcr rach reveries adulterer cappellini syllogistic regionalization synergistically azan sœur southcentral smeal khursheed skanska cherkess hodgetts pilloried multimode littrell outpacing thies vulcanized aihara carmouche tricare wicken roosevelts clyburn aerea ringspot khim defore imbibe microcar ceram moneyed balbriggan ngv caffarelli avie anfal tuggle paxos underwrote vouching teoh jhoom huay klayton kijima xss nabo manoogian contrada wads trafic poring barreda vatanen hittin pullum leconfield inseminated morgado longy pushover wigger itô enkidu breathalyzer heppenheim hgs diethylamide otford bigwig bagpiper dambusters akula poulos lourie mountview subbuteo bruckman villy cogwheel abutted okuno taghi kmg disinfected demirci tolerably gauntlett amauri tricolored sparro goofed rihm wheater regalo skybox frothingham harshman fellaini burmester charle motoren iftekhar englishtown appurtenances veenendaal ignalina seijun lissouba kotel cvv bingaman massee venray wroxham drayson presumable velveteen alexakis vaziri salzmann ctesias ramalingam bonatti polluter risco anzu athenagoras cobe wojcik dentil arale hydrophones campaspe nahas bowra shrieve hilgard mappin rjh mephedrone polli mawlana stamkos eigil mgk fitzhardinge werkbund fortnum spellers autodrome yit oversupply shusterman checkboxes tuz pwo kosal rogallo vdw tantallon reciprocation zarek waz mfb mchattie woolner simp courtrai whang swifter wiegmann berck mckees engelman flettner claypole lanphier pelfrey breakingnews nuland makmur thoroton serc derya uncharacterized moher relatedly beaupre ellender misdeed mcbreen hanoverians castaing agudas cawston grater electorally itaipu berkey disgusts wakeup dennistoun kulish faiza kastel frierson déjeuner bulking antiseptics aberfeldy azamat mufasa jedd thp transfigured humvees wsoc mcconkey gidi jeopardizes kankkunen zhukova tarns abbs rumblings ultrashort donie sheema nqakula hardliner stadtmuseum gaëlle hanada arps broiled rehire fakih peregrin vaccinate piermont mrk unos chamakh tesfaye clanranald mccarrick infesting tekakwitha blears zeuxis arboriculture gweedore gergen markarian recycler goines toxicants principessa taklamakan lamberg turnor fll drolet foliate ahron waxworks kirkton shilluk aerie keikaku actionaid yuill hatano vessey kosha directionally sankuru miettinen cornelissen virani vehari longchamps uncorroborated spellchecker carnotaurus brasch muza jagdpanzer reheated viby gresford barques parool eyebeam connells armer rubbermaid zaida cual emoluments nalwa dissuading georgine hube memórias assail ingratitude guttmacher levein bridles personnes reflexion orality gpe gaona timezones hord soaks menage preparers sluggo summered glockner illusionistic wuornos tramlines uen rinds clagett bannerjee excommunications bfp polytrack woodway backstrom arrowroot osca hemery maslov nazionali héroïque ncca meem parklife churns rajyam facially crocco douglasville kavner ardiles windowsill corallo tanking tegmental ussuriysk aifa lurkers charbel otherside porges lelong duncalf caraquet rangifer samaniego psilocin caselaw rendlesham morali beamline zumbi readjust pazuzu aloysia sturmer seatac nistor seumas nyra gellhorn imada caramelized ofili persinger fluidly yata gratify bedia djawadi electromyography afrikan remitting federales buttonhole fuglsang halbe yanji bradykinin suozzi lallana oolitic jibrin kips leshan benj meis fussing furie snowblind dowdell milana massaquoi robbinsdale terribles papazian swoopes unreactive aono ufi zhendong vögel carrà oakhill purslane hirani mourdock jawhar prickett limekiln sers buthelezi guicciardini mugwort ryk cupra klarsfeld marquees arianne tsmc admir winklevoss dishevelled grat mendacious hallin mondegreen kawkab emptor fady vipr billick infilling krajewski plattner kalmus pite woodcarvers tanay donatas nippert squished lmd alatau naogaon mickens vorderman cremin foskett praxiteles blaby propter woollard shiseido aggrey cmn lude pret skvortsov kprc pantoliano sagunto tahini gorgons prisco agbonlahor barritt terroristic dorien weingart portmeirion meekly creedon gcu thandie smidt galletti bellezza bundoran ekholm ouedraogo tetuan softimage corms klansman precendent fibro korie percolating farhadi lehnert returners ruderman sesac himani timofey auclair daka grossen kemer chkalov kirpal elsdon esquipulas cilliers janardan imbues unda sarker blanchot whio ocm takasu misconstrue lolcat banns temperton rasoul piggies crandell papageno gullett fethard daytrotter serializing alighted vaginas tiberium matinees hermas visualizer sirtis monodrama mistitled lilias noosphere bagg monywa uncertified depopulating horsman tair aledo goldsby ucav abueva douglases kwanza forli karpin gunny simeoni nicea titmice craddick taillon tonsil beuno kamisese applicator pravasi sarel nanteuil castellet silkin cataluña tesserae takehiro nard hydrae africano babis waterbird aflac wynalda patitucci labette lmr arndale botnets mirabile lykina mincemeat ranthambore gati dkny benetti dauntsey hydrogeology calendaring kalmia creager merve nociceptive grosch fetes sumana austal leath reined kiprusoff darod londoño jeung trec parchments madu jwa vinohrady joma ebersberg ontologically hermeneutical muawiya oric ravensthorpe rill abol galasso dqa mapmakers wncn duprat emslie ecml anaplastic rendang plowshares bheema delmont yuman nastier forgings kpt nvr impulsiveness yasi eisenbach aldon eem rawer medievalism sexless strolls midships mirani grizzle helan formalin meaford gwalia devery liaqat tuberculous cossiga otg condones assar bero atropurpurea railroader crin spinto clindamycin beem didyma darc relaxants tongzhou herbals stambaugh brockenhurst albanesi mashburn buncrana chastisement domergue portstewart chukwu sarmento tremens dennie rescaling synopsys shamu cariou mishan cabbagetown ciau galkin mirus kyösti arromanches préval reprap rajai criminalisation caloosahatchee hongyan weisinger churchyards jolted pipi crosslinked dostal brf beeler ibadi bargoed hazlet gibsons kevyn blowouts lahori conecuh luckless saggio vver umlauts erects sutphin cafu aironi stutsman hypertonic briquette tonson moola kirlian sahgal dalem lubov allchin lodes herlin sabater steelwork marbletown veining zellner subdirectory ronaldsay acir knowe expatriation kvb reverser maylands mercurys ghanim transcriber yablonski serme yoker fjeld ipiranga peil mazatec craney cropsey shirreff branche gardo jungmann helwig caq subzone duri luisito lvs impasto postbank magliano malahat repatriating tholos baoli instills dosh tyrannosaurid newfane baik huberty karenni ratiopharm handl glaive akabane amare ebbed lambourne correr aweigh rastrick gilled solé yid comedienne trenet alembic garzanti chaw apv kamio anangu tuomo madrileña hypoallergenic cristine klosters shimei consecrations dilutions cordray snooze zerlina lannon coltishall indubitably pmla kashiwazaki endotoxin bokan qassem yamano caspersen lockinge jadi suttie definetely lynxes tonalities metalloproteinase gosdin nums minichiello pummeled sonera coury hohne cauthen slmc aecom jakubowski halitosis terraformed jerusha torchbearers underachieving marijo lorenzana bootmaker apeejay mulcaire itw flashmob pledger adalat kolam anisimov winterreise denne madill paralympians brittleness pasargadae zigzags atlantico medaled naxi uvc cairngorm alaya leir léonide marranos joep guarana clague kleiza harn metabolizes sulieman schantz realclimate telewest neutralising lumberyard maté suskind schwalbach balete seavey blat insta nastos boya coulda melby scottdale diebenkorn totp sweetnam predispositions inverleith ltt bioethicist hawi giesler leggio grbac berny alberoni stetter whitesides widmerpool aïda caddyshack lipset spruces quivira aiglon hurriyet muluzi arntzen junket commentate poeple blaize afgan prashad unravelled immobilizing pergamino polkadot silsbee epling zombo educationalists jockeying neonate hyping sugimura lazarsfeld mko greenlanders shwedagon altun corporative matadi backspin clubbers canajoharie euratom dassel solitudes ransford pugs dépôt hendley flappers pethidine onsets humic afx cornthwaite swordsmith sidorov kraushaar sisteron baor examinee primakov immobilised enraptured konsthall memetics ycl aars mishin konow wrp kirkcudbrightshire luckhurst sextuple ostracods fidelia eckard scally korsakoff licit victimless peeved perfidious idir cih pinkus sputtered thermopolis figueredo biton oly vova horrigan aduriz kuts effi corine bakst manderson preparative pindad slingerland monken owaisi toji zainul escp sculpts hudud ivette yma khusro impromptus halfaya farfán mafi higby reichenhall intrest buton bajada viñas neild anciennes matkal opg perin synchronizer pembe communautaire seydoux meader singed wheler caroni stogies majorette thunk spork tonics dismutase submitters tageszeitung floodlighting zich barbaros evashevski cuyp sativus morat terayama tigr osv sipri experimentalism shuttering felisa prothro debbarma rait fritzl alport overlander coade szegedi eser lunged imed kelmendi kahrizak jahanara hedo referenceable están terblanche asat asociacion staw tieng karmin manicure kushnir ligeia arlin caspases squinting lualua bienen audet dussehra basar iraheta jutarnji parth mimoun crewkerne residente frowning pakpattan camejo hunterston rne sarasin gevaert placide baitul demott kuber wigand fiddly trex rewiring elr acclimated klinikum undisguised breastfed hypercholesterolemia micros frolich neilan alprazolam chiklis kuttan carlgren brinkerhoff reuleaux wbtv canina eosinophil ellon farahan hepatotoxicity osvald jahra hoddinott divulges scrawny jotaro waze kripalu asturiana cozzi extern disbelievers getzlaf taihang gopnik immerses weyrich trillanes kuehl hambone yungay pacto zoellick jelling bhool vinho saloni esdaile luxuriously lalah mcmorrow rinna frolicking mawgan horwath ibert medlar bucephalus playhouses crystallisation neturei embryologist juez ranier mashimo wrubel geof unkel herminio temblor valin hydrophone cabras lynmouth ruoff episkopi munaf sabel título revaz benchers varallo jusepe prosecutes lante kaleh lph almudena westies kichwa nazif weinert thermography quemoy hunsaker wangler aloo carpooling pommes farnley krenn widney clohessy hally unfeeling rangelands wutai bres fagioli bhakkar tommasini passionist chasetown needleman sampla hydrosphere brer makinson pfäffikon tobolowsky awsat pcso odaiba puyang decentralize concreted hgm rituximab counterbalancing evacuates contrails svedberg bannen bonomo liveable sadlier maxence sensus crisscross grabe newpaper mcilwaine janni reik djakarta mcguckin kercheval rovs singletons irlande parasitologist kotagiri honington mihael mizner retta wows odalisque khela euc hapi welham hellgate peronists iacs metrosexual uddingston limnonectes janny gettys fabulist lapworth pishin yuzhny chuckwagon syncs scones yusif quango yabloko rozario furano gyration wheelies puncheon chaisson swindells amadi cgp expiratory scheyer shortsighted cowpox zohan entangle scamper maplestory krystian ultrafilter klavierstücke gats solaire stw runkle cuerda elvir rangeela breathtakingly widdecombe laux condie islamiah spatiotemporal conjugating ravitch bouchier tsewang icsc yous shoutcast griffo sajama shiraki alizee principale wastebasket freshfield reverdy silvani mentis hydroquinone jado bombshells nork jalabert shoplifter emscher glowed edgell exomars clayworth vladimirov doted gooder limey kuriles salihi azzurro innkeepers statments donath trotha meral earwax statt dorus thuram karney limbless isoniazid tropica forestal wellston noha sullenberger pugacheva celebrezze arpino richet fluffed matchbook bordin jazmin sige tasik saenger corsicans penury castronovo datar langworthy lanolin stolac acas dermody fondant bva alvim kert lutenists shimamoto ausonius smolny suvero deogarh maillol dysmorphic nordwind simko yohei kitayama contraire jsu aught cantilevers vanderlei logout drd pakal unobjectionable iorek hawkey idli sundeep ultramar adbrite abdellatif buckeridge jasim haylie sinopoli persichetti digbeth wallich ghettoization hollandse goitre dinaburg politi identifiably emmanouil thep beyaz kusi quispe westerplatte montesquiou sackets cherishes tangling valene fortman basak veasey krikorian labarthe brawny waterspout ceremonials difesa eilert rixon concoctions claudie mantia thyestes douchebag netivot steelyard damariscotta marveled ballona beamforming gunmetal labadze weitzel bossu acec rhapsodie hest necessaries chanterelle spaceshiptwo indemnify chernow frechette champagnat pigou owers rukavina gpf coqui ruto neef kaprow mirisch groo tona petrocelli fondren bookers corinto marvelman narrowboat dalriada stovl semetic gernika stipple mosaad botan sassone acco slapdash tajo mandula acinetobacter mangrum isocyanate pertamina teleconferencing kirmani cowsill mikee ziolkowski helikon jazzmen sebum aubers fosca cubbon hendre ravenscraig schallert ioe polishes motorbooks carbolic fabricates ameritech hoopoes elisabeta iriomote shayla beautifying magnan galangal reasserting reassigning siddhu alife jarnail flom yilin ringnes vicodin dynegy invidious brancato worksite accretionary gregoria enterobacter industrias solt goldhaber feminized arguedas uncultured stringency publicaffairs frogfish interschool osmanthus furrer cld maximised hotlines erni alaknanda roethke xanax telecommuting nauen chugoku berghaus waterbuck reits poveda adcs roswitha pountney nephrotic baci spach tombo anlong oehler stumbo vassell disinvestment darnay gorshin stagno spel criminological gslv kazunari cleisthenes alano provine kokhav azua torey wingrove rivonia fondling kosor piggybacking flurries vigier goulash tinkling cavallini ladykillers cataldi stowaways lorenzini qalam sgb pendarvis zugzwang xinxiang manetti hermeto chimbote dulhania nowakowski outshine fiddled hanner marsteller rhu gagaku carborundum goodtime oecs mochida sberbank fev barfoot murcian rienzo ladylike dafa slotback erice baard btb nobutaka charrière whan trinley porcelains cullin wagar idel prohaska isakson simplemente gern indelibly quartermasters afri muskox nonius writtle mignot gamson debretts grizz franchet kiwa blackhole corbitt millibars harshaw loafer turnarounds fonz mhra aquilegia stepchild satcher rietz aráoz telethons personam yushi allergenic intramurals certitude meryem kram madres qalat kickball althaus schizoaffective euthanised angkorian takayanagi interrelation mouflon nuristani isbister encase tosco intimation dji cilley webbs aking adell dynast howerton barrasso acheived tamari reoccupy involvment coonskin formule tansley icebound queenside raffia cockfosters cyclooxygenase audenshaw interactional pilkey histria kadare catapulting devilder tetrahydrocannabinol midcontinent neris zakhar pastorals shafqat tapti wheezy deprivations artemus tomonaga erec bookmarklet faheem buckfastleigh paek abramo wonderbra posterolateral cawthra elderslie seamans ribonucleotide yuxi motivators drian taban isolators warneke mileages bhavna longhua miccoli wilcots circulo bickersteth shakil stylites crafoord bgl gizmos abot reedbuck melick briles pirogov belva lovedale negley matewan eichel eyetoy wrongdoers dimitrij ticaret mown suppers brisker wedgie gasps gilruth unpressurized eita fortson odescalchi cdte reneging golia shammar sweetser claverhouse kameng kalinka sajida doublespeak koudelka collocated stagehands decavalcante schad tiye paratus fallingbostel kans moretto meche yotam delli unsalted haberman sharpener barât abdulhamid mabrouk cryptogram malvin tanimbar batesian wernham devyani acquis semaphores conceptualism rostron blundering geiser freeloader raita tearooms kaikohe maturely chowchilla colada menkes poissons verran apoyo pittoni granato letellier willowherb coltness charito soco makoni saut jollibee sousou weicker litterateur whippany ozerov serrana diversities adminstrative keston stamatis spinocerebellar reisch arenig bromwell iole puhl itx mattern westerland southpark mierlo keskin valcartier columbians deben chacao aliki kalymnos percentiles doner dulany ayoade virgenes revegetation vovk usfsa armé teign okk gruss mikki vasko snak neonatology belabor galitzine magas deseronto quartos karrie blackstaff ethz essayed chinquapin spinnerets asexuality epitaxial volcanologist furnival dransfield stapler squirm handprints eucom silcock alutiiq gherkin amorsolo merozoites pavillion hillwood nafa xtr hingley strapless peons zaks varios toupee lartigue lpn farron faur bulford cret joa capricornus polsky landscaper vaporizer rimrock caucasia ahmadzai stolpersteine levitra faulkes kims negrín gup christelle midc bealach networker aberaeron sangay kemptville absentees franciosa velvets goyen desales zandi slugfest wignall pyeongtaek spiritu intermingle rre ieng monzon wangari antwan mcanuff villamor welliver antiterrorism centerfield ical echa vacillated kapok cerveza gloat amadu gilardi penitentiaries gorseinon meursault uberto alya través miñoso offsides yonex wachau kowalewski bouwmeester tanf gyllene lamplugh manford vitrified inspite chiodos hmx conchos poulidor turre sweezy descanso raasay rossoneri pasuruan sazonov spaceballs swilling cradled corredor marmi jindabyne ragnarsson sfk kermani touba tallink bottomlands kinyarwanda bellus mmpi stopovers desrochers savana chunking cacciatore feridun blinkers ittefaq tailfins franking kokang senderos pointlessness grondin lumea boîte satam tottenville inkheart harvestman sorolla stoplight lisson crail buchtel pantaloons toffees nytorv tollcross bigley idrissa marcil ryuhei chadds realizable depositories quorums dembo honorio skylon nkf cenotaphs paddlewheel vipul mazurkiewicz barrowclough waterwheels mehrotra fluorosis gianfrancesco eytan leavey ladi forsook mischaracterize qca jio budnick burkart albe tragedian rondel kissling refract mentalities perrineau krld monsen iacobucci quadruplets bevo ecobank ripp leke molfetta ventrolateral vinzenz bracero alpines mckevitt monovalent parkhomenko embarassed cgh funtime iser tretiak brechlin keenum caspary erogenous beeville genesius kielder obert pfoa snagging arbel paulsboro imsi borriello repetto sayd ciliates tatsuta klock nrbq mikhael equateur salmaan clayburgh giardello kobrin obl baldelli datt bergheim liesel bekka martelly pyrrha elad milsom erda prabowo karem volksoper neuheisel keya gortari julin megaprojects canora looy sabas mccaskey sweetgum sepe deba jovica pupo baconian tearle sevenths suero kasprowicz tanigawa cohost carrée puspa kacha whitsett fogging boghossian churchland seamstresses glin loic contextualizing isidora miani jokela tickles quercia middlemiss bellomo peyo arben dilshad fsg arinze tobogganing woolfe akademija kasner describable witting halkbank streich vervoort yantar signorini weide veere mislabelled blandine aparri wyandanch thebe bordaberry campbelli arwa squeaker yesod omnicom weck luppi mequon bluesbreakers mauritshuis coseley crede langside blenkinsop bathrobe motzkin mudanjiang cucina willig tuku exorcised newegg mazin brantly ohioans bronk avermaet salvinorin bruschi airlangga briskin ditson tto wedd burlison nicolaou brainpower santelli inbal tenens polyploid masamichi aeroport heimbach hofheinz acclimatisation seadragon hassling parapsychologists juggles figari mulliken unplugging rationalizations maravillas lolich lelo kohoutek seasprite abdeslam teun anpr yangzi patisserie elucidates maternelle punnett laron photogravure balladur karmen lichter naomichi riitta tarvisio beber irsay aashish pathein passphrase shizhen novakovic romeyn investec samphan torte macer cisa mejuto bols chytridiomycosis daunte markman parkfield orsino slideshare wellborn titty bourret gilcrease takenouchi toff kersal stap colectivo nli silversun jovita wanes bautz risible lunging tyramine hatanaka fuengirola lustgarten revenger tanita hohenschönhausen perito mullican medwick piperazine aiman claesson mazzetti leporello heraclides zuhair copperas propensities nymex babil scarisbrick bjorklund nechayev trimark bitterest lenna vulvar moapa getcha usw backdraft matcham chemistries ambros kneed bruma gammons islwyn stirk unguja goutam stirrings bowa hlc misbehaves reddening branksome riemenschneider bmf acna biogeochemistry itas wrongheaded saturnia marinade kuttanad rrp ecclesiastically pecks instil logjam forró digga postulator enl bareli procida onedin filemon courson pulham pictographic azzarello toone osmolarity orji dubya insularity inconveniently arzamas wawona babine librarything nadp ocasio staller umemoto maxpreps alviso zervos magnifies ziani kiper trekkies dcns hipolito tamzin sthalekar discussants martinsen lilford piazzetta jivan dogmatically jiwa büchel labe balustrading woollcott rovinj pflueger tarnowska mortared jiayi trotwood sauteed aabb luckner contestation daruma heta neurobiologist lunney maois hoofs barkha excellencies mudbrick nandana davon diptychs trumpeted nishant scandinavica miroshnichenko socialised infoseek desailly kazuno raudabaugh yelizaveta inputted glorioso folkman cey ccas mitsuya elution hedingham irrelevancies rollovers blacky igoe leonardtown dorjee pawhuska panah viard marylin calathes laidler knits representativeness intermediation santoni ayrault guangyuan heddon ziering tonys hemangioma kulasekara reutter contrivances delahunt chadli disinfecting fulwell roussin chirnside bujang interborough panskura dirlewanger lindzen hayati ridgeback riles bedazzled nordgren rossio neuroma waechter castroville nmd orlen vandenbergh urtext hallier idolize voeux preorders dente mifepristone trawsfynydd intertemporal randles langlais probationers lamberts anding uxo candelabrum damad ploys tarentaise caulder gujaratis vob meinl chazy etchells soz akkar petroc summonses clavin glum zarra xabier bagnoli palwal chenda classifiable natufian bitdefender guastavino pratica tithonus ferit dracut peatlands udoh apic stachel gava khedira juancho racialism chinnor foreshortening subdivides dulli homestake warioware bergersen discourtesy teofil ataris wixom bâ lambing tannahill elworthy assayas glau voigtländer sukhi dispersants newbattle nonlocal dodin okajima tima fisticuffs lastfm catalani ekd nordics zaghawa wenvoe loli thickener kaupthing sojourns wappingers coquet awaking taranis suttle viciousness dysgenesis baartman heeney holmium hassocks smurfette leontine flamboyance hathcock kakkar sneinton hajek bösendorfer fieldston phung tuckerton neeme sidewise castleberry tullibardine mirandela obaidullah jadoo lacewing keiichiro harpa bichon bingu tanvi palminteri datang doneraile dhhs kroupa triest papst tresckow talis mohar hfa andantes abhainn rockmore hald necc thielen mahopac reverberations chissano digos prehospital marilu elbaz frolics jettisoning gulik kokko christl schaghticoke birdsville eei helberg rachna vaishno astier jupitus fetting runton wme beba consign farzad erdrich erevan mifare zare fufu goodacre orientals dilapidation ciudades laffoon closeups littlestown hees stabilisers hartpury debridement marcato tfsi betanzos nosey demaio kusch melian keoghan catmull encalada dars policarpo rawness cullom fieger glittery russky demobbed coomer ddot teliasonera kepner abusir estrellita böhler chernomyrdin berlinger brah sreesanth heckmondwike frederich depressus frea officina mesbah bushati pirsig mant furloughed valsecchi woode otpor malcomson miraz hise jazzmaster tovah suprise stremme doonan blogg solutrean cassill cvi florek summerson pazienza amahl lubetkin kassala horgen wades fex nakane dmrc helford centuria abeles kitschy watchword photocopiers lenina hohn volf scarman saralee gonzague lebowitz westphalen isaia ooooh addlestone tallboy faunce suttles dethroning tediously kreuk cryme bingle hidehiko headboard delphian bavo multihulls unallocated sprecher etzioni yauch giovinazzo mthatha sonybmg curmudgeon ebbing teleconference notchback bowsher perjured mirtha soulshock ignatiev nyonya terrorising tuberville revolucion coxwell tgwu jovin ecoles nóbrega isayev bafokeng temer tarini kuning immunoassay processus biochemically guineans fynn outros almanza desaparecidos cronan miracleman silje jsoc pilchard zada &# bricolage iftar noriyasu qingyang georgeson soundproofing dorotheus bordoloi unmaking bissonnette totes maybin mrinalini vanzant agbayani bouzid caloris panguni chos hummels comacchio eav naryshkin nlra fontan shovelnose infini illimani cony mdw chiding julep nidderdale flowerpot bachelot mersea pacesetter ichinose kiesler hig factotum minet cassi hahne desperadoes fritsche foosball bacteremia nubes ryutaro laïcité trawled bursitis dishonorably romare patar dehydrating arshile pronovost aircrafts poulains tsutaya dhamar retread transgressing asur fluting heartfield penthouses jodeci christoffersen laroque ongaro wardman skytte comillas switchgrass florizel becki itay chastened vitantonio heinen rusts fsd barnston gorst sudano aminul dohring immolated petabytes ruas underbody lusher altria giwa industrialize downtowns tekapo zehnder masoom desiccant yonan lepine moscoe nottage nashoba claspers bornemann jawf jizzakh chattaway fieldbus wachowski walkthroughs quoit zot shits alcubierre yuliana harkey jif maryan bioaccumulation skillset lyin marktplatz filemaker harasser bathyscaphe balbina rempel kaolack borbon kleinert ricasoli antagonise shuffleboard volleyed acconci shanda altin dominical reinoso roseboro weissmann beckenstein bawtry fysh brizuela improvisatory yyz cabezón ballett jori psalters liversidge crabby alsthom patchouli dissects sucka snicker simonian grandiosity aghdashloo liggins potentiometers quaver gruening galpharm sichuanese zealotry trehalose parihar bodes saux baranof lambro obstructionism usonian crosswalks gahr meti primitivist upul ule ecatepec ecclesiological shoaling corexit latinum urgings naden coudenhove cester heirarchy honorine gutzon contactor elitzur ancon negations kalasin neocon libeled lampre folco chhabra bifid nle auspice climent iguazú ripton whould mimsy gonthier greger vegemite wheats frydman unnavigable voxels apos loggias sideshows gorley clawing takemura gaelscoil cihan lapo htin rabbitte immokalee klöden mpw xicheng schach perplexity xist gugino tuddenham marianela wij shabir garganey figeac arkley gelati wookiee makis pwyll sann petiolaris crossbars mycologia calamine lopate exhilaration montélimar uniqlo kabri pras segawa fraulein schleich commerciales indignities tschumi saddlebred totaly monounsaturated waipahu barwa charme cappon grisi unpasteurized movsisyan neemuch tamai saraceno shaws mcgeehan fleche sporozoites kellock risca wisley kundra konovalov fairwater loonies aacc cayzer bandanna redhorse pyr milbanke europacorp beddgelert launderers ullapool laggan hasid pumphrey inocencio shrivastav hirschfelder suryavarman rangitoto falak trebles potentate morada ranta kilcoy etchingham romas swatting rosamunde subcontract pedroza kikongo ifrc manek relais comprehends sleighs verbania laurentide ouahab wanadoo aerodynamicist quickbooks schenn tractarian egen metalworker hughton wll pedregal belgrad creedmoor vierzon hht felber thind mozaffar baboo hrg pedalling baserunning ladas adaptively derksen genkai sungoliath uncoupling fritter prsa bovington exhaling roughead attias francese blart darwinia mcbrien hubner sumburgh agreeably chitnis clarkdale norovirus harbingers rodino hincapie evocations fossilization melor sacerdotal reil lacasse irri silvas wyant shifnal zedler jebsen lemminkäinen pyramide mittenwald rwandans mitla houldsworth simmental ancel egli haarp vraie brominated allspice alguien raskulinecz greaney isard sestri tarpan vaping pirzada makarenko kaylan dhalla unerring larina woolnough mühlberg talkeetna zie delt gabber hysteric bruit hainsworth garko queasy balsom shuey finkle maari deeney otunga innisfree invasiveness sulfonate consuela granulomatosis polytheists staverton roundworms tanimoto synchronizes gerow abertay petts bakir stifles heathcoat snoddy seidenberg ceph dest loures complet schlichting gowans atiku antiga armey jailor sensi snared cavin veldt gsz stradey scripter elsbeth kaliko carrott egea skint infront barretts aaib hanus svidler uncontaminated paralyzes pandolfi mcglone shifrin aftercare okocha allandale duchin tosun uremic wavers hamence dwc bial pomeranians mehring jamma euonymus arveladze blaire gwt daingerfield amniocentesis tetzlaff lfb washerwoman wknr tapis kmox aztlan aravinda faget boel lhakhang sambuca catabolic paintballs highwire barsky millerand jiawei acsa jula nte nethack somerfield dessler repartee woop guillon cpw rolpa streetwear lewicki ungarn grav caractacus romme crestone radiotelephone pilothouse floret oco naranja upali farfetched audiotape vittori thingies hemostasis cornetto hoerner deschampsia abuelo fazilka reticulation hertog jarzombek vues dardo kastrati abhors fernande fridges colorable karlson pledgers troodos forstmann kvn carwyn starshine saretta maragall rockumentary digo stewardesses zimin compartmentalization ademola maradi pianistic meskhetian socceroo reder communions dumars sparser strummed helal macrocosm stonecrop goeldi schönebeck santillán coale apollyon howat sherer rushford briens whitland coia habituated londons stripling munidopsis dispite courseware dogging greatbatch shoeburyness twyman proove hanamaki qura shepway admonishments wri rostrevor rowsell jeron montereau dawat saronic cruder jackalope rotoscoping luxo cholangitis tblisi klosterman wickremesinghe meinertzhagen carollo iof unbranded laulu kleptomania morphologic oxyura muette reye mazrui functionalization worzel girdwood tidier bhrt jata availing veeder yota silberberg sheller unswerving sabat mosaicism hano stell donachie munier vaida laius bischofshofen plattsburg cosmopolis brg hadham abdirashid laak kelurahan edip leatherette kavan saravanamuttu rulebooks uchaf teatime nies courtin compressions daubenton curbishley obr atallah grampound syston olufsen rottingdean photosensitivity panisse rustem stana trudie judgeships landraces xinyun mimeograph micke zyuganov wadleigh quilters closter varin sinopec faried gallos mosco albelda hermida geigy protoceratops malyshev onix maplin gses monumentally zittel proops yatton vizquel habermann vosloo dovercourt baynham ushkowitz hazra sorokina wertham jeanes chaudhri muad perowne woolloomooloo cortelyou riserva ifb sketchpad recertification hakusan prosciutto baitullah zoysa collates bernadine prigent undernourished meadowland palmitic adamek atrophied japantown rispoli almansa levelheaded despoiled hanham cdep sharam hoggart criqui cipm gondolier nicolaides coonhound adsorb padbury krenz bilawal callousness sarmad radlett dreck maring cnw eliphas massingham danseur bywaters profiteers crystallizing missioner clurman earthrise delict duplicator thira safflower mkp bahawalnagar shakoor tomfoolery arcis stankiewicz sowden illiquid bailee morakot gordano buttar ickenham relearn rantau amichai onias waldon helleborine mewat sansepolcro milion peshtigo bromides litle clodagh puppeteering girma gotoku abkco jakab opprobrium verbruggen dachstein beeld selfhood saci riopelle beckles toggling superstardom pilaf cruzes jarvik kellgren himba pinard uniq pieler overestimating chelle southlands menses nemea woozy ordsall jamo nishimoto tmk tobermore dharavi quelch skittle sanzo rosaleen fascinates loosens hapsburg opalescent fpt unexciting vacates apas tiaa optometric bentwaters devalues sputter jobert asharq itzehoe groner unassociated peñón trumpy sollers aesthete ishioka mangels arsal pestana lifesavers campanini ondas aldie microflora mortmain qvga wednesfield caddis nipping businesslike fallot syeda frederique coor extraverted unmil nzd acceptances bokova lockbourne staudt insipidus bpb dehaan mannino tailhook aptera moncef pointillism chametz dgh ngt sarcomas laogai tredinnick zeaxanthin thielemann dcms sheerin malibran phasers innerrhoden depressingly hurtubise gatow bastardo pullers blé archa tatlin sudduth castrati giocoso punctate puryear jerkin hertsmere mebane creutzfeldt moorside chanology evac weyman zaur portier esop microsurgery talton vore ruah militantly blighty peppe szabados tiangco tangaroa adbusters quello honjo tremelling paraskeva tawas keelback dorne baumgardner woodsmen doeschate lgs dpf redeemers collimator heuser canino galles mammuthus shajarian iuds siber septicaemia kishwaukee nakheel micrurus udrih hornberger thel disaccharide kawhi kpo sedes vanderslice peccaries wardlow clutched adhoc champenois fep forgone burntwood minnesotans hemochromatosis arsenite hypocritically gabreski hanin guenon reitan funderburk womanising balochis cabaye tabarin kapanen jablonsky angstroms stubbins iliya keary atoning cammarano reptilians paise decreeing vlastimir zunyi pingle bovo dauber itk toso mortgagee diné cccn shallowest mooch planed greensville ceryx interconnector dismantlement laze subbiah heswall groes ignominy krul berggruen reptar brygge foulger ganser danial achy impaction tahil beckers bhang mainieri blaser veoh pmm parratt unhelpfully chandhok godda plumeria klute hesler roughs kostelanetz hambletonian srw elleray siga hingle ays rantzen aymer penthesilea trekkie northborough peiffer mcgreevy montrachet kuwahara ello ncba deverill eells basim matzah deichmann soldiered duminy yefet bratwurst weinrib bidet killingly margrit iip tyche duany vacillating hinksey gnn suryanarayana accokeek unexperienced nalbari bottum angelita milankovitch golota médias ewc naila perceptually swains benckiser baral strandzha jargony termer subbarao noppawan instructables cholangiocarcinoma diwata hypnotics ptf veu citalopram sachsenring paleography ognenovski bignell neuroleptic pettifer mceuen quantic senden ryno argun wicb knebel contempo skyy chilpancingo salang pridham souvanna dowty attentiveness hiraoka spearfishing pdq unrealised apds kefir cudjoe coactivator kopenhagen kharaj hsan naiads rolette aparently cantero purloined hechler vautour maccarone shrewdness woodcutters ihop desmet shuisky garlock waiblingen nenshi mcisaac nyos majeur elmander tava chifeng custance brookshier snelson aveley leterme escamillo kjersti lycées tyrer lashings benwell gardée wdsu kozloff sexyback mels yehia gilmanton supaul yarkon promulgates purposeless rnib iwabuchi xueliang cecina murmuring brigandage garraty rodge brugger cyro amta saurin flannagan beckie latticed llandysul quelque bakhshi destructed jinny gordin scribbles baghi parodist buea yorkie pench nishihara rolleiflex lurleen draftsmen arh thromboembolism kerio neusiedl sathe kapell skyclad carneros countermanded whitham broaddus schoor owo burkard padar ecthr linhart flatman abar livecd sohmer stonewalled cribbed axilla kobuk uxmal emmott mcculley pbt gabriels robotically brangwyn voulgaris strebel mondino cutback vivante firaxis unapproachable ktvt heinonen rasping raychaudhuri goodbody yamanouchi oot arrester knibb lydford deroy dowagiac zelinsky icograda torgersen stefanovic shinwari midhat sfe dejong miff medtner calavera myrie hott blackfield besta helenius yamana disbursing changa pilato azulejo muneer naschy ramseur géraldine squishy jiaozuo dongping partis hansie cedarhurst unpolluted negritos uberti mlada datapoint kersee deckert nedda armchairs luxembourgeois replicants murgia meacher doyne entrepôt pflugerville colleyville joyless predication bayramov preferrably nvm epauletted palmata ketty mommie kollek interdicted paree positano nunley oyelowo transalpine baguley bellowing parabolas monalisa chirpy cambell tidus ungaro soltaniyeh redoubtable toubro hagemann sumrall nobuaki wellow ezcurra whateley spliff chorrillos asprin barash hamri ransdell evrard dotter reverently malha chittick chromeo mte kristaps winnetou marjolein emasculated medinah papá stinker arthus tolomei lissner bagneux hexavalent arborist wintu drabek ecuadorean chucked madrassas pyong holofcener margalla novaeangliae prioritising urry ottis mitteleuropa renteria pve ameliorating yglesias pattillo nakia ruen aniseed furner senne nnp gibert paranhos siegelman politicos sourire blauer glided phosphoenolpyruvate fortwo haxhi seleznyov barefooted cercla danaë berendt nurburgring chye larrys abdominis asphalted fightstar ronconi ohsu counterparties learnings grondona vandergrift aerospatiale benmore speach jacklyn madrone relevantly nonnative amboinensis grayslake horatia metatarsals supercharging thrum rushmoor boozman chauvelin pertinence warboys willacy pickton taboada narula flintham rapace titta bottlebrush vuong croupier hoogenband soroush strapline shirow ikuko kuusisto boonsak naci desensitized boingboing melanotan yeller derivate internalizing giganta putbus schoolkids ondra edenderry packington purifies selkie cios mkiii slytherin visualizes shortfin geoje caddell undeciphered pradera budan robertsbridge mowlem ismo pyogenes konchalovsky mirian trivialize sobhuza pitons monna riesel toltecs triaxial dossi moonbeams panathenaic bonariensis yagoda brugha spiderweb sorriso swaythling furrier monsell viglione critism pursey gazzaniga tomalin leafcutter handford succinic hoynes belliard qsi cerva muezzin emendation hoonah xiaolong righter wamu loughery lafortune elum victrola yack kaisers bunratty mccudden julissa schepisi vics taruskin coalisland gracing facchini zhengyi ennoblement maroulis riri bge anglogold huddart batali vesterbro lesabre cottee yoshifumi balsall siki peridot fisc lages gardyne averred rochat grímsson nauseous sposa cdv mainlanders posillipo panfilo cullowhee sorn cime rohatyn hlb thunderstruck clonazepam niskayuna tenko italicizing mazzei interdisciplinarity peyron petrina melter mirepoix shat bjs jrr hostetter aabenraa liveryman vilifying khaya castillian conforti bivouacked sogni amitai wariness blunts brasileiros figes charrington nukus mni goudreau goldings persuader lilt alvy grifo kerogen aybar withnail boral colomer ristorante helminth verrazzano rej imb nephrite contraversial distrusting chinense gattis bonnici stanislao agogo qaddafi burlingham gild muchachos kubin sunne saadeh axelle hostelry meixner marybeth alehouse tabaco herewith spetses maitra begay mofa sonitpur schiano lianna gaydar zamba variola smouha pirouette siôn gwi abla woodhill domesticate bondsman brimble sextuplets malene edmondo gardell biocompatible louviers tzitzit nomani tribbett addi mokoena unevenness gampel moppet dusable dovecot cockatiel peepers golic purines rakai sharking chmielewski kinkead lackner fluoroquinolones bairns cpk backroads tianping casl iroh chamdo cluelessness pierina beru industrials altec harl thé rondell fredette hoarders gooderham lancefield solidworks margolyes gasparri langstone vedomosti ecclesiasticus redistributions surfliner deshawn diaconescu utusan undersecretaries bajío huggy autoerotic sirotkina burnage syrupy crucian bullae lanman onley johnno gédéon lauber ifeanyi adhan gervaise tcherepnin woudl jika jetter kemmer consell transgenderism glatzer curlin söder puerco conigliaro leora chelly epix swynford lievens vermeille tansu miska shcherbakov bdl chahta touchwood poveri cuma marcks michiels shakespears labview cogburn leaseholders posehn stree patersons garett underemployed onorato leeke fust selenia nutbush olanzapine mkk convolutions halffter rintoul caméra pagès hendrich guarany bosson caltagirone jamrud mahovlich varon olhos parul pitchman minkoff callaloo mutuel utsumi geldenhuys greenacres acheive rosholt solier nonferrous synthesising ouzel squeegee ecomog presaging festing juneteenth bière frewer trifoliata whirligig astrud diosa staggs piv guldberg goner deeps eola gigaom timey finden jacquetta stender capacious ciders thermosphere bienes mcswain stickiness rainworth attenuates schwarzman halvard eidson presage trafigura federacion ostertag batajnica adjuvants chromic howze farnam passionfruit fmcg guille abjuration wiregrass quat stegmann bedfont qsm felina danel najar rasi unawareness sivaraman cuan decries dyker historial absolom barnby hwe kahuta danilovich bokhara viviers laune castleblayney highgrove frustum crickhowell devall airdropped vaporizing höfer decriminalize sullinger victoriously ladenburg yashar rummaging cootes pittock touhy poręba ashtead disassociation kishtwar ptas jasika transect bfw rudie dupatta crozer moralists permissable cassiterite shibh stanwick sabbatarian disconcerted fresca halloway kirsi shraga thymic khatoon heidari sublayer ehrlichman hypoglossal graniteville papert laager wroc rosolino perivale icefields decouple marcoussis natrix ethnos bookended moton stenerud kemah penev frizzle tunick japaridze lourd sicut sels kishon uncorrupted narine electrotherapy transposes bonnat significations vido cufflinks neuropathies fpso damasio mancos balestra kureishi tomassi ponchatoula downrange anorak breves wherewithal monolayers endothelin plr amash construcciones mustafi girdlestone saffar prx olivar gutt neils willies psychokinetic teradata fowlis kinetically nykänen maccormac espagnol undof darfield disbelieved senger robley dlna hamit mtsu kensett muffet dispersant porpora moundsville manxman wikiscanner cevert schnur sirdar holzhausen iberdrola faites ungerer rigoni werber rache dimmers googe meatless krank cregg sandstorms wsd taconite elbowing blackhearts galimberti salan keiller hematuria icebox babka homeboys sterk pequod mgv toybox wonderment lastovo akst móvil unip saclay pollini guill wplj criminalise schibsted hont isakov gerau fuehrer sendo nyle guerrino janicki hypothesizing sleepaway pillet grimble bazelon wubbzy druidism barramundi iwashita pich gerolstein panico grella flamin majerle vallées althought hilger lizette ngugi sensitize diii agre illiterates bpmn stubbe domecq premeditation frossard vasarely deloris categorisations phalange cornstalk milde cyanoacrylate shands shibayama niemand factitious mareth unadjusted bpel dhami fatos imv porcher jeana hugos rivage rastrelli stayers yippie bindweed kudlow kheyl laner iturbi morels agd tomiyama shankman quebecer económico amethi merstham finicky bundrick wicki landolt kulongoski uglich stowing kinahan jagdeep cullis castaldi noahide lazzara laboratorium viveros mekons neuromodulation ginther impérial friedgen filner undemanding farndale sitti foyers algeri voiture iddo gantries flik tibbits elastomeric claytor rentable orascom merli mutans killingsworth flikr micr bringers nomis truces consols giana gammer naqi crisostomo pkn kesa sanjoy tiegs bodrov treed yuke northcroft pdms crewing endel vedova masorti semplice cetti halk montoursville blanke khattar ownerships yippies mear richi limites glasse drypoint vaghela jewelery westling barcelo asteria mindell langhans optimo lojze nutters noncompetitive budgen valeo pantries yingkou tapsell aaup johnie vaginosis grot madhab hongik wordgirl ravenwood morcheeba skandia chaga freemans ifvs magome tarwater qum fabregas villacoublay fluoridated buchen wobbles motorhomes incoherently sternwheel stows whittingstall asahara lustration metanoia fichera chikako edgley emrah isozaki korol agco stokesley lazzeri kado diabate nanomedicine ivete civilizational wracking zodiacs hede lasley livens grene parfrey tropfest judiciaries handbell glossopharyngeal unzipped chaat heliosphere fulltime harpreet deflating conceits ciri henlow delfi horsetails blacktail blackhead attala applique vva kindermann neah ritualism rhagoletis bolena fenger johanns filmore paynesville diferent mccaleb challinor fibular acount inseparably naslund schultheiss dvl puits halcombe jetset avus cjb beatniks bowron baars mmog penology fortiori steelpan ungovernable hepple conchas yalla daydreamer arhats tusa generalising reinsdorf roundups sandile varroa zocalo pitying budhi ducos knanaya roscoff mccollough wsvn accu incat dipika soward bellmore techsters johal eig ainscough chunghwa carolee arterton severodvinsk orishas sicker henton bharatha taveta holmquist tost nezami conjunctival talu southesk shuter artibonite docents rumbles malins bmws karume bedbugs blowfly haik melisma heri markwell momi zwirner blackbox marnay nitz assouline fcps multipart desroches achmad intranets dunois weinreb gehrke damson annaud jered mongkol waxwork hocker lizardo polisher lisnard chitosan probables bredin cabalism polyrhythms hbl adulterers cinequest causse cederberg paea mias hekla sprayers ichijo cbj rodos academi decisiveness frede khater ashu mohnish tempeh kinescopes rowlatt proteomic fixations biocompatibility coring tía bullhorn oakleaf netsuke marijke cosatu goldston goalkickers islamonline guerriero regraded takahisa lucene byelaws tiant kcra dalembert rodolph hiren nahanni edelson turberville cleora enticement apocrine visco jumpsuits gigahertz bloodshy paleta chines saraj norling abai wilken durkan vigen fashanu emelie livadia bixi miscarries lenart desprez underland kirstin denisova venlafaxine moco srbs pricking katsuo greca mcgowen dirigibles northlake attwater gourdon krabbé sanlam vermiculite settee baugé toggles mosa carmon showboats werde sandworms pacifistic newfoundlanders oberholtzer tiebreaking flr umbridge bandelier niassa feiler ytl shevlin sestriere splotches buffets upwey uim integumentary souks bollaert femke supercede noten rahall graton mafa yaser tottington wiart fajar ottar zenyatta nondescripts kowtow seelye fehmi tribemates kjeldsen macrocyclic myelination wrns ramlila beilein emotionalism lithonia hubbub hedden maysan mukasey olerud streetview orthographically junked animales persimmons harrisons flec sinfonía doraville revitalising klingberg gabra spac redzone augury turbocharging squarepusher brawled anttila berkner ridgeview nastia destouches fanni mng dater tathiana casos sours imprisonments rohlfs akutan kondratiev overachiever aboutrika inundate intemperance hanbok superpipe arsons weekley neurocognitive rowallan officiant koru ladywell boschi butare voorhies cychwyn modise celie juvénal tolomeo valvano accreditors harleston uhc redbox collude doerksen sahlin mcgloin biomorphic paulk lazica torrado didio meia vasistha chantel miyawaki mccorquodale fuit outliving gangways klaassen xda hobgoblins shootin sherratt reconstituting trotskyite btf severna jagiello hesa phreaking bushcraft sarim kungfu kalbarri fdu salant todes aysha sephora vautier diatta soliloquies anes puchkova duffer lynagh petrick saccades velupillai esselen dodik fleetingly ausaid biasi concarneau hiten desmarest sinmun sentimentalism helming medavoy dissidia lette relit macrolide hobsons melbury breytenbach warryn lavanya anthracis zadora bahonar makurdi goodloe sarcelles ylang edmar koenen qbe shaham conlee barometers politti trashes godo grechko pbworks chandon antz branwen renna visnu birns mogis socar garnishes roj montpetit ehrich veronesi clubrooms duruflé boganda heyn aboutus georgio gallian fixit disavowing schoonhoven zeek loganville subcontracting noach debray nizhniy drystone shangqiu westerfeld dgse talaud healthsouth minuto nephila chumbley linne belka whitely bentz sutured ikuo wohlfarth shaoyang allograft bashiri hypercalcemia parolees carcharias eastmond certifiers lacerated tiz dovedale falabella genotoxic belfiore mocky mechel weatherboarded recieves fixin luddites waht margaritas proeski houdon sasan citgo vilmorin nathanial microvascular kimbo thongchai yateley ineffectively karkh bsw schone toggled bankole synephrine kelland vetere perloff mitsotakis doud ambigious mikvah chiki veysel shearson superheating casuistry nacc necking territorials nonuniform lavo loiseau farabundo folkes leykis arenac bzw danegeld drazen misquotes craniosynostosis holmby goos aqr jóhanna toprak succesfully husa segi oenone derides bumpass cwu rosenbergs sankrail duignan hamoud wbez ebf gummadi sitiveni ostentation jacor chilworth katee isotretinoin techo shelli ardens burgage papillons vgf ustrzyki padamsee peyman cowpea munenori argyria ludgershall wedell matsuki quixtar lekki lochmaben xfs hightown kujo unsweetened cabinetmakers yusufzai boeheim annulling amec jitka execrable alvorada wittenoom adecco deadlands chivu marasigan nuestros broglio mayme swiveling ultralights walkup clotaire contrail vastus crinkle villafuerte preservers rehana flagstones bellfield undset intial wicksell briars locklin michot boker chhu morbi borgata almy leonowens guadagnini polos colorization poquoson radan christiano chicoine beki kouichi hya kitzinger deadspin codnor exorcising assimilationist gallone unseaworthy spinsters dulas bpt arrol binfield seeburg eliahu bunky paatelainen pheu inapropriate sisterly muldrow baggies gombrowicz sidesteps sixsmith bryer paranal dunkle bouazizi ktvk ditmars maclise perlo rippy dulled sdv shyster epiglottis wurttemberg loyalism resourcing duikers titelman teuber ostad baramati girlicious guofeng gimmie firouz opilio kolesnik aguardiente kcop perreau reinking theorising ayas korth mbia publicans elizabeths fusible fdlr bahour zanjani reprimanding wainfleet enshrinement kofman masso homeopath bascomb fastballs goroka goannas diehards bloxam crooners speakerphone guiney tatsuno sonnabend earlobe ordinariates wbap lightens comunity scholefield schmelzer constancia endodontics lattin proact holling jarek ruses conze levassor dako machos voraciously unfruitful braindead wcu penglai uneasily pcrm atomistic crepes portugues almgren truby belter annihilates peonage seccombe tŷ laist wardroom arguta mnac rolandas laigh chittenango selfoss farquaad groupama eluting mefistofele nifong whitesboro mamer limbourg kanchanpur finan quwain roeland naches kivas geus tanzi shamba bhuvan ballhaus betton draven ardeer kempt sharifi hegan ontv prominant siegman hattusa verticality borey pieman piñon burgee satirises stretchy masher doubtfully garat bbt evr ketton rerecording terabithia kaycee outranks tennison mocho moskal chickweed gunjan bénédicte catala metacognition lnt carrickmacross nixdorf buffel czajkowski iatse edg priddis unidas redouble vaulx slammin pretences touchscreens pervious evernote timecop blankety pudge cabalistic yamini stewartstown coaxes armonia sellouts yongfu trembled bridgepoint lactones handal cfx meka jansky mickleover gutai commerson nitromethane smarmy bookham swigert ixtapa asobi cono heyns toome trumping liborio canan finfish maccabee sanur eilenberg urt eglantine gami solders alsea silverthorne soubirous burgmann malvolio batsu mataura rediscovers cyberia gewürztraminer farka outpace wenche oleta bocaue shalford recoded profeta mocambo deedes monklands oir parvanov shaya krop afflicts corcuera repolarization jamón nagpal matfield lillies dicarlo rodinia cogently balsillie gallitzin englishes rahmi castillejo checkmates ayatullah majalis termoli mgd alleman noseband dunluce laia adelaida dahrendorf reworks shant bobbed gallinari hammerbeam taldykorgan ultimatums caple maryon bungoma killybegs unimpaired blurriness heijden waaay stolper michalka chongjin bushmills gamely knaus manian sapientia kinsolving transgene steffensen shaara mannesmann microgram delas karakalpak sofya ciclosporin methvin infectivity irbm draves avari henlein javafx dauphins bolloré badon gravediggers leichter shunyi puntos boudou sawicki doke plenipotentiaries appétit untroubled donan turpitude yellowman deano jansher kinkladze jacksboro haryanto corollaries karz rhead roguish liran seawright morawski outlands barcoding caudillos babergh ijsselmeer marjane onenote confabulation pople kidapawan scrying sabry wsg ebonics sexiness delton upb duckweed chessie blitar tuxedos taveuni replicative kununurra spectrums oregonians lfr guymon presqu steny tainting jaarsveld scea raviv brigida semiahmoo deconstructive liquefy clouse bolinao butterfat jovanotti travelocity hackford ayuso meghnad marinaro inox odst weidler pelini aeropuertos kettleborough brownfields birra arantes kdfw gennari dollie lotter lynge nautique luhan lrn szechuan struthof heg ebell oxfords jahi schtick doradus murasame fihri billig passionless eliminators dolton janik zilina snm hoardings okinawans dovetailed wennington elz kiger inferiors chromatically rheinberger heyde nedelcheva reapplying gleick tarnow fedir cedartown moschino xiaodong drawbridges trabecular peeblesshire sloe lochinvar tejanos lyonne chiodo boit railfreight shapell trailblazing koral ajr picchi bagerhat hyp emdr ferrario minga remunerative saturno effaced adzes preternatural coeruleus nabulsi loveliness soltis wttg burgdorferi snips pangkor ilena desportes gardella roamer foale provera tempests wormley sconce rimer ryegrass carto schapelle amte abiko fgr dumm mudflap lethally fordingbridge kantei agentur developement mandie geoscientists piccirilli threlkeld tuu prados valtonen bruff fairlady sobo breckin greenhithe gearóid symetra olivers assal luuk salaria alcocer dausa katri yindi hockaday ochieng punchlines ashtiani lann winogradsky saddar bunder footholds balderston skai gowariker shetlands vermeule zanotti mistle logbooks ulladulla equalizers bently afpfl nagayama audiophiles toan chalybeate plenitude raiganj brabin titanosaur buffed haranguing snellville mangu taipower graininess stillingfleet khoei klages weymann hsf nones bioidentical nrcc edendale dorrego lauritsen indicting kymi inès babia kaida myelitis hlm funktion samueli miscreant capricho pasek fabares draconic väyrynen acutally pachauri restocking atelopus darel nghi adenomatous defeatist ndong shirted mosuo syreeta jewess dornblaser bouba inspectah muggles eyespot cuticular deodar duprez shehata shahed ypsilon aseem deayton heitz goncharova orko carma scitech oel myres mintoff llibre ryuzo uberoi photoshoots lable messala actinic baronne mullock hinsley dolphinarium polarising embolus biehl laie sativum sendmail bergere reselected harbourmaster teensy mattu bartolozzi vivos radovic fevre sfeir trossachs sachet anklets avic feh caseworker asadullah arguelles corriente penrice quain extempore hamdy undefinable summerton churchills oursler rusudan kere shumpert chiavari relator kimya bino nocturnals rse quitter czarina rotch garbett umaña shlaim adobo eddisbury aini muttered kronenberg skehan kosminsky bibo correio zahran cutlers nabila lyf beckingham leuschner raees rawsthorne subir desecrate corriveau meteoroids jaideep sweltering hijau nightlight bema craigs anubhav naunton asbestosis matsusaka pilson schendel faten earlswood aldona gujjars tygart proteoglycans rebuttable azizah laven pirbright nru jayce souverain futenma pame rathke aoshima interweaves temperment penley pontin piché indoctrinate tomblin protractor codebreaking pettman hobyo dule killjoys daca epeli kenda bransford translucency exhibitionist lasswell cooperman preethi polmont camac dimidiatus mizen forktail riccione chac zandra thomaskirche pingyang konarski angelicus rominger bedsit pharmacotherapy liaocheng verco aidoo przemysl maiwand longhand sheene leybourne amrum callery waybill bradleys manono hypoplastic aykut arencibia florham moneim periodontology adriane penola gnomic outlasting speedier leijonhufvud hsls cremate gushed kolin wensum bittaker solukhumbu babek piggly mantello vha campero weisskopf givi whistlers recoils overestimation rheged appearences paps getup strader misse appstore unimak hewison wiltord hypokalemia punctuating ekins hocks striatal clunie beaky chowdhry noia nextera samani mealing awana saldivar zephyrus dickstein ewg coppel sessue sirois eckel duerr goforth itg yupeng barnhouse crazily facetime hinchey orac pasdar ramprakash laurindo garrels zatara obscurantism bisham laudate demolishes allsup samaraweera toux carpeaux mikolaj namechecks neti endecott ccv hurtin serafinowicz towners bickerstaffe rantala divis theorbo lowball peroneal cerdan kerzhakov monas kiritimati foti rostenkowski sandblasting castrillo wireshark fawzia deterrents bended jamesville monemvasia pogge nadan flashover inal misperceptions amsterdamse theophylline preparer philanderer salzberg eska thromboxane halman klayman brade conversano pito filan purcellville deis boxee whe itworld coggan coche federale guralnick crawshaw petropoulos certs dld abductee askwith tokuda tenaglia samp bongani binah muley outweighing ermenegildo mambas ndaa boulos roditi synaesthesia yeston volkert abadia communicant lunatus hoodoos hartle nuru ecorse scaphoid scdc repossess medinet zookeys greiff oligomeric uttley nizwa fairbridge ddf komu armel natation lanzmann metlakatla laforest incomprehension depuy amamiya fatten mazzocchi intranasal antwi volleyballers gadgetry evenhanded enp taimanov maeght darwinii transformable tacy kolber nowlin champaigne romanée olafsson waksman birthe westernised trapt whereever meike serica antartica engelen giard penck anticapitalist disinhibition herbology winterfest lovells valenza wmur concessional armorer quong yasuaki sft kiana tragical cashiered lescott guben marthinus havasupai kmsp agonising steane gwas paschke ratri baah koskela politzer trudell wendie mirek waiouru polyrhythmic moffit meshach spellcheck shafto farnaby uthai tappers unalterable komala mris borers inhalant drivable nishad lampton aune fior polyandrous hypnotherapist pilat norbeck bte agonies ryazanov gittin capelin notaro makina fastenings kalra alfresco sichel makro lese sacca desiderata djan klump kulla simancas diamanda maleficarum restock haberdashery malakhov juggernauts chads seyss lycopene morsel schuld stratotankers makita zobrist hengshui atlantida stuhr carnavalet vatos fertitta shujaat ageist oastler jom aloisio jayna cariappa demountable eyeliner mwi hudaydah leawood modul natcher bobigny adia starmer dagenais cound henequen interferons uwp podujevo cargos frewen mehreen sellier tuberculin moerdijk pneumonic inos lyoto hcd peoplesoft seun unomig largish sleng clyst madaris sodexo underreported lovestruck matchings repays wudu eurocentrism germont desautels beland torito dekel willcock welsby uncompromisingly trichloroethylene lakha pohamba nich indec confucianist chipstead terrana tanqueray sanhe telematic portents seveso contostavlos politican hitchen tiberiu teneriffe ecri sews chesser boisvert sophiatown okonkwo volcanically harfleur bodman orthotics melhus vaswani samarai pipp kawakawa recessional lowie rosetown elvan kufr brushfire mehler floridsdorf bloodrayne punx nihang tutankhamen bayardo normandin puebloans colen waage cursors tugendhat lammy karagounis luchetti cottingley shunichi parce roxanna panaro pengcheng convoyed badarpur zafira eha zaventem verdana reser pagés zhangjiajie expressjet gerontological govindarajan rtmp zipp alroy doyon moonstruck tsou newlove kitch strongsville capitani berghe luteal gliomas zenimax propoggia cinar financière trivers presentment maldivians motril metasearch workshopped chiru norouzi ajla torme pennsboro wellknown atapuerca guage sukarnoputri iannone bober autotrader schreker gaydon thoburn makaay unarguably vicenzo hammondsport mirs evia automat pontins cloying lensed imaizumi momentos privada alwi wearisome valdo fierceness tatupu tahira crossways ashcombe dantzler probaly cineworld netra jhangvi ilker heister okkervil salbutamol comprimise shahdol minott almunia childlessness temco lawrenson nomenklatura reinvigorating misremembered assonance farington keenness entreat northstead logi lumby hyneman morganti worplesdon afcs monopolised atau reinvest pitiless dubuisson onlive diederich golenbock porcia convy montross madra tweedledee madson goleman zaun parminder govett wanderley oppressions italdesign madjer stuntwoman fulks abendroth painlessly despising tobie dalgarno grupos painswick harkens thiess paled yorgos beeper zarzuelas perfil teppo modernista lishui ganatra cohabit hotelling binzhou takato chetak vicks litem bleyer complainer trichoderma ritsema dieren faln malpractices wearmouth flamingoes telepictures dressen ppaca herschelle leitmotifs stoltzfus schieffelin noticia mbah schizophrenics guiraud dimmitt zippel puga fuxi ejaculatory alagna chafin salvadorian parkey kable moed leam clouet bahuguna experimentalist comores hypertrichosis beardslee harriss milks lubezki ravings ketoconazole makemake jenne serguei cnx farahani donzel flamand meurer gades kaige nishina bizzle tendler fusa mckerrow nakamori burmans manvel dorantes trippi jenkyns connellan aspel didon modarres shucks foisted schelotto whicher calibrations nyein bretz braaten namiki shpilband giao sindy thirlwall someting paraplegics scuppered paramor skoog koibito tamponade schaik allahyar slipways hareide bridleways beckert clandeboye upavon reallocate gàidhlig kratzer immensity tweeters reintroduces saghir vmas mcclennan fourfourtwo terrie nankin workmanlike kentwell gamel clynes suja darrah rottenberg haidari lattuada rangoli mahra neurophysiologist autore piloto veerappa baró strolled szubin serletic fow barnack titcomb lihua melanocyte nlg claeys clausnitzer bestor cleverer coloane moghaddam callejas ibirapuera mildren bylined nickey rejigged reacquainted aughton oppel tendancy borjomi dziga gladman darci burla bashung concussed geth gitai battlecry anari sodje atlassian sja warroad unde vakulenko triplette untelevised christs battledress tahiri baban annakin imagers hacktivist ideapad kruis gowerton brantingham randow aimo harpeth slutty sublimate florilegium osyth crowdy latynina gandon atrazine kibler tatian paranthropus optique soundclash jrb hideko pichardo accuracies kempff furosemide simcox magidson komma boroughmuir bouvard lindsborg waterlogging assuaged rookeries stigmatizing kleinberg eccleshill medha wismer lader bagnold devadas photodynamic fathy mariazell narrowcast mallrats vannier leniently minutae speakeasies upperclassman coontz grooveshark ishizaki uggs solaar elantra grabner danz jyotsna shrivenham kolwezi peppa nicotero cerrato nazri mrta rezko raisman bobi halmos putamen cimber lepidopterist ewr unreinforced beidou outstrip trailfinders sinews treponema normalise kanne garima fbf tellin baima bolly blassingame zoa holcim matthewson helgenberger bluing ldcs catchiest frontotemporal bunkie cintron pastrami rarick copyrighting obsessional chilmark raif siders craggs outran wakatipu ikey gey configures stubai fingerprinted beamon laslo guingona kutti doggone takfir uriburu keet gameboy hepzibah effete hyaluronic versteeg dearman usareur schelle verbeke mckinnell guerrini delis unfaithfulness slingshots tawton rabha risala namer franzese fresenius kupferberg erbakan bongiovanni carousing palynology pharmacia xanana kyne talo watsonians warte serendipitously coller cespedes gilmar hungnam globalist multistep rubis kingsolver deepsea byner samaya laima buisness orthographical signallers diagon dunrobin kenzi ocker wallenius paichadze teachout mckillip demagogues bodi boxster ilea roha pentridge farmworker peacefulness belon sadanand greenburgh tekoa disablement wayuu verdone marocco tarjei erwan noemí laks pince verdonk usumacinta yoan goz kafa hephzibah sigar brioche ojala ophiocordyceps italico snowfields mikkelson regge falardeau obliterates jowkar coolock longwinded ransoms overwrites propanol spaz competencia transhumanists lattanzi brecksville cleanstart colunga chasez yanko reshoots veys zaps yeronga trefor ellum huffy pagels forensically moynahan akimbo capitales dmax cowering taints seretse macbain ceferino arpeggiated kroto ringsend pokhran siver manzanilla overdriven downstage rubinson texeira madron gohain knope casanovas abz kahala genscher cabazon institutionalizing positronic shewan tubac heid levertov riv chandrasekaran jakobs mesrine sabia pebbly karama jole mispelled playaz newsflash kolleg barilla cloonan cuello undisputable phosphoinositide haacke iliff beggin nunhead bienvenu lecky milliliters bachelard amaretto longform syal løkke internode modano niceville imperiali lutton levchenko naqib greenan dharm brattain savall subbing autoweek impositions rattana gudauta manneh ketola bubb bovids kamani eichinger allers crocuta christophorus gripsholm clubb jostling matiz mcinally staithes zollner boscov irretrievable katif palmiro shugart souad iob hongqi iwase chone allana windblown ahlam knopp cilea keizo bujanovac donora bauch sabol saner nawrocki recapping schlichter lizz oisin flims incinerating flickinger tamarindo relabeled ampk melanistic platino nansha nitroglycerine açaí ellerton otake pasini nihad adamov hubel altough inayatullah lycidas drepung wrongdoer arthroplasty villedieu kovacevic monje rondine aymeric nordine herrman orelsan oguchi pushin leval hydrometeorological esade sibert otávio pigeonholed haagen spouted clines levuka lightweights chiusi wayde tetum enoki cottenham caver khur lithospheric akande primeros babbington entorhinal paparazzo nanas cassina dozy bolli gramme oroonoko fluidic khadra etiam deheart collignon campesina sacy rubberband dno nebraskan murderess deak cambay lile sheetmetal mundaka unitel chunyu mehri ceol bilis wertmüller zoli navenby dongjin extensa ceasefires rzewski negligable lasek spattered caballito myhill woggle larghetto sebastiaan bascombe shotter ekstra gillham beaujoire bronchus haron gyngell artas pennsville gayane nordfjord skybridge yusei fifes grenadine robic consequentialist bishr grafica roffey bruyères bucksport atre rehashes oddness emigdio poteat yutang vasaloppet baharestan curbelo carraway jolts köllerer vratislav lycan tarte panchito nopal stegemann ccbc tonkinese tantrik mavin mcconnachie bafut onaga oblomov ignorantly kuwaitis sibs asanga lown worryingly promusicae ranganath corpore jauhari printouts corra comparatives kuria nali reassumed jazzman brüggen nealy edonkey biotechnologies davalos alani kaushalya mockler kakuta sakar chiappe classism ericksen impracticality jetman glr kitti ujung warmington supplications rushent newsgathering kereta luiss bioavailable rmd jailbird drach ftf wadud shinyanga pucker fargas drams newsquest pepitone ahlert hornady tatlock lincolnville mindel falsifications dulal interop berthon ladybirds hajjar sandstrom superbikes mousquetaires kinman palmi despertar kinchen ptg isitt dronning fakery straton downturned surcharged commemoratives linington sensex neoteny asq snidely carluke summertown galkayo giorgetto liquide kuntar achingly obamas estenssoro wrenched sawley roesch negin dignan reproved regier pachi azathioprine problemas suturing synuclein urkel winne volkszeitung grumps mardones lippa beliefnet gruenberg meador inm commiting mufflers epitomize ischaemic ranft joannie chenoa kameny fangshan romanow lillington bricktown visualising spreaders castner gimmes giono farhang gabay tanzim questor demodulation fransson edfu mytilus bouch countach jeshurun nocioni mells twelvetrees rhinehart biram ankang abbeydale dalan ndn whin ibas tranent sarlat islamiya caimans miras andreou realest shahriyar cleto ohioan bulo elga wisborg floy mallaby rongji barchetta herseth aberthaw devildriver saval satrapies geodynamics stegman rearming bethke titillation faubert sorgen destructing bagshawe edgecumbe mckoy coulombe logarithmically shrader dinge conundrums aranha murmured aslo chooser sprawls misrepresentative nafplio recyclables piatek partage sanctifying errs tangen oxcart faregates scottrade wurman leadup sirico finless fastness dittmer nesters mpvs colella viljanen mountable kadisha copped goethite chiens bavay youthfulness gimblett teer aberfan coastwise greenwalt messerschmidt massu energomash hitcher yadollah breslauer kodjo dibb illogic erewhon ollier sidamo majorelle egi rennick aronowitz diokno mopped zakuani icsa bertsch blazey huapi ajmi guss vmo muzzled kucera unsubtle theisen seikan pharyngitis balashov akroyd biderman besler savitch lujack unreasonableness lutein exactitude highwater anns sjp khryapa baquedano santeria elpida straffan rheidol croshaw settipani sdd cambiasso wheelbases naj pressey rubbra embraceable gellért blatently merchandisers seussical ceausescu daido samah aylin yinka coucher merseybeat bloomsday cordyline odierno deselected shawbury gizmondo woodthorpe laurynas ferid karpenko henryi ballena abdiweli bairam albendazole sorong shahani coprolites eastin cheops breggin coracle garbajosa capitola frane chedworth ganze dubber sadaharu threadneedle osen marignane townsley jagannathan freakonomics hosta beatitude parthenogenetic shinhan ceja phrasebook wiranto irobot shrovetide rotich pedy convulsion concocting gerizim auris frumkin terezin dither lfn subsidizes karrer medemblik ludd castafiore sajna nickles novoye beitou levonorgestrel eddin freighting inne annelies ohlson embalmer khad chisolm mashriq fazeley soglo nabih rancorous kehinde eatontown swaddling rodenbach snedeker drey birdied misri sxc srdjan ferney gumede waitomo clearlake ress aristocats catie martland trainable affronted gintoki menounos routs powerset billson tonie gosfield interstitials moniuszko bilski ellett substantiality silwan teraflops alginate ruvuma birner ivanpah febres blaschke doodlebug yakuts musc neversoft reconnoitering crispian alphonsa antiprotons samye yoshitomo fluorouracil oaten ‚ meadowcroft kyprianou namtha jyske loudmouth flamini ibne barsha morinda lyndale downshift mikati ecolo kurányi bjc geismar braeburn pedretti tuum sowmya mckeehan boortz fanmail garcin gothel karmiel bradgate manzur sycophant sparknotes waard casilla métier pappano kovalam carini topolino pettibon maginn fike arye spera uniti bernardes nivel annville broodmares valuev verghese accompanists eddowes seydi azizul ladon siaka emsley tompson waggle parried officine buba measham zentner hatami bonefish scarboro piccolos ierapetra jego primigenius vitorino tylney giuly berson twingo tyo firmicutes oln mansergh ikonen skeete bleakness anesthetist monstrously offsprings temkin données buffing scba kumul shahjalal torrez breathers anticorruption ebensburg orbitz caines esten manuelle hesitations mortimore graun dary tomasso naproxen cwfc pinpoints acuerdo impalas polvo mehlman hisa lesmahagow rapanui honeymooned banky softest haiyang epigastric nzz episcopalianism schildkraut quemado bekenstein ranna baltz heartening barrot parrington kyoji kesel meux mandeep iren fraunces ditchley geren shortlists goopy cambia edvald edmonstone namibians vollmann tornatore brik branstad kunle houlgate bernay morato offen rockcastle rivolta blunstone mbf hayseed sugarbush hards knollwood campello pontifice cgl severly gillberg famke forestation kruglov egawa supercharge rassam prolix gheluvelt pequannock beko elfrida augurs mutatis bigamous crosfield halberg framer watusi fod acq evangelic bunkering balilla uttlesford csj marinello ballymote rockaways mattaponi heeling tck sandham endocannabinoid loonie olshansky tanked hydrolysed baldwins richelle alimentarius assyriology clingan handicapper amna fiyero bernardsville viloria bolten thunbergii ecko squiggly digitech bika rebecka ragstone transborder millepied tbg arkenstone navneet ikaria denaro shebaa decs pizzonia hertwig mulroy mccallie kingsholm froud gowanda micelle krasin goatskin iomega kochhar pletcher fourway béart meirion muncaster kunin eversole illusionists klingler corkhill fevered remorseless stodgy pinsker anandamide whitted davidsen cryptococcus sharpley mcmurphy icas maness iphoto darras zahira gyres coober laminations beutel pielke dioxane projets stacia interlocks jingling disingenuously ximen nonaligned intersport wreckless ruark coler davidov rhel leviathans jessee womenfolk tde penicillins avelar calapan ignatov dicing shinsei ezzat mudflat interweave deltona naiveté disinclination mangi muzorewa lucette hybridizes ardon novacek heterodoxy aunque symptomatology gundu fati marineris sorbara prance dundy zuzanna amitriptyline eligio bioreactors chorion azéma colonisers japonais phileas sadhna amabile jianhua cyfarthfa arboga malaco buttimer salitre seijo smucker excello montcada lrdg sabharwal venti pantene saheba acupressure rajahs unclothed koroleva ratigan bellport honaker ramkhamhaeng hachenburg ballachulish laprade reson freethinking espa baruchel morado payen ertel pramila montecchi justiniano lbo temirtau mutai essene charbagh zutons américains bashes asriel edinho wetsuits amoeboid dissed cofferdam bhaji menes panabaker trespassed infuriate étrangères nastya cooing bartenstein delorenzo yentl wibc skinless boonah krumholtz consulta evy hyperkalemia goji undulated gopperth bassani auricula elsass goodling flw flashcards stowey nutz svevo meaner hameau shigenori anhedonia gamache hayti persimilis kennaway kadiri yurakucho fringey syndicator gaslamp arizpe tiwary rancourt vitolo unfortified yazdani doctrove lightstone thermocouples groenewald sybaris hean sgpc gorenstein herbes millin mosty parejo keer speedskater ferrel donyell hirwaun theobroma deserto hrbek skil sirene shouguang confabulate khafre soucek shoalwater recognizance leadbelly hurlford feux boehme jamai frites commandeering tinicum roboticists hounsfield balanta hagai kombu loret morgenthaler vindolanda bussi tokuyama qena sibsagar primedia durrett coarseness kurhaus hasnain bringhurst sandwith lortie alaeddin huracan ddo hawala soini hiw cowshed ljubljanica rasche scandic brahminy moli hickock hefford fedeli bilas discoloured tougaloo shirogane chappe nehring sanmenxia ludwell krausz karplus verdú umehara sesa tamiment ario rigueur charita langtree yehudit hemer cassy tortuguero somo oshiro multicenter cormoran tamuz dcom hardboard ngk bayati ossington conchs bennettsville barona bennink blushes kahraman deadhead roughneck ghd partout sommet driskell shinchosha finola argall tsumura safka refractories krome warsop giustina caol somnolence kerkhove parkstone telle pummel dille aita mkm maton profundo xel bejar exudates brienza gtz achilleas relph muntasir wfmt ruffa lefthand throbs ambitiously craigmillar lasswade koshy salterton generaly udyog misprinted torstar idled nearsighted backstabbing balsamic osea victimhood mariza amschel conciliator blonsky pickthall tammar kimmage eufor khizr mablethorpe wdl boustany stepladder malzahn coalbed cerys kwansei gericke whalsay ademi eardrums amsterdams bletso ghostwriting ingratiating bacs apprenticing referable jovanka emaar myositis itri pitied chauffeurs huji baus cunts palko cseh fehrenbach joice sbh analytica pahlen slacking goodin banisadr schoenbaum matthys chimo roustabout orts netizen shoelace sfera tuqiri charmes shehan thsi pawlikowski cagiva peds startles sigatoka pacioli cavender obayashi rabkin verno bunts meiwa kimonos bristled monahans mcclay rallo urologists youde masakadza followill sirat pratas punative unshielded beemer unbending unlikelihood sward nilesat hoehne ogerman microfilmed icelander alvida mikro cheeseburgers steckler mercouri kouassi blare multistate acrimoniously belluschi livigno jetting tollefson palang unobtrusively rakovica prompter retrocession homoeroticism minuets returnee ventris zoomable signboards langhe pocky anklet qiong franceville boire rooy recirculated bladon elveden tradespeople chaudry ebow brewerton jiamusi birchgrove triangulations employable meroe khalis nial rothley contaminates kotra haccp endacott fidrych glovebox spacy flamboyantly seismometer yussef hughesville piombo alacranes momotaro activa bisharat spacewar tvbs yearnings dirtier matute kofoed antiparticles vandalization mbu compactor wagg arihant mcbeth truncheon bioethical thoroughgoing steppers blowjob tuilagi ccma remzi spdc lunghi verbinski berr brosseau avshalom esquiline baillieston mehbooba vidhu avas dajani euroncap scaglione troubleshooters nitrification zipporah filat femm monacan hagans crabbing rula ncpa raquin byt panufnik friendswood bergsma pachyderm kuchi hacohen casstevens aurorae mallen dzmitry wiseau asado skipworth yonghe doctrinally tast multilateralism cattails shahdad unconquerable sequined castmates heindel modak swanscombe huby ajp allardice coran oin macguire sarki gubaidulina nevo birchington geetanjali brumm tracee palomas bumpus bunim dira roselawn ezri mankowitz judt dokkum arks isse pomposity vpm demande plautdietsch kosuge nurnberg ibon kalindi keiretsu buser negril baldus ghoulies amii overcoats feldenkrais daymond flagellate uninterruptedly morsels dispersions pharmaceutica overhunting musl kptv electrocardiography shchukin westdale juliani harehills puteh rotundus abuela kynoch totley cizre malzberg parques xishuangbanna agitato cryptome evry pwe rockier hymnbook laterano nikhat tilehurst nginx mckeag anesthetized jangly wagoneer whippy kushwaha esata stothard lowveld subsector reformations conways arnd bleier kinnan zaya keylogger khalatbari synanon jennens crookham childishness plb cromlech salvages jarrard crispell disproof kinver asten farfan chamran kyrle kaio rwp agonized medem stratten lavagna serhat motioned hickie madhi bragdon hotak kasson tristeza majida wrasses lusting emlen nanocrystals mobilizations bonifacius prayerful mccreadie dreghorn shoehorned obviates ohira pbuh lcpl rebelliousness paolina baser backwash wilda dijo bolat lévêque platero lightwood christakis fortenberry misspoke tremper killinger mws orsola phenotyping goltzius tartt benaroya workability maille uhi sanayi reyno bitrates katla archaisms avnet edhi vembanad arris docosahexaenoic tersely zuzu nouadhibou undersheriff belpre zemlja langwith kalorama vidler defreitas sarginson housebuilding kusano soeharto lle trots ormes avira delahunty katzir enthusiasms upholsterer banamex kiem gulla benq preeclampsia kakko talbots hamstead bleating sunnier phenology alexandrians jais ovarense billfish haledon maenads mclelland tilston hanya prochnow wben amenorrhea prio nyren unroofed myofascial mailbag deuterated katainen monifieth najat allura cazadores blatz ceviche fortunei classing stolze noachian nesterenko standpipe ventilate giedroyc anoushka rela thuringiensis cheevers drachten turca babayev rosenstiel copycats mensae shohat tartikoff yanqing mcclair stereophile palmitate escogido lubao pinchuk melena aliu cicerone colorants kilmacolm thoda ciampino ponda hobeika cybersquatting sux mattick thill tudyk barata dch waterspouts eshleman whicker murguía filipa denice malapropisms whfs palapa expressivity ingrown klac ahava laurenz oberlander bica messersmith iliana taks rockettes hitsville syktyvkar witchery dipa jiangyin vot ruairi wegmans najam juta caldecote righteously tuberosum konotop oare toothcomb cilacap reichsmarschall oste radiogenic revocable gentility timonium physick brolly balma albar shanksville cansino reclined barnsdall vno hackner tiswas torpoint hutcheon rulli deinstitutionalization wolkenstein stoup osmose bourdeaux bradys ator handsomest bellah ustasha gaetz ekanayake silverbird gusman sistem henfield caradon ghotki blackhorse noko reeler summations rampaged baili upmanship stotz duba heiliger clenbuterol hofkirche bijl vfs rollinson pillion knole brookman collezione shophouse staghorn daoust zahrani crossbills fiorini murnaghan marcinkiewicz gebze beitbridge absolving fortuné lidded eyepieces coorong pmh adelino dvrs extention extremophiles tausch editorialising brittingham marcas feiglin wente drenica kelvedon bulbar muffat endodontic connon harner multipolar trimpe gabrielsson grabbers astraeus belal boatner empyrean rafted sampford starless keylock chlorofluorocarbons vorobyov procope pashley beliveau ifma cullinane medcalf landesbank pinturicchio louison furat floatation batshit kineton fullwood hogbin denness mikayla panchos entreated gulper knaves laybourn saparmurat magalhaes steelbacks morvern indice wahabi unforeseeable strieber threadlike galin cuvée evader midyat usepa mockups jhr licentiousness cunene jenova denature motomura grigorian amai bormio dortch deathstalker morwenna combatives sequin staiger risperidone electromagnetically ngawa prepress orginally ixia kilkeel bangali mustain gages guana fetid trespasses wattie rgc ropp bayit nailsworth verão unaccepted livi kirriemuir bavasi chiapa feedings materializing gobs quare tartakovsky batroun mendicants fallis motihari stites carmeli cranmore imagineers datchet acculturated ajai blasphemer uglow hummocks crawlspace hugon yorkin amarc erkel cantieri schruff flannan memex schelin itim manifeste jeno tangs schoolbook farmhands cnnsi microfabrication hotchner gdm atitlan quizmaster swisscom stilettos nauta thit independance eberl dumper noseworthy chynna nardelli bringin hawza achala elettra moratti faille rotationally disunion baggott benchmarked leatham montalembert sugoi gursky quartey tablecloths liaised creamed denaturing nonpareil gloaming leetch raffarin lauzun caffi provisionals tasmin virenque nylons tropopause outstripping kennerly arscott vanderveer blackboards barril paramotors hierocles tregaron postcolonialism nostalgie ceallaigh dorot crw ryr hagens ssid sturminster shac haier grannis lundeberg taihe renouard tristes azoulay hhmi fortuneteller multifactorial stoyanova scheid cheerio graffitti tessema digard diddly squanto tortelli desmarets zedtwitz boza kishoreganj putted haggle akunin wendler fastpass schotte mckendry khail drat koltsov janu tabatha sambat maxson marville tomonori operability lackadaisical saladino burgiel skerrett athi pavlovian randburg bettws malky taurog farmerville botica songhua goñi woche poyang kawakita reacquire picco berthelsen sicht shealy desson zuazo masqueraded eburones sunol handprint scapulae sechin solva steatosis tortura bossche hassoun taizé camoys tramiel cressman saariaho ronaldson braggart baso tanha sociologically sokurov febvre pubertal orco quilombo baryton inosine deshi kalergi nhan quiescence youngsville maudits dubravka peatland lacto bellingen delbarton artane kaua goudeau chirino jeanna niese mutagens htf veo langsdorff gondwanaland suli pookie corbière robinia haddow alcorta arowana wampler schottenstein timbral bronowski extracurriculars sillett northvale umrah jaspal zygotes aungier muspratt airdrops kuduro larf ohayon vieta nečas balneario lench beeny majura bhusan lbv pogson igloos raouf klip pérignon firelight bnb clytie tandems entangling adulteress gilham origliasso leidschendam smashwords spiffy pullar wmt belgard visicalc beesly raybould lieshout peconic taubes coquerel kourosh kulikova puddy redes abbi lirica chorleywood dagang meaningfulness bergesen wgrz iacono uglies homemaking giesecke spuds pardis stratocumulus dextroamphetamine morrisania erps biopharmaceuticals biase deberg babysits regularize porphyrins mikimoto quattroporte lefts deaflympics boxofficemojo kazmir ranas dccc gazebos crisanto flanery barnstormer flm waltraud netminder yachvili anticlimactic excl windspeed speare trilled siao expanders sriperumbudur dainton konk nakdong hoyerswerda saltley berd nekounam sylwia vikash fishtail ked garrigan ashan widom electricidad nimi antimicrobials counterfeited marvan lankershim konik cfz faludi administrates hooijdonk pioli gracy follet australiana bacteroides eccleshall inhalers buscando lavis chiho reso seagrasses pianola draa fluoroquinolone misael isaev talan poach crilly schneemann bacteriologists waffling disfellowshipped satkhira biochar heidecker polypodium yoichiro phlegmatic yevgenia aper norvo orthodontists mazzy sitdown jerico linstead mesivta kuningan knucklehead whalum qaqortoq contortion bistable bebeto swagman ridler predisposes solovki geathers anteroom turchi turnin plumbago aham abierta paramountcy evalyn hydrologist wif skd empath brightlingsea chipley kunga rwr dystrophin belching clerkships olivi nicolosi queensborough broaching finningley stati bfn udonis broadgate luper rockliff officiates vnu andermatt batara hustled taqwa mwene unmeasured cryptomeria jahnke tarah devaux uif mönch schlumpf rimac ilic tais kepulauan edensor dings malpass arpin subtractions clapperton gernatt grès mutawa flowerbeds begonias kanayama zuccari combinational troublemaking bäumer pcusa turkel magnitsky aegyptiacus lome whish lapus bonta ayodele baly ufuk ambrosiano alse chughtai disburse zsl prasanta goteborg nosocomial centigrade boondock zeni namgyel heterogenous styris xmb oaf salubrious nitto josten excerpta pcap affonso loams suhaimi mome bermudan owney senad bellino shirur pinkas kulan sanwa spacefaring financiero charros bayraktar minders erratum boning torremolinos visuospatial disproportional shefali washable boby skitch macroevolution aacr flyaway perine swinson filmgoers counterrevolution yepremian bagman mcdyess rollerblading vlaminck colyton fouke kerinci virally zoeller aasha cornbury pavitt snuffed gachot bage douaumont vinni kenway actualité mujtahid otec calculable bataclan vassa gilera questar vannucci queeg equalizes soner hewat gaula sunbeams pilotless risalpur butchart garrote paolucci rile nativ marpol baracoa ouchi multimap chemehuevi tambellini ouko inulin nces stouter cja pooper phalen cnnmoney turkomans dogar mikhailovsky grope serina nette sangalo vatnajökull diorio jerker alabi colostomy lounsbury lns byer glasvegas lifeway platja critisism echenique allwright qisas brk fmm toothpicks kaela meum tthe albritton nunda whittling temesvár timar katti transvestism heckmann dingos rodley stalagmite yoshimatsu crisply waag brütal rockstars orfeu darrang godey pistils moots lehua valadon mihaylov ornithischians interreg demaria torrevieja kjaer outmaneuver wilsey tortue dzagoev efimov yoseloff riboswitch pannalal christoper riprap boulais newbould moes firewater tsukahara ahadith paceman skopelos cantone citronella woodsen polycythemia mulund ify antinomianism khronos rectories fantasized penlee qualm clyfford numberless juxtapoz yaeger wiffen sigonella whatsover briere bathonian zaira yab pufnstuf seyfarth carona thors pusser lacalle rym solemnized mitar khrunichev marclay shandi buckfield seismometers boese talay homestay obuchi santur lluis marburger bochy sokolovsky cappel emberton gerner tafawa izzi looseness marsabit fiddlin lipuma dullea thole mtsensk porcello grazes eviscerated hatboro hurriyat tii liveliest seiffert mazumder aravena nejad motes windlesham ragione sylwester volksbank earthmoving froot donaggio sarkin rece millbury seyyid ryen icheon ruehl moliere weimann czarnecki koscielny daouda kuleshov iacp menz fabulosos makeni dacus qurna ishares metasequoia miffy confucians dufty pazos silveria wrappings shabani kehler marsal saffy cury clydesdales radomski mancilla howison pesonen horon latsis iturbe fixative sundridge compendia tyrannosaurs assortments feaster pettah preachings snowballing schrei emmie emond ginola kopel begone callously evenki clec nwankwo zapad mazzilli premenstrual clandon boruc jobling ostade cujo highspeed oxygene bny mavra singrauli gilberte seashores kctv gudmundsson koronadal maltais teare suprematism mdv otep raghubir sacp toshihide extrication bärbel scroggs bazil samish littleport fairland lamonte nigris aliyeva kaput porthleven nomex kinesis moina greider meltdowns woodcliff caldara stanchions saras salzkammergut kingsborough korf riverrun exigent pccw brightwood salvacion dinton langfang unquestioningly qusay manjit tzeltal westpark agnesi sobchak ergaster fungible dedo verrett kingsnorth palanquins theunis shudras tucana morino loganathan vanoise firestop lavallee fordism tharparkar vilches brinjal rambus faucher lungotevere reedbeds carso funnyordie kaptur gossau hayashibara berges rancheros mitzpe disaffiliation nicols eftpos feints etive subsuming toothfish kajima epiphanies worsfold honderich annfield suwaidi tepes orkestar stayman hemicycle quern mengs baju musavat burnishing bujor fash schori druckenmiller saag vosburgh whacks cockfight luntz foist tepeyac biped misfires laparotomy khazraji blase zhenya sauder freddo monisha mboya sacem kuralt apparant ipse gholson gillo telcordia amrapali wambaugh stults rowden fireweed barrón krla bortolo tach karch hunches tipis foreshortened belletti tgm onoe eggheads panniers cashflow murphree rosenquist wolke oen rizza partaken biretta umf lamarre orlandini umoja lomo leavesden unexpandable ragheb athas lano petn krasnoroutskaya petroff troutdale nabatean vagif infests garzon tuto souce peppercorns carnies korpi litchi danan voloshin sakr ulta riner orexin plisetskaya wone mcgrail monhegan blistered hichens noort dorton zambrotta nahj alexandrova zaentz thuot mihailov redeploying tallard psalmist khami bulged klimek enam kandari lsl tonghua chalan oddy strategize afaq anacapa olavo sisir ginty mallalieu kadosh tishchenko celanese leyser mgt mcculloh shardlow delaporte unproved samland oliveto wjac bogdani ciutadella clubfoot embolization syringae siim langstaff ambos renat johannson jaywalking occassionally mante kushite androstenedione saceur hematologist drumlanrig birdwatcher gassama noisier lete kalonji icecream takebe bref hedonist bronkhorst evenness hermanson warble drumkit kerbs gub fundemental wittstock moviefone josefin kretz machineries domitilla seraphine trampas soufflé thimerosal thordarson chimaltenango ssdi dimmick capgemini soloff golspie hypoglycemic braund thirlby pomerance weeper proces sniffles redbreast metromix backfiring forton dillmann perpetration tauron playbills kifissia armeno juneja hanalei annegret tandoor lmf darro nicci calientes paratha premixed hanami zloty gleave krupps colvig suffocates paraty kassner edinson karjakin ambrus batiatus barbarin concentus yellowhammer meckler khorezm cwg venuto ampoule girolami welcher leeland staphylococcal keds temerin nasd bemusement smidgen wanner carstensz karmakar patronise tasa alcl tripathy dealbata trilemma tinu pharsalia ockendon thiagarajan sousuke keziah goetghebuer albayrak retrofits deprecates marijana tamme hiroyasu derivates simpsonville doormat photobooks boarman cherepanov mutandis arzoo betina stargazers shox inteco yuzhang prospectively odebrecht ojinaga ilustrado egeland hanafin valdarno warrenville scheurer cedarburg bearsted jeffcoat dancey effervescence macvicar trucco khanjar ittehad kabba bumblefoot okemah possessiveness fearfully milija galbally plunk bluetones grimond schiavi nakamichi exterminators gavron midan oakbrook blueclaws kukla slaw woodcrest wenhui ceg menacingly faucon polyphenol deuk harpswell beerwah abzug harchester terius fml afars maoris atitlán volksbühne akf akam kongs muesli northcutt demond sharers hounsou kista gleaners presencia permatang luggers pittenger dockets ketel pudi tignes ranjani treffry digitalization reenie gijsbert undersurface amontillado uzan hudec serevi rainiest favorit atoned noar sirène dmsp longville bruel fernan boie porkchop kreuziger bellen narducci sismi desnoyers hhh folky penalise mckenny teetotaler gambardella containerization shying junctures frontière anoles aptana husni knobel piramal bakal pingo xinghua pmb bearss arlecchino cati mlakar nanostructured axia wbns wormer spitze randfontein vadhana pentaprism wetten emis xts lorant awww alpers currys trialing pathétique besieges harber mentmore inculcating securitas zager trimesters gange schuberth sabzi safecracker inoculate satpal jinni chinoy pantha malcontent lavilla hoefer exalting klaver rapporteurs leakes sirin pachamama bonners foli chubbuck institucional bourj pascali mahane llanishen dpw upperville nieuwendyk coronial crackles cheetos cottonwoods elsom redevelopments homunculi nantlle pully penan purgative mudras jindo barling podravka fifpro werent jalili lochee asafa semo nmo carvallo karakocan mitsukoshi batsuit droving laube trinitas cadwgan cabala shondells ngcobo ravidass thrombocytopenic islandia motormouth blackshaw ballinrobe hengchun winterbotham sorento cantat ohman cincpac dukie elzy forfeitures melanomas trav phyu sneath nazz javanshir suroor elaeagnus maleki degenkolb jenkintown hinshelwood maiya kfmb chappaquiddick ntare beatific nagina haydée olano nanotechnologies yunos amiata composted surv griet portada sunia noval commies pressurize lisburne merseytravel illustrato higashino demoing neisser tondu alexandrou dibango asbo twycross straightness pallice polityka writhe giacosa koff varnhagen juon reattach semitrailer vestryman crimping ktrk judaean smorgasbord canam tiba schwabach reijo tenace guren ypa feek hax yiannopoulos dihydrotestosterone angwin gremio keiner pittance zworykin ruffs cousine pigna whimsically bugaboo ajloun nezahualcoyotl terneuzen petrosal sharifah lightsabers rosca udm veon branting reim lesher objectivists beame oho compunction aubenas givet sucralose branner azem circumambulation leakages khosa elixirs berde macoun gatekeeping dorte duparc culme tammie andreoli kohr vollard ¯ scandale prepuce banfi beauvois conformant singleplayer squillace lits abyssinians sparkbrook zadie politicize afsar certaine lectureships ocotillo arceneaux audiofile cuter lubanga meilleure uzeyir mcghie msy excavates burek fengshan ghimire meades kley keyvan arcimboldo cuscuna jayma thot mandisa reneau meekness azn maidment shabaka yanking hitzfeld neben pragya lutcher bernstadt mosquitofish aerostat xlt molas breakouts bessell kiron tausend goyo ayresome individualists yoked gayer stotts kyokai biblis insulates trux subcomandante avetisyan hambrick davachi mccosh reider darda theophile reseeded ntia intrested beckel yamamah goian ulhas kadoma survivalism frindall hoveyda immunosuppressant commins promontories stroger masahito steatoda bakrie montee smartbook scurrying chaffetz seibold nauset crais ibara verdeans shrum redbull sharleen fforest imprecisely sawflies kahlenberg ballsbridge whelp nicastro donita reagle westleigh legarrette forestalling karola sodden musin besiktas huazhong fatcat kluane bondfield constantinou intension cotonsport jannis mouly huarte atrás matryoshka yac losin shockoe rhod slatter tragicomic rrt presidencial jogos interdependencies astrocytoma limacina poorman pointes manele seabra untaxed pathfinding kinnaman bendall laxer nanchong soaker valters donnellys sculpturing wordpad mahtab biographia kazushige terim michihiro poesy lissauer commerford brunelle preflight dakotan enlace birkhead insensible luciferin ginning vallejos niggle frontierland biodefense mammalogists zanella lochlann gabriola espindola olton pldm campau dollmaker mermoz leukoencephalopathy immortalize felted dolson darga yarmolenko nuhu delatour xvs aikins banteng comana omara supe ireneusz jaluit hammerlock xizhi thandi ilwaco solare garet cadair chlorhexidine bithorn hural manjimup jugo newb mutualist taradale goicoechea fulla nanometres borea heifers altis compsognathus paresthesia lakhani blanketing separateness santuzza demet hershman sebold portree jagatsinghpur hambling proj aima horno momofuku samarasinghe discontinues motlanthe gyrfalcon macungie demilitarised axbridge mulcaster sclerotic tomaselli boeke klöckner argonauta cavorting artel scathingly handymen shotover drumlins chitinous boak neuropsychopharmacology malavasi novelisations velorum monir philodendron dulin hurr fasching nissanka macromolecule sunrail jih astell hoopa unformed wci lookalikes okl kingsclere lufeng sqa modzelewski opo inflatables chane speedline gawthorpe batuta schönherr salvinia shestakov rospigliosi loboda anacondas kitchenette mavro endosulfan kyai duetted reial halki traineeship thiopental mainlines herma michalek escutcheons legrain mcmasters seelig minoans meshell skylarks rapha nakazato relavant scribed sakka kernal gdt tikki reiber housemartins hef orlean kirksey dacascos gonin bwb mccorvey orlok fábregas cassells dheere hyphy acipenser bloedel songkran ariyoshi degradable razo biodiverse yurts shohada lvds glenfiddich hunchbacked belchertown denigrates latinisation acupuncturist minc guaviare zits karpinski sout greisen freyre skykomish ghadames nashashibi ometepe aleuts mapperley prosiebensat strawmen sinoatrial azerrad nutcase portentous gubler cinquième noyer maligning infoshop cussing rotarian kissam wendat fredericka khvostov alow oraon asotin jalaun kilz temin duve scrollwork peche adamkus milanesi turbin sniffs rzeznik rosseau guli kasuri warid implausibility almario battagram castlewellan cofounders palhares contraventions lenghty perturb ardila germann ciénaga nubra longquan khagaria bennati vickerman potentates azolla sarp bents ainhoa rafiki bedknobs wagstaffe zopa stec nube chuzzlewit bradby htoo cepr hamnet liturgically coston kiesel creag rmas gaoler walkmen beardstown gapes benediktsson spectaculars tonalá repenting costel dega beranger detoured ufologists unseasonably theiler nordlund partenope kujalleq sleiman housecleaning golino khaw aspidistra zaslow picnickers popple tetiana accedes recognizability riverworld poots wearied benoy benten fionnuala kirkley overflight gentles blancmange rsh starched mesmerising kady metzenbaum recharges wheelbarrows mazzotta awesomely dresch birthmarks tsunoda tweedledum twila inderjit ianthe merna iwork iseo asimo montlake hamlett sodano munching radd hammamet brewmaster eschbach rutabaga nsrc hodler bifurcations earlobes ispr amuck junjie parisse trialists bromham terrifies mentee frese sriharikota wch rodden tvw lvovsky typhimurium dumbbells lewie fiddlehead eplf glycation quagliarella ondemand canuto ephrussi leemans windscale djanogly rosebush greenfinch dietzel bellot sizewell glienicke kiyohara auditionees bandito polperro pawsey shehab tomentosum holmqvist heydari duje intertwines munnings prostatitis cottons kashtan megamind cosmographia craic killingholme barzilai bierhoff kapel ryuta admitedly joell pourri rdd mujahedeen brinsmead brisa undershirt hoaxers kante soffit ranted riblets jemal olbrich nixa kold acea respublica jinxed jojoba urwin actc lindvall aecl vlasenica shawneetown shortbread rheum bogeyed shf hitmaker benighted penhaligon clinica georgieva dhingra demitra anax hennen canot gunports troodontid spuriously caché dtic galambos phen millport benefaction fuwa lestari crania brucie mugo comley squibs saberi logothetis permissibility coleslaw jape scriptable ackbar bould rys crz udorn iracema noisettes cadetship outlawry lovich gallion trankov clum mahabodhi steadied daryn kevon couching thrane ropers narelle bossard dilling coproduced roatán nccu antitrypsin windstar chirbury bwalya vasca hikurangi hubcap crowden kalypso wesh geotagging brewpubs koyuki fmx epilepticus kinko janklow melty xanthe vallés urm mahaffy schoolfriend djalma skeel stiffs awestruck gruevski pneumoconiosis concordances zuleta frear ariba namibe invovled zwaan timman zettel shayan bosques ennedi plym prostituted nafplion leopoldville bisignano molts gois cavaillon waives madou beiser mclynn merryfield experimentalists anuak somov wtvt irreversibility barrantes amylose charioteers cuvelier skolimowski mundra kamasutra timekeepers deadening quirkiness tappa negreanu razorbill petershill nctc doull estepona ditta ranee preliminarily weihe nohara rackmount balkrishna bortolotti sanski sherley hylands cok osk duchene mtbe vandervoort vaxholm readjusted ttb planked tahitians espero bottomline xinmin sudler gholamreza coalescent chanhassen haake rubell miyazato geim jolimont hoarau portelli dinga fazle hartranft babbs wou sporades lakis lahi cordle ivanishvili temir namdar xiwen gluckman søndergaard geschwind zambezia goodlatte tatsuro jirí stickam chisti crosson ehh zvenigorod titanosaurs dazu sherani coughed telemarketers luman tinsmith morzine archaism ferrule rousselot csce thermobaric gaden rúben consolations nafisi enciso maty rappa haçienda fingerpicking redressing zien unsystematic bcw wilm brons canalised subthalamic shiwen lowii scroggins sge bressan cibrian ansible castrate mitha szechwan loisel sailboard cotchery kahuku coureur rakan dalio gorged rimba sitchin akhmed konkona mosasaur marianist hypoventilation taborn manacles illi unrighteous traversable kisselgoff düül ausiello craneflies balderton marum pliant portaferry belisle leef fbt beamlines chanteur papaloukas letwin hasankeyf tfca azithromycin luli avani orti paines pcsos chambrun windley roedean canzona knowshon nalle vicens chani pulsipher sapperton jaafari arachnophobia nasib westlink presas skylink vollrath burstow antwon merrivale mobbs firle baddow meditator laforgue graying dahlstrom kurumba karawang vermonter kaltenbach checkups outstations mcla toffoli gonu topicality nonmetallic comesa lucano sokratis dispensable acrocanthosaurus ontong realisations rupo sonnenhof byres synthasite snakepit kutv phillie scherchen langstroth renyi petaflops ceawlin yohann gomberg franci mustards bibs inia reoccur fitzwarren leelee sprigs cavy radiochemistry jyri waldseemüller siegenthaler shipborne stubbington chesterville opaline phrenic sulking inishmore earphone screensavers sardonically stylo kovalevsky braylon jatoi shelden tibbitt plutocracy fanfan lanxess droege disenfranchising pixelation macrobert guignard delavayi lamaze baptise sahid jeanty fadhil accardi keynesianism contouring fatiha diabolo pymatuning marq differentiator sudip dickerman kleiser meredyth obara diggory berlanti adrs tenuta kamerlingh dibenedetto marchais macandrew debnam blaauw genetical subtests layed ellisville impresa binga aerovironment dildar insurrectionists torreya buckstone vaisakhi mosimann unluckily foundlings taberna kernighan fatf railwayman goatherd iops doorsteps anica moët proselytization sentamu unseld diarrheal snijders aboul reframed dubrovka microbiol sarpong commonwealths giornata glenridding tattershall peattie maraschino amnat changning commentates hutong bomblets kurdo krishnakumar microtransactions dohm detains plavi pannella barrons mohnke lwl gahanna beny zenas abeid yibin trittico coalmine overplayed vengerov montalva enercon tarjan urease patin inserm itanagar nonpayment malignaggi casandra skocpol retorting bargate bleh hirson abse embarrassments parmi melani tickhill hakama perrette pickerington weightings pizzetti antipersonnel planetariums forsett shio ebersohn trematodes mckale gunasekera caddisflies jopling kovin pimlott pigtail dunbarton kyp pequeños nbcc grayer bluesmen lohn fogelman swar liben macovei fixate sarjeant passman whinging pbf waterslide biscotti nasco colostrum herbison clearstream longtan yuanyuan meurice activ bonneted suz peten brockbank stimulators mcnicholas shamblin kieswetter longtown ludovici insurrectionist shirahama mutti fermentable tamaz streamliners forwarders colmore étrangers marcom nuccio erysimum dirceu swapnil coudert monofilament pitter tananarive cashion dockworkers leppert shiplake intonations gamecenter pedo transferability cathinone reusch pokot gooseneck blackspot individualised stiffly siza ryce woodcroft bita dalt governmentally byre amroth prams weslaco troxler ternan dilan senescent coum freckle panny agenzia enderlin matamba manari rahimov egbe leeann kursaal expecially minories imposture topples napoletano ununseptium hims pluses jdrf wagenen josu reggiani carrizal sulphurous tiang goodna lauck gresh saifi justiciable bargeboards olg katznelson deafblind remediated siac bassingbourn faccio riksbank cinc tvg explicable donatien felicitous disinherit epsa toyin erico cottony protoplasm obsequious refashioned winkworth zazou countertops durov ordonez volland testability crenulated dickov fishburn inchcape enciphered federigo windschuttle ystalyfera maschio swinden bahrainis nill thiong workchoices kanes maccoby claassen xoom darryn nwosu goumas colliders landsdowne seigner winchmore sbx zionsville masn sbarro ferencz adelie raworth serano globule rajib edmonston periosteum heys willette wcdma callable pamukkale krupskaya custodianship mazon electroporation leviev hershkovitz feore bernheimer sententiae pictograph ajram herston sotirios svindal austronesians gatemouth satch postville exide yeonpyeong atbara foros cubillas paleoclimate stonemasonry ruddington sundt calomel primadonna lichtenberger jeopardised experimentations sightlines tomm valore amate yahel yus eaw liberato emulsifiers calcaneus doled oww norie roseum arbella sebright warschauer felten curiouser pushbutton petach mamoulian swannanoa packager hefce williton maresme jonkheer minorly paraiba balcones witan chillingworth ngarrindjeri abridge bendit intimidators overshooting chaung caractère regulative hollington nordhaus protostar streetscapes holli oversimplifying mawer dobry monastero ocna mevlana kabalevsky lembo veneered bushkill yuhua basilone jobeth muhanna regressing dastur mroz pernet seawell kaneria searls creameries udmr razdan worldbeat unswept sternhell constructivists haddy biopharma lossing mashona jony kpp pfleger grune viñoly aetos haussler denia davida macgraw kreidler tzaneen stuarda maturities bjarke eklavya kenichiro riddoch slazenger wahidi trombino allensworth otolith boychuk niit pattan ieremia wgy quenneville randalls lovas babic unestablished payphones penas beezus teilo inarguably cabi otti mondlane brou boxleitner kuby ornithopods prefacing chitta outcompete unsa margerie ardoin syllogisms breakbeats adalgisa osteotomy massanutten gingham deerskin hillen tsvetan swallowtails outsell beerman hennequin babineaux agyemang rass bobrovsky duva choicest lammert imposts bassnectar giordana idlers antonsson seacombe ariya terraform beatboxer lintott schematically brushstroke adiga cedd ardito rheinpark dwyfor coninck bellydance lippy puting veatch hakki lyubimov acquisitive sahana ahlmann mycotoxin keepsakes minow reigniting balata fidh djordjevic riehle ahlers shamar tumaco abrahamian chorten kuehne korir elberton macshane camelids mendo dendera hermaphroditism rrf ménière bzp cityline nectarine opf maharajganj monetarily husbandman elv kieu abdoul julesburg vézelay hotness faxing barrosa hilma finnian chubs gribbin suddenlink fortuin ewelina feiner abdelrahman abortifacient muffs catafalque crescenzi nibiru glenny potbelly castanheira scrim imbeciles konstam kataev braeside fáilte grol raghib telekinetically walkouts voya fatemi caresses contractures babich odorrana shizuko gefen remanufactured mitroglou morgon gnawa chromosphere sbo tlalnepantla elektronika purring ambre schreder simien seborrheic catalin rosatom yosano filiz filippa malchow secu dorna regionalized sandiford liaises flunked mmps rajnath clotworthy permanency wyrick ceh rolaids darwyn borich ceta visioning ferriero plod suas cabel lebedeva neuroscientific neurofeedback harissa khaja neocolonialism coldham agrochemicals dworsky guapos bacchanal kuznetsk vakula ribavirin blon nanuet marylou mwana promethea knowhow eliseu canaima killough mebyon isinbayeva haedo bridgers sambalpuri tessera gittens schneid truther quemada hocine debugged necropsy tupa reverberated blowdown yamma wifredo disproportionally anirban afterparty sanatoria zolpidem hedegaard boesky buchheim gearshift oborne valetta equidae dighi enquires unexpurgated swarthout flitwick guillemette intresting smac machala fedden athis aylestone jaffé meiners parasols skender yonemoto whitmarsh culturales gismondi childwall stanleyville setts vassy chrysanthus disorient berenbaum szymański sanlih realclearpolitics unsaved weisel llew timonen vermeersch sulak kurnell witkiewicz sakigake phenylketonuria orthwein patenaude arison rebeck schine smileys cabrales parlamento eigg mallin moeder bakau zuleika barkham zipcar bourgois apiculata tauno dhp pagla leps gombert elongating shahpura ghesquière trefriw htb hatzfeld granero kvly mutability valon dysarthria jewsbury gaafar tremulous curig toné callister cutaways proslavery venkaiah guale zaken nwsa hacket banak moonless suggestively segers sallied contrives craster landside henig markson depressurization peress colantoni rubbings swatches wetzler tappets onomichi demange commer seferis atomization glenbard cauldwell twixt swineherd gsma mpsf antel tineo trethewey hilarie branyan sawbridgeworth elhanan dongbu mangue incorrectness wbcsd pownal sulk qanun cranny mehrab anissa clent filadelfia lrcp misteri malolactic sayce arcona videophone tidning dyche leksand trico rge takuji smoothest nodosum auke riverwood discerns broseley jaubert ribchester intimating baseboard senga voyevoda kalma cibin cannata barotrauma caresse uip mothball paisano mellberg topolski tosin tembe enjoyably giry chilkat fraport offputting milivoj mythili tgr gilels icare kelch semitransparent seances palmarès bahlul ipcs bunten saheli phyo tez senge sublett noncombatant patcham princeville mexès woud gorney camano goodwins bertucci tweenies pisans soldan mármol larnach bouder burkle verkhnyaya chehab knussen arbois meriweather jiaying vilniaus aldf bluejay mautner minotaurs zusammenarbeit emy foresti stehr churchgate propitiate lindenbaum fallas farad dacoits tuti ohlendorf belloni ostrea dapeng affordances aweil brabus lebus springfields laer schroer asarco harim paleoanthropologist flowerdew fatback copeia carella truthout likeliest schneeberger tanai ataka teta hha capurro churlish stylishly misspells pvsm supremacism highnote hwee meirionnydd whitebark bilingue chaptal plasticizer elg devotedly slowpoke guinot mersing corradino poydras refried unian regroups blagden impregnates michelob kalitta pury hrabal rucci exuding pulverised mres dfr severomorsk claas boisselle cernuda mercante dcnr feasted comunitat tinus ansermet pemberley reingold irrelevancy brainwashes ehrhart ppar boyaca khidmat goodpasture grindley wazoo servis ekeren violons csec gramatically clewer inrush manute freest novelized encantada clatter hamre vidhi ooc anniesland polites pirenne nimo arstechnica townhill eliud residencial whisenhunt nccn noke hibben afolabi daringly kirman douglaston asocial unive abolhassan kamble marcucci bumbershoot phouma ridenour copernicium ptak mohawke fadec régence almquist tunggal ionotropic ewings ganso magwitch caddesi sevi dpe accretions pechstein alycia saxbe kpfk hinging lauwers bletchingley essig wtp secularity lemak greenberger tointon storke takia barendrecht scourged swamplands congenitally snia prophetically yeun craiului cak chaki obsesses rahane consilium griping conisbrough charismatics langsford taggert milone romig bromborough watabe gerwig skiable thakore colvile prosumer lundie makani berthiaume kilcher roslindale lamitan kosten adlershof expressionless icsu newent imke kienzle emac launderette tiberi ruffy monsalve genotypic forefoot lordsburg scorns pinki bullous meep blythswood cognizable haldi travellin raptorial velocimetry louch knighting formalizes mulready mily stenting commandeers jedidiah broilers shalal glassed akishino hatra swearengen idolised pesantren everage frosch tyrosinase dimos bramson feckenham roundwood baselitz zimmerli lilavati jinotega gavazzi vallotton aiea naheed theatr soukup agers quiney millsboro amanti faders bishopbriggs jahad jeremi mccrum solbakken colten clumber léogâne mashina glanz clason hamal eeepc jdf sarver klea kirkhope bennison gosnold estyn letha ekpo talence kornet dersu nonbelievers leister chaud tepee slayed majali kingo trebled kheer sporobolus expropriations faile studier looby flori rezek pembridge shumaker vehbi sikka glashow sezgin jandal khafji natterer fechter outgames musidora gyroplane navara simus yanev sparkler derrek tamiflu bonnaire kumalo ucha ostentatiously crull coulouris samaha garen aspersa naef whcih diffusely arwel felici bovill koidu rooty azat conerly beckum knac merloni xai yuppies pauleta wajir nhon medoff gragg mushroomed cervélo raiford haymon porites abdiel sellon levet guanxi mayock foundationalism defeatism disunited recasts paternally cinnaminson placencia navesink anomalistic transnationalism rasc polhill ohi atlit pandur waaaay pardue chaine tweedle swannell pudovkin hitchins siller faurisson ahtna efford deet veenstra insurable tellegen zigzagging schlesser sannikov emboli weedman nandor rochas sainted ioanna wildwater gumpert dowels pluripotency attires oxendine electrocutes ceefax dogpile pouncing yachtsmen bocock piccirillo rason skłodowska agros kba seyni knapton imss straten quetzals maglie baikie bousman lati banani dewani laleham huaihai shored tradeshow kalim jianzhong clonidine dentoni konz tasered cassadaga builtin acetates xiaojing sadiku agah vold spangle prizewinners earthscan yokneam swofford horcrux waverton recoiling wilhem kellog kre abbiati misconstruing ravanelli lemper vicini ders sabermetrics wolvercote bearwood oversaturated weingut quex hoeppner maizière nosecone yuengling connel tarbat nitsch vicinities waru amadio kittur lloris epigraphist nxg nationalise swinburn deprez woodchurch winesburg bakos wilmshurst hackel liras ratto rakel systematize mouawad retransmitted kettleman xinxing tianwei pachycephalosaurus cachexia stosch cuppa nivkh cherrington transesterification moazzam parnaby mannin reclassifying zeri metaphysically elephanta cdw calment haslinger gustavsen merenda braunstone nordman ipil treschow southshore abord associational lineside ghaus ganim silverthorn balayan aala cruzi arcangeli rheem pardi babita bloemen pavie penedo sobrinho mahmudi carsey sherawat keaveney beccari fractionated ekurhuleni realness jsd ponciano toobin ramnaresh tamkang worldcup ezz squab cragin mooning boreman akindele senador inviolate conditionality hgf blaye promisingly katsuta mamani chaotically jeal womankind dollinger harmonists folkets numerological dka bamse yothu subsidising rogo bagi pirrie whisks arbury reawaken oddsson lsg pompon culter donskoi unicycling triga awqaf highcliffe peps trinet comverse bessone sgro postholes westpoint springboro oceanarium cona kriegel grantown indentures schlag urinetown piercer abms pandoras sundries théatre manou aica metastasize ardo osseous verica goossen nestles skadden restivo réalité hubby wormser lizbeth campanas aben aparo cryan windbreak ticinese bachus sportaccord derryberry tongyeong nepalnews woodstown kneipp pharmacopeia mendell horsell dairyman htr mudflows moville dotto ogino praetorians yune truthers montezemolo sherr victoriana criminalising aloísio halvorssen peterbilt tazawa millstream tschaikovsky nado narinder kastor gobrecht schnitger lokey portugalete cnooc letterforms karavan steff uspa peñafiel gransden zygotic edginton keowee larke superliner abass dakhil parceled brownhill lahij pede washingtonville regensberg indiggo haltemprice starliner sprayer protegé creggan toran rustico shellshock ramsbotham esai blust lorikeets tingwell diskettes danchenko hockett gagnoa ellerbee amerind dragone buzzell chatillon mckissick zacharia einziger sorce marchwood imploding sangwan kroy softline luxembourger kinnard courante jarbidge oddjob preferrable relicensing dirtbag xcor leuenberger seawalls keirn arny estamos phenomenons mccardle fadlallah nonplussed sabari vlm fulop rooth neot govert lawbreakers unburnt wonks renderers loreal holdovers redlight kanthi coachbuilding lgp spoelstra tijara jember terrifically struth monthon vojinovic pyromaniac quanto enza kerfoot esperia peca kharg alexe liquidambar thiophene memorialised kaifi wtam pulsatile hadlock warlingham clancey trivialization wakatsuki nudges schwenk netlist flappy codina casserly kyohei wenjie soiling gads polyurethanes hosie yba initializing harless jiyuan gasteyer tipaza gulberg luqman kyffin warf compellingly falconers salcido tendre omp borok blitzed boto lisco arghandab senecas sasu bastidas boumerdès jakaya odemwingie daugaard hurtle ratcheting glendenning vortec angiolini cannizzaro ult dreamscapes trevorrow myerscough uck godfred schaick gonaïves seagle magnetoresistance vergata wiggy rhib insein yocum yampolsky pwnage charline königswinter enitharmon aesculapius diaa bokor krentz yews northbourne karena flys nahm criccieth flyball fossiles uneventfully countersuit hbp grundmann televangelists sady wilmerding pradier zhulin ouzo screencast grabski hotwire brittania dwane geraniums deori microplate tearjerker nuka misnomers interdicting reichlen wrung tenofovir worldliness paladini tennet wondercon superunknown oped cryptologist yadier jevon nuva petrosino totale southfields ssrs hedgecock scac niddrie sadruddin sporn ixs rheological rocas virgos hafizullah ulundi immunogenic woolgar wda yokote removers annatto ciudadano kintner bacio dudamel kitsilano sapphic dundon allogeneic tikolo ralegh ufcw fishtank heffelfinger spqr seyrig hawkinsville stevedoring allem schweiker fespaco idot aldrete ndoye katsuji capuleti tothill baldacchino wojewoda mozer lindland keitai shafir deemster ellingwood dosimeter célèbres teter segregationists delen harmonising darkhan intoned erotically pariente poz countersued kaftanzoglio sparklehorse moldenhauer rootlets towboat jasbir ahmadov najimy avent dpo mcfc tomohisa internetworking zygons vermette dge relaxin troya athanasiadis surfs anerood yasa reorganisations blacklick overstuffed aprils helgesen girne heinola pluvial paleoconservative simplistically iwatani garraway meddler pontet gleditsch advisability dedrick chh baldauf buma ardi unenclosed newsbeat oropharyngeal preempts militarised mulatta unconstructed moonglow tihanyi deterministically flaunted showell assan rinku fornell llanbedr clínica cramton ryul blixa cartmell mouat stenmark célestins gyantse robofish uzair allways dangriga musan openview instep gld xcp assaying fahs dictaphone neese gyrodyne msha gillnets prosthodontics lafe biotopes maciek breland goodhew raffo wadesboro xinzhou heartedness politicisation alemanno multistory ferrini franson dippenaar fallingwater leehom bissel nbf rattenbury olas stigmatised thornes climbié krabby relegates beholding spay ktvi souviens fraternization funcom applegarth fulbert brining lorenc ribbe sidearms coté billman ganzel vainikolo hongshan ksbw smsc lapide mahn lello buettner chonan marut ery yorty turgay tws rubins ballsy michelis azulejos revascularization perrelli bodoland pullach erth zrt wolstencroft cuddington rebuts eyot ensminger ngum gallinger yerington manoranjan dallman twilights gouaches hilfenhaus gandara repudiates mastan haseen setouchi priv copulations walli saddlery akhavan garma swadlincote tadoussac hacken lakeridge shirking prantl contusions appy plesner matriarchs macrobiotic kolm hypervelocity concords lintz bho sharqiyah teachable pitiable yaro didia slackened hurford evertsen badenhorst botstein sviridov amplio chauffeured lefever okemos centrosaurus jades dahiya meldon hesder gillispie koekkoek kaleva garioch støre narges corbould nicaise woolson krasimir emigrates snorre boff discolouration rangell kwela committment leukodystrophy sanchar isay syf impish unicity zijlstra bucko dfp huddy nonselective splashy sonda icer ksb arvon lacquers catholicity laira mccahill icbc tenino ramal becali laurelton riemer suttons repugnance modded aksana wansdyke thomann sladek zecca sodomized bemberg narcos amadeu pumpkinseed tanami nordoff mcnicol marlen guoli beisbol moza recouderc unconventionally wagan morganwg höller haeng béal barging skb lasses delicatessens rody renkin competizione mcfarlin drydocked jackanory paolino raytown lamade vica derogatorily stahel ekstrom caulkins katzenstein hacha mississippians walts lucescu brydan jeng carolinum suydam implausibly lyssa laghi glackens lail sheriffdom telecasters didonato turkiye penalizes alexx syvret mcphatter wivb eisendrath beilin ladybugs quot headbutting veri sotnikova ponteland eddine westfields tubeless wibowo baine fetishistic dabur natasa kokorin bandmember sadiya losman ogburn lamed glaister ismene ecotypes gaols lansley ochil freshen franklinville medlocke haff tornabuoni semipro quoth schedulers welner humidifier brookgreen thamarai rhapsodic laughingstock brebner deneen rupesh planetside kaps andriana nyla dominium congos shahidullah ywam jic gunasekara hayloft suzannah gullion nomadism coulier gandak valdemoro ishtiaq turion deign incarcerate moriwaki pleasingly psat ginobili slive cojedes bleich wijewardene teplitz irg schwob jameela rigoberta curatorship penetrators coquelin ltz emyr jhaveri sunsplash tenser burdine striken varyag sestiere toton helaine mahanagar apiculture rongbuk anjir rouser vaguer nattier allori franchione tebbetts mavado loaiza rahan simvastatin boyana valdir butland massana iowans spitzbergen kerkhof thorndale picmg epileptics willich negrin funiculars spcs taneytown omelet salthouse daljit loughridge kundi javiera kens ahlstrom cognos sartaj trafton fedra eurocom dymphna niblett umari stumbleupon tenochtitlán cheska malec marang akinola impellitteri ellinger icsi opb dweck dorst voiculescu wurzelbacher schuette fortino danese lexeme ayukawa medecine babayan hanjin merfyn kango cytisus valeurs burgoon bachan cyberattacks stormbreaker suntan kahoolawe monia daglish ramar slone novelised buchmann sultani schemers ballu hectoliter graptolites stammerer corpuscle gwn chaturanga mecum alalakh alatriste brissot pelecanos czukay mcps sidor grajales clowney tgl olegs laicized sykesville panera troubetzkoy ciera pmcs videoclips lapita unglamorous blitzes uned carrasquel vgik bensley thorington mansuri julietta forwood glassnote grimwade najera borgwarner ngi kard burle fullam bakan comex pdv debat badung pomodoro vorn felsen toothy bloss wildstein roomful lydians darkhorse milfoil funtastic aapg coaldale mynd kabala ghazaleh briffa kiprop koehn argyris vesco niblock vinne zavadil stonegate huntingtower brickmaking likeminded bohl dgo kastle trimethylamine kenfig balck prosecco alderaan shawfield christmases avez valori bumpin tibby risser docetaxel plagne ysleta vend müstair arkangel neerja prosperi mizuta fallaci groggy tweedmouth shimkus brame antagonisms forschungszentrum titmuss flagellated granatstein cepheids rygge schank hierarchal wayfinding clobbered daddario dicentra angella michoacana oosten tranny orrego kpcc unflappable byculla transfused remota norheim brookhart wcpo loree spieler pepfar thornalley clenching hatreds guadaloupe jir poincare reutov herbin caci charkhi norcal oquirrh mubarek uceda mckitrick battel heilbronner salamone stieber brouwers actuating simonelli boorstin vishakhapatnam taua bodacious browz tushingham patterdale multibeam possessory wenjun choire aopa scawen choloma scie oduro griego verda norlin minnick tijana greenly rodchenko desjarlais thalheim sanden tinkoff harf sharpens spermatic nzru borei generalizable allhallows stuttered haskett cezanne rylstone medicate byronic sallinen dilruba imbuing acetylcysteine manrico groninger antichi berka nermin svitlana skillsusa bailiwicks azza hadean bisher dargie unethically civis röhl droog majles stolid backstairs frankau rawn lifu appetizing sheetz creedy kgaa carillion alexej oyston penetrant epicureanism pokrovka mongrels vänskä organi gaude unachievable agudo qadisiya stodgier consortiums slager dealbreaker khalilov keadby passu cavallero traumatizing multitouch marito razorblade hallein iaps reveling cedi borns lagrave szolkowy photostream canady loye jijiga alpo alsup heedless grails polarizers senates lappland jingyuan hbi keppoch upraised nesser postproduction inflames gnashing sanka neuropsychologist talybont magirus trammel aloofness contrastingly nucleosides despierta cicotte teertha peepshow lemmons snailfish wellings rafat gaonkar halfon gamu cyanuric cielito shitting tolkin gradualism kanellis rynek plagiarising alness ujjal weigle neé naafi kormos fulbeck smudges interna dénouement molenaar perfects whee ninds aping gelled huws goffs russophobia loga doji firkin kopassus calaway geppi faruq wptz chalons bigmouth mcgrane iara paulsson kawara chinery syncytial fretwell uiq shawarma capuzzo suran safonov betaine boyet southwind dupuytren buah developable triangulate montagnier tertre rajin dangles mursi mcginlay tolon toorop thenceforward djordje emunim easterner yasumasa ragone hardstaff nneka largescale glaciological gastelum renaults atic overdrawn dupplin grimethorpe implementer twinsburg duron venning chashma sibusiso tadmor todeschini rupnarayan miseducation chondrocytes askam brihanmumbai samsa gervinho yubari seogwipo hilltown adfa jltv auslander apostel trombley lgt fadeev tufo manekshaw zeroing ouidah fireclay coltman deray wankdorf delocalization coveralls transrapid maximalist cloninger alie kirwin bezels flyhalf kopaonik milada usdot vrubel valproic inosanto improver formartine pawcatuck dunstone pröll montalcini harring karuk salpingidis berrick palmistry grn lhr sunyani secteur petrides adis guadalupian kayama pixy paschalis psuv kalanga eckhoff maneesh breitman glassblower stylez trastuzumab tszyu brodovitch medicals churnet karakol sucessful showjumping straaten capas shaftsbury hambleden ambersons milosz werff lulled embley carthago panucci pushups hunn swm lione blz dsh wallichii sies mtpa roderigo unscr urbach pko elstow umberleigh yvel montesa semicircles sjoerd zenden lanie legard unbaptized lastras penders awasthi trepanation quarantines bérubé frameline cohousing mashrafe silberling paduan dooming greyed locution acog azrieli knb marriot indecently popocatépetl balearica moffo unwieldly rados muito glitchy hibernates quasicrystals jaggers dismally arat caber hevia argylls nairo milliliter radler esguerra uziel asra ikiru feckless shishir lahav chuckling reimposed inerrant winched feil jurvetson cyclosa flagon atack bidault signalmen tth disclaims isasi luzi unclassifiable doling macba glauco recliner mahin dubliner kanchelskis evenimentul fucus falkiner dorée kalaupapa eglington scelsi stidham emelia jallow triply zax phytochemical woodleigh tamburlaine klondyke unsullied zillow gigha nikole sanni liotard ashaari huka wistfully condover seymore rosiglitazone weerasinghe vimpelcom chevra brightens sulfonamides ledecky roes greenglass olynyk zanjeer momoa mislaid ctia kittle byerly claygate savolainen czvitkovics muckraker culicoides westonbirt seleção tuaregs dryads threadless wih thir musqueam cloacal jaspar pedler babalu sequoiadendron terrorise reveled coreligionists breadboard yucheng dauda madine thinkquest abortionist hinkel firehawk enlistees bitchin comfrey ruach durak fomina filastin faberge klasfeld curieux haseeb goodwillie recoding kwadwo sidis gomaa obes makonde discordance justis tarporley hamrin aulos fadhli patte neurochemical bloodstain adilson hacarmel jongen kulit stav bunte restructurings duverger massereene modin megalomaniacal sedia hamiltonians scrublands fallada stereoscope laimbeer villella belltown mafias kryptos neger grenouille smudged duyn marshaled kule joynson airlifting trivializing diverticulitis sibson lowth wilted paralysing benesch lasham brasier hucker weobley concentrically djite sheeler intercedes goater ryokan backfill deistic tioman tokiko gaus denko demoralising makossa transfield pallekele middlemore mdewakanton aimable misjudgment dangled endon schnitt ation chunlai celibidache zah bhumika maccracken sclafani higginbottom packy shaz trapster litel ponticelli sprewell padlocked ghita chailey santonio altschuler reposed abdelwahab dubus retinoid nonda extortionist tunnellers pewaukee deadhorse linezolid vogues bucha withstands gernon impinges efb bladet fondest frostbitten unfired turnings pintu mih greven parentis marabouts moralism muddies baruth virgina hosed sneers arashiyama amero sabahi hotseat doisneau mustached fattori tabara heis resonably saddening gnoll peery interconnectivity carrageenan postlude refiled quadripartite riccò assasination chambres inbounds roboto concertación thingie kenyah enskede wrona lul pigozzi cankers byc beida morandini ifds tarpaulins nystad orszag syk unbundle sicknesses silkie novar gronk llys kjartansson elyot antrum bivalent nuf highlighter sipah hegg salada schayes ghouse stoiber teutons emerich szdsz karakul lenehan sarkisyan hodie allanson grigorieva laettner commissaries heneghan simplot wiston vize illiana fuchida placentals ahola chike reinvents grizedale mercaz muji allocator lein bucarest overenthusiastic gatta probabilistically imst traversée mesfin collodi pribilof distresses nicd staffa ronning marloes unifier unhurried knaggs zerbo hary minish rengo aneesh zhihao jjl approriate tmo girado malines stewing hpr wagah corbu jaundiced undershot biding bryzgalov fews quella knapman taupe mertesacker cardenio pegge kredi beneficially sawar glassmaker ransoming balasko spaceframe lullabye centa sublimated scatterbrain boonesborough fosbury kli ardant kissell hirschi fratianno unitized westshore noson accomodation fairlee haart whannell adath loveman sres cowin neuhof taraborrelli jurupa panozzo liek mandaean rippled stik benefactress mucci wohlers bandhavgarh metalsmith bmb pitstops octuplets cernay yarnall ronaldsway valeant danilovsky durlacher levenshulme rubab duranty aslef logsdon fank xxy macivor agosti waun darnielle kangoo unfurling sweb shambu hendrickx luxembourgers nideffer richford lakshmibai labar nabe imperceptibly sallee geilo cachao kernewek peaky blanched closedown alist tozeur tyas alyth renea gites khlebnikov satou slx slaveowners eho pataca particularism hachioji schary prejudge adduce gigot weegee rootsy shey nikaia magis mclouth ldi lovelady pheochromocytoma gaffaney sandyford bloomquist cappagh mirer deister redhouse knockhill muffle rippin zowie heatter vostochny sensitizing perego giovannini varshavsky estragon wringer stavans lkab camaguey horáček tonioli polytech jordyn giandomenico bubbler heinberg rubicam toughs dumpsters koyanagi lambic tyseley joyal kaplinsky newsmen mcfadzean avilla ammerman chatel ortrud vishnevskaya granddaddy stapling nuada albergo colossi hudswell buni eeva beguiled zuffenhausen steratore renesas kenenisa chakan mindjet baclofen wahkiakum panchal germanophile koningin surfrider flubber salesmanship freestyler crenellations ruedi junos agroecology dayglo captivates demystify algerie blobby guthridge ecton brockdorff gambaccini hepper demodulator mellower suryakant rotan romine snowbank outrank malesherbes precedential debateable truls taxied lehel eades boere submersed cwp tegmark multiplexers teddybears bevins roslan gawd shubra rudbeck greedily katsouranis teseo thatcherite shiela whn crescenzo multivitamin folwell gerrymander wmal tazio allene crescendos paroisse baled mhaonaigh shaila isizulu dungey mittermeier consensually commonground kalinovik kailasa tfn bullpens sauropodomorph mörner pulli kutz shaa hofner sutor sheyenne cink miano ferrucci lottomatica mcharg holck dmo schweickart cannibalize riboswitches jawai fascistic excreting lodwick mozley motorable withey lundbeck dunja veldhuis counterattacking exhibitioner hoffner booksurge algimantas behooves xara mannus traurig materialists clachan carribean vpu tournay collectivized fréchette martiri commingled saltsburg cheverly bachelier openside bedwas inri selk chunhua benedicte hassled nastasia grannies ziege sinisa usmanov unpardonable ellena heimer helguera reffering adobes dudas tamen suprising ganay flagrante asatru  komische hercus schnier leventis scialfa frischmann khokhlov llopis gober thijssen ducato komai sholing cookhouse wachsman nightstick drydocks cbcs lasagne sacrement tearaway ustr kaiden aniket forbesii seasickness limmud nimes ivangorod sankoh indymac dhb astroparticle cuauhtemoc decon gwennap norgrove kelvinator kindlmann waba estee goulette rebekka oximetry sympatico jingoism dorky subhan böttger microsatellites pangalos amad quinet shergar berrow lucido democratico bdw kaypro dongyang onitsuka xiaoshan tuban namazi sandison kobelco maccormack paraneoplastic milke jurisprudential henrickson brutalities claggett sheeba countersigned nutella communiqués chiana riboud taur gingko cleome silverleaf hoefler curculio katsuki estève duplo lilya morcom canidate harrowden kaim napierville alama avadi handshaking valvoline asla dipaolo ziwei rore sibilla solf taedong electroluminescent kingfield beaminster gradel lavoy dufner ashlyn montolivo huebel megafon billiken vukovic golijov donavan arara handforth mckissack keret reinfeld pdsa absolon scruple revved duquesnoy wycoff stoxx rezvan highworth unspent stratfield nuwan pierrick larra cnpc breadwinners zorka aulia cedrick hkust nachshon finsen hamzeh inhales foglia balasubramanian scandinavium hustla lughnasa chupa creach caughey comptoir wojtyla malpighi coraopolis poya mrb zachry arkoff blomkamp checkbook looff vorm cruddas ultrasounds naciones mauris incongruously ibrahimi okina unoci tardigrades rezzonico peristaltic nafc catlow huell abrazo kemo croplands kealakekua woai abap vivero pengelly polyansky gmf heru ingela makhaya godmanchester kutschera ignasi ftb leprous hailstorms apheresis unexcavated kerrison supernature egle angelides morlot huaxia sht coralville pcx misandry bujar uzo labib gabrielsen infiltrations bartolucci fluconazole churston mortiz leotards cfpb fruita tallangatta ricarda lancôme pengilly merwede survivalists setif petrini puiu tasikmalaya zahm diori witzleben cke brooksbank mkr roissy millas ricos landrith hakimi dandekar cooperators avons aldredge rhm inital doggart labouchere jamun rutelli ucn rimantas mundesley gart cuidado lamoure huatulco sojourned temu tutela statuesque cvb formalisation bonifaz sadaat sirik scampi obuasi bloem blanning trista leisen garegin frej dahaneh distefano guesstimate illsley seeler markups torretta quavers pragmatists procambarus nawabzada locog igh kohrs mapmaking lipsey exploiters exclusiveness northover carrabba exacta matrona methow thackerville wjar assistantship blomstedt valgus onex tolgoi tmj ybbs tevatron icesave gualeguaychú codie mellifluous aceveda diablada bicuspid wimsatt embroiderer pittard westerveld bumba bestie reflexology silversmithing cotroceni freakout arsenale hertling suining alimentation thurs stottlemyre aui wellsburg hydromorphone gants millette gonorrhoeae takanobu clucas rockery calon plinio brunelli tidelands hotson ténèbres frenchay pansies tagetes poile overcooked kaukauna malaccan sazerac quarterbacked larriva arcaro caerau compendious simper markdown sindical gookin prinses londra srilanka petraglia eog shirehampton frothing wwn combiners legado priviledges nazer mukarram belvin talukdar ashihara cambered myre meins damac elmsall vugt derman eylandt rofe astar stefon fisting mocca sunroom reli shechter reactivates monstrosities pancoast zazi rebalance foad martella hyperhidrosis rockhouse mnj lingappa lijun eidgah rcog gordonia gaisford squidoo bellocchio badam glb pyres ginnifer rogel lacayo fonction cleveleys kuow marlinspike yongding flails wadlow puttkamer schizo savon polaco banatski whitnall wendorf fokin pastas vokoun gerrymandered biocon oxymoronic acclimation aspie orangi savickas thn graystone delarue meaker altshuler prestonsburg pequots turnagain noodling immunogenicity mariane ritualist plotnikov ivona antonioli gatland effa militarists venkataraghavan davidsson mahl orrick galbi pujo sangye galanos suominen gillum broadman respire rynn montanaro ultrasonics ncg rohner pittas burtin koçak isothiocyanate daira dietsch mynah uor aloma neergaard igawa rudolfo kaabi uds bamana carcase chryseobacterium haygood mulki saaz malter mcvitie veritate sycophants accumulative lygia gleich hawsawi diversa braunston reiniger hosoi pharmacol imei acri dinneen rancic edet iványi saho chinaglia rockridge taslima turkle raymon sgv latiff heun mouli jaroslava bryar esoterica rafo chizuko obscenely delf clementino turow hilding wiegert bromell wenig badra gloag electropositive snaking bordeleau murle trebuchets birling ramalinga rosendal wainstein garga tragi helou selsdon yakovenko kassovitz polycrates hénin fonzi marzouki backdating gaarder depor fortum locksmiths selkirkshire clausus reaps kehrer harkavy huimin jimerson poppycock guestrooms xterra rph torreon goodeve sanoma khruschev supermoto coachmen bolek dandan heffalump stinkhorn wilburton fritts beauman impudence nephrologist clendenin zabeel hornbuckle lactoferrin moissac wolinski mehtar inebriation toile sprue mechanoid hominum gedge marazzi waterkloof kaiulani termas laet eléonore aarushi leant supernovas joburg toula maglione futhermore saturates chesson gunna hitesh scimitars wallfisch ghimpu switchboards deamer righ returnable latonia khaira fatone kobarid proselytising perplex apperley ferlin schüssel mitsuyoshi diamantopoulos widder kaoma smalling belleisle riyals comedie imminence exosphere pierro amnesties walbeck ankylosaurs skokomish troutbeck filippos sloc leron leggat titova lifo fleener macaire wigman amenta kimmitt apoe multidirectional brixworth katzer sofiya follmer citywalk ferial leyendecker restatements nduka ozols weisbrot althoff javagal cottin librae zhongxing usrowing geomagnetism expungement lael sorbent worthlessness attunement plaisted mujahidin baldly mealworms totenberg phenolics larcombe culina tlaquepaque ladonna senghenydd percolate kaska protheroe agnar claudy wijesinghe luding compered sturdivant pattullo bcra luza vatika baraja hsfp witticisms marcu somethign twiddle corvino schmieder wicksteed pyeong minicamp purposing andrzejewski ohhh astc oxonian coved throaty brined bluford unlabelled pache strelka daskalakis assiut dabbles treichville inds fortrose cardi bakare eryngium townline spagnola mulching zentropa badel exr fremington workaholics pulpy eigo afterhours ventromedial merker dragonair warrented scrat ifj trotz gauna dekay sidr merrylands wenjing enduringly sayat cantharellus jdam greenview nivola easyshare benkler overproduced ocularis sarich khabibulin mathy flatware leadbitter comedown semporna guterman attinger guinee crosshair lidstrom abernant getafix duckbill heym knockers bucktown vivino mamuka goldsberry harambee xjs sahira onchan stablemaster usurpations ingenuous ebbesen leontes rkk sportsground ulusoy yeow recrimination seyoum bekim arous herkules midnighters picher boote gambara lyke rimshot getman agel himyar sadam romar cerqueira katunayake cappelle haxby pickpocketing schmied drenching shinned thiha kabin bagamoyo upholland siewert opara universita actualize phevs denucci katou castmate samatha viliame budig supplicant marki lynds prednisolone mignard chakar benifit ilizarov soundcard nannerl kimmins megabit dones szostak persky belliveau pontivy panayi colnbrook paulistano ordains dosari coniglio edlin avin chiccarelli bouley vitaliano turbinate israelita auchi basecamp gisella eurogroup portmarnock meise colesville teaparty viroqua bernacchi svga barlett huffing hassa dinnerstein benbella pria koyu watseka novica pseudolus gradkowski neelum duplantier palmgren grego labbe picha ranford anticipations aaps trarbach crampons gasimov edibles pronation iaith malpaso manika dahlias ruvo epizootic luisão popmart kurtág amfar panegyrics nazanin streiff msj hardstone timpanist baranovsky siamak noblet sesamoid rbe owston killion sibyls ironworker insincerity unceasingly seppala handi saguna avascular conjuror dulong paribus nafisa truc jogged hawser wrg dragway blackjacks epicure parda blockley hullett boyertown shud standardizes vitiello hogfather bloque enviroment yokel aminoglycoside neuropeptides nonimmigrant marris strobes techdirt touchstones walp formalising beedi anjaan abrego alman tonsillectomy lovette especialy mikaël descarga khorshid cruddy hyppolite sangma aphthous cardellini maniema versos reagon imbroglio malbaie pinotti kumquat tammet forsch rotblat gatson jonni hesburgh canonica instantiate secessionism tickers eim vlasic missoulian curtained mahjoub chatters samlesbury donagh sheek kokoity zueva pearle pieniny misdirect gouw catgut chatteris yelland unprintable andromaque rostered soldotna mxc castaños stinnes salomone banzuke livedoor dooling powderly keynesians gosha geli fukada rijke lanercost burrington easingwold hartnoll tgc foudre illusionism eling refortified smithkline guerriere tobyhanna ilich sumulong leninists gamez uncitral embolden mandera audette mozes soulsby bossing majestically creus brazzi mujuru sarao hekou groynes coope blegen wintersun agli geldart thersites alna axim schutztruppe defendable moens yassir chinami hgt hippa cing ksla hampe stonefish pashupati norbertine clawhammer darzi zaveri ootacamund psychotronic marmande stoss ovitz lambchop velda moncloa plaits daulton migi tese mangham cantillo bianchetti vasiliki gouws heliospheric fbe barwise awls mantaro bogeys maloja huhn midcourse inkblot gayfield erga nchc saadawi pecher lates regresses fryers chael picturehouse washbourne shope giffnock utb airburst lembah blackguard parreira kawan benso thery eboué snoops yaws bleeps malnourishment etranger microburst alrighty recombining icrp zairian transhipment luik samedov millage ducker acey empey somersby sternfeld westridge jiggle venters combles laster camerini dawgz tuiasosopo featherless lefortovo tallarico ahearne luneta savarin wiltern perranporth curdled tildesley cissokho akathisia poundland valencians rovelli arbib homebuilders risin prazeres kokopelli vittoriosa biwott avais antivari anjani vago runningmate valediction lgc slickly raddatz abre devananda multicomponent razali dubb discription mozarts swooning calotype swaby richartz headmastership adichie hardangervidda flaunts arbed rematches kingsale tottering narcissists complexions brunning lipo borings forenames zhiming hubo cornick dogleg pummeling schmeiser marketshare racecars yoy akn fondre beguines targhee ossorio tln bryher mewa turrialba keisler maccready neurotypical coneflower overcapacity senio yick sodi accreting mcnay crestfallen heuchera misericords neurovascular webstore hornbach arvesen glaeser haynesworth takita marree quidam jonbenét dominie stager lamona delamination mcternan perr lusso emiliana venerating moonshiners limonov flypaper hoppen sfh supress necesary nextwave soph faymann kingsoft recuperates wiht waart guanylate lewdness feldshuh besmirch wining schweigen lindos rouble furby pirveli fanwood trimethoprim botte tiebreaks scalene tacs dunkerton echoplex creusa driel gainsford quartus ringel appologize cryotherapy passarella kalsi phooey puissant turmoils nimruz inthe orris pagewanted revalued mahu mathworks lamarca obnoxiously avrich creditworthiness greatorex dandini pavlou vasp kandhamal rajang kathimerini abrikosov comdex harlaw alvor wabco kandeh bego ktunaxa kuniaki polypore juhasz euthanize undulation talgarth latos braben keesha krauth teaspoons spiderwick hellogoodbye velten undulate unhistorical buntline pogrebnyak delamare lungren hanway kelner gettier gunsan mtw brott unprivileged druckman dawlat funnelled sterilizations criner gasse phencyclidine beefeater mankiw gazet untalented siemaszko autónomo fallah boam blackrod amnestied camero lipodystrophy sankyo excommunicating cohere scandalised steams totentanz glyndon polyphosphate robbert mukasa amarcord sxt seismographs blf distillates valujet brutalism aaker perrotin tejera pakokku bengoechea naber ballyconnell menards torborg tomaz unclimbed hethersett zanon buzi thermographic simplicissimus rodica vieilles averil englands peeing nordsee nurit welchman yathrib hoisington mingora noz zambada asare slurring kalli brandies berrier mackays dunsfold dehydrate rippey vdb partite ewelme miep purkiss esslemont mccoughtry marter montenotte sujan ium glt mackendrick pabón hagg odorous castelar jedinak enjoins shukhevych moskvina keena outa daghestan zulema taiaroa katydids putts fazakerley hian holguin slumbering rimon rusal etiologies cavo bonventre maci uncommented reineke kwp torras robertas titley naess erminia busher hummed faivre disd unitech honeywood bemoan runup qaid bussing zurbriggen blattner gerth amstutz bruntsfield marchington meriel nonentity dramatisations rubidoux botwin stai baranova freep maxïmo alerta lovefilm manwaring iua spoonerism kgmb städel dawoud eyam crumley honghe coches grutter solal gosplan sculptress doar parahippocampal arsehole daytimes khambhat ceteris afflalo extruder ncsl carême bigtime kawy belyakov lambesis politicans iliamna edelbrock barazzutti wadhams twe routley burbach thresh remez uut forgoes phosphatidylserine lulla busdriver elizabethans greathouse chenggong daho smolarek teviotdale adap spohn debes bollea pozner heitkamp wiehe degenhardt cunninghams ewb reawakens localist rearden jackey cartage cassandre whirring seiran untangling temujin grenze sherkat jarosz nikoli maladie risha asao peredur bops vieuxtemps haise lifers ultraviolence takahito deerfoot jeer jtv inessa destatis diliberto doosra kostyantyn pontremoli cadaqués hink lakme halutz debarking issigonis birnam jóhannesson députés nuuanu cpcs originalism hancox osis manen onkar unchain odermatt uprightness michaele leggenda artan themba megu scudo rogozin mcmoran soken cleophas jaycen achondroplasia oronsay purnea waynes potente allbusiness ancrum wanchope yanovsky ghannouchi babilonia infraero sonn faim annemasse greivis fori ringen kneebone fratricidal crowthorne tatti storable booka breitenstein pucallpa federate mcneice kirschbaum abood envisaging mcaloon eridge hoki refinished gillot clingy eigner devraj eurabia pryderi tradenames turnbuckles bonzi baten hilf triforium damacy jolissaint wasikowska giat meranti westers tunnelled noteriety reimburses torphichen tolstoi geering enroth spinouts evangelia changement uttermost pirri sleepwalk eqs kamille sonck sinuiju woda nubar holmesdale donck tigrayan ejemplo sermo underfunding balboni houseplants heilprin personalising knockabout uio chokwe nyai deitz steitz phot rutka chenies werman yoseph toolmaker assim loane thg minford ggt holey mastrangelo laundrette jolliet whitened powershares scheidegger jowl rehhagel mazdaspeed pathophysiological ysa kimbrel prempeh blampied shok uranga girotti donot tager underslung unexploited sallam darrieux longbows neile kets risner laverdure dextran panwar olliver bourdeau reconfirm cardstock golisano duenas outsize baratta meyrin camilli rcra hobbie llanfyllin sihamoni lynnette nfdc puglisi barragan vanesa vaccinating schumaker bindery hummelstown benouza chequamegon presteigne optronics addicting mattera udayana eisuke univeristy myotonic pattenden albro imparcial peddled xyy vouliagmeni pamp ivel aje ercall futilely spragg ureters blotchy dilithium roederer grania akhlaq koukal poka loriot wilcke dadaists tomasa underperformance reemerge zorich iroda batyr suvi kilar bergholt satiate lockouts proxmire langtang ensslin lawe sabanci babuyan diko clangers matchmakers scarff huei zikr koker ajinkya tonnant choshi bedloe wpl diopter lucerna kacy natoli grandpré ballantrae auchan fgv mahat nrd lcbo leiser bassols edzell dorries dotti cbz kubicek whatmore varrick slindon anhang dragonfish micawber destabilised mannock deady ramer pontarlier pome tinseltown datas praline pittenweem lapi baucau neurasthenia divin hcb iberoamericano arrestees keaney exosome culford woodmont shatskikh adroitly aircon consorting caca patak thair daju reflow massospondylus iadb belhadj taffeta krysten miggy gless groll jarndyce oswal encinas croston lutze ogwen decontaminated wrightstown eisaku balibo potocka vomeronasal souder defuniak kathlyn irradiating tarra baaz clissold citri inveresk signups zema syntek effing hilaria expandability smooths hanisch friedlaender mccaughan recoiled christofias acocks sapelo leider asphyxiated malathion viktória acidified dixi quaintance nuran tokar formia roosa nirav fusi padrino valek caas raro leupold watchin florance epimetheus harmel alfonsi sirajganj barq kashf sampat koech tepuis leskinen hollowing totalisator osogbo hvidt sirnas heartgold bosham betchworth nicknaming nehlen ploshchad beil rotheram hudong dernbach frade declamatory plasterboard pahokee tacrolimus annamaria matina kfwb middles straitjackets koppers siddal corail nasda billo pimen yaphank cockell youll smartt holway tomasello chimaeras mccargo jutiapa allergan tomasini marchionni swerling curiosa usualy niyogi camaiore fetishists vincere mortalities nbu haddaway monteros federspiel kasauli metafilter vao prendre mandai mgi pablos ın skipp schaan sike aillet allia pritt minge vicenta curwood cinderblock bowersox modiano mullings untreatable twangy erian tanimura dahod ramaz mortazavi screes schlitt metabolomics miroslaw vilks booysen apicius interpose mariquita plausable cayetana kibet morihiro swashbucklers marve lobkowitz hearken fager psyches cavuto determinist compadre soem counihan accies thermus vartiainen shobu awns zadkine specifiers khorana hazeldine beason transantarctic campfield israr portamento whitewall dmitrij psoas avens panju mamdani bashmet muchmore ksf addey dawna nyassi lamoriello spiegler massingberd cameri arietta engorged lupica boyens meshwork borked boze fockers laplanche felicita crapp lomi alerte hatchard enchants unstudied tordenskjold algonquins batmen prosopagnosia bamboozled alaoui comixology korobov lambrechts scornfully farooqui ditchburn dominicano erasable metrobank urbanski duellists allwood qas fpb mebo footstep backrest glm dunmanway corbetta phalloides hyperpigmentation eskin guardino ilike michiana moderations cornishman evinces judes zhirkov lukowich bronchiolitis rokko irritations rexhepi shugborough galecki garron torstensson pruyn atomized hemed botello poteet surikov vladek chequer caballos awori regiomontanus ewhc yerma denitrification bohanon êtes advertorials stockades nudelman tarasenko siona fenter brener chiaramonte minghui malikov mauriello radjabov springtails terefe muqrin scullers butthead intec quach peko advantageously flatfoot marí oklahomans clemm sidle netweaver banyoles voima parmigiano tredici malakar eday dapple urbandale urinates bolex utaka wre treffers finntroll lindenau headrick expatriated mogilny portlethen faran bluebottle fourplay goog westtown layon abare dfff kolehmainen mowrer sisera tobira kosmo neocons fumarole iniquities michale suppan bisphosphonates asdf dapa pierino twosome shehbaz ffynnon andruw minicab shigeyuki miret mobileme zaide comptia staab nyarko satrapi philipstown lovitt lancy donnay bustani rosler ­ bellos piasa garnsey rotha borchard intercompany bloop pccs simoneau isachsen annetta brandts crossbreeds sunning movida maddened duetto swindles midp closs meningoencephalitis autogyros mida heatseeker hakuna piggery emancipatory minko lighty liveth dellacroce mcnee elastically ledezma haeju wasmuth updrafts nuruddin mafikeng xylophones carvell gottehrer ibraimi borley małopolska drainpipe nicollette verbascum lickin stard schuldt singler nabarro sagacious lashari catchier mronz wildernesses starmaker simonini armenteros naus daat ouali essl rhr hafer romay whitesburg eusa noughts lej candombe nonsence safronov smarta alphand xibalba vladyslav rasha vajiralongkorn vanbiesbrouck hatfields pilfered tiktaalik bledel reynes koppe osawatomie satpathy avraam soleri zebo skorpion daag brüno unsimulated evelio betuwe quek polycentric kellam gennevilliers buggin andreini cerveteri rootkits alptekin smugly seductions grocott tadros quila dewlap cleddau birdlike clepsydra highrises mkz priorat hetrick rembrandts pratten baquba rtw alvie yorath ptm chilis trysts cheilitis baccarin munte amans kahnweiler mjf jamu noer bynoe rougerie exploitive henpecked shahe abano ueberroth drage abberley pugmire yangsan shohreh perlite nijinska hiace sdlc photophobia benenson amagansett magon abbes palates ttk tripel imaan woolery madalyn meebo bhavesh bishopton domokos gimbels nasogastric pnw rajhi neak lwd nazione schleifer gummo charisteas surmising benon suwardi swen mycorrhiza zohn ksfo ledwidge rotoscoped boisbriand whiteread kostin gladney krauskopf manzai sted kowalik extrusions manimal brehmer pescado kagen matusow mallette ronchetti kovic bislama argillaceous cutcliffe helmore chador hypogeum regn fgb cherone buist blache mouldy wkaq dorrian climaco hudnut sahashi rouf aysel madeup kirovsk legatum grippers gerunds snowdrops agor roderich segmenting fontbonne staffords brr amorosa karonga oii nardiello vads avod cochon bonini econet maziar alicea aveo podiatrist velikiy preller dollman stoloff citylife margolies humpbacked costi kepala cryengine guérard maila donoho balat caprioli relishing awais coate talaq rlv freights mountainview congressperson nachbar jouffroy thermophilus trevis marilena gritti cheli schoenstatt pirjo mugar existentialists eeee haina crossers govia leptospirosis jundallah hewetson foday curettage soccerplex tsuchiura helfgott mlbpa inb whipsnade honeybourne admixtures hussy dystopic everthing tweener philosophizing schrade menne parri kuramoto mosqueda googol siarhei noren koseki barrandov daisey keffiyeh huyghe barrelhouse berkovich vanderlip moghadam kiah matancera hotwells rimet urgel evar papes braggs urdd lvr unmemorable yadegar dewing bovingdon toxicities bahian schneck russkoye mylor vrabel tymon schmeisser bookbinders myfc italicum leterrier karstens wilhoit collingsworth bawdsey susini chiriqui casspi koyo dayle zubaida alvernia berroa pothos niwot varis longwall africanized endresen seman gign ostin shintoism krays clobber attal highnesses rába savitsky lagman yips francesconi lafosse spener tahreek leashes oncolytic mccahon epinal tawan adelboden rgn infor matton debusschere wgst corofin coletta shielfield chegwin carwardine imer kleptomaniac ildebrando centros kyat izo satine ricoeur chavous exacerbations jaimes subregional mcbeath domenick hpu osbournes ebx azcárraga belizeans sillman dxf herrerasaurus thaddaeus fukumoto hemma usgbc yrsa rockton downunder asle djimon thornett hollering kime chatuchak bailyn syncom schiaffino radulov kreiner willesee murawski bowdlerized laforce caspe checco villalonga zakharova shearim saly samoyed hasbara olay azria challen lenexa natesan untranslatable judice icma northmoor odenbach cems gumy doñana lalaurie dabashi shostak aberconwy kajiado nimmer unpronounceable furcal corruptly giertych hochstein danell dalmore capece urashima irrigates limpid swu diaphanous wxia preimplantation ulv mclure mcqueary sizzlers externalizing schwer techsystems innerspace utrillo hanjour maraj henges rothen sykora durutti bgi moralising toposa redecorate castellari olaudah incomparably prunier lateline sicario belotti simeonov solondz rancour gemsbok rickwood guelleh schoendienst boodles heatstroke musawi satilla digestibility blamire choptank fhi tourmalet archbald unlogged gattaca tastemaker jonquil khris sniffed oetker heteronormative oteri proietti teff danil televisual bigleaf combusted shinko lechlade mezuzah drori joschka sleeplessness arcadium insole luminaire nonvolatile gaugamela burzynski jallieu kameron ackerson grou cairoli mowden aspirates esbjörn kreisau spenny kristiina intraepithelial tusked wittenham jagoda unhyphenated misremembering dimucci arlan montuno gez particuarly mazarine voyer inagua flaxen cuillin mamilla utilis novosel gurvitz bozzo cortines popularizers dukhan parochialism timoney libia choroidal jostled denílson cpy stanciu wedging editorialist costlier hectoliters winberg authenticator hasen lotman marazion sug mancow isiolo sogang bodenheimer zipping jlc quiera moulavi drobnjak wheen latasha bargo palantir biobank esson melandri greenhaven moussaieff crippa swit jests htun bracht hofland becchio oleksander shiekh radiographers spectating gobbler verulamium saulo shaposhnikov hearses mezquita ayyam grouting scrapbooking quietism markiewicz windhorst ogof replayability vanwyngarden cassaday debevoise octopi eiken shinta degla akre habanos vinick myostatin bummed steinheim protectionists yanase wraysbury bourdillon lacewings harassers lightroom rajskub kamaal veyette ritzy polke illescas emmets blackmailers easby nool nycha yatta garrulous siggi couceiro collishaw ponente syphilitic kandasamy hyperlinking ranville enunciate surkh gnawed sefi ajn talvin wftv schnellenberger stri attas ranjini bricklaying bedecked ghika almanzo jibs ergun kalanchoe arzhan dening ritualised pekingese gakushuin repartition parvis stardate phytosanitary tuer litigators ribonucleic delma swarts zdeno zohaib stojanovic tadano fulsome waterproofed arundinacea caestecker undistributed slatted ferentz bouteille neccesarily babini bareiro glessner takeya ankers countryfile shenker luffa haraguchi vybz incontro strop whisperers oncological moyra teuku siddig parer castillos thomlinson lulac doubleheaders margy tache cockspur morgenpost unlikable enstone skowronek polegate paba monolinguals eget lorence bermeo motonari metter hooten pelissier braamhaar shailene aamc tahi flippantly herstmonceux helipads neth balmoor apexes ninagawa icfi yehiel objectifying fletchers anally cambra sequestrated lmao asiri meze geisinger fishermans georgievski missin herberts winnow powley pyrites minner comparators intrathecal norddeutsche temporada deshaun zalgiris wdg farda kompong brettell tirona passaro vagana samwise cumnor thermic arensky polian kreisel inspektor deutschlandradio interspersing acho discothèques loewi phf ribatejo worde conformism screeners burtonwood blandness chloramine gelai zubizarreta grommet sodomites gratuities contraindication varlet onboarding insúa halawa rosellini lusi ferrybridge biosolids dynes septien gerasim apoplectic mirabello merest hovel ferdous fallsburg stickle gapped macrory pesqueira detoxify pavelka marda nemerov pedophilic nervy uthappa kurylenko donlin recidivist schwarzlose montană thelin bronchioles heaved haddenham baculum daubed roxon tokaji altuna dyrdek oompa klaxon overfished magrini carmencita trengove spj hardcovers mccorkell buttressing kdd extortionate eslami grandage fleshtones ncmp swags coline percieve mauceri missives alarmism haltwhistle honved distributorship dalling asencio mcelhatton halcon banpo brmb kohls schwabing maitreyi jewson linebaugh aracoeli cresskill laforet scantlebury nrma lambis sentier corpulent oher jessalyn kidson kuzmich mosk wenk thornborough prodigiously mastiffs pejoratives padley callout gretl journo khamsa deripaska laville changxing eatons samur ficker westchase siol jhabvala volle blindspot cristescu tsankov culcheth geremia bestiaries fluorocarbon maoming calado brocchi whippoorwill gurian cuticles heisey taler bramblett possibles debeers osteoclast gleed crossgates riegle oldboy nadra yasukawa yoffe yobo chateaugay rehoused nucleate iruma weigert tanar méridien ressel kaisei arrupe rfo slipshod withernsea castros hardcourts nare oligarchies slink satyanarayan acaba buczek ï vilkov praziquantel yutong asiaweek tortora brotman cordwood volage peguero rdu taketh monetized quattrocchi signis tinning virgile schwartzberg kadeer peover thiamin otowa kyphosis fpd ascertainable butterball hawkshead faros jpm lacertae rumpf ridgetop bargen difficultly driesell superspeed geach concertgoers cremaster opata arundale hurlock lemmer kirkkonummi ganea milen technicolour matheu weasely lbm maeno vanwall baddesley lespinasse lipmann pichu leeder haendel mitad dant pedicab gándara hallan paoletti imbrie khesar legitmate pigsty xetra sahuarita arundo deiner understate cholecystitis cadigan sebert glenmont moltmann overeager nmf vionnet adressing toxoid paravicini digue centralising biosensor dickhead ackman kheel scmp privatizations calmar repaving shasha freas procurer sundered atorvastatin vituperative certian reconversion muddling skidelsky yanshan mountrail zakarian propositioned hamstrings tropism giorgetti bmh coddling axelson slighting gravesham checchi phytoremediation fitzwater psuedo myleene rosenwinkel cnm macaco bardney cefalù ozias miniskirts kouyate tiggy rockefellers frühe kibworth benadir miscounted ovenbird strout inundations igad monheim moomins pachanga llyr cheeta dearmond sundowners goed nihilists okpara slouching ostra godín calcraft wombourne villela placita starlog tabassum jinji releasable psychotherapies latecomer pervis belleek johana termez odet quesne chancy prisa magix lakey activin rattner pavlina newsmedia allrounder basketballs kailin freel saveur dileo sadaf formalistic postulant kmpc minkow splenectomy lors kbd donovans towy inessential maharoof sugai getchell freebies paiement jois righty kafe empathise quetzalcoatlus stieb schiro kuruvilla stavisky leeves tmv oktyabrskaya leukotrienes karnac cosponsor descalzos jarett magnetar baljit whitebait savoyards namings sugimori buffeting fumagalli edrington meggan dozing rosten haston mema tvl tigar quacker fillip gaitan sanjuro edgerly piette festo ssnp lustiger unsubstantial wrgb pohlman cofield cobley bruckmann hardier providenciales bastianini dryly lombardozzi oea tueni skitter koushik karnik smiting scabrous syriana dowe fringy hemingford laca banyumas treebeard mosaddeq saren skyhorse bridgett resi professionalize sturmovik nans whu peachum rabinowicz titrated chicheley virta extinguishes casiano bodyboarding fcpa taicang transgressors galmudug piffle danta vermaelen parador forsythia kempster cubo pries imprudence belova babbo garsington remmel eop wittily zollinger applicators speake diffident scrounge avit torok gerstenberg shiwei goomba auletta lusby kipping antbirds ferm stanfill lugares steeplechaser blaik lavette zillur hustvedt misapplying laffin kerchief zeina akinnuoye murer arrochar rabidly kripalani johner buhr deseado tsuno nyhan kurek mushrooming compston firn jiefang changqing handwashing terpene schuylerville kyril cautley stene iesus silicones kjos vincristine urbanites cobie kuechly incompletion masucci greatrex bodner cartuja femto terrel dacheng qissa tieling basnet gyn dynamis gvt slicked soetoro kljestan hofstetter benecke becht tulpehocken grinderman radioiodine produits pohick chatrapati coolbaugh luminaires montreat acklam kgc eip piques corked landsburg sandes stever frayser dendrimers paia igd communitarianism attapeu wdjt baiul michetti plumbed abysmally zavaleta hawkstone antropov xinyu tornillo asman sursock xsara lannemezan kwtv pollin wlad lemi trichardt freestylers brixia sccs ethem fluellen sandersville braunschweiger reiten clopidogrel eulogised sdks fishback fraying ugarov eijk navjot ringrose daini coital mulugeta haaland teutul recategorised larabee pashupatinath grimal alhaj raymore bourlon mattino kanerva mtas mihama dulcamara luffenham jaqua suef kemmler ngm jackdaws oppermann hived meeke elish dargo unlikeable dafna sunninghill yamaji tennesseans atomizer marcelli teegarden trapezoids sammer homebuyers bensimon pacta washouts danso chassé plock solitarily bsba oceanica hijras kiyomizu haefliger bvn buntine psychopathological panja wchs villamizar hougue arses mandore raveonettes heiligenkreuz jokhang zarza chatelet pacepa attiya ivarsson northington thialf glenfinnan nemoto messam cicco karolos restlessly eppa biberman wallah pilchuck directe deuchar dimethyltryptamine bejan erekat abraded trackwork nagla noby wpri ammonoosuc tilefish candelas vizio hayashida bluesky peppering footlight delegitimize epoc arci vorpal unacquainted goyt paralyse lipponen hobbema alleanza ejidos gachibowli fructuoso faumuina moiseyev valarie saintfield sccc chaiken doubledays mergenthaler ihab senatore cyberinfrastructure dhd scaglia villian scald lycanthrope vural fondle bressler wiba bevill bifurcate guarisco akhmad bacco tautog hierophant sardinero photoacoustic irini rungkat berrie harkes cellblock akhurst hurtig sharratt thurmont causley monegan cleckley ansf crumpsall dimitriou giannakopoulos tweek hacke caspari miscanthus retuned kaluza gunby curdling decisional delamar minson fulminate begumpet robinet schizotypal healthily auditoria watty cussons miva devyn hession nchs negligibly sundering kosovars brodick photoperiod lius cardia weetabix grehan hho netrebko resko eurosceptics tenenbaums mhg sugared euthyphro herkomer mirando fitial casei biomaterial defintely pettengill munchies nambucca gilyard tanin sharpsville brehme vengaboys pazin videographers galal cording distichum ludic zilei clickers dirrty airgun eichenberg heaves wahiawa wyland sportscasting melech nuwas typesetters susser wlwt walenty nisra lehne hawiye itinéraire pachulia gabonensis aeo neumeister unshaven rapin kertesz galwegians karamoja santacroce mikulov caniggia goodheart radiographer hutz turu serenading chide ufd mahout nickie moku minuten kwale villus radmila tonda crossbencher abating punchers oger cornerhouse climatically litman borak megaplex fiordiligi hruska tanabata marcellina holdren emperatriz soopafly biffen cupples janiculum lillias arigato glimmerglass gladrags tossup decamps profilers pariwar cantin quieting barsi currying douri schuch knp sallisaw expends arbeloa somaiya idabel shinseki koteas alderdice ghoda refinanced expurgated zych ambulant borwell gruesomely procreative unitedhealth huckett areco barbourville maita cael cowlairs ispahani bulldoze buchberger bleiberg daubigny politicizing aicher abruzzese dabas rde wordie kensi rashied fanad skaff bollen indivisibility sirkeci tradescantia rachele fayyaz doidge ramallo meningitidis cheddi londesborough dangerbird osier marsico uachtaráin muntaner tucows hinesville rusin wvga hyltin hachem solor frightfully panicky ozuna hazes borowitz sunwoo krarup overstrand pisarev cspi slbms soundman islamized kamps dysplastic insuperable capozzi ajayan shibumi catastrophism mokri muchas urbanity ioannides gme flandern springside stegosaur toles colletta begrudge baquet badland kleinwort cbot wellen hidi poom buggs hda yanukovich fibrotic mcareavey skempton platner twittering vasilios checkin tamarac pignon potshots mulberries makua leane weininger zipline russolo brandishes clathrates snubs unsupportive kalter alyeska tpk herschend belnap kurahashi steinert tracs baff kopec nasaa saltergate pichot klc hasting slaked frailties bfr kumaritashvili zehr yeshayahu rayos sarde toils ellmann birkebeiner parrying anae meetinghouses ponni kakai niac zafarullah commercialise procaccini maxene frodingham condenet ghgs braman deyan museeuw madej langbein bicicleta wunderhorn molé adenocarcinomas aubade kiryas maximises cacia hydroplanes cemetary cafos phorm fauntroy straggler djam encana glocester barchester monumentum dewyze gonadotropins anderston vercoe rebrov disemboweled elrich benxi neeman reauthorize dennings windber ettlinger anila kuwari silone wilbarger northpark mcalinden peeper spikey saung ramshaw qazwini rossett corica txu mychael trussville souper retested jauregui unsettle tilles patkar ucode tuneup glisson karabagh kendrew rumina gadara grobe paramo fames guiomar epaulets encinal bandersnatch countrywoman okai digic piot hacky abbass tuuli limps zamperini aerocar barkatullah boones prelinger duranguense hynd cowgate guelmim hopital ilhéus bramshill asfaw heirens cabanillas bulatov solter drench lignano iod wineland spanker unsponsored cominform soundstages tomov shoham abkhazians lanfang rickardsson nationen chikan imdad mortgaging distler cymdeithas chivo shrugging haidt kimmelman echocardiogram dagfinn pokljuka ramanan marginality ridgley moutiers onel fixity anhalter ravenscourt reddened fiorino taimyr mielziner moharram vxworks molter unpatented fangled eltinge driskill lulls bayamo tenner giorgis friston villacorta cheang saddington schwebel seresin tpn enterprize shirvani personalisation chrisette itraconazole stemme cruzados keast delk warford monotypes liddiard fuxin liba moonshiner ferments lrl jahani correlational cucumis salyer stadholder acos falzone neuropathologist pacu sherritt sereni gidea toehold manpreet gpv avensis rauno bassmaster euskaltel haslar althouse balmes chatroulette stough chikamatsu samosa deery wahba detwiler herreweghe benburb mudcat bereket meachem ticky toback ysl nissar urate oberwesel widgery kex wannstedt obziler sickbed faultline aberdyfi amby bandari mulitple kutless blackballed foundas antacid tribespeople candelario pauillac beyeler boersma coalhouse yossef zakayev iacopo hubbards swappable weiz whitener elderkin kasr shourie viorica rewire vmd sodastream blackhill soemthing kernersville katas sverrisson consorzio jarhead thornfield anticlinal slik klepacki mudflow appassionato kaimur anneka kunstgewerbeschule multiphoton importante alcazaba altura kyler grackles neshat kovner pharaon molex jli cref matmos garum ghirardi chobe reald dwork kampo muren alecia paiwan virola shareable mirabelli mothballs swissôtel apostolov bobolink teitur elfed behaviourally impeaching bastiaan dannemora psychologies rescorla batha shahabi finckenstein namc zindel meder crk shalabi maral ekran nahon mountstuart afterburners sexualised enchaîné arney proneness antistatic jaff sorrenti quantcast lasd narimanov tidd rebid cerri buer bogyoke pardoner hargest kronfeld parricide lachiusa northen sulfonylureas leny rhenus cleugh regality sakazaki bundesanstalt freia fogler schluter kgosi vxr grovel wfd superfan subcutaneously wretches azcona yusoff lipoic thistledown buxbaum thumbprint oversimplify siregar embezzler dups csrt strömstad whoppers crapper scheidegg avalokiteshvara rockey azimi grigio bertorelli cassani flattish qinghua hewish zulma heusinger boozy tverskaya atem brochs warrensville labrecque cjm utterback mirallas tipoff combativeness yangs methanogens bahal sleepytime ohtsuka sceneries delhaize brunete craigellachie carri glovers hosoda obeidi anegada gyratory teamer sjogren laotians blonds jerram kjr nonoy cundall arribas laska helloworld ibad malise mobberley fulgida battlegroups messiness kamaluddin tusc disavows phoolan distributable zingiberaceae ogston remold pastern farha doula commonness effenberg manges vbi esfahani sterilised qtc kello foamed vanvitelli aalsmeer ichinoseki belcastro railsback kurin bergmans minimalists syllabuses shigeta mitchels liese munis metaplasia fesch wiltse hassanzadeh radiotelevision helmsdale northwesternmost auxilium gulli yiming fiercer malgorzata slama nefa waddon bedwetting sokolowski kichi rizhao deserta fnd moonves connoting bunney amoung precancerous repp drongos neera deno juddmonte cadzow koregaon tiptronic cwl polman matahari showstoppers mehul krispies grainer biomedcentral cresco propagators dosent vioxx siskind lgn diacetyl pulped cajole onlf danford weigall hudlin postdocs uncleared powerboats turbomachinery perpich eccrine coface bronchiectasis ramdin gulmarg kristyn oestrogen luini pailin keypads sego kemar lysy caulk bbv juskalian muamba monzonite biegel dracul witbooi pfund ambassadeurs chelm mereworth talcum frivilous soichi ketapang tantawi redraft schoon haizhu antifungals cropp armine twelves nazarenko musicares yangtse alric gelais escuelas wsmv temperamentally josslyn mithradates fanbases daleville mumin plasterers facchetti elmdale gerster whitwick swartberg vagner serlio annina elron turnour kente nikanor contento kleppe dishman mcgonigle cenotes wwiii redgate vignelli muan nvq jiangbei aschenbach mikoshi onchocerciasis romanced studding fishmeal speroni kitale urla berel jaswal kaempfer tollways terna macrozamia grinling funt guardroom meridiani surguja overtown imipramine paani syncrude dermer youcef veracruzana runtimes pasteurella riyal coppicing weepu prica eastcheap superstorm sois kaura asara kuffour sunnat zisk happyness qrf allitt ruhland mcguffin mandoline bothmer iparraguirre straitened coordinadora webelos hudler quaich tychowo meadowsweet medicea dushyant satchmo breaston mimpi scorm kemboi moosic tean anosmia oestrus dayman jarvie khir kirra readopted cimetidine mcshann cauchi mystification tidball scioli marquês prader saundra polesworth religiousness moakler sanitaire bircher fantasist nostalgically numbingly cire faryl meconium sampans corke seperating fadul waupun stringfield wegg soulsilver minky kirkup zhongyi cruncher ibma enquist rengifo lizana charring arreridj facio kizlyar mohseni schoolmistress knaben pulpo moncho devadasis ganson noya osiel luers smote stroboscopic defter ekonomi laetrile kxas internalised werf anokhi ayanna terrington markieff bbw birgunj alfi qif beaird kansans burketown lemche dilton depressor wildparkstadion saidov altimeters hannie glassmakers saudek rhodia dippel herff yeddyurappa pego kassab dunnington britian craftmanship motl menevia joeys dingbats fowleri remap spiceworld woetzel solovyeva oldschool cisg incandescence cultist rockline rajpura uksc khalilzad giannakis winging delineations gregorc gurbanguly tautou outsmarted wonderboy dvora backdoors karabekir sesimbra gonchar arrivederci moonwalker demonoid alfasi uggams ravil teals nibbler tadesse ronay unsurprised hsca olliff izaki interchangable ascertainment feltz ikechukwu serviceberry libbing croxteth latrell slogging shapcott yaakob gérin aartsen maniacally miramontes ravenstahl pluralities sundararajan knoedler investissement incontestable liatris friburgo masturbates appetit cannabidiol decc ballyfermot markovits sherin ghk tatev lamotrigine pitifully centr marvi andenes muktananda bloodily hommel cartogram qmc marai agnarsson lululemon juergens aicha panhandling toyne cherishing rooter radecki coes leandra thua unifem capetown enlightens nishat santisteban elion garlington vorstadt roubaud javaid fistulas bootstrapped hakko sigfried hatoum maytals millwork stanky collaterals scribblenauts capleton miac ifam dlo coppock roboticist kinked yapa roofer jayco mazzara sluicing chidgey shenk hbcus prospers cambo pbn macwilliam huys balt karanga frightfest ponomaryov quetiapine goikoetxea wlodzimierz lcross mcgrigor mufid ramez greenhead anac backstreets takimoto radiopharmaceuticals warlick eizo fiza manacor molave htut vivan kitgum teguh ternent mailloux yearsley filomeno unrolling misir warmongering buttonwood idles parni keolis culdrose organophosphates gaillac boyette almondsbury mariss batmanglij sureste benefield nidia deviancy zizzo summersville growlers rangjung cips moonroof mangabeira monopropellant pushdown brasstown andrieu vri borderlines ahlu oira pendry lzb howa cockerham defuses horsewoman hamworthy coalport eker bulked bertos reliefweb kienholz horlicks talentless evinrude keum sacconi mechanistically carrano viktors fishlock guzik schertz mullany situationists yanjing digitizer squeals tayshaun ergen indexical ekklesia penikett thinley brownstones warnborough sigismondi weever heol interorbital kalemegdan fabs snubbing estefania jajouka schwall hdx kway costarricense semiotician ffw apba stuber duncans tapton warth briganti tampers wistow hydroxybutyrate gallate mcmc mammograms sanzar lysenkoism mmmmm demidova gazans mokhtari imtech benedictions gargrave npas woio legba kiai xev indiscernible exter lodève qms pakse culiacan entrancing chansonniers vindaloo pardy neuhauser hmu guidotti carquest shahaf speedsters brousseau psychobiology poos kasongo sirri unquenchable shold fabians roomba heitinga gurgling godsey linkous ratnayake antagonised stirrer lividus herridge sethna wieners turnback changan caha trabelsi charlaine pepinster megastores nafziger plancha durres akobo pointon replenishes parolee evisceration communiques hoolahan dudman stickball deniable patricios reiley liesbeth maritimum demeaned barer aleida skindred ccel lpsn abscond ovaltine elizabeta barbat akhir nautor baraki dodaf chronobiology spoto disproportion linocuts palmisano fishbourne anamosa kcia rickenbach shavian papoulias jodoin hgv anwyl slapper nimbin goodge brookeborough kurfürstendamm mercaptan gnh joconde bhb manège jayesh schuon ofac paho interclub goizueta bufalino omarosa sandblasted hotbeds listers baryonyx barboursville penland sousveillance hambach seabourn mostow pvl zabid laham gote formatter burkman ytd norwid agonistes chiselled gunaratne verme snowberry garrotxa lovage gutnick flowcharts telescoped yoka chillingham soules lusail zervas saule lannion viewfinders cockscomb trappists andray tahoua jabbawockeez parras sundstrand grungy topsport macnair baccalieri fonder marojejy hybridizing holdaway ledet burness occurances phenylephrine hanania rebirths sarli acording curmudgeonly pearcey maham zeferino carlstadt tayport cacophonous nalco denervation neeleman lovegame scotswood ringa sumon lenda marantz devitto abani deiniol stelzer dunas yongzhou papageorgiou brusca slas immunohistochemical midkiff caia gwathmey weavings flatmates cammi imbibing sulfated xeroderma shanghaied swivelling adlib acceptible finkler namal newtyle thumps feit jeavons newbolt gazal thirkell graef extravaganzas nortec shukor hetland greatful mopan inducers ngp internews bulgur phoenixes tayfun zus souster organismic bradbery mandovi cwr hiroya glencross sahota sanitizer clamdiggers eido brignone atiya absolves garbed woolverton deniro schoemaker ypm flaying vassaras moldoveanu eikenberry schmutz firehouses happisburgh semak ieva difc cholestasis shipmaster inwa pcna bhatinda inglish vanoc timesheets fratta margetson omeprazole levina pablito sawah lfe mapleson liquigas phototherapy heena aboubacar koger duggie gushi htay beckon jackaroo vatuvei andrewartha bonchurch filibustered oyj latae rivieres godderz wahoos stigmatize psst malamute alvina cryonic meschede obscurities wellesbourne blueshift garvie reveres timberlands copello guider ivanek kassebaum bolckow porcel omentum interjects canarians invision jannus beauchamps braly ningning yampa odium rosengren subito disaggregated glynneath servicer worldwatch hopoate bucquoy felsenstein tugged wrinkling venkatesan carefull merde sunsilk andrianov laumann upwood chapelton unties alz chaunac hedger cspan shioya kabira proyas deepdene firebomb hamet egar broyhill tmw barasch bulaq mutiple ghafar kapolei iodized frewin dehmel snehal courmayeur helma zazu karli mesmeric aree bichel mellado leadsom juicer azzedine hyacinths dits tade quechuan macerich fingaz headbanging reptans leutze necochea strokeplay pallotti sarla barnea carbonneau ichthus busuttil maintainance mallo mcatee visconde adamec ramadani tamin seia detent hopeman saxa blinkhorn bement domke siedah gwe barnabe sudairi ramakant naysmith viesturs picadilly bacton buglers sareen eimer karter nesha sukuk tavia chavarria laurenson scansion khoren pequena monterosso lcb ayna unreason losa labuda maroof silao stojko trinidadians cuaderno dolliver nestmates decompressed henryka keithley meaden reichen losar stromer leahey camoranesi screentime thonet lookers pumpers norfleet collectorate rakhimov kunstsammlungen ccna trinny spremberg encantado sneh swardson thrid defiling valcourt dechen mcrobbie oxycontin ekatarina kolli dumaine bushong yujin sharrow crusted surkov keenen headwall strozier erawan fullstop afzaal batboy molesta longmuir bogging mapledurham neutralisation cando achakzai belugas dainius contentiously pressurizing cbcp maquettes nosing mohamoud russophile soay debrief martires kelm buckhaven projeto bendik aminoglycosides qinling nze streambanks wadih shambling lozier routemasters freeness swoosie avnery bastianich poperinge markowski sciuto capecchi rodenticide deliveryman umuc pindari glogovac fazlullah pellicano reran ajou vesteraalen hydroxyethyl desolated aama fluorspar falconieri westmark mahasaya gumm nuj colorant nejc klu shorthands luvin wertenbaker monja zaytuna xuande auxerrois mthethwa duprey savoye dethronement morente dantonio bhala outwits scovell lawdy caffyn talai tamam carabiner wurtzel peirson vilfredo minifigure manoeuvering laur anelli haislip waterview pescatore madgwick madri afters gpmg bechard jordanstown davoud sitatunga bankrupting annamacharya shouldering vinterberg nymphal murase hoving otilia motus calame manteau trumpler smirking hemraj xingfang tasca puffballs chep cwd mightn foodies mccraw doxey zhizhi wilbon dawu bedbug adcom wicking xiaoxia bootie multiregional bozak magilla jianying huallaga obcs unshakeable palad brychan meissonier tille brayford xrs oktibbeha caparo stalder curcumin barillas solimões thisday moas appointive unzip marathe gremillion humping cowbirds touristy mariyam esrange aiw bne swados karet efflorescence ovine swainsboro salzer sendler eynesbury compatibly rumbled cubital speciosum ecke dabbed strube kalugin fitzmorris franzi bonpensiero tinubu tahta fornix brosch sweatman engblom katju poff tems puttenham natoma nativists khoe stecker brymer sorsa classicizing knifed massart boonstra heatedly ballydoyle sulfa pinhal bongard marouane revolta stoically fassel ricardos apocalypto pharmacovigilance inextricable serdyukov glyptotek wookie micrographs niceness menchú ishim bashaw prigorodny dkr halsman cominco overprinting dorel clairette achterberg petrossian resurrections kabuli palpably brockhurst sawmilling kettner ekdahl mozgov trustable sadiki uhhh kowa sleevenotes mcnelly baisha shidao paracha armlock thurl ymcas priestland karlen narendran jetways stratas chami yermolov medicalization herland gusle southerton dimensioning thundercat woulda mcree nield elvia wattens ruysdael macfarlan harrachov consanguineous kopelman rautins sobhy overtopped fhc tillmann cbso kazin slowdowns insha matuidi khaitan deconstructs ouseburn pilaris shopian saltier joba ibla grammes lydbrook maoz vouet iptc dhari quinteto iliadis rentier berkely professionalisation luy lorri castaldo halcrow conservatorship celerity trashigang speciesism zeiger delly volcanologists ridenhour nunney dehumanized gongora blisworth blumlein meserve neckarsulm ikoyi avtandil panfish gida smets vkontakte rykiel triano auken zombieland schurr iyaz applesauce teepees cumings engelke caerlaverock newtongrange viagens wode downloaders bullingdon wentzville kaleo alderwood benzion jaffery tetons voegele fauzia colligan garanti dumdum chukyo overmatched wissmann xaml ediz ratdog pocantico knauss shameem mandalorian grabowsky harsanyi kaulitz hochtief emini talita dietician botanicals yanick heathkit semenya yatim physalus yelle aéroports dorabella protium offiah genesys footages ncss mcclurkin bosso fould tiryns ilgauskas pangani araiza transracial deathblow leering rheinberg hemlocks touchbacks evangelisti hebi sangria braggins lizzani lucite mingles vendel ecsa labianca worsnop wthr dalys beetz laymon napoleons wollman hinchingbrooke villaflor kessenich wainer sedlmayr gruffalo babylone sabarkantha chote mantler misi yadavs maladjusted deoxyribonucleic arko kiteboarding cellucci waterboy saeedi alemayehu idec prionailurus klopfer cornerman cauterize equiv spacial resor cordgrass grimbergen selecter knd hazim botella haunches royaux pharmacodynamics drik benaissa jerman nibbling enteropathy extrude huzzah saijo marlyn overdevelopment pfannenstiel semir aonghas momper stauber cordeaux olejniczak liphook shadowplay eichenlaub unenforced khadir darran fehn ahmeti hirshberg ganderbal pentz overtaxed koco feldkamp labra lewthwaite frr satirise bapat swink tilework idolaters waterjet bijoux mudford subotnick ghostlight mccombe aspies glaspell dyncorp mannings neomycin anothers onita lup dwd parisiennes copulas paskal edlington beetroots joscelyn narsimha gartrell moley ukase guerrouj dilophosaurus haykal autochrome mcx minaya krama subban rwf tappahannock biomimicry bachs waks mahina bolz marzotto prosky tourettes zox imageworks polari boitel serenaded pusch isreal cevat noordin dotan teds welches lundstrom meadowvale tokenization lubis nanay villiger agecroft pachacuti maiman lucus kozel simtek lense brodkorb leukemic nephrectomy danubius gangaram outrider misquotation moyale newmarch cocha poseur cityplace caersws charn distills nonie ishizaka exige expunging udny pauvres bickmore kpu rosenfels rumbek houla aetate fredricks monos cyn balkin weisgerber blindsight pierpaolo alling catherall stargates haavisto tuason tiss toothpastes baabda panizzi legnani kingham semillon amphoras rituparno wreford adelanto chopstick harrisonville foredeck housden cosin gurmeet telegraphers sepah desaturated heslington fye loquacious solara intouch shims mozdok tragödie portobelo burglarized shahadat budgerigars festoons netty sabita mitm chuter holsey empiric khary prepositioning shabran priore hilltowns aspyr prahalad xrd reville samb labled oversensitive berdyansk associati kawal cyma kozinski loera secretase monroes dietze yafo kaisi nochi sardinha ethinyl cribbing marcinko sackhoff larochelle reconquering subliminally oldtimers eshelman earland egocentrism moscatel sondergaard lieth rish freemont speechwriters comfortdelgro bertozzi yukihiko octoroon hypocalcemia festen srn tupaia dahlke weitere pfos foggiest premie davorin sigurdur schunk smolinski alysia abiel berlaymont antiemetic mccoo pentheus heitmann waud anastasiou horndean kozyrev gorecki blasingame cifra subhi affeldt foll cockshutt glaus brandauer mdiv hematomas jovis dres njoku superdog cyclodextrin retype kowroski hedd antúnez stupider warland rudaki aslanov lopilato dunnes ceasar nandalal pinetown madin scarrow saegusa mcgonigal tefft implementable verged decorous ponselle milewski kwo zielonka güiza lmo wolesi vogon sonus genovesi petherick deconcini corktown vibram sanai whitegate enzersdorf nonlethal provisos gner lauterbur mowlam articals ystradgynlais ambergate sdxc dashain chunga orduña berrabah taylan proteges laich diaoyu lamanna atomically tames venir bouna ryer edilberto hlf lakelands quino kolingba ostracon disavowal nanpa maberly tassajara costumers kci five berninger terregles andoh fedoruk trivialized lockton disgustingly santosham dombasle ultraconservative kryuchkov paranaguá flanner fecund sawston merlene documentable loquitur caliendo ashtown fewell nahdlatul letterboxed kinswoman tylo recchia arocha nonwoven cybulski ballykelly klossner erler poblete crenelated foresman ujan vilvens ruether azb unloads majoor dishonoured louganis spectrally jenga macneal wattisham perfringens pulwama parto ukrainy chinchillas garik marenghi dethick juticalpa tlx feeler rads pungency peces habibur repower minsi alvino leccia hassayampa capucine lifar dgb templon memmo ensuite incharge nostell durcan kerryn murfin labradors boggart bioweapons mecom petta kilty aliança storrington ulsterbus bishopville neurologically cacau hyperlipidemia gjokaj babrak wateraid alpher nanayakkara torvald lbk mcgeachie pupillage rosmer braco picoult doaba dragline hizbollah herzogenaurach valognes hondas poing sgurr mickel teacups barkston sweetgrass naadu gbk wagle chermayeff polonez dpms lordy dorthe hershfield aidc moonlights steinsaltz wisk qalqilya sighed coomes ndwandwe nyac zilkha yellowed miera lewknor counterforce mineko atomique hickories acpa goujon edmilson booga pustular pallenberg needwood weybourne soejima résumés goldington mplm spayed uncertainly cytological bastardized slutskaya sherine integrationist etops girlband russow wangpo demeans rothfuss gwasanaeth fashawn longjumeau ledovskikh echeverri instore ntf cdre webshots oenology sentani photorealist plebiscito insieme vlaar singes shafran spurrell apichatpong antidiuretic hirscher pirner eynon foulk calluses encumbrance probated anthonis bange schaerer effacement gothams duopolies vitz airbrushing coscarelli legatee kriz vrelo desisted censorious browbeat smelser janss miche gisenyi cholecystectomy hiskey parijs mekel kustodiev silverline hardeep lobban hribar clapboarded monavie peper cobban eisenberger beledweyne mancroft cockeyed puddin maged moocher giugliano windpower jaago ashvin thurloe larcom pmw lekman interahamwe nobi scalpels clothilde tallec mistreats vawter trudge doot sye hepp dufftown gauvin existentially ekpe mangler banken myalgia einojuhani pingyao tukur steinbock cerc oxhey igraine minimizer elapses winegrowers lebec alchemilla sulpicio golzar yashica whodini schrempf hamadeh forcier crocetta cattelan theissen politeia krastev esfandiari steeve gilbride gokey inexpressible blader sparrowhawks dominants empanadas dibernardo melchoir wisecracks isel yawar beaglehole mescal sveen realitatea caverna cynara deglaciation enbw blackland mcdivitt baldin unicornis polycom stenographers buttram monotreme inverlochy lemaster roure balland bandas policymaker emote castiglia committe gaar orzechowski mallender greenspring westlund nangle tounkara waxers upnor ganoderma hyperalgesia appr galvis vaizey borenstein merchandiser zinser durness sansoni glico parducci yeshurun merta transglutaminase coulston stepien beasties meningioma nektarios lalji timmermann mandylor gofer sysml reha luthuli ibru outerspace bellion tarkin fletton preordered deworming dewy propre feyzabad jatta slumbers piñatas ceto jero charcuterie dougray kimathi rossmann mudguards bareboat nolf catrine centrelink schouler greenlees gueule yonemura bilious rankled redway duelists claribel winspear michnik moonman hollo namaz sinnett holms dogville rastas andrian thawra ditte roffe bangerter albam submunitions adamczyk fiv ruthe hiland lowlanders rutting walczak pasang satar slosh hybridise salimi kepco salemi chamlong khunti chenu blandly gummidge rsb wolframalpha ranikhet invalidly oxygène rombach shipston tillmans moglia magliana berigan footrot konstantinidis anabela metrica gomal desnos fien conowingo mondy colonizes montebelluna hitmakers chevreul unadvertised dholes kalon sparklers schrager korem combattants denouncement nauseated kvirikashvili endean acceso tickler celik ljp kulgam hypocenter htwe megyn osg mccarten yaphet possil positronium nagasu ataque dongxing greathead rajo glenlivet sctp fascias girardet lbgt influxes linky keyshawn ansty mattathias marginals utrera diodati microphthalmia burping cruelest cbk coaker dehiwala extractable plurinational newco gyeongbokgung abdullaev sast celador carrico keelan ketley feghouli colori liska kingery ripstein xiaoxiang cadaval rackstraw longmoor underglaze kvass brulle politcal cornershop latinist rockfalls weisner vsg yukichi carriles hmhs alzate siop pirc persuasiveness randee bergues matcha demagoguery holts flapped kemmel moortown spendlove cybersex extensors tryed gravettian pinkertons ossuaries weinmann carls frisson khateeb sulton sconces irthlingborough carterville imperially asgari flatters assata pecoraro blindfolds aapa samovar prunty guimaraes collaboratory sapone yerger nanostructure plouay nault herp shinda conversationalist citoyens okrent klenner twisleton ireen gametap bilbray barh jannah albasini thumbed appassionata borle rushforth bende guzzo catchfly festoon cuon tsukishima hovde mureaux yabucoa zagar tienes banik terlingua oxidise whoriskey nagbe kollo aday cochiti homeplace taufa tjrc forthbank sternbach kallies louka ilva strongheart danijela scrapie kanehara dreidel macalpin pentastar peker ichabods pavon wadsley daikin brüggemann ratchets unsparing aramac haarde oce misprision calcitriol megaron draughtsmanship spitak tftp bassac stadtschloss zetter iben assumpta fld gga papyrology eustachy redundent salpeter dadeland cheder leid wilmut seagraves osteria brunssum glorietta bace bemrose houseboy bruhl gastronomique praful judgepedia buzan rochereau lempicka manoeuver artiodactyls tirzah noujaim ninotchka vaile carmena tinkham landru cattani deutschlandfunk reekie sonnino fingerhut hookes anyanwu dahal bergs kailahun qvale vhd arthington ideologists itochu ravished benirschke rovin qsr gosta ponomariov walesa villazón ledi protectiveness lucozade defrosting weeny radway ruediger ramchand fazel kalaf paletta elysa ledra cayne marketa climat growled harward fresnillo precipitations qabbani maraba exurban thiazide nachtmusik brana voisey burrowed biassed teetotaller cléber faggin loxahatchee mammen mcgaughey cegb mooting underclassman oleds rbg mazzoli unfurl okhla jpr korin chidiac mindbenders mascheroni maceachern rangana bocchi privata kikuko rehmann inglenook bongiovi waterlily runet causton shuk polybutadiene whirls kondylis binjai dalu buscaglia oratories epler expressively victuals waples farkle zenji vcm bentine agronomique anzor standon somerby triazole mersereau retarder salote cajoled oceanport gamil witsel bigpond potently alternans vocalisation aibo tetracyclines snettisham barcia stoppa punj sindelar digitalised bontempi misgav paaske shapwick belturbet crotchet chulanont cedrik ¨ leafing yezidi ahlgren mousy businesswire rhianna trussardi corticosterone clime audace askins karelin margareth kaikai timewasting camouflaging torrejon devendorf arpita cameleon yeasayer harrisson richmal dondero toyooka headend oppong ibr kewpie berkson focaccia amadori tues bkk flossmoor kombo racino neas margules southville understating groped fourvière sprit kupwara doleman flagman scahill wtnh hellweg lymphoproliferative boublil rovshan rossier dassler pzpn scuzz marikana sath vaknin farmall grainville teddie washtub solennelle manggarai yazaki szyk uniques awale assortative hscs asghari nadarajah incurables buckey dewes lechwe starkman refosco weida yunan klier slags alkatiri frocks oliynyk misjudgement jaman petropolis messaggero alatorre dunmurry gerace obliterans sackcloth spradlin erlendur roszak glares wolfensohn skirving oddone nolans stollen regularised pomigliano zaven kramden wights congolaise bartoszewski pellington desalinated brockhampton khairi dobrynin dredgers aicardi spanners gwc trackdown arlit petrolina paita khalifeh pachachi charalambous battell haole baddie shirane thereunder mandora ticklish enfeebled steinweg ntds staroffice leow immolations velopark fumosa noncommittal sciarrino maíz hibakusha rebeka extruding stirrers fetlock istiklal ewenny circumcisions whome yeas hornstein amerongen unilateralism countrywomen wassef kanth stra aoraki lifecycles trickey tamago sameh kellet yibo attractant gitega kemptown pree toom prejudicing worldvision futch folmar mosser harahan bardonecchia chrysanthos giganotosaurus ugetsu calcineurin casseroles catchiness selick boniek asilomar eynsford lho varlam zeisler wolfswinkel karpal lauderhill nepad gauger shailaja garnham chandipur harborside englebert wojtowicz lanson raylene denio vldl sonique keko wmar thaicom busi kovar metacognitive teeters goldreich rinella wahed sensorial fankhauser sheens hernanes lochnagar oculist turnblad shutterstock armande madrassah demarcates magal acellular niedermeyer compendiums canda kint hukam wenning greasers březina mahathera bibendum wittmer igda butchie rimu vartanian bigs incompetency leeser saxmundham bomarzo schulhoff marschner furled entrées shadia scabiosa elburz rhabdomyosarcoma durer imagem hypochondria mushfiqur studdert foula heilbrunn jmg soslan wooller jayantha höcker gendre anhanguera fle balde silkstone qtr drue disowning ayt croissants kariuki mekhi lyngdoh mousey surridge kopecks oleum castanet durnan tejedor aripiprazole frideswide vattimo kelleys paperweight durwood toshikazu mahdavikia kirkgate gasca jlb hickstead aggresive mccance pracha philles fgd nclc globals proffesional martucci razorfish jokester abdelmalek ostium martinican huggel outrunning perote habte fatness joga namecheck kadeem pesquera errett upl haselden kooten youku depredation chanin swangard cric thorncliffe jameh tartare casita ebina skelmorlie mcgarr ecmo granicus mucked socrate techrepublic zidovudine pleasley manawa gummow pitfour birlik benincasa rexach bertoli adeliza paycock repro aradan malhi shiren glanton buysse bashley gamliel arja iue irex copake atget foris carlstrom zerkalo borgetti nechells sevin scrimgeour biocides sinkiang diplomatist nanofibers carbineers eirini horsemeat junebug oad motorization ilwu franchini teac sobey zhigang davidovsky serenely kaminey cicilline osteopath kashdan unscholarly miyakojima yanda wassoulou pedigreed subscale avoriaz anwer hinode surco corroding franzini aéropostale welts cholon raun rawl zopiclone injera uremia khilafah electroencephalogram yulon contentless aluminized foxfield djebbour outokumpu butai før lowney linate vudu timbrell dery typographically munfordville herbstreit ichat kazbek appendiculata castelluccio dinur anki domu ratte pennsbury littbarski comiso clowne telemarketer tiferet procurements goelz girded snowcat alphabeat sulby macgrath caunter handbill sunline andreeva pilfering reynald chenard warg differnet gunbarrel bowlin biswa hailee iscb sandrock fincen abscam psia jairzinho vff jist demetre ketsana wobegon ketosis fustian ishara maiga shindler daine pomponio kurochkin collonges bruyere cilt eiu wilsonian nescafé firdous bhuiyan colbys ehn haskel demoniac zuccarelli pearlstein mauls abita primitively yasuoka newsfeed gréco rachlin poids budda kataria eht distention blustering mandarino nfca ssrc soame seales centenaire trattoria netheravon pontificating endroit gaubert malapropism glassfish gillom macrina northpoint tirkey véra nicam matricide soberly picosecond trentini meridien uncodified koronis aktiv bizerta solovyev ligaya carcosa shoplifters simoes bearish stephin avra solsona salicornia kildee lushly dosimeters serse luristan chiat mihashi shiyi vexation koffman factly quynh ulbrich ollantaytambo philippou voth jinlong aforethought tamping balsdon airmanship kamper duavata gregersen ugl inova trkb heckert diuresis perronet padron hevel ssat ufton bootylicious badara behoove armwrestling marmaray errorless bugbee wallhead findus flatboat bapuji atomium tuimavave mccleery solemnities mahfuz amax achour monashee mechnikov canonbury lucic astutely sumptuously nassir prepon mawla wilmarth homogeneously leckhampton komati wanyama hallgren antirrhinum michaelides sportpaleis moviemaker bockscar piddle humourless otari bandoneón duddingston ihnen hollers amezcua zaoui vaguest reshuffles whitemore panamera dki egner mechlin lauryl revering ifilm elmaleh clasica bibbo jop freshener acetabular lents voici manchego kuniko braggadocio castlerea bechtolsheim tatsuhiko arktika maharaji nevena rajagopalan bylsma geostrategic ornery colab truculent keyloggers bingyu dobrescu rubinek wikki playfirst silman goudhurst kurtwood jifu mousseau tiede bulganin crummey socioeconomically redwine inrockuptibles stammheim belene pbj schmo rampurhat tornquist czechoslovaks ghirardelli spondon monacelli eiichiro numantia bruhns pattabhi haycraft aleatoric spangles smarting zhelev fuzion akosombo draheim beachley castricum srei lovelier peerzada manyika middlewood encroaches petrochina miyatake minsheng saliers yeesh schachner steerforth bebbington pastoring acquits eyelets ruchi macassar pierluisi mccowen atsu microfiber scrollable golmud folliculitis taroudant motter hond torra rosia treg overlanders nayer hallard torkel ildar prateek serialize nalbandyan standells mckeith abrogating pawlett sumar mram milbury munificence petegem enzi viewport wriggling widescale paster pursell dottore fobs derrickson meraj cultivations baxters fischl tablespoons spaciousness geeson shapero pousada pennywhistle mccanns kossmann ansen hydrometer neeru hurworth wno cknw gerety photogrammetric greenie wrinkly berretti prostrations vincy waldwick flans mamy pinkel sunglass barno bleda choleric mpofu spoerri eidlitz puffleg workforces iberostar poliziano bjarte achen kirpan machpelah dorrie lockey interweb bidzina seena neuenschwander revi lacetti idil pagosa preconception shahrokh dollop budgett wackenhut stealin ploeg prefiguring fanie tomjanovich cherniss galit thunnus cancelation blantant hady systemwide jakks zhenhua harriton aset bystrov matterson bromont kochs capstick razaleigh schranz kuzmina gioeli minigolf grasshoff disaffiliate emh sonneborn trichophyton rijo sevgi amitri etcc monigo clayden memphremagog garri deora borana mustachioed learie doura manholes elcock amalfitano envoi eppie shoreview jpc nalepa allestree impro lilywhites mourant curs troglodyte adamas konstanze gasman trickles francescoli intercessory vredenburg langrishe explicitness lurcher jinzhong ohle tropa areias renk ftz mcgugan apley vlaeminck forslund ohkawa gustavia hironaka jocularly extravagances whipps cheesesteak npdes tarvin plourde sagnier poulsbo casma nevilles lawang rabel vistavision vaden thie otha kandersteg unconfined reas melanism francigena dissimulation kpis transco bevilaqua canaday anwarul dadlani mixco kryukov orbetello oophorectomy poltergeists enlight swinfen muirkirk swathed galtieri medlin pft dilara jarawa beic furloughs rossie gundi odwalla vieru acanthamoeba orsolya loosest sarine rorquals crellin westens piège foucauld stamey antiparasitic dîner vmt asters buffoonery bronislava alexandersson solario covetous admiringly gernreich aldonza pinniped angélil giddins treehugger malon salvatores combattimento mcloone prpic cudicini ambrosi dabbing pappalardo petroni liù bihan willaston happer peronne carlinville bers rootham overbite novaes ceqa korbut vandenbroucke satins disi radiobiology takaka zhongyu seaga reshef sarkies thakker mandra coverdell raavan rolles benzer kipsang bhupendra mptp brookner christenings itto directionless barbadians kibby conservancies percenter nagqu pantnagar durables eustaquio rayners manikin shw salvific tme marwijk stewartry kellyanne rupununi bernera amyas gantner helander wolfgramm casu theri endearingly pennoyer schlee jigga nonu dæmon marchione wyggeston castlewood keesing brisley weinfeld casitas perforator warmonger showdowns flippo polyrhythm midrand sheperd grainne nattrass kittlitz obis quondam wilkinsons hallas zorica novikova tarter flandreau potsie kelk kodar dawan beghe exene incentivized satinder toppenish foodways beatin trendle carice gamekeepers vitalij husin minorsky dejection kpbs lezak ratnakar sahabi ilda mullarkey wombs kosgei ocklawaha breezed schismatics gateposts arcobaleno almendra rippingtons nordenstam zions doppio florette jdp rhynie neverwhere cullerton retraces hanny tournant afis dandin camomile gammell carencro triclosan enslaves coel vlast spamhaus schwartzel congealed aquired kirchherr barrs matheran kurmanbek fifeshire madingley berquist carino meszaros brouillard overestimates microsdhc sohi matcher cazeneuve lifer samual dongs vitello olczyk shahir wingnuts bludgeons zachery unrelentingly prenatally asscher speedometers mongia pagode pranked mgn tremeloes latt asilah hias shimoni najee skywarrior kwch trata battened thaxted sejdiu delko vietminh feifei aamar troilo matalon persil kawanabe dapo ckx mishal cccs chamberland undecidability colescott shilla changshan oxidiser kitchell opper qurbani beeck palaver ligety papayas nerz stewardson reeser ilin fijo malverne bouldin aaahh iwu weisbrod redlynch volkspartei kalima freelander alveston duloxetine anguiano vrdoljak segismundo insurgence schwendinger livistona manicurist kidde asociados edkins lajeunesse wbir suvari keiki dflp batard khadem hallstrom krishnapur kahlon tupamaros underrepresentation sopher freelon illeana menocal thewrap ticehurst cathkin fregosi carloway kadee fagerbakke amberjack cgu adriaanse ramstad annable bogoliubov sportbike wakkanai wizo mahanama aidy ermington wennberg kafirs linslade carrozza freighted pussyfoot floridas upscaled chytrid tyrel stutters feldheim visoki nektar bulava mombassa choson anticlines vertonghen liebeslieder leocadia campan ugu yawing blackwelder peston misspent fring ansys charlbury steingrímur roaf ghandour mistimed platformed crimewave rodenstock ayoob piene roodt sawiris hammerson odn ladles stenning maska neustift vanport raef etx carnock txn soare fargeau bratu trepassey hypomania wjfk winford freni lithos fronta freston insets shannons ciolek gpd chiddy orofacial torosaurus eag anthracnose greenshirts pirmin chene gwillim dispossess lavik mouette richeson zaugg acces draftsmanship kajagoogoo colwick kleopatra tvweek rgr nulty huckerby apol loui bifurcating superordinate multivolume jumu fumed warhorses lcv geopolymer polunin rodenberg purtell glsen pakman infinitude lactis yeamans belser landownership homewrecker lme westdeutsche undistorted pyrethrum consuegra baburam padovan messerli whatman maughold pawlik subscales rubinoff nyrup sissinghurst soloveichik bowerbank tlm photokina faan kaap nese chinar tkr pollens uncw khordad haleh jalapenos larivière manageress sittang ichthys abramowicz stamatopoulos pasdaran völklingen braamfontein brierfield knop pastured parasitologists tangye mireia milow philomath malkiel misoprostol waterbeach heckerling fossen dehp vsr shirey outpolled pounamu overconsumption riegler statkraft llanymynech pllc noles akerlof zenn prairieville sanchita grumbles collingridge hindbrain chutneys seyi gpn stanage molder easdale jancker impassive ferrick sidedness adms paymasters janz cryosphere feza webbie mcgeough petrotrin swg kleinhans stutzman rathcoole taue bohrer domaines whiskeys sachie wylye koes ansted xiomara chenzhou furong restall eriskay papain boul acurate braf hautala rabiah harridge hillbrow laffy mcanulty meldahl kiehl rubery caunes agression dimpled keti morenci constrictive valorization shekinah eeny trinucleotide congar navnirman occassion gapping sonko kilvert canzonetta kwacha mbombela camelon aider durão gestel levack thinkprogress inocentes shopko usgp holzinger bettega frickin pregabalin javadov burbs rotenone altendorf mitsouko lanyards modan sotos masp kooy famished moluccensis rosand postino bleeped torro khedekar unmanageably cavallino klusener lyd junaluska dusek rogosin carmakers chng rottweilers suckled bioweapon hoodwink urijah chisora uib wolfer boyarin sentech cryptanalysts bludger tagua gesink gambrell ohsas klooster conneely mittweida mediapro dawber buntingford tabou tatura rhinestones micaceous fleckenstein essor hunwick saidy farney moisei artscape otey prehistorian magleby zusi crepsley dnes worsthorne rockband actuelle prefigures jungr nonpolitical titer braak bigalow erhart managable palach nampo negocios jianping unmapped sheikhdom cruttenden streetlife hosp sudre sobran redpoint homestand nonbinding quadriplegia firat ponor bokar romeoville mellan calix fetlar odorants berlins keralites orol crocket dories tongling galibier fmqb basauri cutlets lanc slawomir oyun geliebte mahnke aljubarrota stitcher retallack azinger cacace nalder honourary masirah sangkum dumais putto persad hanby blairs blickling haymond bobrowski serey freesia ferodo bourses scoles trubshaw cristopher sketty keralite waukon hekmat ciment pepoli jentsch pictoral commendably breathlessness scogin littauer theuns adamle damaraland turbidites emmel nilan neurol nre grimthorpe firebombs getulio durbridge dego thuong basicaly tamiko ballinamallard anabelle imput kostadin butty dinter fener foregut finocchiaro ribicoff murphysboro amritanandamayi coincidently benfleet adhikar backgrounder taining sebo nikica blankly globalize hillcoat stockbroking homophobes zna chalks clarey imputing abberton veazey goatfish propithecus insultingly acms campomanes piatkus derrell despaigne ukaea bargy lewi wankers adamovich preprogrammed manabat rbn pavin lings flummoxed jarocho bikur fuglesang erwinia youself syu chanoine wolken huayi acadien bilingually civitanova ilyumzhinov hikikomori belda crispness supercarrier wauconda lytchett tzedakah vijender humbler albertino illiniwek lomita exaggeratedly zillo jowzjan lache wunderland geelani majkowski transair anaesthesiology ignorable dhss schuessler unambitious ikar tooby tineretului beltways carmichaels leura paradine quercifolia clarithromycin saddlebags hvad oisans pearn nakadai rafales eccentrically wehrle mislabeling arkie inconspicuously rakia majeski tummel monotheist medardo dfat odenton youve ditmas forstner kition gamefish sundaes bandolier lettings dutroux dextrous sace bissinger britos netcong inappropiate doillon blotto mellat moghuls gadot dingolfing natapei abstinent holtzbrinck draggin rozenberg homefield purex metacarpals summerhayes hyaluronan americain philippson nayla coplan tvd eastertide montée cambs liyanage eastwind karstadt lafite spectroscope cumulated misclassified klippel kotv whinchat wallstreet oney myotonia headly mudder matviyenko atay frankenhausen pointblank debasing mimura taurino inversus zadek sigsbee guoqiang snowiest landslips ceftriaxone okoli acquitting preceeded multiplane undisputedly contibutions sitz tracon frankley madhusudhan folkerts manhoef plusnet xetv regner receptionists lewenivanua chamaeleo cocom diamorphine blakeway easterby kinburn espousal compressus casula sext marocain jumpman mischaracterizations parisyan schedler echegaray honker impulsion baudo immutability soudani komura tussocks boodle pprune shaddad coppiced rokita dully berget harroun penfolds nishani weighbridge mancino hedworth bateleur stoessel choosy lathers isoflavone miscues shige guirado buffoons manasse nisl chuzhou ilma traina etd dzbb randomize tpms bullivant fotsis sportfishing econo sterkfontein moncks groomer knie bifrost citaro flightline acgme gillings dyads disentangled ehp mendips enyart lamoni redistributive elfyn pioglitazone hagner ministrations preteens veilleux roadmaps bioprocess tamal plantlife aravis shimoyama urbanowicz frederikke villejuif viscardi cij bernières kokh lorente bourgeoise druten tuifly kdb accreditor armoire lupul gabrielson quarterman witchy thornwood karolyi celandine campany lehar stenz keizersgracht tomohiko dantewada johnst komiya chimei koganei nazli dockland superdelegate eyp entrepot slumming lount kosuth urushiol nurettin kilim fronsac dizdar leers taï zumaya parkchester clewiston changshu challenor grognard mallows drucilla clamorous laminating guadagno wolong jlt gardi rmk europcar zaharoff terawatt washingtonians schwegler janno beqiri faouzi garfagnana snagglepuss høeg gameplan sawrey nonconformism nookie atak bilino overarm snapp passarelli eumetsat hayyan ranbaxy intracerebral advocare frt smv raymi butley sieff ferres arthas fastow barnier wenninger pratik cherven lobar pitsea trasimeno chumakov xueqin willans protic amitav carsen tyrannicide avista krishnamacharya franzoni gumienny untraditional shotput accesible jeram uniroyal heidsieck chona passito yasuhito vincenza cotman bliley vassiliki numer dockweiler esas mutambara intercessor egleston wools katsunori dalbeattie gangrenous fairyhouse alrosa steinkopf corralled mzee ogwumike nasturtium kls bml makri zaillian psyched nawar stepin wabe cads trichotillomania bilyaletdinov lincolnwood deruyter typewriting naruhito wolmar kassis bradesco invernizzi podrinje roloson ranmore chypre temi genya silverlink shambolic bastow ozren vouches appo nvi sligh satyanand losch lutwidge penick gelert slavo metalheads kozintsev yandell balka delingpole wtxf cowhand champagnes galip spottswood reverberate coalmines sappington scrutinising cagey contemporânea northeasternmost limahl holyland befalls traficant crveni unrewarding tavy ottoline compostable doumergue korgis gtmo monstre plectranthus mourvèdre tharwa gebhart botz zaritsky bartholemew sassa arons weblogic sachets ormrod reverberating siyum distractor savidan roughton clamored aaah kravets brigman ambers istaf keeshan linichuk tommorow lankenau sekimoto padilha mncs reseach wojnarowicz thicknesse carone palmquist zellerbach yanjun fme epia cyo slussen curtice pressurisation leysin mandeb marzouk convo ebolavirus nimmons midyear haemolytic phoblacht bastable kipketer minni gushes manoharan lonzo songcraft birchenough parran fritjof colosio dieck annotator trendsetters grubbing diffa abramovitz fki bojinov ammer wakeful marles trompette refurnished sirenians delpech auria nevzat carland jaufre manchild basov mckey funland seleznev hivemind moshing eiki fröbe crorepati salik whirly tascam bleomycin bauble baying finstad acrophobia mandana yesha hieronymous zoraida supervolcano razzies delran jover perfluorinated cabri roediger cipd reciters cappiello awwa intussusception pribram cerveris fitchett egginton tandragee ricken markelov farabee faerber xisco gesine immel efremov petiot pól mabley vornado baystate handbells tingler serapio rande ispa olza visiongain bowering massow antiepileptic remonstrated pmos phentermine gracchi hulley appam waay stippling accs hcmc shomrim groenewegen developpement cgg dacapo krusen twt mikhailova postbox ibstock muntean laybourne mrtt schwein qinglong coagulant besame recodo waban apar abbemuseum nyota yoriko xplore reep dattilo banyuls nikiforos spermaceti cravo boffa unsheathed rkm freind goddamned kering makgadikgadi chertok sisti guite sandblast capris alimi abshire reedham colicchio chkheidze lyth locascio keansburg wynwood burditt balletmaster huckster muntu confit bukittinggi sportal emancipator ninnis mcelligott cbrne oyak idrive pazo chaand jowhar mameli crouches akra enzensberger ghyll bermel musiri ixe ruhleben vilanch gregynog derian beshara reveley jordie nanoelectronics thorman bossert pinback hiruma sucky impost blunting bathysphere creased standridge hoopers nerine allason zinni greenshields jihadis checkouts mirante baekeland netsky schongauer phuntsok clamouring potapov arkanoid harangued americom horween kakha neall inoculations meece denisof woodchips cyriac hundal dharmesh foretaste cookout chanh mccrone bundesländer dromgoole hunnam dormael sonika kayden viernheim madhok warbrick regrown novis oslin supercollider rudrapur mayson newswires plasmons snakeroot parisa oswaldtwistle calve comt costelloe leimert dahmen abim cowsills ashima neang ligertwood standart kubicki monpa mekki ousley tulia ellicottville kilpin kirovski koosman recapitulates ewin theatregoers valentiner fima yeagley bavand harrel repaints ashari expresscard practicability octagons syndicating mafraq spired hanak marani dathan neitzel camulodunum schoof inducts simitis resited gaudier dumbadze forbear khasavyurt heasley lutherville ifab frenette demby lubich ntk turksat folley weekenders disses rupnik perov extradimensional fayoum kindai skogen polymorphous ayuda wingecarribee locatable abramsky besigye methley jockstrap assefa blazek metrological enplanement maisky oser wurzels hengoed nocenti wyles misuari mcinnerny fenske caiazzo cockiness madiha spycatcher poniatowska prognostication yorkdale kavadi emulsified cokes schulenberg recoba söderlund fortinbras roadable sreenath ebihara haunter landfalling strahovski burdell artemyev gutknecht perata audiotapes fleshlight briest reyner brigden lorge botija qol oktar spoliation gjertsen tronc mouthfeel ciega magnifications kratts toeing denr lmn allagash dogtooth disfigure cellmates eissa diljit ombo dedeaux prarthana danker midianites nutjob burba pelinka glk singlehanded waddingham derisory deodorants innit armthorpe gautrain reformasi rossetto crennel aurland blowed dettmer speights buicks tranh wittrock frappier vitzthum altizer kmel joblo magat rozema mottes tigana villalta mbari thumped toshinori helling baviera televison drosselmeyer pforzheimer tavel pragyan kudrin expeditor shuten topline brijesh yarwood unimaginably airtours xolotl kerik amali jannik poolbeg uplifts obong unmoving rnav jde wtsp andorian lathbury armonico sated gaggenau roumanian ehmke bigeard scrovegni heterodontosaurus saria glencorse wideawake tolpuddle flesher binges stickel artech stefanova mickleham baulk covets capriccioso connectome malivai jaros metoyer wulsin succes unreceptive schonfeld kapela oceanview equifax chalking pyg damnable azaad cpx longan gravette garzelli deddington cddb takahama keurig blakesley stilgoe scrawl whillans holdsclaw consigning orit homel angmering goulds delauro lugubrious unfenced hurwicz pipal condescendingly bedd extinguishment ekblad tipster sterjovski tadjoura yennenga cortney karppinen chimu korla vyrnwy rumley pimped ornamentations protasov electrica jennys shafie downscaled melsheimer costacurta teignbridge intech gbowee lawnmowers millones breeam elián sentir farsighted desanto partied northcentral saucerful restalrig maisonettes ethnographies xuereb centris lacie azpeitia nvh ggp gornik tapout dahlmann leatrice bovines siria hartel mmol gilsig haviv ations pramana uihlein decongestant bobst quinolones hillgrove reeth fiammetta passato scats tabuchi riesman keates haiden racisme escom dragos unbalancing niemiec munns teshima harboe myla boeung definiton recouping coatis herrell redlasso tobaccos clelland bottisham mamat nicholai torralba solomona propitiation audiogram adkisson rochers glees eile garthwaite martinair thygesen experiencia teleri bisschop romila theobalds modernly snv rakish bevacizumab kuchin agot vinoo aaha hunny factbox oll wenda sesar fahnestock abagnale hodgen towline nonchalance marashi silom rbw opet glg kondrashin younan kroft mbaqanga linage wetherbee tribally navaratnam edwardson craine koolau ratatat jrf superyacht aasa demichelis vinalopó colver kernot kornman okechukwu nontechnical pindling sesc elephantiasis longbowmen brzeski omma kump adrenocorticotropic pedantically westham contactors metronomy wheaten sesterces comtois weakish forbears transferral warkentin subcontinental hideyo wfsb machrihanish geras shinwa sindoor cojo kpho levit keebler lasserre lanni rioter alary musco chiffons sakhr otlet gatch mpho komplex haraam sylphides schaeffler adea mitzvahs transcode uponor gunaratna hanashi plumper caban convalesce cochère rupali hawton senese sosnovsky penydarren zlatni ubach barabas pobjoy drinkard ansip marven stifel baroud eion boatlift skoal placa inescapably bheag sabalan wakering yuqing manero bami prefuse newz mckercher tuscania romped asier obituarist doorjamb kirst rehabbing eyman pemphigoid ovulate frontcourt borwein lemerre ppps tollhouse ancón cabaña sharipova kamae jamesburg sledgehammers monoblock bluest akyab itera mandiyú penteado bradner qadr banaba sachlichkeit reorganizes fellah mcdine stouts ninawa prizemoney marschallin trainmen charef cordyceps newick aahs lodgement zewail raisons jhumpa autio manganiello kwangju devorah thinners wpro frizzy photodetector ecstasies biogen rabbitbrush muhyiddin papastathopoulos rosko distin stemless pulsatilla koston shanked airbourne tranby pastores transversus tarcher zacher aena chessmaster eleftheriou amade vivitar cabourg ehlert salemme finkelman zaranj winblad bagnolet kickhams felicitate gebbie densification scrivens powick unquoted unmoderated hairlike shuba diena racicot jonjo eatonton dcis owasso pozzato tabe vietti duni saling ninan zehi starbreeze wallula sheck brogdon fioretti sheckler piontek mcgarrity zetterling leevi guertin nfi ykk milberg poisk miren geluk tureen seita stretchable smcs fuzzball terraplane dfh maketh simulink vasoactive levitas patz eastwell bronagh mcgurn finalisation garnaut nemorino restocked brigata elkind arnal pozos machair maheu ratzel wfor acquiesces cívica stuermer thameside paata alioune yel csps giveth wildbad protools zuberi lisch naveh legitimizes nuckolls generosa aastha kerlin silkair foxbat yanagida shoba schofields mcmeekin tefl cardarelli kingsman routier hubler sonoko battrick ludin mezz rappoport balerno wainman asja boan auma fitzgibbons namjoo scatterbrained nainggolan retzlaff shuvo yafan apapa chauve vernes reversi blinkered omanis describer spooling ledin dépôts mprs matey walkability mercs capek lugia briny bunetta picante borini pleasantness whickham professore sardana buzzin grunter zeinab lohas reheating aalam zagging metasedimentary lyden yunjin supercapacitor feedlot montone shahida volontè ledebur edgerrin ivans elvington epaper demartini wynona ewig unflagging lagunitas palmero pelsall answerer dairyland paterfamilias elgort whitall propertied dinakaran tiya gullwing mohonk uncivilised faas broin winokur ozai coffelt proficiently unalienable braziers bashers bloodworm mistrustful raizo sanzio sabur lockean bitam rottman wassmann husaini heavyset centrino atalay naxalbari verdade skakel oberstar arakelyan riseborough exultation breillat hulce snog hutchesons aglow stanes keckley sbir masterstroke donnel penk taiho catting livanos immersions genov tomme gymnasien abizaid tulalip sobriquets gallstone swoosh antoun spagnuolo findel emeriti bierko flunk ileus impossibilities mcmurtrie wiretapped winiarski horlogerie chataway lesvos tubed aspens reappoint evt merisi bantering vtech arregui monell unimportance erixon myopathies shahla athans polhemus pollokshields malakai tammam coagulated krehbiel escap emomali minoxidil eww persa disorganisation friedreich isaaq ederle koldo hermel goodfriend bwyd emancipating collon cholecystokinin sandbagging rices dynamique duffryn idon leonides becouse wny xiaoqing manza zucca sakiko dammann danzi castellini dhal ravitz pietre mojito sayyida killorglin underplayed ooga youporn jeromy dongting ispat jull igbos iittala dillards gallega guler mcqueens wcbo insomuch flaxseed wibf meanwood arūnas agglomerate absurdism goodwrench calaf asds trichomonas olhão mudassar demurrer visitante windsors mushin inapt caminiti kaysville mathe pasqualino kiowas chavira prawo karapiro canzoneri phosphorous kemin hellzapoppin clarett cadden cullercoats peggie markopoulos engagingly chicanes skytrax loewenberg shacklock lampshade becontree jimy indpendent whereon vizcaino dramani figments contrariwise validators skylit azoff tularemia lamorna wust antonova hewa coagulate brantwood bellmon casciano homebuilder françaix brauns gartland radtke mtech triiodothyronine albiol haloes haicheng myrin affric niecy carneal nilda noffke kanza luganville rattail colautti siemon jehlum peals nyce hogi welke hammerless hartridge mollified butanediol laterna thoas pastebin intercessions hasanuddin sealings filmstrip caudwell guenevere hostettler cavanna kerron outshot drewery smid ranker eelco playwrighting thian tepa naturalize boffo wansford steadier cristie snacking giacomini dulaimi nccpg reininger delfonics diament doust shipbreaking mariama scheiber gaydamak palmo cfrp baginda granberg epoxies acmc fairhead gribben gittleman compartmented overreliance airt alworth kopple crossdressing lemann schoeck ksdk transglobal larkfield carrig femurs farshad kabinett babbacombe manics sasai emulations subspecialties robuchon prizzi nordhagen xenomorph sparseness lastuvka derny pictorialism orsett salaspils chubais alane rainfed algarrobo gutu hornacek gangi kondapalli otehr speedman saor mountfield catalino fatai bidu böcker heidemarie ugi fuckers halyburton milquetoast alpilles galeton eastburn pukka immigrations cloacae geminata upyd steinunn coffield paschi thoughtlessly verina sofaer poststructuralist defund sproston slithering vapi samudrala crassostrea snakebites sgarbi twits demonte klinge antagonising gabri barangaroo butor mochan scor rosenburg dolcetto brinck cyberchase indesit futcher kirkegaard matras sparsh jagielka drumchapel cordiality precipices zanelli wibw gateman sannat menkauhor ulsterman dysphonia zitting oura markes muchos hereunder gwendraeth palaung clawless barentsburg deluding sekhon voller diaconu neuroanatomical callimachi goneril watkiss pascals opps wilkos doonican procedurals panafrican llanddewi austintown kittrell hooksett ameristar buxted barakah vlahos storys kyrillos bwindi turbat wingrave terex paraquat chickenfoot jabbarov enchong oyler machaca folkston hengshan guanggu curies reforested serralves qingshui zissou katsutoshi monogenic mbala alstott abanindranath louver leininger copel nocton rhizobia fabrikant stephie paleoclimatology mobilicity rigamonti patxi marandi adjudicates baize linscott gmanews nabhan walda noteboom feret quilla penzler buesa bittman foggo gestured jrd ebsworth liautaud skaf taganka philippos carhampton holliger kfs bholu champenoise sanderstead natomas transcendentalists noughties bazookas waidhofen streng darktown piquette memorex breer hastens fruitfully nasb sahakyan bonaly wavin overspeed shredders wachenheim harpal pedagogies gendebien lispector jurowski bessler frappe eastick accola berrian afrodisiac argens hyperstimulation vato roubini michalik maniche eyelet dysthymia glams sanon mfsb misson mnn saleability clw mardas dayspring overhung falsettos chapmans perone southdale alev portsoy yanes teissier heidrich monda giai finkbeiner techland quader parenteau drell dhok mbembe tsujimoto cooties nadiadwala poher dunnellon scherz bado sunrises mullein overcomplicated harnish bosma telecomunicaciones yardbird freestyling refillable microhabitats georgetowns jableh tasneem lintner schirripa mythologized mudi frangipane barye barner twic weera pressel naggar bonte dallaglio nighthorse wex adits nycta marondera tbms lemley dto koneru dysautonomia belchite agganis tomczyk möhne hoarder vcg studland featherbed selten elide poythress pretentiousness fraker headstocks sedating quinolone technotronic keyring fiddes intelectual etic sztuk carlesimo uninflected killip dtes wassmer venerates permethrin infrasonic vitiated terao greenscreen jodhaa jnc levelland luns nardone patou neds ronquillo bouwer kiyokawa arrestee badinter staphylococci staniford quoits kirstine freedonia chigasaki moyses bule melanocortin alibert moskalenko enas freixo valadez luxemburger hiltz glenrowan hackathons gofman servin liberte ingber midterms pitou retaliations sumption tumblin rainmakers dependently stoermer phyllostachys dissuades bouet philyaw résidence oopsy kidzania siegert marigot manosque antacids stettinius claverton mingas splendors ivanauskas godt abdulhadi celek toz glavin timchenko lyrica chefchaouen theorise calar gulou corroborative misstating coogler unshared rahba uhp debary aprs chenille fasth treue fagel candreva ledisi grownup stereotypic konso windstream liepaja qic gimcrack rebozo theunissen recurrently etymologist buraimi drr wychavon mjg itokawa motzfeldt sunshade rockliffe buan sice unscrewed berhane sonett kololo giardina vamped godchild tartak ulmann thuraya mellophone angelia corá sundell shipbreakers unprecedentedly cromie cobbling dulci macatee clé rosecroft crofter kestenbaum megacities breakeven pyin lerista kopek dreading velaro assez cheju bouza khuzdar digesters nctm garvagh lutin therm irreproachable ratho tefé tega hurly mascolo volman killyleagh dinorah eimear lumenick nazem arshi dahr disdains anthropocentrism commissione fukaya lingenfelter paperboys dittman mehlis beeding mackerels zebrowski tinyurl widdop jorquera axi kaffe takahiko gabar demaret ajose ashcraft barinholtz cartwheels avf odegard jottings kizu berbick patted satiation samhsa soes displease perast bmm nordqvist panga ravilious horsed threepence ldpe gaydos krysta atalaia loupe burti twerton dabei canin kever dunfield orphism lisas ladra chole epting pitfield maoi stefanik weee wiehl ateneum defries flydubai moak serbsky evenk mallikarjun parlin meriem dimitriadis decontaminate woolas lussi toucher alborada smallholding panem wydler sypniewski cossington huangshi guiting varone blystone andrias haniya rinjani ostende ruthwell maurois gautami debatably changhe murrey yeghishe ight hooter unplayed policewomen bleeder gansler nearside bartov cetane malladi cadaveric maters alpay fukami overemphasized rennison vincentians ransomware gryposaurus monusco liverani magsi mcgeary normanhurst alcopop aztek swallower laurentians blitzing gunson rashness wher enslin yoplait barricading dicen attfield banh lavrio stechford bellahouston loyo sifter praunheim lechler ylva abdelbaset luchs ineos hazanavicius fauvist latz shelmerdine twinkies storekeepers sloopy mandarina fransisco simeulue proscribes blome edgewise overstayed apportioning shackley epn mazor ethnomusicologists wedi dangerousness petersburgh pither castlemartin compos whiles sdot canalside suprisingly obfuscates suad galabank rusia laughingly faras downeaster aramark grokster peed denna homecare slowinski skyship sirion garbus randon nabeul appletrees fusillade richy dillahunt pangloss kerekes knoblock ravenal deknight keser chapati barbacoa baster metagenomics solvability stretham paleobotanist peb nhgri kalinda safaricom heade mauthner croitoru nieuwegein karnow secrest lucasian humen benanti fauns seemly szymborska borzoi landeta perton escapology busywork ormont poors ferina rosalinde thatgamecompany ebsa pooler satheesh lifschitz apts wissel hesselink purolator khil complexly stirchley benacerraf narin maytham handlin conine wilhelmsson leuzinger metes hige kuze marvão bartkowski feddersen bazeley aamodt papiers cosmonautics elmsford tamano enameling khudai garnacha jurby zom dander bonacci diethylene necklines manohara gelin veirs phosphite laureen robbi masahide krupski horsefly arief tideswell yaeko reibel chhina aldwyn aonuma lova sansui rammohan proudman steadying irreplacable signifcant niveau nonunion staar hakala vanderpoel nowinski caral evelin underreporting dysert kiessling keenest ette ghosted meddlesome bacc anastasiades harton stumptown piromalli postdates tahnoun turba snee eroi busfield sitarist licensors kandla linnane shechita taneja zachar opacities elyas brockenbrough balibar barriere krymsk kalach fluker exultant zircons telcom smocks iodate ncic bandiagara tzi mouri almont dangote ultrahigh colourings benali turcios costea bruyneel npy reinterprets molecularly reversers bootsie lowrance sesso scheinberg tyngsborough kieschnick zirkel mechner igal bmxer wittenburg sayward darion hypergiant toryism otsuki metzen samhan mazra disgorge margetts kitanoumi backplate doswell baaba schupp withdean prepay flatus lorrin transpower lenkiewicz thrashes cheena zunino pieroni kosti siaya burdisso kayvan woodworm jalouse frerichs djalili sohl ulvi lobanovsky develope ullal pollutes clipstone angele amoah roussanne muggy laderman papazoglou aromatica supercilious pakka mycobacterial zamil formularies aminuddin motorstorm sinama searl sadlers bostridge heartworm syndromic jairaj saturnine vanik bukharian mouchel norfork roday jalousie tramonto yalobusha kassian dinham retransmit hilmer pesters gainford heldt sjm jamea sautet autobahnen carrió prouvé nanopore foppish mesopotamians silmaril desulfurization minitel marans rigmarole galleried caramba toyotas minoring arisaig propounding tensioner wathiq gregerson alemu mender kanz kcnc arborfield harmonizes unbarred oap burrel confalonieri metso untestable whatton patang chicle lavalas sneezed northman rasika mundari docudramas areata tuktoyaktuk noatak spectrophotometry windsong amacuro loquat pried beinart partin thorens tway hireling ashkan sidnei daedelus estancias convergences homare ceesay samjhauta crimefighter wollemi cinematically hemerocallis rubbo impetigo gotto bof andalou stencilled miraglia mammut portora kilbeggan wiegel vollenweider overcall baglione wakeley oiwa terespol antiplatelet deese skims bahoo carquefou extraordinaires taproom narwhals decena chimenti minsters tideland gardasil rubeus ginormous bergier toughening streatley cascarino watmough passcode canastota dumba botchan tankred kcmo screamfest cejas roussy amorgos ryding satelite cassowaries torma klavan orgueil oligodendrocyte newspoll excretes whitestown mccarren glenmark turfgrass yorubas victimology dolmans debu norina demilitarisation ocelots dzehalevich emx killiecrankie hagin nelms farriers wanderson doff grether josefsson dwdm haunch lupins hoaxed schnebel kulon ngah isringhausen thielmann bogues kyoichi jalapeños niah kurtley atrociously veno gynes zhangke gillnet elitists porzio carmont mellors osteonecrosis labruce danno birdshot cardini piñeyro unsworn esche armatures mistra quijada scotton dubos aspergillosis profondo tempter angarsk aimi hatrick gooseberries strathern escoto lwa judentum hellbilly mcquay madon natwar majuli katv higinio pdgf ipas knyazev stereoscopy musoma woodbrook coronaria prevalently tarvaris pawning cancan mirpuri linnets isobutyl dibakar perlow sulis nappies iddesleigh unhampered fehling brynjar ketterer knapdale portella rowsley thuc mermelstein cazenave spiciness maquinna brambell manannan northenden pantaloon mazury sowetan hajib devotionals parkinsons véron sossamon akhenaton nippy kickflip pathmark moorilla krong mitti gett palito amicorum glynde atol mijas regurgitates cannet fesler harnisch mouloud vladimiro izrael gehl mulanje splattering keino nagatomo réaumur omondi lella luchadores poyer foulness patters públicos arev mamelukes biotec foriegn militarisation trichet gernsheim venturas supersound progres macallan adame steeplechases bodelwyddan dankert cuozzo firelord sunu rightsholder psychoacoustic hajong slimed selebi schut antireligious hebblethwaite dilatory schorsch whomsoever ashin dockrell selanne ngala stagioni brenne politicus oreos akhmedov eberts newsmagazines blasien dncg snoek foregate expectedly marcovici dissapointed eugenol blumenberg thundercloud mythologist salvoes grandkids plainness cendant empiricists paprocki soliloquize kadirgamar hormisdas inupiaq duchesnay shinnie boteach dimi outstrips necesito hooting aprc prizewinning esmerelda reum pegues kasra nafeez gearhead wayna stirton lessors titers radioactively skold seabass unmerited quolls nushi vaira thankfulness mikell witherby wtvf ltcm estos proles kapala janas ruinas dhows ivon belville bkr ocio ergonomically nscc aletsch broadlands wanborough liberati demystifying mikra immunosuppressants wedmore parlato brandenberg waldoboro deposes ogp reconnoitred steamworks mckew llr mitri wiffle biocentrism meddings bajar dobrogea lhotka dotel razu poite oldland mahabir crile gelinas metroline girgenti littleworth manucho rustad beakman kdaf waxhaw habel chook oizo sough parthenia triable guerard hoagie muncey segerstrom sagua sudbrook prestons munjal socalled parmanand nahshon dhahab moviemaking aleksandras hsts barrau waterbed bhanot marwell crenellate colindale brioni kustova botez mozzie maierhofer pingali kluwe matapedia mateer louanne fastcompany serranos krippner redburn pios mindi orangeman cower travesties tinsulanonda multicam gravelled socioeconomics styal escada grasstrack offit ctcs cerina dados nioc biddenden corporately lilit neyer candover gubby malchus saygun giguere dorsa kpelle nicoli kolan hartness atjeh hili cockett usmar tauba caj seicento sifang dictyostelium umansky ramayya neema roulet kinn kimhi worsbrough gouzenko karlan ungaretti lehighton tige videocassettes swerdlow efsf warfighters reeded holtville ventas issas rathburn casados helicina harnik bernardina gowin nabor yurovsky soberanes aeris fuelwood longneck elvises marchuk vendela arken irawan kensho pohlad mitridate pantazis frerotte lubomyr mystere dreadfuls higgin mck seelbach lincou niraj götzis unresponsiveness ollerenshaw mirto pastorini balquhidder gadea anyon aerobraking takushoku dachshunds aitch solorzano libet fouda moreni lalgarh gaullism ironmongers robida bandolero finegold virsa othellos raiola brainin cernavodă ptacek wroten cathedrale pistole renren hollin eszterhas hfm caridi thelwell favio reionization heppenstall outnumbers amah sya remak windowpane copco shriveled bpk aargh halifaxes warehoused inbar spagnoli salver laxminarayan mesmerised orotava tne hehehe cym transacting brambling regrading rugger ebey greers ammended toasty netease irabu yushun denari gwynfor presbyopia lawther weishan wist bayldon greenwashing reanimates dimopoulos milhous ameriprise gurrumul hoathly fruta leggo terashima blueback woodspring ggs novocaine dibnah muchnick roumi gudkov swafford kantrowitz reassertion stanner buckby manlove cumulate rupal sognefjord milita hazza flippy churkin imtiyaz princetonian perahia amperage wwt swedesboro julee stephany propolis punitively dipstick kelburn cabg prevarication lusted cardillo boxmeer supercedes jialing davyd tennapel enh covetousness cymraeg monton hollanders wicky campidoglio reefing poinar genzlinger ockenden tantalising breakages killerton jagadeesh pullinger tasburgh lasme goodpaster excises hardway troxell llu craiglockhart schnack opsec prajatantra ivanna argali depreciating shuhada weylandt velshi klimenko declawing pedis caging interprofessional impassible zoellner cisac hyves arthaus coatsworth bedini mellinger prising ipac gulak thighbone pramoedya hieu wws bordetella wolfit castrop omneya microcars nsps parcham hesmondhalgh badagry bogans gimondi rieko gitler omprakash donec muhajiroun gilden suppes voina gilders ypt chasselas obermaier fdma mineralised saraband pullan lunula tabarro francesi newfangled jasinski bresee esnault epitomises slepian atrac koldewey jimson gspc laslett bedrosian flecker hll idolizing saola ischaemia magdy nembe handlooms tachira noid eulex vulgarities dicko cheesemakers riklis iiid hirsute trach digressing krick deadheads humblest tresvant bwe torney olusola donavon moross seiberling hiranuma roadwork irmo wellpoint tlrs shadrack osteopaths unrepeatable lutefisk madonia intergraph gurl flaten noura garren gedde qasmi katzmann bazzi jalopy strongroom simmers chukwuma nosenko horseshit freeserve lissette pragma propper horseplay discriminator swalwell pianto magistrale assaad hispanos canvasback troell coolum privies upslope gahn patrício smartass tripa vittek tarapore hirsutism dushkin falseness turl vishesh deerhurst electrometer sonos amedee obon preheating springboards bonfield superzoom taizo hawkmoth hatful candlelit ampera dipendra belote proprietress malen spirale abdoun moyglare desch renegotiating izza buechler thibeault luxemburgo burnup shkoder qianjiang haroldo chauth cantrill amorrow orrville bowerbirds msgs margita ladyship aguero silverliner sorgi aphoristic brightline bjornson barkworth rebell boggess sheepish playtex darman rosevear ussc morganville persue presidentially oikonomou vibha ardross systemc philippic toxicant escheat guízar llanwern glycosaminoglycans guigou shadid northop dededo stratiform officemax siriraj binali qts hasdell plumeri psycholinguistic vlk sofar nosler mystras olausson trevena doesnot stegeman hornburg mujeeb esfandiar altimetry preheated cohesin zelter bové leites shortall goshawks dph polymaths haaga webobjects tunc prioritisation quelqu foppa buttercream senftenberg minuses fixating maninder democratica harmsen jagjivan tonder udoka gessen gbps matula floride kingsize kelle railbed etches steketee groundshare dockum scrutineering roughy urano xms garst plumly maleme orate mischievously cookes zientara kinna chiquimula velle poca yco moonies arrillaga tacita conveyancer ollila miika demel martines hbt seraphin uncharitable tuazon muktafi amorality clownish sixaxis redentor solich quires mccumber ghaleb improvers chalgrove melhem keßler firer unflinchingly bolivariana leeching adger inao footstool ndungane luanshya papà yasar sundews bambra gibbings filipo gayness inching lianhe duetting astill mignonette nephrons cader ravensburger ahorros shaowu oyamada aydar lizarazu bengel ixl cuffaro vieni kins melismatic conductorship nelda naep ebp delux cadoxton mcneilly fundament bonan careened nordmeyer woodchester yippee raphel aksakov goude pictionary fiorucci toptenreviews vegfr arguin bulling algunos sciortino fator coban prieuré chacín finell norddeich normalizes hamhung keshari manageability woodshop internationalis kumon whyteleafe shiffrin baluchis buana valdai unmonitored perim snipping wuerl gwlad silbert cockaigne yaari brookhouse chelsie askance shigeharu nyf unbuttoned rigali pantaleo stamboul personalise oxx anyi coneheads zhongnanhai kabbala anneal owasp relly nonito wemple goosey driv milicia cementitious umtata standford kammermusik lindum huds depasquale deivid isotopically blizzcon maharam sise merca kangding wakey abcb tamboura rocester evildoer watada dyscalculia predawn gaullists garfish barbagallo nasopharynx lugu cameco fuisz keturah loadmaster goeth steading tqm apotex tinti noem cerdà malingering goodlad wallows eudy gunthorpe drinan defecated permira riveters ukhta techne elx somercotes hania miyabe madlock incapability ukhov luti rawmarsh nccs schanze halaby elliptically ipcress imedi tremé floyds schwert backyardigans pesek autistics improvment upscaling charalampos malott tendonitis telcos soothsayers torabi mtbs monomoy kayano mandella burnbank karmazin comiccon pdci ggc samis ogam adleman montefalco gersten ghaffari camu sagittae gaffey llwynypia dadd candleford tauzin shali basser rebbie adiponectin mutairi hvm teruyuki boltons henckels hym calstock yanan pancham timestamped chalumeau defiles forough gaffar barch kucher nubile sportin y,z dribbler vinyasa socky wned unaccountably monkee cuckolded duroc microprobe fereydoun xwb samuda birol controvery dokey evanson maestranza egat transuranium kazantsev hymnus recke shergold rushkoff perceptibly langenhagen maccabaeus thern ancho youse tpu única irún gons schulten chirinos mauritians donella brainstormed authenticates nosema photoworks brezina hengst usamriid ironweed bundesnachrichtendienst ktar spratlys lundvall laframboise rinfret makena ief proceedure melisande crêpes fidelma rodia leapfrogging bozza clarkes badong chichén cyproterone dalbello hiscox donruss umezawa sagia hexamer liddel strongbox durley boix goulandris metodo dorfmeister thakor babbel beyle armfeldt facinelli ankrah cuv almaraz whitechurch pepperpot amerasian godby zakia qia liseberg halberds zhujiang untended zhenyuan baliga goba vish ballygawley disengages rzeszow shaefer yuda gamestar hbm peev frontieres thaad uep backless kolpak mercatus yanzhou helmick yassa izi solidary exalts pachycephalosaurs unipol whb gesamtkunstwerk peller euboean debriefed takeshima shortz mceachran brockes sweated pasteurised malians headbutts coveting shawan alyaksandr repertorio cuties simes iui militate naftogaz farnesina deductibles shibani kortright kozol cabooses kienast plautz gropper deregulate doorknobs maer aspheric dusa cossart caes yorga darma guarín yoshiyasu brugghen krotov rasheeda sarratt mughrabi sulev peckforton sainsburys praneeth abdulkarim cantalejo schoeller dkim songkick brode heighington mout fadhel ekstrand frighteners stanishev mccrady carax taibo matthiesen extranet partyka gornal hiroe honganji hgc haem philine ramsi silko dieringer magana ecfa edsac heighway transcutaneous rearguards unicycles brayan morishige pixton waibel mushing shambo foxrock yohai amoria llega jaji keychains kambli meare gunfleet tricorn westcliffe grindle bifurcates wissler raposa dæmons gret portbury bucaram harumafuji belchior ramshorn hairbrush ferryboats charla dullest shubik boisseau cascone oodnadatta creveld melquiades bronchopneumonia mcgreal slushy tullock goldendale hotties ibragim antwone voinescu sheraz mjp andreasson mertes loralai restaurante mbuti lighthill varnado soulbury cosel counterfactuals adz keiron guilbeau gabourey latehar chamara porosus elnora stodart sharone proskauer enriquillo ghazanfar ijk mesdag redshaw counterproposal regueiro immoderate corgis boulud hourigan minhang dantley masr shirota toscani devons xuhui ulstermen zonnebeke fubon gitxsan kliff rurutu boey allnutt alfetta mcclements lafaro sadrist bettys inanity sadasivan gangsterism osweiler erosions convulsed royse homarus palmanova blurted spodek supernal atrato marubeni xiping despain aquifolium puckering yazbek chaussee endogenously wayan wops lamey premade kufi lungless axé anonymizing longjing thornham sysco raghunathan terrelle carleen panmunjom manzar unanue povetkin vahagn wuss magellanicus unreached fawdon histologist tirunesh harben hakewill zembla tadalafil amstell takamado zookeepers hassinger marico lakra kibbe norr grasser glimmering devel klare moneymore auty complainers technobabble immunomodulatory suddenness tabulations savic bylot repairmen clunk omertà polwhele kipchoge stayton burakumin relocatable armie witherington abbreviates skellingthorpe rodnina ezaki wsv covenanting holidayed gigolos oasys bicoastal eccc metrix tremec dispersers vecsey scooch harra tierno quente coquilles pietrzak snooper despereaux calbuco bessbrook ellenbrook byut vilasrao medulloblastoma peskov winefride dyslexics kozuka escondida larmer legalizes samoset radomes radivoje fettered bellringer tepito liturgist palitana cunxin froyo knappertsbusch vesperia precio radics impinged mapi heiligendamm degassing unemployable fakers mulato saraki punkin infliximab patay motorcross pashmina pariahs spiderlings gurs luckie therin abstentionist beaudet chimneypiece revolutionise boudicca bainter canudos lewallen schlei enchiladas almagor pontarddulais gess sagarin consummating asato stapel orebody tanzanians mechem steinborn goram burjanadze gonesse qiantang kalhor beye loganair hankerson thrombolysis ninos mcraney rbmk wardwell viñales attah oresund redbourn pâtisserie babad dounreay workmates beurs wplg popi decosta fowling cascabel ruffino caroe certifiable circumcise vendramin honigmann galica löwy kirilov westerling patara fisherfolk admited sjöstedt monkseaton colerne niza hagiographers gatcombe scherzando gumption bershad svanberg gracin beenhakker muthulakshmi parlett boiko zitouni dhammika bioaccumulate shindell coworking ningde suprachiasmatic mccalman barkingside snowdrift clady taubert maximov mirzoyan broadmead amap dscs blackground rahr martinborough potboiler noul sumners trachomatis beaudin loebner risby huashan wapato weeting lystrosaurus castonguay hidekazu hilland dtn bumbry faute unbowed treyarch metaverse barnton macaya pazzo sourness goias sarona pepino mavers desireable maniatis renishaw sharafuddin savaii haythornthwaite conservatoires langner progestins eicosanoids maldah mcpheeters reggiano karakum frys newspapermen mawdsley lleol delicto ayon excell basílio recapped khot orba tordoff armagnacs krein kitani outwitting shimmin varty laurer saikyo nymf fastway burcham shewhart mencap earthlike nilton asenjo wiv cybc weissenberg sweatpants burkeville amalgams leoneans reviver endsley alfriston murghab lako spurning siega llerena arcadi ruxley harambe myelofibrosis doublethink probations snyderman personel wullie enews edon squealer sauchie feres horseland hewing hignett kimm horfield ardisson isahaya tallet vellacott arnage kröd restrictively apam arese iolanda disentis bishnoi arsht reseau jousts tracht tdg stenka kmw overacting grizabella korova compusa forefeet subsaharan paoletta radionics tarbet quaestors aghion befit crannies whsmith dgca dunsford haversham zahed coreana tawakoni jerling ariffin quintela yumoto pixma woerner bonsor dysphoric unmis thaiday già malocclusion quipping sarstedt mirkarimi sabrosa yoshiteru ankiel fago chazan assumedly négritude elysees kuzman conk retinoids marcinkus weissenberger kaita lrrp unserved portreath bermudas ladan embroidering pertz cannoli vont carll mougins sahrawis microblog ideale chlorpyrifos cyt pauh elzevir khuri gradualist chabat vdr valentijn wintney croteau tokara denneny uruma shinbone fixings extremophile rsta padalka blowgun backfilled rockferry majorettes tuomi detta nonprofessional vichada kenmure ufr birder lünen stourhead wate assy lide peacehaven lombo mikheyev gumley islamique giesbrecht lefler emberá inyokern concertant marlou markit counterespionage primas luder nightcrawlers kerrier vassilios rudbeckia kanaga bakayoko clanging preconditioning kitaj jec stillbirths reyburn coalmining arrhythmic laclau amirkabir brandin chlo pcas artz zhixiang screwtape ostpolitik greggii bluestreak srikkanth wtvr drupa agarwala weisberger nailatikau brayley derdiyok indhu busybox kiesewetter huddlestone janiszewski meditators gherardini faizi economize libidinal lidos campagnola greenfeld derakhshan mesylate pien cornelian sikharulidze akzo lemm lasdun qpi pouting heygate hurndall nijs nurullah dijck dabiri embarras misfolding setrakian exfiltration renold skolnik antonacci kearton moyna defarge tolkein ghil celecoxib sevareid atoc muktar jalebi eyzaguirre whacky pascall augustín adjourning makins coprinus freshford satiated infospace tasi legare unibank zandonai mcguane educause tatts ataullah plebes shyer fessler coye cercis noisette frottage frager reinders opic sose ringlets coex feilhaber gavilanes schneiter darvin falko armindo aghai salfit snellgrove nrx womp lambiel nighters ligabue occlude kumbum powassan bellmer ladson cqb tricot tumu centurytel goven gardenhire weakside mealey wassell juston camaros grybauskaitė sundaravej techstars lissie sponging kilogrammes teotihuacán lazarides spacehab bellaghy garw srdan bazille numminen quaritch hordley mojahedin melsungen ardern warily lombe melanopsin mpenza oviatt fantini zarabad betsie stuy neovascularization epinions devita samsam frankenweenie almendros swihart seulement giudecca gaudette gibs foxwell transmeta pistola cusd solaria truncheons blackwells procrustes belan absorbency lemarchand renseignements seagoville aeroelastic miscarry transferee segreti refaeli arenales santhi gorlin lgi maruf bayberry albertz roton caretaking reynders costarring bizri madore ceneri heimo balonne behrouz mclusky dannemann franju koenigsberg gávea gabilondo gipsies islamica equipoise kudai marulanda moacir abidal parenchymal teras midwater jajarkot kolkatta trailway squirted vallentine kronwall ratledge kavak domtar starvin khorkina pinstriped ejc crepuscule baksa asbl milliners predilections uhu hollinghurst schweich gwilt unenlightened bossiney nkc ambiorix conjoining railyards daloa satans lapan glimmers lotze gordeeva marada skintight geminiani ruland pard superblock iriver obersalzberg laveen schakowsky eriq satisfyingly yapi wesselmann mmv adelgid glomar avuncular choden grubber abts cuadros henhouse repressors tyurin ocf dxm shepherdson zeitun lummus dismantles caddisfly brechtian rhic overcharge venial quagmires jueves hasumi farago classwork neuenkirchen childrearing evf delie whetton handmaidens pake carsley maisch rayville careerbuilder duderstadt chenonceau parsee kallikrein karenga fudged bacci postion martellus coughton pelicula balsan fangraphs throgs grischuk thirlestane chuma saltiness conceptualisation motoo kenkichi samil candlepower sergii silverheels airlifts jook gerron ceps keiran fetchit bellard shirish corncob inched monett denstone shuddering fadeout racey prittlewell naná westermarck gratzer zwerg ustedes ruina weening brealey belfries barde courtown calleva dramaturgical blogtalkradio vasomotor pottle underwire unfilmed whiddon vona pennar kompressor porec devouard obermeyer lisbet shaddix kafes popat mchardy horrobin filipovic hewins appertaining yts dromaeosauridae shingler giral wua belfrage prefabrication richings blean naadam mauboussin breffni csulb plomer intercontemporain okoboji fernbank sympathising lainey institutionalisation nmma inscape catling farinacci manka colourfully hoka nanjiani tsuchimoto heldens conjuncture cassella veltliner ottavia poolesville coly escapologist mitchinson trohman jarabe biographically adderbury krouse carissimi fazed soquel mrad tierpark stfc metrocentre demystified zaidan itsekiri saed kasumigaseki snappier uee pinecastle kamari ranglin severiano croesor matyas dtg sarsen kiyonaga entertainingly avramov keinan kazuyo condemnatory wbff punditry llwyn twisp etro interdependency kisoro bendt kremnica bièvre metaphysician ruggedized centeredness navvy raca frontstretch schwadron kooser baverstock witteveen schueler dats lrh baith pommier outers calin scrapple ghanzi lumbreras cyanides akhmat samudio acronis clasper lobules darted ashfall americanisms kaung pectorals bezeq burrillville tenting kovr facelifts bisceglie briegel norrish gase florens rivaz fishtown batti katic mahabat submerges magu unama fithian fitbit amfissa oligarchical countertransference junglee brians lindstrand jaipal topacio ulas tuel peti chadd laziest moster panynj cairnes ebbers zhvania glenrock bradstock agbo sherm curteis cohutta moskow pomerleau sabeel kornel htet naut whirlaway viduthalai frontalot pigmentosum donné mwamba reja trr akala calland geu boardgames melanosomes tyrod serums dovetails hodnet ineke godparent protectable tutus papakonstantinou resistible sémillon twachtman florissants rajala cowhide photoionization forero tjx cdss subcomponents visegrad aioi kirwa ruritania perishables sollima giussani basir ultramontane certa hhi taroko watene vieng lochy nardin pederast udders durano interwebs ritonavir ruido canner yingling deionized mbatha gonen littmann rsw enskilda vilamoura goot venetta verbage puea khanji mcgilligan bulbus schlueter alterna supranuclear meiri denisovich sharland operationalized aerialist chaperoned rumbaugh eagels hooey sapsford bankrupts brookley thorkildsen ruefully monterroso xiaoxiao bakili rolandi dorie nho heavey shangla casterbridge tinner teslas sonogram lavista caat erythropoiesis frankmusik williamsburgh manjeet kotlikoff bartl gottman fromkin refounding hve pelamis agma unalakleet tiner hideto zinsou multon bomans ancholme foward ashrafi jörgensen terazije askegard kinte prh morency calcagno colombano wurzel disrespectfully amiably pichet cactuses viljo vonne springett nespresso vestel lauchlan myongji skateparks warneford millenarianism affectations tootoo ultime presages stubbings koslow shantytowns naeyc chinggis hexose diggity frari coya janvrin mompou mondsee romiley cunneen talboys hamartoma amodio gawn judaization earlston lww demotions afrl culotte downfalls importations sidesaddle glotzbach busman binley abdolhossein pih eio poppen interliga deanie teetotal cardonald alexina buzet reintegrating herengracht risan eicosapentaenoic tverskoy wilcher lotan thesen luminoso jianlian muise cheapskate nungambakkam medlicott armourers ciber royalle gabai sulfone balearics otolaryngologist gorseth ulanova ruffman centerpieces farentino akyol speeders sistership poage slimes tfx highley somerdale bardal slovik sussan martir coltart daifu quinns polychromed nagl tarnishes ardipithecus mililani kandar shortish geoffry cabby alverson domiciliary kormákur muelle pitchblende repopulating sarpa nidre pellerano tattooist galili hochhuth dutchtown monocrystalline condobolin windbreaks apisai sabinas frangible woodsville lambuth bicentennials wilhite syron breckman strug uher squeezebox octanol heeb cortesi liangqiao hme zizek junor contort swedien kennaugh cavada ozanne tuusula hypernova hayesville hardon carrigaline ague leechburg nikifor balaclavas mckain prabodh higman lewisite buruma anslinger unspeakably hyperdub fusebox dillons primos laikipia chicka ─ quehanna dyin activesync manabi bnfl npcc weipa arnalds jabulani contrade lannoo sólyom stilled overcurrent superhumanly josephina sapan detmers multisystem connoted buress ntb reedman histoplasmosis afterworld chiavenna itchiness measat cohon coti bander ghillie bickell bacteriostatic segantini monogrammed aweys strathpeffer cetto sugamo brooksby colander hypersexuality tomizawa waldschmidt ferring dujiangyan blencathra pizzazz glasswork ildefons crape winders kuenn darchinyan youd gelignite pharmacogenomics accomac ameren portner granqvist offramp accelerant motomachi scrushy katyń darvas castelao klingman spitefully ricaurte arzobispo sipple thackwell ponson lrm elibrary kasun gurcharan dbn cavefish collapsable trango degan myrtles ramadoss condello saverin ftir legendarily tindle tradescant pantagraph masing bubby lisvane prochaska invectives nevaeh cannibalizing bierbaum massar fayad gollan herberg candlewood bandish droppin mehmeti verni entablatures malmros dote rodong operalia mirande baerwald shapleigh margueritte barrancabermeja scabious sunrays shott ariodante mongabay merrymaking soulier perugini breakwell flossing aptheker cfos ralliart schanz houshmandzadeh feare chy gwaun brackins aerosolized kiawah didacticism dardel delude allos raisbeck abasi carretero fertilizes samplings bassenthwaite pantun leaseholder slavik millson qattara editorialize osibisa baldinger schmoll obligates heawood avida tyzack tcv wieber servicewomen candlepin coronis rationalising pallant couturiers gugg jauron berkus skellington ooxml bouffant karmanos warmups hanningfield fasttrack rodgau hassard xiaoyi gasifier showoff hagos veau lenzerheide aventinus luter anum calme breaky wainaina addabbo maricar venceremos gilet yarmuk natch gorkhaland wiu bhoys westminister surtsey siffre executively mangus campani fuqing hashima westwater shapour kenward pilita verbeeck rfra moktar przemyslaw elmasry morga saluja simsim procrastinate trophee cassetti gelabert interpenetration hvc killamarsh badley estremadura robeck apicella maleh hert weinke millares nachtigal kuharich kanemura lachs incurably teps outscore kaisen bahjat koufos schrenk engrained brayne economico policier lietz seamaster elnur lukic wheres charmbracelet benante iselle waked karros cohabited buyten balamand veneracion omohundro atoyac sursum uncollegial furley epes schabas programmability karega kluszewski alydar granberry outsides arbos olumide segodnya intu kottak kovachev thae suiter fert gurmukh ngouabi kirkhill kanaeva brookeville abbandonata faul gumble paulius oudry clarkesville etchison blowhard salvadorean fairvote zomer tarsem lysyl pollina vibrantly shabu kaliska bororo erkelenz mohite mammogram udin privatbank hchc caccioppoli reputability burgener hanaoka wittels meur pionniers fmcsa trimdon schaar cornflakes longport crutchlow unvisited smps wheelwrights morros hower woodlock marquart karpeles afspc tradewind bobin kight aquavit guavas priebus recomposed clearwell blackleg snowdrifts lalwani jelani edenvale sóller arsia anx wangmo bilgrami thermoregulatory clucking vendee emersons visy farole ypo daehan kidbrooke franschhoek ripert brummie werve loudi garstin meteosat shabaz mcvea kanamori doorly furkan monetizing fellside aldar emeth hasil jesty blaina parlow hesitance hollandaise pissy siddis harmston listeriosis abramovic biopics putina phthalic nzoia europeanization stacpoole dombrovskis frontin fitzcarraldo shadings cambronne xenophobe chomet gwalchmai widdicombe stuntz sharqiya geda gilfach mykel compagnoni nordjylland ingvald sanjiang citysearch lichenologist ziebach colwall capriciously fairouz pellatt hakuho sandfield hussin evn dartboard darst chope schlink khap bendor chiarello zalmay theorises galisteo biopsychosocial greasepaint gff apuzzo jesperson ovett gwanghwamun kalen wsn novack nightspot cumber nyam llera calafate friedberger heeds landgren redecorating taleh asmir adalia staite lapchick munira vija sperone entreaty databanks dropkin shariatmadari carlina mavinga barefield innaccurate malba orgreave industriousness handcuffing medhi supernaturalism kcp lavrenti inherantly trellises multigenerational grans macintoshes pilobolus daiva synthetases ruttledge larches erpingham kaut bedaux tashima revamps pityriasis overshoots reevaluating arkansans mappy allcroft jotted cialdini mastung kashyyyk lemington maraca kurkjian uvas longobardi kriging congolais péladeau virage darabi taverners samin wans carlisi tracheostomy acciona luhr sydor hippogriff atiu gaume itec mackichan pelaw manchesters wadkins godefroid brownjohn bossed laryngoscope sychev radegast forcella moroso falaknuma grein basiliensis keawe audemars adcox penderyn sogavare compadres haotian rodewald baleful phial sweetums hostelling ojc wesco titleist aleix sturmey renison chawton daraga kadowaki kingsburg solebury savion humaid testaccio lipolysis arulpragasam michelmore tizzy bethânia bartholomaeus promociones pineview mohammadreza azubuike snaring epas carapaces matthaeus esterson kabongo scient alade anerley ridesharing arvidson propinquity gyle coronagraph prothesis kotoka walis pastorelli tsca wilberg amsler rajamani nastasi unforgotten petanque chinhoyi intocable ivillage technol adenhart kirks weissensee tambunan nicoleta mirpurkhas pannal pawnbrokers woodlake ligula haradinaj ishant penni steakhouses kabaret flakey fewster claypoole kharif blandin govier effectuate gilot dentine keate ciss supernormal lancair suhana apy hussman padian lognormal organotin chrysostome thuggery ibrahimov mackrell nyrb bladderwort jankauskas outpourings ustashe kashmira youyi thalictrum gangwar pentobarbital mabye cspa evett rakowski rabello tpx usms brokenshire radamel gracefield rockmart confrérie obiora makhmudov wili heeger kitties abraha ashlin badaling elmley vardalos amazin somatoform bunnings carlill calcaterra unfused wheter había lynnhaven amantadine sticklebacks frae kuman publicizes budiansky sirola hypermobility caponigro nagila hybridus iwk cardinali boguslaw pateman armoring appg rameshwaram laclos dongles newdow scdp huntoon fcj exportable murdoc halkin uncreative luthra aristos flied njai floren woodborough obin ravenclaw butman airguns forouhar tektites benaki bragan unlearn sangaré debility imazu snazzy caldey gamin ifpri mixology ogm killearn nemeses loutraki gautreaux ricocheting issi koistinen kuffner harang unsurfaced prum maliciousness transmittal schroedinger sugano popsci adairsville cicinho binbrook chinglish kenting samardzija hoor zindler chadda edenfield boco iler inoke wvtm ephesian ambriz mait bedsheets monico resected fogelson marcelina lytell evicts apurva robstown jombang rfq motionplus diminuendo seeff tagliavini tailcoat gotama cosson dims bolender mainer brandao nams moneo lowkey cudgels alinda dearer lawan nahash annadale snavely dainichi adenoid expectorant ceremonious rdh reconciliations takiko enrols wakka reneges libelled eagleville buchanon mcgown sinkers willo exhall ghari thermoluminescence carmassi izon scours qurban koloman morsy merchantville hookham pyruvic underwings ceara fuiste munby cinquantenaire dewart askren unvaccinated coolants hemans nmea leadhills ionizes camba pontyclun shaner brandwood chiva netiquette opti recapitulate abobo imperil larrieu arrecife maccabe stabilises dudok ducret ecv ambert spillers okkas shuangliu mørch sheeley wango carajás antiserum embezzle kumin yakovleva onis sarraj paet jacot tenneco disconsolate pisciotta lazic rheas ostbahnhof sanming myk mariola krikalev splatters caruthersville bigland gaters acerenza mangat gosch icta harkis undelivered ysp feminista zlotys fintry alerce yarmuth herskovits kielbasa myelopathy cogen dagoba kellas chengzhi sabic egc mascall bertoia thode djia ippolita mosey jilava brunhilde earnie olalla datsuns bandaging goyard rhue chivenor chary devrient ndo scabbards mumbled gajewski heegaard sadok gieco moer topiramate beezy jolting federalsburg venkatachalam reinette kaladze roxo pohle tookie fickling pnina foeticide oversteps butrint millionairess kiprotich thruston leedham huddling frolinat bussière dissention daae fillan killanin nomenclatures anglicana vazirani arleta aderholt ivanoff bevelled mamontov sanaz eastover zullo scholey micanopy mellanby setubal torke creer computerisation malis heppell kirknewton acap chare gusset sheriffmuir bagnols nosworthy waukee mansoni twined gtt liberta fuddy electable sukanta hauch etnies marilynn papac singerman lampshades showtimes swissinfo osnabruck buggered tuthmosis gorkhali oyen bishvat vittel pinochle arann cadens kauto fauver plugger meatier madala blomquist kaffee blanchfield steinmeyer saltram jeanerette egrem thoren centrex handfull centrosomes toppo lucidly regas pimco bearn ferrate nathi pities hamms unscored seraphic sandcastles garcon shukan goodmans boscovich dofasco evington steenson roughcast fumikazu molefi tanmay hawdon interport luaka mitchellville injudicious oldknow asyr neeti elongates germinates radleys armilla okulaja titties diandra pufa flightpath mayle coopersmith loehr mukhabarat smcc artless gencer dolmayan marianao kessell nnpc haltern tuscans bodson qim harmen adolfas campoli craftwork misreporting dostana gerould nemacolin schoening bendable mlx memebers fennesz sawt cotesworth indika hippler overtopping eversion melco delattre brancepeth mawa indep parmley gypsophila viiis lichtsteiner dashan necco transtech poplawski cameronian yankel airspeeds kitted médiathèque xscale burgau tenix scutaro balfron griffeth capitulating springtown whoring zorra vavasseur lochside yohe giovine accosts bpe esotropia rooflines seraphs breker ourself couchette kaleri lofthus arithmetically fritton finnemore phrao wcp osterhaus hollandsworth blach eemian clitherow bungert mismanaging lokhandwala riall makana geochelone rosés cidra colloq kutuzova bestwood kirshenbaum crda loeser uttrakhand frais paediatricians ratchford jytte oremus hpf menlove ciguatera muhammadi treforest desheng dogue objectify zhaojun trautwein isic heigham middleweights argenis tujuh delica ontrack teairra adieux cotts jerai trea frugally hewed dulux kallan consoli standin lundeen hahndorf invitationals darroch rakic hiromasa massmutual nitinol natales kennison goldline huludao howdah brays acrolein magots uncaged uneditable tappet rosarian spatafore maures killilea mouseketeer waterpower avijit dovizioso darkie hifikepunye gpk poges horndon demetrious unitedhealthcare urna brusquely evoluzione superieur fauth naté echlin viets dagher lbd isobars milot carbonia cebr blackalicious unwounded singhvi rocklands pachuco amoo matese apears hirshfield houssaye boerum dullin tryp kickstarted isserlis exuberantly thimbles rothmann vedado bushism makor chazen intermarrying elegantissima tonko kambar baltoro latigo muckross documentid lunger valentinos pacwest hollandi smout merengues mcalmont fmu masin frati voyeurs minassian yakunin wacc gpw endive sirc siero ruutu gucht goofball mykines atiyeh hako moring milieus scaremongering hafid wightlink pipiens minetti scharpling longhai rossia shigar macgeorge bratcher wangford snt pandal extirpate repellant armitt molden tubo tabler novye cheesemaking iggulden defla mushu bailo colorations marsy caked ecuadoran brunvand blogsite strandings irrecoverable greengard kullman ugni walby ampoules inler kess starbird boulger cvh debreu meurs rauber recanting crotchety carnivalesque megalitres ecclesbourne serret hdw minetta overflying crotona volosozhar mamoun jerrard nefyn sirolimus mohideen cenacle wees visionaire facchinetti poels firedrake darbuka megève ethnobotanist ivig rindler nikkhah laack melker proinsias tumby masoli shelleys guede bevacqua dunchurch hiestand cellulase springhead zubeida jetlag brookie jemini weatherbee kuypers fromentin sbz balakovo civilising muraoka unrepaired otoliths deste thierno strategie siggy biplab tulk langsam giladi suel eavesdrops polya mulenga adamowicz noreste welspun wenxiu jacquette gnash quester dawda nessi materialises glenties kapaun mogra magdaleno ressam behler villalon besen almera chamorros castrating takacs roome headsail schaer borrel sangala cits wangjing frishberg tomin laloo pwned marolles interpublic sanitization newmans mcwherter keal meggett vinaigrette goolsby luckinbill tortelier anatomies developerworks strums medows diverticula individualization beseeching graus stanislava guinier pendine barberio karlberg trixi drivetrains creeggan meile bodde ecuyer wnep randallstown zzzz junod thse garve neethling darville weiller moix milord fpg chriss calcavecchia purdum rothkopf suppository rennard emig merryn culzean uwic zaret beauchief spencerian samms greying matuszak cesspit snork untying needell liberopoulos posener orderliness ayyash selvage balgonie remoter osmaston pako crianlarich lovemore laduke ozer stasio containerised handwoven aury wwb grean wolfs yayla neighboured icecube zemfira placating sirf castano depaolo croons pasmore pocius vernons cryptozoological boddie glassdoor skrela macafee curent insitu iccr rudloff berek pronin sociopaths netten flogger kameni gascón axiata shawky kotick trimingham messageboards blabbering haidinger schacter zavvi lendvai maccorkindale stronsay gibernau peyto charlap izuru deek sipes frugi stepbrothers waldir inoculating janeth vindicates denisse extricating frediano cocchi hypervisors dorai qusai cico nvs blahnik manesar greyling witchdoctor constricts shorties szczerbiak corringham downhearted chix ricochets gusinsky enology bisconti ymc perama lubang aryana asok ellenbogen tommies buol castalian scoular ingar zaslavsky riehen bonampak orilla havilah eavesdropper dmanisi sesh ragab betteridge guzzling braydon meigle bahati buil publicidad dokka kerrick csto issaquena pacitti dembowski panno ragaz grohmann woodlots afghanistani oubliette parachinar presense annaly puhi uncommercial bullmore sipp muskego blacher nantgarw compay twellman stampings lifebuoy blasphemies licklider antlered mstrkrft maranda villawood hiti hathersage jaguarundi gmv korsholm fantasmic cuti querulous accf calacanis untaet bezoar sheepishly clokey malecón konst jatte filderstadt textor marçal divertissements tergesen cammack digitalism penberthy manures kennis guruve lassila athe gerra likelihoods spiridonov keamy senshu permissiveness dogana snj iffi canyoning wahyu circumspection picchio saranda weda unwired forkner triebel raymondville homosassa kocian giovanelli arkush charmin bresler brimhall spondylus wittek kappe vasanta fairbrass bishopston inni fers gigo hanon rautenbach pimento chiqui monachorum sweepstake herse klapisch platitude gidney clower bioassay patarkatsishvili picart theydon abdelkrim sapin teymourian dhiraj carmike mhe munslow sooth sterzing fistball kachel gusted mcneish rosenow bajos yorkston adeola rosenbluth hartsock zumthor darina cusi contalmaison ikbal kamra saltspring ppis vasoconstrictor helia floch portwood malawians gojra gambari earmuffs surti rodford conair pronounceable depts tahirih dongsheng macrolides abseil etn chizik pouncey boocock sanitised geiringer yifei yonas pantothenic eikrem meeny asps grassington estimable karten botas tifereth gregoriana meersman chronographs unfabulous kizza bovec alland nanocrystalline richburg exbury yermo demitri wgf juby ouistreham kealey winglet hertzberger baskaran tiw septentrional taag knapping aeromexico oromos gutless dombi tilke opatówek cottington najim enfin hinwil fiducia barkow apfelbaum geebung makka yeary ignashevich qahtan dissimilarities passey anqi numerously odoratum schirn bradberry ochroleuca hockessin millicom gardners barzel gedera refunding firebaugh mutv lambley splott chocks hultman habgood souped fitzpaine trialware bozcaada slapton scarano dufrasne maynes jrg pannon cryptozoologist reille skelos hoodia banaue guillermina wiseguys fotheringhay vengsarkar hikma hulan grayed amz willin madai lyngen plemons ewyas theys barab kuso nitrosamines wlra balcom bourgault varkaus poochie abuzz stadtpark propound belga bottoming unionised copson soporific swanzey trenholm kulob sundhage hayslip vinces fatat targett jianzhi darksiders fiesco hampdens tpj campervan thorner cheechoo sabbaticals hendred baalbeck devrim nutraceuticals winlock marinara santora populariser chaan indissoluble petrakis surprize retton osser donaldo earvin monstro biscop orestiada blythburgh hemon chella kornblum enteral otology topnotch amemiya honeycombed relived gaetana krook ommanney wella jampa finnissy tassell ysbyty sergel fertilise seraphina papabile sherston pobol amonasro polyscope geibel duno ghp flunks ondi xidan fetishist tdh ersin vatsa nasz pepler shereen trancoso tuoi deusen conformable howman merkaz bbo deti lwanga photosynthesize hissène farncombe pakefield jimin artek chapare judaizing lizotte dcps jenko yanling tividale sveshnikov barkai cardiogenic dilger electrostatically olexander kwasniewski vanston wina pueri schechner raiz hinda monsun satwant liberalise intensifiers middleditch oake panguitch hensleigh spaying cerar washougal turangi meadmore highams sumaya gomen cianjur cuche stephensen telephoning contee sloshing convery hormigueros manuchehr yassen peppermints riski barnhardt interceding quarrie dermatomyositis rajnikanth unamended walburg gaynes xmrv sotir salboni euell jeppson countee valpolicella roewe ruia oubangui crewson twirler pandiani bushwalkers schopf maroua partypoker elmers wolowitz systemverilog queerness lonelyhearts vizsla peered bouc slurries makau ramotswe marzieh knightswood sorelle morwood chomping bwin teia mahaney glaces kitman mccamey vaccari hartville athleta hamano grosbeaks lence enterica frigidaire medicean herradura baseballer kyrill palade riestra acquisto ganeshan mcnicholl knutzen lucht kipruto antigovernment rawalakot gadahn castlemilk millrose streamside deerpark countertenors zinjibar elsen yattendon ireport metoclopramide dragsters carminati bradbourne durieux accorsi urca diederick contaminations entel kiselev bulimic xilin cockeysville cryptograms gento newsweeklies lout marna pvf sabermetric exhuming lyndonville deignan bottlenecked sabet birthers clearnet saloman niseko bombes bilboa unconsecrated evincing bardach hatteberg mallowan katai axler conspiracist rutstein manhattans goguen bondoukou abrial anticlimax nutri sarana talismanic coupeville concreteness chindit riccobono spagnolo marguerita offerer posi gammal marett dansby ethias handelman patkai akie scowl chargeback benke rosch terp verwood lmm meisels liangyu arlyn wakatobi kenly badhan dissociating canonize bodos shortbus agressively shubham fowlie flageolet schobert franey wolbers cherchez vitous petrovka kalogeropoulos dukan cione unpretty matts sulivan moginie hena maggette ccfc gwaltney lettieri coundon puel hatcham martinsson laton cloudesley torti kajaki anderegg nunu netsuite middlebrow arcor mitro crinan bkc wiersma wimple bioregional sifa faramarz lemna zambo bielawa fbm prole kemet adenoviruses traut chcf sooooo brington behavoir dilema lacera laghari chillida rasam cordula meadowview jostle heidrun supertest jamalullail khakis fffd shuna ilife constabularies photocell inmost notating ermitage perfetto waterkeeper yelm floresville straightforwardness greenspoint salatin jiayu alleycats kragh korten augh couverture wilman colophons brawner walsenburg signos dreja rafeeq refusenik amatory dextre regola reznikov pinda nuti frate jogjakarta creeped périchole areni longlines bennu coleville lonie tuukka linschoten nobels portuondo trieu yongxing portsdown esmée bastardy bleaches maayan marak scpa hovnanian shigeaki heavener exigency primerica kandaswamy adley mendte centromeres stanmer yucatecan walmley fiano berlyn walburn lennartsson deliriously overflew melikian paly oakbank gandil divined daou marmor mamaia ntg leasowe concealer resignalling lundmark zini lenham onyewu sandrich coments labastida vuoso amicitiae pilav ringens capeci skalski dubourdieu titch feyd taikai libin raghuveer solnit siberians ahanta thackley schmucker nields rammell janabi schwartze lety zagazig budgam marshyangdi walkaway grévin cicala belsham shengli broadcloth kulczyk birgir facia yangyang elya jitem ultraviolent warrendale vanina arminio lightheadedness panchmahal dunny kessels appleworks waitressing colsanitas streett lenience rolfing diari convoying apso presle entartete prostituting khandro kabar smal multilane quantz hookworms perfused lipper saltman favorability meere cothran rodier hoeffel annah hopland tias gaoyang poops gebran debunker oldfields hodo bequeaths etha antispyware rheticus refinishing lowson breastplates marib apon ourstage tionne gulaal steinhauser terreiro paranaque misti encumbrances croall hapmap quinquennial goytisolo wetterling borremans understudies mayuka ballaugh chavda castlehill mjc dominico factsheets succinctness receptiveness abbyy tsintsadze planchon cityhopper tajín akeelah bogdanoff unnecesary brosh ramco fienberg nicodemo yanjiao wghp sonatrach fallston weideman icepick xsi sercey majima sautoy oughtn wilsford stradale vuze frack smartrip thoman rozhestvensky bakun whimpering baren nykvist heka bathampton virg kwapis olivarez mabhouh beloeil epargne hawkinge bardolino marcescens bedstead agresti thaung souljah steffani ritc jérome frades braselton malfa medlen yañez lippold soliven welin nasif racette aija ruttenberg babeş matsukawa unmediated bgy bahnhofstrasse cootehill skullduggery batasan alaw propitiated balz norrmalm brasi giuntoli kinuyo festers nrps fuenmayor hmiel nunan framemaker reprobate wilpon blobel odysseys ghilzai zoonoses bridgeland dendrobatidis prego bbj billingshurst delite grassie moawad volaris raceways knxv psim formalists pyrrhotite trover vacillation brunzell lucasville andam solido sueno pensnett benihana hockings b,c heuvelmans meulenhoff hyacinthoides ecklund vidrio hyong tums techtarget isoflavones vohor beitrage nasseri playmobil turkmenbashi hoodbhoy leewards willams unkown mirasol houstonian refaced chetna mishler betfred railamerica siani docwra joyriding larmore teardown enuresis arey piedrahita direkt danfoss pateley monopolise libbie vpk toldo fullbright kunert americanize troikas arly gasperini falconio reapportioned piperno triomphant tudou singable freebooter wymore masella winny nghia erdemir grunion logix repayable ccar gladwyn olp roder berko darry boggo cittadini poststructuralism hartert restenosis shellenberger plzen macks mogok wego kinneil luquillo melky aggborough scherfig sgo usssa phun acesulfame tlf volodin tabish norc peekaboo cnemaspis smales xixi padshah kulash calello nonmembers hellner asham pretties giannopoulos risse changbai seeber brugmann broody siddharta rcg krumbach supertankers braunsberg adeleke ibv goalmouth jeckle eastridge mileva kelby thaipusam rigondeaux flattest entryways nyoman wheathampstead yarosh oropharynx evasiveness gáis shawqi heterosis austrinus transgenerational vrb aloi blatchley gitter tlg alibhai sarie buras diethelm ihrc fayssal aspell colhoun nonesense polycomb querns transferrable melillo brodbeck fatalis eurowings hawfinch lucke pacte diversey mignone fifers natsuo cholly haraldsen crutzen pedestrianized mekka alpujarras cmts allamah tamerlano summerlee thermostatic debited scherf jwst trerice aleksidze lütken andujar ishchenko rexona spacecrafts lindemans teele remengesau horticulturalists eurochart bluffer nazaryan siumut certo pinups boxgrove martorano accenting jitra fieldfare xmi mershon chadbourn nozuka dhumal adelard sekt electroless macchiato forbury witczak irreducibly mutasa nordan minocycline ugwu braising bitti straightway flatliners segui differentiations sirian fafsa hundi webkinz guiltless hepatomegaly prostacyclin pisana wolffe ssbns dutschke janai filipov bonnybridge lubec lihou rocketman kranenburg overheats glister catán nowt wehbe fizzles mirel mollies khawar kriv kohistani woodcarvings antiquorum tubeway peulh blackstrap januari appliqués brugmansia newsagency provincialism howsam medalled maytime guira aasif bizard tjuta venga iams heitler tuonela wgl loomer alers liminality gochujang tilos bovard stoilov makars wadge lochend frendo seckel bosal abondance originaly multiline courteeners uncontained sarason cokie ceas wenli rififi mezvinsky neki minutewomen understates poppier krakowskie polytheist agas fabel heesters ainsdale yefimov capybaras vastra glenburn lynchpin imageboard anglada impresarios jalloh luminosa yarnton yarnold bramer levinger kalaallisut plymstock vainqueur romantische conflictual lefrak ngwe lamido xingyi ropewalk templesmith disgorgement vikes nyoka volleying byzantinist shandwick dinata stinchcomb nintendogs mzuzu mariposas brugada mongkok gouldsboro wanzer poil snx biagini batsheva kazeem wbr afterthoughts yountville paxon riderless ekibastuz derval eberlein panhandlers shwayze powszechna mertensia villeda likeliness oystermouth blackbushe sidey micoud constand pastan autorotation revenged lumines alparslan warmoth paterlini plut devault synnott nikkel blye shawnna exciters sophistic pacificorp esquel ceder noorul bulley caremark turbeville orah stenhammar beninati balakian selinux montagnana esea coddle dereck peñasquitos chilham limpsfield haydee marketeer benching admixed espressivo moanalua orecchio domen tanui waithe spolsky mumme marring cains overal shagreen rascon hadramaut normalising finnbogason feneri xenobiotics minneola palmera boun cognita basters oldsmar farallones femsa tenny neptuno confuciusornis webchat olawale bondarev gauloises sadistically kuja sobhi homestretch idus adelin manaudou wesfarmers silverwater golab laryea bams urai srey ismar abhilash larimore wgcl boucheron nitesh bugner roady amoli jref highers mythago cushendall ntozake sturr sabzwari samsonite magnano marsicano renger onny incarnates riat volonté oatey absorbable zeytinburnu laran élites mfis perfidia lambertini saufley prend vmx befor neonicotinoids nonproprietary azadeh jayde navsea leggy bidart plett stoudt actis wojcicki victorinox meopham corkscrews luego exosomes confirmable behrami kongers lurching boltanski inexpert obligatorily requa kozy moakley walkington blomdahl trowell vampaneze quddus podiatrists ciaccia mccary kirchoff bildu serratia rayer huseyin watros hnb cheswick kotowski baadasssss bruces hett jcu lilien mhealth claessens tatsuki abdulkareem kazlauskas dressmakers mansha evangelise masculinization ahsanullah expedience ovie divot childbed unpitched salvaterra middleboro shair scotney barela micael everywoman manahawkin legris viridiflora negga frua undercount commingling kunashir mamerto quisenberry vnn andreadis intracity stemmer tapie necrophiliac feulner kuhle muizenberg rocke foregrounds kabbadi pawiak vitara funkier pedrera izakaya retinues dingwalls regenerations douve faca grindstones enclaved couchman gisa dipeptidyl brustein skyrockets takaharu bejaia debase gascons brownmiller ramstedt cajas herzlich yankey medicaments papermill hevea nanorods farb epw hydrops mistranslations recived sudd kisho schnepf lanco viverra elsy junning davoli juric woolens zoomer colwich trystan tamlyn durra kraddick varcoe temaru pasca renesse schie ueshima preordained iei nestos valise tanne parkinsonian sener piëch corbo yeganeh numbed ayinde rotundo cestius mctigue torontonians schweikert radion clusterfuck jtb ordet rudall duenna mclin ishizawa ramie batte multiday argueta achuthanandan trivializes kimbrell jazzed niton ccx sciorra hundt smartcards giebel attridge gheyn eurotrash tober pitka jagdeo jackhammers christan plai ritzman aati topolobampo trog hmmmmm migliaccio carena thusfar backround nocerino choteau wulong paroubek phlebotomy altavilla baher lindane larrazábal tesori scrapheap sportback sabara cadotte périphérique lft jrcc kuntsevo croquette mesdames welti wpd oegstgeest valdéz urbania shinar queyras pétionville timidly harakiri atole kizzy yizhou kronberger agog greengate otamendi kundry kondratyev osieck karpman asgiriya utian impecunious sarhan bjerregaard ferzan abtahi raynsford stobbs bitmapped unionpay bembe stelco holga starla awfulness cohler sublimely lopera nosov fionna dispiriting hollaender naimi mcfarlan seehofer scra carabaya rcds soothes appan ballybofey dervla vogelstein twinn homos limos petur verdure humanization demur exmor lasko yoki anglicisms earthwatch jugnot jasso changling buerger jumbos nairac devco messori kordell hanekom carping sariñana kauppinen lifelock mulaney gete accustom insley cannibalised bailhache criseyde obradovic redistributes badoo hykeham scrumptious elettronica cnhi komeda southie tarsia zirkle fukudome sonoita machover immunoassays gangly kartemquin imaginationland podsednik arakaki pienza skus kanzi runcible easterhouse weining maconochie berning gatlif vaster braço accrete lisk edun sammarco goelet izarra spicejet stigmatic feron halma appley cuadrilla idiocracy luban centura recollecting weizenbaum thrapston ssgn shengelia liuzzo towey poleg mudeford bakaj chandor gyalwang lrd udeid sortilèges tavárez nordquist anv obligor gerundive gangloff milvian moler kizomba auditore multiprotocol klans mühe cronkhite meningiomas cfpa cachi coloradas unredeemed cunniff kdv berhanu rillington shorthaired halsell sârbu capulets isobar flameout porajmos castigating autocrats zabbaleen villosum ruddiman arthaud lactobacilli amaliada sanan homar wkrn satsu badmouthing alshon straiton shanel jolles subaqueous diekmann rooij mitrovic palmier crépin teucrium mitred bancorporation maslak nzx miseria lasson anindya volek tadley congres flipnote alpinists phog creak crichel coonamble kevins castlederg bellowhead reproof walpurgisnacht antivirals tylden magilton gainfully ampney peeked servility toppan sparano ruminating neua overdetermined limulus steersman slaters latters debabrata kopacz arcuri movingly saltwell montandon plantsman fiats whoopie danette kunsthal milanés bragin senario loor cryptorchidism zareh chesshyre laurium terrien sobrino exsist aglionby enevoldsen stohl rousso cyrill jakobi kuramochi stratocasters connexin irrefutably migori iesa himmelfarb gohmert vdo hochul gidman asthenia hais olegario yentob nemmers lubick kobori aracely pensieri stanimir wirthlin kumarakom mccullen buntin aberlour kelabit lorbek faena boorer phas elwick oshie cullimore xiaomin omnisport premies wirtanen chislett mouza larrie isy siemer leunig hyperkinetic takoyaki squirming litwin jobcentre dorridge robinett tompkinson nitrobenzene neocatechumenal godbold martic getto dkc excelent maruo namikawa tausig dallimore grinde planarian geovanni ftaa slateford rambin prai moyola shuka safehouses gigabits fgcu drylands forints hundredfold saheed gerontologist golombek aulin espuelas vanderpump karoon ables butterfingers whith pyrolytic haayin magomayev magdelene easiness purwakarta weiguo noticably tibbles nyingchi hinzman webiste mooncakes parrs thyrotropin memantine genc radko meindl scrimmages divisi roehl seabright georgica postive triac kardos stannington meltham aardwolf jaanus shrimpers tugay housebound sotero vanderbeek garamendi saric forayed supervisorial cacc riddling gallager headcorn collagenase firebrick rocketplane pennon mekon shamsur sanaullah haikus premia akinyele deconstructionist couderc antifeminist yohn iracing conscripting buiten carriker fruin shapinsay henryetta hiatuses höch buza dellal cogger heale kieta alliluyeva tecuci helders wiklund standoffish windiest terrasson rosenior sudhanshu zigman dslam enco nvg tsum woodpile plonk jingyu dirges seeder grings willers pampulha quirinius babacan realis pallikal hindraf cdti yongping seipel bilgin slamball brelade cavalla zaca chinaski praagh fuzed thetans bairn xiannian wearhouse submarino dystrophies kulwant cial hulten chernyakhovsky skall macbean segan nathusius knifepoint superferry bjorkman alinea hosh gamkrelidze mawddach stickles scourging langille kpe uniqa pridgen nottawasaga barbells werkmeister buffetaut starves môr shahbandar sudetic hormonally tookey nazan egoists teske spains kesten jehuda starmedia coyly shua cerana mendacity daiquiri mikva pictorially memristor tricker odontology backlist daube rongai vestroia unavailing elrick addressees hinnen multitudinous ghostley waterworth pitton miltos legler dubner malenchenko trefoils nosedive vels reachout cazale summitted afip tietjen handfield patchily propellerhead gilts arboreum lauth tappy missie crispino helheim nelmes tippah vanniyar feldon screencasting bijlmer manningtree arria marich holzwarth schlender petrik banlieues seigel metallurgic dustman porsha kmov ozick gayda wijngaarden frights aceros morozevich abdulah enchilada kallie huntingtin oliseh jollies jegan lindenwold belligerency bouchon quadricycle gentlest stentorian tavon recalculate reckoner stratfordians hellhounds telegu deeg ambalangoda liburd vlasenko gogolak unwinds geoffery farsight heen julich rewinds softwoods wretchedness printworks ursini woldingham toxie lidle stilson odendaal chis imos satchwell disincentives gorelick wallman mckinnie misericordiae tecno kickham ashwani sharifa petralia kuhr keenlyside anastos aformentioned linnhe shawne parasailing armengol derec papps ár répétiteur safri davion beder durgin glatorian lomaiviti patchway hamberg faustini orok kapono duhan franglen barelli celata summerskill rachida stereolithography verifone suona constanzo homeplug hazelden evanovich crackin hipwell spherules trammps piersall leonardos margiela karttunen rodallega havanna zamor distil darrius mentees tostitos cutlasses leimbach drf sloatsburg scorelines scotoma ruhrgebiet vassil toplessness blofield allouez letham dinorwic saphira fulgence emburey thornber agaves occassional goehring unsan balado knoe beiersdorf perabo broths rixton hilliers ledwith dizygotic hangeul contextualised wiechert telesis spicher kolis mendizabal remanufacturing bensenville hippopotami dudleys sturman galeazzi nawfal grumbled neocortical tinman cheongsam byr behaviorists moonlite arlésienne cerion horrifies kagiso alver mcgivney maseko worlock chimamanda deringer badrul atiba quencher limuru wearables angelman southbourne dawsonville tomich harperone sabreliner idemitsu scholte pâte konkin stache gasthaus cobus usefullness viscerally sreedhar bachelder sahajanand luneburg attali barbin organochlorine cagoule fiorito pitz sprog dedecker ellijay tschichold hollman delalande dinnie haynsworth cointreau sutera jumilla lonborg glowingly elkann emollient rogov spaceborne ringle ebene tadworth golin pasodoble uchitel hertlein techfest cyanogenic reemerging hogtown khejuri senbei dpn nanocomposites explorative vostro fraticelli boyi matsutake carbapenem hertie francofolies bandsmen prehaps deronda syt yayi tramuntana stoychev dilhorne hereabouts lappalainen palatability percolator arpeggione rubha mdh meret uscc jaglom malodorous chambray yeap chromatograph leisha covenantal mishearing calmes callicoon auli clubber turiaf ystwyth charnin nondisclosure danticat ridgelines peruzzo steinhart blek horrorland chettri parastatals vertes hailemariam kayunga mendl scratcher isador boxsets cahuachi southview barbaresco nonindigenous bhakra pavy sabba bazile sparkford nmg ringback karie hoiles acquiescing sleepyhead versi fereydoon abendblatt hyperextension esposizioni queequeg furney fomc nonclassical camuy dhakal dwele dabi condat elvet berganza nimby enstrom galdikas nikonov cauterization bbsrc yuni sarabjit dundry satchell longland puleo pistilli kisen divison astarita centralizes patwari isetta aquiline araucania weidinger pinga insource bocchino enterocolitis bisse ratso dingles marjo umtali mljet efr robusto igb lydden borujerdi mulvane jaster intones softeners rationalizes dumbwaiter lehren pervasively acocella hfl savoring tijeras necator jabala vilhjalmur destabilisation nhler foccart dipeptide initialed cuchillo karasawa nkandla eske multijet hissed ehi belke passin boondoggle zengin jeta doyenne transaminases sunburned vlg caudle jaydee fumie cypria stingley timucuan somerleyton averts naude staving krewes yetta seigi karimun yingli foreshock cbms börner flimsiest poena weeze bellecour jerichow aames mazan trainload bitlocker okeanos southill communale jila expropriating mifid anytown texada battie treisman zacharek shamma wollesen cosham torticollis wrathall argov scheinman cutmore biomonitoring hardik kiddle thisara allopurinol alldis cardale widecombe delport mohel kasch abjure tbv spawners contis framboise wujek glemham orkneys yuge seke jollie paxil barcombe speir bloomingdales covo madhvani raan fiumara boxborough kizil tierce rhinovirus heiki neblett catsuits drw westmalle dysgraphia brinkworth rybczynski goddam hassane kudasai dungloe microbus outreaches nonliving anouar sakamaki tgp disalvo sneered omfg bharucha beah instillation havelis jks lintas mummer shabwah kirt lunardi fujima cauquenes monomania acsm chism wns goudge naiqama theun nappanee balbuena dré noynoy gamp demobilize ucluelet frontmen ratterman seegers liebes klapper coppelia fleeces neurotics cedeno wangyal kramarenko garreg bgv petkim outwork kaberle carolis zimba powrie glanmire gnjilane bavaro liet amadiya godsell stavrou gehlot retweeted xle brownout irritatingly cartlidge rutz jamarcus lofar gravediggaz ktf solicitude lehmbruck curfman batya suhaib unifi moszkowski ibnu antiphospholipid mcnall malefactors cushy vye coffy vzw rtls manufacturability raheen frombork nepeta delizia drimnagh picoseconds tripodi kihei gaetani tuh ransomes ccgt dymoke machell etoiles nikitas helminen demarcations soothed primroses rivesaltes vison perman okolo hazael readman icehockey hettiarachchi sulejmani unschooled thinnes tosches acdc greka naughtiness brodin gerges mosteller subsidization escalations pennal meye knigge bursey zahida linklaters montfermeil marketeers taieb martialis macfarquhar geoduck maryellen nuestras browbeating ivanovski mejlis mactier allal whut zaiko fairbourne mauvaise manase repairers geishas ecdc duco tzigane afsana anello winegar idealize andis veco gemalto abrasiveness faires singal pavoni voalavo recaptcha tattenham upholders berriedale deconstructivism zambians pathmanathan jingyi chilson unsg cleal gottfrid torcello scovel stoping classon oakengates timberg rudisha pizzorno doomwatch clippard zoghbi footway razov bondsmen slidin flitting ghuman vons mollah kirschenbaum affluents crippler asanuma manichean rohinton bremanger lívia ucits labarre wamsley boobie nipah azzaro quintillion nudd heav seery krekar grizel shunryu dreifuss mockridge spaceports ryba stoped maluma ealey medd satisfactions shandaken marram commotions immelt rukai googoosh rgp isec pumpernickel dallow alj shigematsu narcis footedness raghunandan luxuria ezy tressler yasnaya willens hoepner jianfeng ridolfo kiski wyncote topolánek kyc broadgreen intraday gollop sriramulu cotgrave obeng hamani previa vrooman tregony litte tsakhiagiin paksas gummed backtracks dujuan proenza intepretation musti diler heliopause preindustrial plaiting sugaya ishfaq diesendorf calli intersperse lindiwe carreiro schoolchild vejvoda amont jabi reyn warsh shamshir cleanness bbh recondite anastasis boskovic cfra studt soupe ebensee xinxin utri cavalese favola dravs agag whittell maati husam hypnotists toonerville okur paranoiac denfeld jpod fardeen lenn artifex haes vivianne pratts barbès chelli roumain natanson soumah latacunga guerry mecc bellon impastato rendova packie kostic pbg heren tucanae edell karyotyping cannady cict mhatre condi drawstring chudnovsky scroobius windsurf wrangles furriers sunexpress arnout awst shortlived abderrahim culin forsell pratti videoconference wimbush fictionalization sangharsh sambu savane unidiomatic augenblick saltires jym mugambi trofim overwintered onramp iasc gametime euphronios sokak nnenna horcruxes karlsplatz tyce shatz kilmuir dewit sorceresses aics sakia chagrined borko pennard autocourse diltz aqib toepfer oppositionists ofek reified halkirk pebworth raczkowski dikeman yukiya absented dixey fayet kampa vaitupu masaba yellowbeard ellenton orthodoxies sonae hockenberry sefolosha enderle irradiate uninfluenced interlayer goldwag nasally rozman mcgahern vanderbilts duranti mccrery signorile rogg fearghal stehekin blunk appealable derussy floorboard moxham roadhouses ciac bouillabaisse dropsonde ciat rebalanced leana priveleges bardez cozier montet katongo sautter begetting movieland pettingill netherwood guajiro albertoni mirrorball sociaux shinwell paliwal incompetently pilli larrieux mimran escs klipsch wopat kitaoka mcsheffrey steeltown interworking brilliants gardy defilippis gcos clapperboard glaum rybicki munificent fni irakere wirths coronati mccroskey pataky elvire needlefish panko biglia nataly sojka kirinyaga abridging icey nevern kalinago tillerman passbook yingzhou jodelle plaskitt gianelli kiet donator anthos googlemaps milstead jerónimos staker cabranes matschie likhachev coypu röhrl prettiness ballen proconsuls zanoni raffel besso slamannan pjsc camkii kret zol mitchelson reinstallation intercountry garling heathman damin horstman norgren ktxa uclan sharf namp vist garlasco entreats attari nabs zeisel drecker sixways positiva victimes roundway drek gionta shans kealy yati didim brancusi fischbacher pasturing porche napes mascaras wekesa wohlgemuth aout hessilhead willerby delek goswell entwine longhaired meanderings madaripur unsociable katsuhito reffered bilheimer domesticating burkill akhalkalaki chuseok asiavision yane laboe benezit kemco ostiense seafish balda zhiyong reihana muradov engles nisman kopje preponderant simular westerby hotten reanalyzed kummel palaeoecology loek crookers archimede csrc debilitation radionavigation soldierly samih ploughshare halphen lurssen baldwyn hotaling streetdance eeas frizzled tuteja sape sweitzer delaine moravcsik seafair skulking klr uncompromised herle delone turnbow billin slesinger bloons morrisseau loadout kibbeh calmac zuhri ardoch revelled shvut garderobe excelencia quadrimaculatus unmee reynoldsburg deanda wkn ustc diminishment balshaw breger barbecuing omelettes augers estacada seafires fistfights kurnaz benett klindworth wair chagossians kosloff ueg khatron jiguang ,so haseley hummocky rtaf strathkelvin wkbn syers ravenstone foschi carcharodon michelia whippings kday skot ruffe staysail tzion aota lesniak fudging semiprofessional lyria dgps dihydroxyacetone levie perlstein qawi astiz flamm aubier ronna schwede freeburg mantoux bramantyo maleness acle airland dissolutions wildsmith carnevali beanbag delgrosso candan tejaswini louk plec mapinfo kurdistani cmes descente cécilia felicien technologic europeanists nepalgunj klopfenstein aesch sonin unknot auel whinney peacockery neuroendocrinology expedients pterygium bulrushes xga godfree dongpo norv sgouros paulins inadmissibility steenberg palazuelos jank amathus copywriters maartje bushwackers ingleborough brusa karpen vonage maspalomas bergers auctioneering llanharan encases mcraven faherty pyrokinetic digitise ernsting myeloproliferative cajoling shakar jlo diopters fero monosyllables colaco freedberg penknife wwlp montois kalmunai hathway eireann reinisch distractors dellavedova scip rtnda ausubel rached liddon mallery jenaro kendallville meehl reeking egorova raffaelli lagrone guiliano pignut allofs marzouq westwego rastenburg gosney hichem khq schapira juca lohara syam flitch seatruck mrls antiquarks schwarzmann hexagrams hucksters khc heavitree critchfield ruxpin hudood autotune scotstown hafren billable mousley cheil lopa perella dondo forcings pankin furtively laths whoosh simcock sughra wetherspoons anemias eveland udrp houstonians yarden hermanns aukland spacebar proscribing zango comportment menstrie nucleobases subeditor crownsville hurvitz stanchfield sensorium claman accommodative roobarb godi becquerels taumalolo zinsser ruminate sherfield felip topinka postprandial neoguri pahad druon lonigan guantanamera heiskanen cleere sumio governement staf cruciani hornik klapa ayora airboat todds nesquik khodynka llantarnam fougasse doleful phlebitis okoh kozinn tigon dubonnet triumphalism wegerle sternhagen neurodiversity wates photosmart infn squeaked vexations raichel branchburg midsayap plese songaila balderstone mewn ingpen berneray marnier deras clampitt paranjpe morosi gossen breault multitask iaapa informatization progs ningaloo tacticians pillon schiavelli wongso gingers didar cndp ulaan quesiton overindulgence talisker fuda discographical hauritz petherbridge grillz irit twg scalawags lebas farshid luciferian wcr crimeans betamethasone polyolefin clairol francos ventilatory buxhoeveden holmesburg chenghua troche woodling houshang adai alejandrino nessim porchester novem braud meeropol dorigo vandervelde singhs becerril hauksson kuter besim macronutrients overclock oración ruga kullen asola mastrantonio burghoff caucused wachsmann flsa carrolls feshbach etape hopcroft flophouse westergren perfumers freeskiing unshaken orianthi azis colmer volsky asianweek coleus wernbloom gunzburg errigal ualbany pushto estell derwood grassby timescape spani ridzuan unratified bernoldi slawson understudying rld gosselaar gallwey bikeways tejpal koegel negrita spag chrysotile lucetta zanclean datz mcgettigan jurevicius otn policastro pettyfer othmer hongbin pelphrey lulav vilella cuckmere nierenberg delimits freital mystify fairholme zelnik decelerates swamis diran castigate merkerson bulis elswhere mazzeo amru ewins searsport tastemakers petare margrete fowkes jiaming equivilent mckeldin huseby vasilyeva wiedmann restyle anorthosite rangy nanog groover bantus qra cshl grymes jomar cordel kolka murthi phthisis gunwale wonderlic bionda araminta foaling awada strainers paish sles joannou hanbin junkman tercel uley neals bizzell lardy illumined whifflet weenies scifo leipold narnians workwear paetsch carris personnels gunji souphanouvong netroots hectolitres merga kyrylo corsie fesenko cuber süskind whited beanies christon snarkiness mypyramid mousepad malartic izabel silvretta halyna pottenger landale leonsis acclaims serino darl skar standardly eya cfw iglehart taylormade cantigny zajonc inamori pézenas lokman attn marioni payami aliye sauers osagie oxburgh epicureans rocketboom kameoka portel takashimaya toyopet goertzen yariv schaech yojiro ccleaner haylett mechatronic johanssen kononov ercp wesendonck brokop mcad botswanan terk shudders choto dissembling burghead wasti yeldham akinfenwa valensi lingvo izvestiya warung stanisic dumai eharmony unmetered georgiades memmi howled mankin merchantability freeney anonymizer namus burps economique jadeed sthree lucayan factchecker superstate shante forelock jaheim athanasiou bsee ezzard tamta amruta bookworld playpen dharmaraj metabolise rezo votorantim nolle choosen bunion fpmt jenji waterslides uncomplimentary lwp sikdar zic parkhotel guayanilla sokcho oah ondimba dilator lgl taymiyya signficance kiskunhalas egotist araripe nni vespri ishino barny tomohito tjc velzen transects gastronomical gameworld profligacy richwoods eckley lendale geochemists walsworth kuryakin beerbaum eisel skewering waldie boatloads essm statut wery gerbrand kiwami giff decipherable gastropub hesseman unexpressed jucker tejero detente workprint haughley canalis naem mccalmont oakington lifeways renauld dibella easson incomers itsm jacs kaprekar nwfa looter oakden suppiah intersperses foxit ecotopia terrifyingly doubloons recognizer glasspool tapo féret davidsons paranasal internetwork roulin agritourism midford pandav sodomite wurth vanel gyda sachdeva altesse fumetti kandiah szwed waldenbooks lambdin itvs pilchards neurokinin minnich multiphasic solipsist bmu idealizing eilidh vaselines grealish belinelli ashurnasirpal ahb carcieri lybrand heaslip sigg bottlings sawin sliwa gladding asashoryu enumerator orquera semenova taitt giedrius waha sneakin bhagwandas cange gardini viscid lepic jehad frognal redsox clastres salming laoshan polychromy burca vugar tingay amaurosis deshields steindl estherville unfermented desplechin gearoid xds mangusta abdillah yellower defrayed vandeveer wagman endura pickoff estriol dsch wormleighton stepanenko sharnbrook cadenet derana khane chovevei needlegrass padwa ghorbani chukwuemeka backchannel raunch mateship buraidah casamayor shalikashvili kainate astrolabes preiser abat sagged cygan tavarez diagrammed perillo remissions bradly seedbed lesly canfora baqer petróleo bacar kilberry mildness realestate coupole gaddy gmh ancillaries teyana sile spiaggia kwanten dourados synesthetic wunderbar feick baratz flosse musandam nyhavn loreta overprotected conto astwood fruto sölden tricorne haass schusterman keigwin khawr codifications kapali unfulfilling ogu halbrook samiullah yabby heshan lederhosen tehillim sipi charner fesa fromer bongbong stefanki southwesternmost mizelle commisso onrushing sejima pibor immured bussie khosravi swatted hatikvah panpipes melonie seraj ulo pelkey verbiest mccombie ingibjörg vecchione wagamama raxworthy russin tansman candis aynak bovino scalzo menter guttridge immunosorbent gallowgate rrv gialle chazelle broadnax kounellis airfreight xiaoling palimony fynes gorokhova jiggers micromanage cnor karmichael aapt quisqueya solicitous namita pachacamac neuquen pessimists talend greengrocers saryu uniao hendershot barbon markan sawtry bahaa yamcha givry unaipon inflammations stabber yubin provos intercommunication augé decio henrion aaryn guangfu sagada miral leguizamón legitimising crysler courtemanche hennesy cockington leibman conventus jodhi sheeps egidijus privett chicksands hudal rinke finestra borun aiping cesarewitch rainout faïence jagielski moisturizer trombetta gauzy holmewood wabush weerasethakul milholland sandall gargle skyla overthinking jeanson anglocentric ochocinco paperweights khammouane submissiveness westbeth markeaton newroz wmgm wymark driftin storen sectorial shecky kittles redrafting dalitz airglow groupwise rotonde hellfighters kinka gurmit kaser mondulkiri andrieux athaliah clatworthy pepes personnal conviviality vizcarrondo joeri makharadze pipino welly bovril sletten krylenko studiolo matis odem augustina alario felicidade harned wildt crisscrossing fibrinolysis lamplight ennals buczkowski carpentersville upe blavatnik prospectuses quested fissions muraki bundibugyo wiederkehr scourfield devastations jolicoeur kardon unvarying grl exfat unwatchable alshammar snarks cliffsnotes arngrim davtyan zurzach wrightington boice kostopoulos ostroff manezh multicasting ocotal ziemer ronee perianal tegid kumgang palena fischetti ijm wbi jsb spenders martyrdoms implacably unbelieving optio ejaculating soacha lawanda guiro clothworkers horehound beaubourg qattan psyllid mcgrain wittes maniaci jailbroken cegielski rtca alphege moonlighted vanves leinonen tzipora avaaz motijheel roura ferrofluid georgianna anthroposophic kunka noiseworks photorefractive etm typifying desir tiburtina slaved glaslyn siyi passel brailey prosector elgie mcpartlin sportsbook ghazl kellow patiya mispronounce wrangham farenthold cohesively duping hanshaw rrl inish maufe panya polus profiteer whinge almen woodfill clugston inhibin ruchill heese assem chaplaincies frazzled hensler memorising beens galizia gruver cuttle forthrightly tyack ndfb neubiberg kaptain csny oscarsson aristizábal doofus cosmides bomblet micaiah hammami rahmati wohlfahrt rickrolling wickett sparkled zadig lombardini denbeaux jordanville beauteous tishreen destabilizes kuchen rossiyskaya mitel liberton sranan spintronics garlanded bernero sellick jiwan agaisnt celant rayamajhi ameriquest kolles ingvarsson trenta touray coulby murieta asafoetida durling strathy servizi wyville moxibustion roundy playschool edgecomb somersworth altissimo ziana menteur qilian alao topographies mankiller foreclosing azarcon nordal misfeasance uninstaller médica zeiler ameba balang wrtv dhaba sehorn ludivine neighbourly wheelmen banesto antinuclear chiddingfold yansheng kbjr danceteria parlby splendidus incroyable severine kreiss darrick wailua skoko rovner goodenow yodeler wujiang mrazek freundel dangi squiggles aberlady diabolique mcspadden morgues harinder toadstools valiante unfortunates comac icho tihany princen onigiri infringment gratin circumscribing phimosis marvellously hardeeville subedi bohle riai mallaber tobita lariviere melan cullens reliabilty bonanni evins maneka rockley endoskeleton sumthin chargesheet florsheim trembley chignik kishu tariana jinhae ktvx benvenisti cottonmouths imec reichstein wholemeal favaro cierra conciergerie bakeware vasgersian iln sedano bluestein blitt dahshur phytoestrogens tannersville basden pterodactyls amed sagrera harptree firestorms birthrates gulnara soie sixpack delila karow ayen vereshchagin tieghem notifier hoatzin rahbar durio baisden vides castellucci massara didrikson icrisat hamor offner rothorn zieba niram reversionary snowplows ratmansky fundi kelloggs gapyeong chriqui tasse mannis shennan cozma rostova eeghen undershaft jinggoy destine mosaico synaspismos sumant colaiste kurve bhante teahouses impertinence clottey osteoblast uhry avontuur clon epicatechin hagupit ghaddar mürren sohel ough gosar benvolio democratized shakespeares floodings deividas helbling kmk unitar scorton sabs freilassing erstad brancaster keuning molybdenite leuser geula anythign custodia thrilla retrospection cappetta buki stefanini ails velleman pelleas toyshop tawana cleanses kouri johri kdc jabor turo dragones collaged gorai bohman oneasia outdid karpovich séraphine butterwick godar horchata andreani phalangists poppi druse itten ixus pade nipon salmonellosis cuy alting divisionism wallie guinta firmed jeannet bolkestein covenanted beccaloni dedan berdimuhamedow vilbel segueing cmcs hrubesch paysans hanwei masimo gushan kretzmer radzi kassie bacri usurer desisto shotcrete burs cherrywood medianoche traber chuvalo wilhelma imacs momtaz daykin trient arnhold munsan fahed booters acj danah pillinger kofa spandauer gjerde brighi semioticians gibber xserve corrigenda mullumbimby gho hillhurst jurats bergeson josi pompa natto millbay surt geran satriano mingay freeplay gaman gapp bresslaw coull crowson mcfetridge adebowale peppas egomaniac macedonio nars songa kuss poundage inorder morrin caythorpe shuyang interrogatories bronston oliveras ceren chigumbura tappi tingo tulley djo docility caam sevillian geddie ciudadana knead metherell rubida ncsc irven boardrooms hotell salinization kinabatangan sedo telep kalala enlarger danoff weathercaster maragos frappé aerotech tywi glorie dharmachakra adiemus hudsonville bobbye hellerman fpn erla minchinhampton genny salanter tsurumaki appelt zinzan litvinoff consistencies beauchesne obraz carluccio charwoman giacobbe vivaro calpain sapodilla musonda woakes dogstar arns nooijer alexio anai budiman fingerlings satnam wexham interchurch cmhc arrack brazell cottontails blinked laurey mottley hermsdorf qanbar pixeljunk cuby fourball zareen raybon rebhorn microwaving monkshood decentralizing benozzo skela glaziers nussle locoroco molano vimmerby dietrichson aquinnah guernseys nedeli counterpoise cannella clardy wrixon propylaea blatch frohlich orfila phonetician thommy buraku bornand pechanga herria aterciopelados pendelton propellors ashlag lixin ivanenko comancheros tschopp solarcity kollmann hiden synchronising noci supersize ecus demetra oelwein instrumentations proficiencies garrigus kyriacou classement pencoed cubanos crosscutting lueders vellai cassata shahul grupp bergsten zecharia metzingen palea duihua messers vicino saliency hobert armyworm etec mispronouncing sacheon altix syngnathidae doka jefferts csat leaseback greenlighted nikolova florica papagayo sloyan jba hotton feedlots recalde essi jelks evc linspire rahayu unconsummated warchild lellis verité chisso ketut calcot onslaughts sucia haemolymph requirments lapdog zingg toytown roader sesostris mirtazapine guideposts otterton ramorum navarin deinococcus alnmouth westampton qadam bonwit verducci antonopoulos opinons controversal thinkable hardwork feltman kukulkan contraflow dalliances sportacus labre stenographic biotransformation compactpci khaing golshan palmatum alperin rizzio kurung kdh blacula saltford jedermann hasim cyberport origanum naud yvr mutliple vandergriff desaguadero mckennon liván vocero babysat awlad mollohan hgp tiddy ebbert talegaon yiorgos redcat ellenor cosmopolitans legitamate embarq peotone malalai posthuma kajiyama marilynne scheuermann waskow contin dand traversi talmon asmik megaproject nevius cheves stallworthy goheen polwart elusiveness knayth dehri wakin esport bibbs callanish burkinabe panero doogan batterson peric flub ltn coarctation flindt garmsir punit iachr sanket einat illana merricks touran staithe andraé malaysiakini predjudice sumika anwr adeang buddhadev mattituck fluss mcclarnon feile matola candon mipi babyz turfway randomizing calayan bathans fadia midsouth lightkeeper crites ludens fikr gegard comand lerici krit glasco virtex lefrançois dunwoodie kukui amic mensing hfp lorella jabbing trazodone astori hbg kleen namdaemun przybilla bentleys michoud cleavon stirlings hahahaha weatherization joyo goller sandwick taja pyelonephritis hlp terrestrials gurira hegle fwc wobbegong olding elzinga eltingville minny bluffed toadflax nakhla luchini boorda caronia runout costessey penhallow dysrhythmia approver egberto dazs sayad vanita ovulatory squitieri eifion griffithii snick powderhorn asiata konner tremolos liptak reducers juiceman raffled yuyuan materi gustavian roundoff cnas cuu seborga budworm dispersals endell fontella hesser automaticity magglio asali ghad suckered whiti hogweed cynosure narcisa obediently rukhsana sentara consalvo tissington thorsteinsson mathon hazir recapitulated depaola funkytown shamkhal floridan flyswatter vanko hinrichsen varto frevo wrva preheat postale cnic mahara footbal layar annonymous ainsty albertazzi empyema expiate kharms hardtalk malaparte birkby aboubakar kawata bushfield xiangdong zaltzman sahr knipper sieh eagleman darkwood ballater sdio pfn pawlicki caldy pinzgauer baumol olek faeroe bonesteel postle sackey totino carnero rinka battiste graciousness particulière nulle mopani chadians doppleganger buday readymades bourland netherdale sonicbids bertola hfi klasen twemlow xfire paneth bunia htd suw toiler specsavers outfalls boneheaded prasetyo fcoe highcroft pinatar chondrules mazzarella maxwells honeyboy herra vigas orlo cheyrou educacional reteamed seps snarls stillington ebeneezer reconsolidation seiche drydocking footfalls balter hanney loenen rajavi huckins thymol teba levofloxacin matthewman mainali atwal beymer qcs jaswinder razaq jordache nacala tabacalera yvo avenal dobler donncha jklf bendiksen matza ejaculated unreconstructed tabet bellier tradeable hiru foroughi uslan cmac nygard tenley bothroyd reede zacchaeus alburgh muammer phippsburg nki utmb vernazza cartosat bulanov profaned navaja surjeet gabbay mennea allods keymer bullett biozentrum schoolies deianira forgas sextans hseng impractically gazit grennan eastbury hashoah psychodynamics piff ranocchia snarf olejnik uhler kingspan beeley anishinabe gumdrop pantaleone krasnow hasyim counterterrorist breanne nzrl freedland slbc cuil kurn fruitlands lecher oit drumbeats cenac voulkos boatbuilder arturs endpapers brinklow kenickie aneka reznicek gerets fadeaway nuneham cockatiels tadahiro lonrho decter lobanovskyi pebbled sighthill kurir speen agrochemical btd fcra sassoferrato daylily fapesp heyliger jiahu rielly mubariz türkan holste sufia minit prtc litening minev undernutrition initally derks sillars yaghi orchy recrossed gorog kunama arki manzella mcglade jayakody salata auchmuty rimando reedsport konosuke rjc bomford jonang atec markka polic cantelli anyting enumerators berty groyne benger swailes repatriates aethelred tryfan abud siamang alredy muin foard varta debrah afterburn giancola rainman pollards nagged tergat deason adventurism culliford yettaw bouschet birse snit nanyuki windjammers ukrinform gerti bommer yewtree rutherfordton stoyanovich apotheker teresópolis harbinson lorik kabo bowmore unfitness sugrue pogodin minahan demars bellissima embellishes phad aigis jefri impliedly albermarle merlotte raceday narm ajia akbulut atsuo brenly cosewic thaye peggotty freebooters toxicologists chrysantha gaghan jongleurs taraneh asef monsarrat microcosmic mankins kcd brueggemann lumbosacral taskin chemoattractant jellied thuban aldbourne markinch criado chitlin orgeron gvs earbuds impermissibly saomai buckminsterfullerene hédi epigrammatic njoroge munhall uncleanness berresford buste migliori ganci faught localise kolla ringneck okpo choling mapother penington vinalhaven mengel elting mistinguett komarno pleming megapolis colavita felli zeilinger massport escribano siriano vilest groenendaal goffey osmers nettelbeck apparels rbr hygeia soapboxes dechert almodovar airth weeki jouer squillaci yangpyeong vette shipway katimavik pipex avails twofour pomerium welbourn decriminalizing northfields ainger landsbergis blaen gasparotto mendax blabber wagners kumyk gdnf jesting notaras threateningly defeis birtle papy braunwald rimmington polyclinics kneeled spoony repast rocketing tmh trapdoors joash wicklund quatorze tamburello tonguing deceitfully maxted schiffner synesthetes nesler elysee tunay gemignani hockin naugle microwaved calì nimani berro giratina lavrenty presale kiriyenko shachar katsidis hairpiece rotondi brunskill haitai bhuta ibope diktat keloids jordanne coloboma winlaton deare kraak fape conflux mundel meel agunah noisemakers devesh kamenetz kirkburton nuxhall onagawa kornfield grimson recode nsaa heelan aggrandisement proses caborca circumnavigations ershov msconfig plimoth collipark labarge gotan silverpoint rustum hews hirohiko plomo dillsboro combatted felucca kussmaul nizkor ardan schriner lockroy keshavan ulua iafc dushi gaustad scotusblog tianmen eppard banuelos mazy sarakatsani widdows alyse nectarines goodbar slighter hopin aerodyne requiems nscs karada damnit brainiest samurais lennar splunk bassas sonthofen drooped yellowwood masius sealife kishwar hypnagogic kavalan virals joio bonadio hypothermic byeon aislinn isobutane believably eidelman mellett ardara bonbons jacinda bartolomei flatout acree derge noell downburst binu jannatabad unrehearsed afrasiab wargamers collusive bracci carnivale bitting eguren appologies colasanto thami shavelson cjp nale micaëla heah transeau oiticica cabrita synovitis baazigar earlene licari exisiting venugopala alemdar wbb xingjian minervois pestovo diamondhead biebrza decastro popovych kaysen jgi gudbrandsdalen jacobina carmageddon shchusev purda hellersdorf vaquita somsak maston stonham wuest halkyn smutty siano jedda macroalgae nasus nedo lopped sico multichoice bft kobler creatore micucci duston broster shouter regimentation anaerobically danowski galoshes biasca plucknett dhir furans diversely fitful houda gharbia stampley paccar bhowmik checo heatherwick liniment orbi ody hemangiomas wynnstay balkman mocker ziemann nollet arntz parcelled linkup asadollah coity fiendishly bleeth dingus athiest damia digitising financiera nyqvist messen bardfield liacouras volkswagens llcs marsanne quinlivan marvis ellerby nicco huntingtons chaabi abscission kgalema honeyed ruit edger mazowsze riegert meisters subida gediz jergens aufidius bajer acreages elsayed unmounted unspotted mosheh lavillenie solr oversampling hamerton amiruddin softley shutterfly menden readthrough challe hydronic stupar eoi deferments carbó caramoor finalizes mousinho aprista yazan zabi brandenstein selinunte illum brachypodium reagans southmead centurio buzaglo enslow ferlito móra quintas nonresidents sabai peregrines kdvr caldbeck gerbe kadhi dibner succisa eqt rivelin sabrewing kimberling dumitriu gonder toker christianizing cimic comorbidities bacalhau mckeague guta raburn artifical akzonobel pinfield llewellin genesi krasno zatlers cartaya deavere incertus listerine kensley bittu ingénue thoracotomy kostecki rebellin ohene derated siham bucholz gribbon lepel esteli widman clerico ethylbenzene embroider luminus wfo gurunath matveev cuffley krasheninnikov havo jammal jiazhen terzian camelo quietude pocheon lems nyali cnac spartel galu studds unmissable essent zanamivir witticism sangoma serero rondos spivakov volpato tannis philosphy sarafina rasher trimarans hartlebury wxrt fullington whiterock desford moumouni jambe sezs arimaa packards dragonetti afterimages shuzo ztv proliferator nanka rueful fayerweather smeg itep eitc herek saens lairg veerle nockels balser oreilly ardoz lovelife girons mccreesh adaa kalicharan dehydroepiandrosterone disgracefully ripsaw bellanger rafto sarc gwasanaethau csia jagath rajmata lepidopterists uwais disarmingly prerelease subercaseaux detling abk locutions icmi hruby boudiaf engorgement franki ccmp zhiqiang dhanda budeaux globs lamble meshulam kukkonen pampering shantel zilker djeparov nashiri novoselov balmaseda piezoelectricity nagamura marange gibril saum bidmead slatington lupercalia soce rapley uunet cerén bastone telework blading dbrs halderman ferreras malinke grayston iph lydiate airton rightwards curzio arular benzine phanatic foldout dewdrop valdepeñas townie ivankov dices tutton surtout bittle khazarian megi sulston samaritaine oreskes wainscott brard fbu sectarians tirrell kucharski auxiliadora sametime neuroradiology icis aonach winslade muskat overemphasize titterton soad floodwall delmon weinzierl tondi namas kyriakou cerulli kimera croze kammuri bnai mawby paymer thanawat babycenter hawl dementias ellam eyen spicebush breathlessly bashundhara snn ishimaru amaan hornist saltation uicc pithiviers baxandall hollinshead iten medroxyprogesterone tusken mawali kubilius spicier jumaa skeels polie bajazet lamivudine topman shikun feda coxcomb finitude govou ampon elaina etel quennell samten passingly chlor keedy gracechurch tinges angsty opendns grohe tarantini matsura spooktacular renker queally restituted vng chessboxing vause marfil sieved strasburger stamler airheads maehara synthonia instrumentalities mge sensurround acheloos mubadala filhos foreseeability bickered copano microfracture eeo haeften manier arrant newschool gaine hiplife karwan lorentsen wachee scorza pyra yanping delwar kameshwar mbw kirya mutator sscs alsdorf kumuls partanen eschscholzia desenzano maskey soundbox rathdrum khodadad incomprehensibility maryja denniss edmc tempa emissive holtzclaw charco szadek antonetti pean roelant sulfites audiologists costanera itinerants chayes liubov boumsong relenting spreadable kcpq taia courter otherwords króna rescinds gayan outranking isolationists wassaic benthall bandanas priego ragen aztecas noton bienstock iliopoulos gatecrash medicating gumtree proft benziger taqlid legging astri beachgoers wistrich saujana nease burdge spearritt halasz gingrey timbs lennoxtown darning grises ascom magnetizing motorpoint sqdn indels mascotte wackernagel unworldly overripe fugato dhf nixons inchbald peartree teitel tka baug racin nonwhite tiptoes birdsongs bosnich intimidatory soulages isim cristino belshaw mazal djibo panpsychism danzer pochin tads kyran etalk homeschoolers wilbury edgren horologist worts sinosauropteryx bedhead samand iniquitous moralities raincy doodling granovetter cifs johnsonburg franek vbl kimmons mulatu germanys relicts havertown perdues thundered cgil disintegrator forgione mealybugs dreambox unbleached fiskars khosro parramore fule frankenmuth bayanihan moidart kurultai zwilling malecki hopetown speziale insidiously kapalua pees lavandula dadullah haberland mesocyclone elenor airfrance kuhne astaxanthin commandante floristry brandler sycophancy merin teather javy bekmambetov copi pelada kundt mabhida hice stauch enfranchise caereinion wajdi legrottaglie disorientated larentowicz norlander rimal dorcus ccrc sivalingam gors hollenbach lubos rattay soarin sigrist bertarelli nolting vonnie amerson notifiable spacewalkers deigned gurgle narbona feustel croagh naville glitnir abdullatif jeshua groudle dionysiou menfolk farland sunriver saensomboonsuk sutphen arbol salido migas provencal rocastle gorsuch invermere syncopations mullaghmore filmland sluggishly sagala procrastinating laurita choosers sagd gosl zabol msimang charif acerra mcgeer harworth denting cmea ardeche abucay dogz galien senni strugnell levinthal quiambao infarcts mirae gorriti krash eame zingarelli chernyshev soroptimist traves hnp zhores dysmenorrhea belbo lolcats woks gerini fintona deci ravish iturup mondawmin decors delbridge transantiago slaloms mathcounts mabuza autocorrect riha willshire kombarov cairnryan schweigert innovatively akanksha hayflick sommerfeldt landhi spinella strahm sterritt formspring revol machiavellianism pwb unferth josceline clamber misidentifications debelius lobotomized ringu nasrollah stort teresia newsdesk paradises taiheiyo gobblers fusari provolone makarem kuzu alxa tapeless elzbieta rebodied inconveniencing frug trufant flextech autoclaves krstic repatriations multivalent chirped joropo sandis neumaier boshell arsace lellouche birkner cymmer uliana bronzo racon demagogic kapros utilitarians opheim quotidiano feinman sudek lewisporte mothered circumlocution sumati exurbs noomi swineshead chungnam csokas houtte cseries morry estima ereader reiver conradt duppy bipeds intarsia pendeen bungy bracamonte banane lurton musina kippa sesam dithered dionisi hoiberg souders brodo khazana abdeen haja personnages trichinosis lajja businesspersons csid ndis tamarkin almaguer chessell hattem cabrol brillo netcraft barz ossman growin afsa outspokenly scénic arafa meana schadt gastrectomy umhlanga shatterhand nowland adwick bahamians nynex linocut harebrained mutesa blemished macgruber deshaies kalaallit karsavina rondeaux flashcard thamrin oneal ponchos authorisations ostlund hellingly cosmi jerryd tailbone schamus shreck ottomar masturbatory calzone osteomalacia smeralda tamminen kotsay reuschel trezise vidagany punctuations hobs vallet ephebophilia reuteri generalizability soszynski eurocode enkhbayar decaffeinated spertus vogeler surve eja deerhound ophthalmia viguerie uly sanyu schneiders chahe safranbolu europride ipy temba blackbrook harap erasures recalibrated jockeyed chantler woodvine kaufusi htl petrification mamey sabates peregrino jooss brogues scherzi exclusionist sundog dogwoods teem colorblindness sanguinary jayakrishnan carros dren yasuhisa doute palaeontologica reanimating dixwell suai booktrust momchil gannaway rawail alverstone scuff cumbie dynamiting obus yangpu tularensis bahamontes forder assegai inhalants pronator liklihood amitava bici emmanuele switchman commandoes sundstrom mochudi rebuffing lipases batswana anousheh badboy sahadevan blosser poyle restormel haïm salafists fanuc littlepage cuori ganjabad manoff vertiginous togni kapos wimprine pehrsson fourty goubert stinkers sunjata codis setra gaisberg weatherley circulators eskandarian debaser ddn shibi colca termon huffaker belanova undercliff nonus kesang samawah poons saltwood lothbury saluzzi stojanovski barnsdale limekilns allbäck dimwit kartheiser toques ccafs armington disrobing ziyu tzintzuntzan mediabistro yitong konaré longrich classé evariste vallette preamplifiers guttering mably baishi jcd tellem germanies floggings larysa neesham strongside hhr approche krunoslav pagham países springwatch molitva rockbox tought hippest eskandar luzenac ardabili freema dogface monocultures pruvot tavan condensations successfull miyaji khom wastrel elkland scaleless salek somerhill brunhoff convertino mommies oase squarer sweetback jianzhou warzycha sheriden allchurch ismaning popgun pressmen clearwire baisakhi norac backpage mbasogo iiro parisii hartsburg malach iwe jianli donnacha satoyama tofts alpern primatologists waterhead digable baalu tyle limerence awen guzzanti lochwinnoch llandinam dolen lachenmann heery billu byk campoamor ludwin uncalibrated highnam brisas overath additonal allsorts trevone rivel mnu komissarov jerseyville pagliuca pursed vanpool pfft pinjore construes pscs truehd spanair bytham serta iziko usis popek nené demarchelier ndour kudoh gleek woofers hauger petrou uncatalogued autostrade xuecheng sociopathy solomonoff baisho boboli robichaux iaff xiaowei wasters dieters lightburn dgk bourbourg wenchuan pinera hoffmeyer sanguisorba alyssum rasel hooping lovren mayacamas tedisco genzano kipa leckey navair chinks llanover virgatum cardrona rosensweig teni observador casasola qomi peugeots malmqvist teagle intercoolers avionic naranjos fastbreak serg halfin mpn bafflement rebrands vacuity marting navasky toulalan zesty celsa muston morones kythera nespoli reenactor unnamable parbati aow skyride sachkhere penmaenmawr tumilty requesters soongsil pipping geerts zuckerkandl tulipifera backburner dusenberry ardua thiede marto immobilizes congers jonte igcc kaczorowski bakso ghurair imputations coreen gobsmacked glycolic securitized qingcheng chusid jorja portet lepeophtheirus rambaud calcifications ipro thurible unschooling therry tabac mudejar pergolas sverrir kneen quadroon perrys harnois tragedians sicl nival barzagli mcgirt ludvik urticating alibris pratim zhoukoudian mucositis degraff aurally reimbursing umran theale hfr léaud larrain exonerates naharin ipmi motorail rebelution kbytes disembowelment musallam izturis eulogizing switcheroo leibler constrictors readouts ryderstedt checotah acetylsalicylic salita ferras mammoliti brasseries godward jarana pirogue washwood guohua reichle gaprindashvili bifacial mudfish schochet ummmm shadegg zauner cejudo ewhurst kosan cravero hanle telstraclear bahagia milonakis dinkum fujiki chintz verplank kroeker boxted drais steinhagen phalangist loftis iwr oddballs megaw stanleys nuernberg explination fliess arsala moulson liling packin lionized seathwaite spittoon pietrangelo harjinder esophagitis cultish assessable naydenova osirak angriest packagers panayiotou nathans climo shechtman purvey bortolami kilovolt drakeford lanterman magistris porthcurno cosier clearway eliteness forenoon ctls gunbattle claudication erfan jamahl sheelagh polystar seang wadd tradecraft cummer computex transpiring extroversion espie tuten forestieri misinforming hipc calixa taso pinctada coonawarra florican bourgas stelly opsahl demelza defelice nikolaou prudy weichsel forewarning cayos loriod admont weltklasse tewaaraton roski vojta xerostomia gerolsteiner katon sobered fichter dodgem grafite cimt tecnologico mixi tugger wtrf grahm reavey kaimana truslow boeselager ploiesti gopalaswamy armscor swed vocalese anganwadi bordet gewirtz mlambo aedas rauff andrology ratcatcher mcgahey farlington huiyuan abejas furstenfeld knockoffs matadero wratten gilmor pentwyn barrenness decoteau bulga jerrell tuhoe narcissi gallopin servicemember rotators pasaje chatt dunkerley sadollah gandharan kust kamada zevallos casero repertories multivitamins quantel farnes kotorska graphix kinzinger bonal brondesbury latchkey lucking byward invasor fumigated wearying dunkard eaubonne cherri certains lobular pengam fatema kurihama aucilla cdcs flyboys reifer jankulovski kegworth proscriptive unconsidered forca esos transfering palco jenas ffn hispanidad bakool ayun buttercups postscripts strohmeyer valuers mumias rheon mowrey avallon norb latas suckley lykken cuttyhunk veryan fariñas katabatic hasni kiz octyl holobyte muscling manteno maike kema timeslip wordsley nocebo akuffo thaindian fanum margvelashvili gleadless liberalising bcis ngg negash vinger acclamations kakuei cassuto unconference maaike shangdu christal almazan argueing magaly alpbach zelinski pó hartshill winelands spatola caricaturing humby shardul dystonic clerides jenolan lizza waterski keran metr reverberates malpais eventer peterka neuropharmacology arnoult zhenwu stj neen ifri yahoos okonedo ribalta odumegwu rued blackbook satur tambien bunmi mccampbell slahi pnds yukimi ironhead septuagenarian lelant athole pingdingshan troccoli crossbeam prohm queralt davood dislocate filipiak detroiter grismer megafaunal magpul limu blagrave datacenters bergeret complexioned zemke digiovanni moulsecoomb ohrdruf narratively mallozzi misal otisville kerviel chaklala cantanhede yameen moony afroz reckord darent mipim belak lydeard falana milda digitel zakim nankang mcilwain kbit cannelton tricarico briavels cruickshanks mojca unresearched premedical berenyi pizzichini tamerton bads nige ebara jonn soderstrom gofal harston giambrone oppresses eurobank shrivel ipic nurdin fakhro nannes weyers inya oxygenate etemad englischer mentorships madad gtn thangkas obb unfreeze eurodisco bisulfate dolar ashigara seagren solinger milliarcseconds rossner gallini pacifici chesworth inzunza pearsons suffruticosa stoffer aulie weyand palooza haiqiang epu aoh eggermont unthanks iriondo refahiye vinyard roce chindia urasawa orense misjudging shelvey madjid blumenstein dlb ruelas wgbs trinkle cucchi xinghai tupe jabalia bloodborne clammy frisbees ivailo iatp renie lehder teofisto cargas tosser swiftwater etools doumanian battleline atchley teana folkish muscari hanken hamaker sudol aviatrix tristana dekle chueca airlifter njal haruyama schotten snakeheads majorino horribilis iveson radarsat klout arvanitis exercisable kadim rosenblat smelts waymark tropicália kiuchi sunseri tokitsukaze billen underachiever gorbea bipartisanship shaoshan sheepshanks oughtred menthe sevillano vinous soundless cheesecloth harassments verri rafalski conad gips sican kene intractability kilcoo qinglin lehmkuhl armellini mokulele lauricella mancina pomade towpaths glycyrrhiza zipes electrochemically kirui rolfes charlee tinariwen cosio nwsc tsegaye autosports bissix misprints savoca conflits westoe rhinemaidens checkoff traudl ktb hilaly mariga claghorn facist cahow izapa spotts lotsa kobs schweik gulotta kippah presson jentzsch wdrb maniwaki panerai glew sniped striggio cowton panela schrobenhausen komiyama neutralizer kaczor vicentina zimbra pigtailed sagor amulya vlahov hadl pickel teeling penkovsky misting isetan sticht johans gallaway kaliya casazza xeni lauterpacht endoscopes mhór lumpkins greiser animé cucurbits beristain jamet stampeded sermoneta jerboas wedo battiston merel sutherlin nontheless stansfeld geremi verstraete sxl spadafora péan policial plumelec gräfenberg lafd meniscal plasmapheresis chèvre lls cimex skor wodi indiecade calegari retrenched unchristian kindergartners bharara steinbrück cissp zilin chosing batallion raec katin sahagun dugi symi sendoff misconfigured kgtv jeopardising relaxations montanan doorjambs maradiaga senedd dinmore copus narla shimotsuki photobiology habibul modbus haxton escrick southpeak stz resealed gairy yusen sledmere breukink hamdallah dusseldorp reintroductions littlecote kshb longet kova dolemite wptv morando clingman malpica wenbin kogler muhr weeb devia eeckhout kaladan warleigh firdasari hanae unilingual tsat kassin connelley arauz folklorico timson headhunted falch hispanicbusiness charmouth supraspinatus rhymefest wilen gasolina wennerström ailton pakubuwono maxing concetto yohanna stenness skeaping bumbu flumazenil denardo rutnam triamcinolone tameness jianjun dellwood profanation nade lidz dieffenbach mahari mayrhofer katzrin mompesson pashtu fcx yunior betwen barooah khokhlova stanzione phife tseten corbetts rychlak wilker decanted shiso deaner tadese dargomyzhsky waipawa preshow togas hemis unstrung whiton mishaal martinu salguero houseflies bortz saalbach assoluta rahon slimey aurizon fishkin basepoint excavatum superdrug delucia nanortalik daftary mostro marise alexithymia mufon alanbrooke namara eggesford lazur cyclosporine minidv kcts penygraig tentacular longis pichai tussey lockland kitasato iashvili tecnicos zhangjiagang behbahani monarda pettinger adisa silvertip constipated nevio cellcom hatzidakis roseworthy buckham grisanti verdejo ferriter rockfest queensgate doubloon liliya autofill bêtes jva teeuwen arivaca kizhi responsable matejka sorek marouf calders fornari skywriting kaari mailey bioplastics guttuso routiers ayim cabotage bloggy baragwanath poxy panula marchetto pasubio agona crumpets holdenville marvelled latecomers foston kuhner happenin kaino extrapolates dramatises influencial spottiswood glengarnock keratinocyte nder ccms hurstwood pepsodent tilli feigen popeyes bollin vranitzky demes caruth kingsfield suleymanov elanor coffi gastroenterologists monkfish kaysone iela brinkmanship plattsmouth lezgins yenne pupating ssci foulon mossa gerdau mokbel malonga helenium maisonette alcee deepali huddie liebeck abdulwahab logística everdeen chlamydophila forresters marku nebojsa zichron amuru royalism streete winfree kendi bentgrass tsentr kozik modelica sabie mouseketeers sunderbans naohiro cineaste coiffure scappoose theoren godinez daimaru lympstone mongar astakhov moshood pizzolo warlordism flahive christofi hiam hutzler karyo giss uteri boetti vodianova vobis hillar refco prognoses cipla zagallo learnable anomalocaris protti taibu pinxton goudey varos blackhouse yie birkenshaw aissa sullom rúnar zebina songyuan kotwica zarautz hogland photothermal varmus categorie fxr werc welliton guge kinmont curently chave rumple sorpresa siran breading pandilla tripplehorn spritz régent schwind wynkyn fresheners hypacrosaurus wallenpaupack cassese agapanthus maniwa rothera watchfulness dovers hermagoras overworking weaf itzin standiford ruddell ivinghoe padfield orbea kimock canche lisztomania zitong shipload spectroscopically woude superpositions pami deslandes marchini hatfill moviegoer tennys yunshan stabroek photinus metabolisms rxs amatya mehserle beibei mikus deadbolt tipps kaikhosru dixville luchterhand lodo natc backa underachievers wavves fambrough portale aktiebolaget holloways delfs godown woomble subvention interservice sjoberg supermercados ,who lightsey helghast mednick rishworth tillicoultry bukh gaige greasewood langenfeld mulley deportee dfj atila harkers rigobert worldspace tafa wellford runamuck bureaucratically malinois timan benjani heever posluszny farnan agarwood moskos abello squawking pleopods siwei dinda laithwaite gravell yaara xaverians gacaca dameon alyss disembarks shingwauk conaty soursop yacouba snowville chirundu primatech ntrs lizardi cappuccini orka mucker losee gozzoli nussbaumer bilfinger willsher bario cocksure canedo reliables swetnam dron jagua mwalimu afifi donatoni richton zelen snarled kunihiro eisbach horseriding manchanda industrija shanon rabot milliman ascutney mottershead dynaflow penmon corneau fruehauf albrechtsen nalc crampon hlas fabuleux muslimin kinfolk wallström changji silin fertita tiefer aijaz andronico sukhbaatar hamidou recoloured yoku michala carga rambova finnart aftershave nzb laitinen kingstree ufe aracena glasner masacre pyrethroid levenstein quiros cuddyer dilshod antiheroes jamon preverbal nicktoon nouel popkov rathje tigerman uate exordium borup parwana disquisition elberta sauma friedheim uscs arac nrao mnh supraventricular codelco bhe tourneys silverwing astrocyte execration jaywick acom albana bezzina bope papered apli hallifax jnj kissane voicemails noctilucent hassania useing subthreshold chungbuk tgt stroe dresher buttafuoco makinen ovshinsky fenoglio hermelin tlw lissan disrespects anuja berriman shwartz magnetoencephalography naevus rebollo aull dirtbags structuralists vandervell neeld kabui matassa tdci willeke gerba cockrill tuts metten noisemaker cityflyer hornpipes perrigo magnette procon diomed lapoint broomall townies montbard ineptly frijoles clockworks dpv cammalleri karvonen sardina magnificens seidensticker leesa kindles communic cemaes arib investitures yafi coriolan bellecourt korma sharpshooting wildfell victoriei cohle vallas cherax humbleness jbg azizia dohnanyi olimar leapfrogged fineberg kirishitan khewra gosain brayer cambourne verlinden cail ortego bsaa groomsmen lycans amaar felstead hyperuricemia thematics chanu esterel staleness cordovan undiplomatic shtern radosh chemerinsky kieslowski stendahl minderbinder termly entrain schrott asexuals celestron mazak valade millmoor bencomo qnb elfi menning teleservices promptness petten figline beija assab osso premnath baixas tabeling caragh galanti flaca fragata veldkamp strangeland knightdale caretas tonegawa goodliffe uralsk lerer desconocido moccia chiloe becka terral commodified brassington macher ornans melleray tvf schnauss selvig kibbey turesson vorhaus lindsell yoshika inawashiro nitsche schaechter derring rodarte bursledon walkathon zele chasey annies towe honkala overbay fasch kazoos coluccio bandsman csikszentmihalyi esy puech ashly guianese steerer sarid kijang greenpower episodically russett nontheistic yis busson balletic craniotomy shortnose pheng perel mabelle skibo morgridge nijenhuis yelps sleepwear moodysson pedot butterly triband vinos sonangol phonak escandalo jeffes lysol baptizes atara shrewish bhide gohl lunchtimes torroella borovets upswept brasted slewing venaria zoonosis drowne leveille gazin turpis essilor indexers mingei parrado lobs utis fleshes carabiners joda murkier cerrejón edurne kosaraju neubau beha conclusory badilla hegedus ramadhin monkhood greenlands mcintee bosoms fheis mcgaugh birsay glassner filarial sics jaso polemicists jugnu intimidator proselytized porntip alleycat cheesemaker tuscia kiewa dachas pettie garbanzo neurofibrillary bics biosimilar kaysar staviski hillsville holburne landell corbell yogurts munera stanlow manandhar auchterarder nadon shitload islamorada aciclovir sabapathy haytor unasked malkani blan ejf kunzang klasa clemo mustafaa revit gingerly hollybush gamlingay motorplex pastori wallum walrath wamego stranorlar constanten itani karez gipper nashwan khata limmer triticale kentland spumante ziskin tryggvi asts stemple bourbeau pjd gentlewomen selja etoposide shoda urman qadhi multis sabbir grody finedon burgstaller niether mangino pinetum tsusho stettner exertional cipolletti wcac spaso eiti ashmun wisch benzedrine chandrasekar misclassification myrow ofgem brazile tappen osseointegration loades edifício tranquillitatis zeile leshem recluses llorca modesitt tayyib bnz hdtvs heylen beebo kersting sobolewski ketner raio kostitsyn tinelli binayak mondesir metaxa doughton ightham sacu antibacterials drummonds caylloma brûlée balcerowicz occupancies kamioka huessy kumail manker minchev trembler fiola sugarcrm sherf apposed kso pathirana pinchback cibeles ilcs levothyroxine aced liebel rodelinda vancamp sturdily mainstreamed jamhuri bangert jetpacks bruker biogs moonshot electricité stupefied infallibly pépé timbuctoo frary electrique bacigalupo jitu chiddingstone sarcosine mcbrearty nutr esquinas intoning slyde squyres weathersby palladianism schelp twentysomething preki refractions mearly trauth slusser coquettish dewhirst tague tocumen ohmi hayatullah domiciles rooi comgall yarber newnownext jinyu ghajar stratego eastvale balawi balko dinged mooreland larken doihara weixing meiring sifakas ravelli voltz mawkish tsvi egolf gastarbeiter aouzou pogany harmse vagos retronym bacchanalia keulen misinform uproarious felty hasham genito orphean gerlinde iroko rabiu persecutory gadon ajan overholser trên beachill overhyped cicc najmuddin tashman liljana efecto thaer bravehearts tmax calumnies shmita unscrew coalminer pegylated stampeding zanini mercey crumpacker macerated zarai assignees griffing bookforum koth mmbtu plunderer arents anticommunism huiwen mcclintic screeches habtoor vgm elsaesser vergès carload kilocalories jhong bery scaa ruppel difrancesco afellay achieng jephcott mytouch humpbacks hegner wallaces majdi guangshen genei crago qingshan khnl clx schygulla pulheim butetown wlf cims costinha intersted tagliapietra bollettieri lexden juggy nandlal rull caaf shoppingtown zaini bronchopulmonary saltalamacchia gbit farraday abil zambelli dihydrocodeine vilasini thumbtack miescher jalpa courttv binyan pangasius schlereth bijaya freiberga valer stenciling roundhill slaughterer limone klimova mayorov tommo façon cineteca hanzel cherono tweel memorialization immobilien arlindo evanna jessika nkhata razif faura spermatheca manini mocksville chemosensory polland jantz coiro kopechne cleansers maenad precht nominet jenckes chickie shonn horserace riska facc unfurnished serbelloni declercq koury groothuis duberman cofton penjing javeriana philiphaugh chiado robocalls actualities ilian hemostatic squeers loewner fabricant monin kunzel clovio cualquier bilin batching huayang hulsey wender makamba medicolegal discomforts matelot supercheap arnone unamid webshop jasem danau xtec acetonide boaventura boche daykundi icaew emry butti ejaculates subsite goldstream ulmo derogative zwan czapski waterflow serebryakov resolven hepner eiteljorg bronzini cunanan slaidburn macie villeret clwb verissimo fibich gustine frigging thanon illogan eumenides toeic completley eshoo oleanna goudelock clanfield dagr atambayev mangochi yseult yadira mejillones haward dippin cuddihy harrows barnabus campagnaro fluffing schuhmacher haveeru godlewski damani braddy constantines overfly sunpower cetin pinnell bensinger readin embroiderers chillington loda blowholes cybi erazo enshi serm gingin javelina marieta rancheras envisat unworked qeh ahwatukee akker umnak buckskins gabelli mosto atotonilco naufahu capitalisations corell unwerth atong mmogs electrochromic landel youngson multiplatinum pader oxshott hacen equestre aitmatov supersports hanish codsall pontis gholami interchangably scholder pinin motamed rufai totus stape horizontals ringland wse scheier tassin lamkin palffy zavaroni vetra menya hune trailside lissack boswellia trauner hyd deyn woodes meshal thalheimer grimacing stagliano skywards threave defreeze lundahl sattam kachinas pooped wetheral janetta dicussed deline kerkar vigia frohnmayer collister beltrão malov eisenhart rold puriri doozy chiseldon umdnj shaznay quogue pells spiv stovetop wrington abersoch headpieces dreadlock mycenaeans setchell marick typescripts driveshafts ssme tunel babydoll cottrill parenthetic muroto narina lhv brithdir saxitoxin douthat reznikoff silangan trouve wendall dungarees debunkers bifa yil outcries euxton koslowski tramples rathfriland quadricentennial avallone arconada ppq ncoa ingvild urth saffire paseka foshay shrimping auh nightclubbing strete ngunnawal turfed josselyn björgólfur netflow kinvara maracatu mcgroarty dilana mckayla apperance lanter perseveres itaituba grabeel bispebjerg shripad lactococcus lemelin amplexicaulis mcnew pantech chisos parries beldon rockslides yur fika somnambulist tianna croma masahisa klestil strank madiba dongbei daylong pcaob fgp keron bortnick kotb lithotripsy sakineh comoran ragù iex macina paradip andalas girata bizot wigwams boaden dirani escitalopram ctbto luebke afdc sundman reisfeld kfan beefs fakirs yarkovsky fasd christofferson reedsburg punctilious pinholes jaggar deandra speedweeks amerasia interbrand canvasser kutama rauhut morlacchi jbel towell belmarsh pannonica runk snsd investcorp degette streetwalkers chooks algodones swang ibos mediatheque knapped ingomar quetico srbi winked unagi isabey mravinsky noritake ghaly avera diann ✤ argentum ebrard esslinger mhairi jaylen michl montmorillonite edps lovechild objectif umbricht piskunov piros alltogether eavis dormont petley geostrategy fxs maalik johnstons lanterna hoos kochen greenbelts nassib acquainting sgf hegyi agranulocytosis waldrep ruddick sekka mcneeley guye hedgpeth ogyen dyersville ploughmen balgownie lejon stojanov trefoloni bloodsucker takehisa crémant doland saltsman habbaniyah rouvier hypocrisies obree andsnes datil mazlan hagon mastronardi perper singleness uncolored topas koul hardiest addingham restinga outriders diagnosable baucom standdown demont ambushers figurations willendorf reinold coun tonsillar allsburg exotique timoci wowereit samaa sinagra fouche creditably belgo allenhurst hongren mollard sarmatia pinewoods hnat superleggera ramsauer kammen abaroa porcini katib mieses bistros vetrivel shackling bremond physicals pourcel karunakara uncaught uams landgraaf kace saine vemork cpre britanny kisling deji bellmont grindon pratte loevy mingyuan wonkette papanicolaou borras shihabi dolours underachievement maulavi koor dérive jayasekera besnik verio eisenstaedt goodis deuba capus rotis pickover sedar satt dartnell penair cavenagh harshbarger solida knaak institue majestie chiaro luisita rewe hagstrom cardium zittrain eyecon pemulwuy cowperthwaite pletnev cheron tirunal hypertrophied patency jerad galesville stlouis pirotta declaimed gerst liffe cinematical lwc jahvid toolson tibaldi masuku castlerock rethinks muratov boisei sugarcreek bounder resul cothi beasant starlit osbourn dya sturdiness skechers ozona merican myn encephalopathies lews brocton penzias sagano tenex fmla cusworth sunwing pippy bordelon landsgemeinde elmfield haci geant charrier embrapa gigapixel overqualified sorín karpo theretofore alterio lengel audy castillejos ardgour frenemy miladin deyang langberg konko undercarriages jeschke freekick quintiles birdwatch petabyte miniaturised birker bonfiglio knaphill grzybowski reepham chengyu tremadog horni rcv jasser kjellgren arrowe colourised jagt dunloy zonk incompletions efas piquer loso earthship shochet deescalate rolodex priapism perkinson tankage rivkah suthers commision whihc waclaw guidoni hazon lucanian weightier kaslik kassi sleepin huilong emasculation trank rxr skiatook optronic sicher fishscale cabanes mytalk kroh burgeo cuper doughy cordials tsukioka campur caplet schnellmann broadheath galanthus crcl ccts uptodate meeran tadic kbw asiatics leïla otelo apéritif godrevy inchcolm savigar fackler solsbury woodchip quints metwally priolo panarin niggli breezewood abjured dcsf potanin hartzler enchanters jcpc speros gatot woroniecki trago castex khaibar eavan bolwell melones faes friesner pennisi cystatin cfsp difiore agami micko borowiec hodgenville nariko cascadian knowhere ferebee langhoff yagudin bith akinwande kapow temirkanov crackpottery illing nzoc fitzharris hydrilla tanura tasia uvo verson kaleta kalapana valcke reappraised dillin chinamen tanni grunsky chmp akwasi lisner loughrey qdos moldau nushagak deducts sylviane dwaine veiller cabriolets arle telindus archipelagic foxdale luders babesiosis electrik mammillary iseries patato lakhpat stutterer gmrs cropley mahakam emilion campamento kasane portraitists ramathibodi sugarbaker wirtschaftswunder acorah rankins bellissimo polypterus raida anicia voorschoten aberle pettway instanced júcar ghedi bleckner morazan preisner shapey dipu ognissanti anastacio gunj dhupia hurva lewinski spalletti kyrre blee hueber floccari farningham velayudhan cytopathology kotova aorangi orthoses minz agriprocessors mecu pyrethroids twix rfn azali vog corruptor olana haughtiness mayama convivium mroe kahekili barretta roborough saltdean bonderman tarbela murney springerville shorr nosh commish monotheists unimplemented momaday kickass workhorses nebulizer homozygosity regaled nols ciak tianyu stavropoulos napolean beere lofland jatinder mezzanines jazzfest servery marei longboats hoppity ngaire arensberg stensland gargling justyn jalopnik bongs glazkov bleary flender kahveci gayheart dutugemunu logrolling corvid chaohu boadilla gobin kolon nightrider circumlunar compañeros annise vlore charterer zuck yemenia schwenninger newitt navali luzuriaga kaboul cortini emsa sunnism blumenauer leuba slagging koce shena tanny mikie dongying caymans polyhymnia presburger dober cahal mutola candidatures maxtor konigsburg macbooks gossman janan agere shatkin recusals vvip wipha roenicke anghiari ajemian rutigliano terlizzi wolfers evictee saibal washingtons nici syringomyelia bezold dribbles eapc dismaying yagya fontanella bolds niam foxhill flits fores mimieux divorcees azfar hwacheon cazes eventualities szewczyk nortons treherne ogbu beanfield ference slotnick stiffkey orting unglued groop bocci anthing defensin chavs laisser clercs christianne tidiane alaniz tnl fawwaz kedia auswärtiges cipes emiratis spivack teabag hadal quashie nurserymen protegee rajpath barcino whithouse richardsons answerers ipec merteuil splenda avians watcha tucheng musashimaru arvs astounds mckeough boppin troller buckhalter dahlhaus lusha centerstage wirehaired lovington hayder sauchiehall familiarised shneiderman rusbridger antolín jassin shopfront bronchoscopy shuafat groundings conen baetic weder hydrologically inflecting cantini hypochlorous conventionalized hng liposome hadjidakis coastguards chinatrust zaydis smurthwaite somwhere usaha anticholinergics lincroft neuroleptics thujone baquero topel flybridge qaanaaq camplin graphische boothman pawa supercalifragilisticexpialidocious anxin belloy thumbtacks centime colourblind matrika bloodsucking schindel gateau pickaxes executory chimanimani foose kidane anisha irsa worthier shoichiro bavel repka niugini scj waken flashers beavertail stev rosel ollis thereunto muhannad campeon pakhomov damphousse tyer mosshart modernities unfitting ventosa externship crimestoppers nondiscriminatory mignini nastran machir cortright gyrating liou notar neurobehavioral roki asklepios phenanthrene mittie chayote carbonization últimos rakoczy sumps songkla sehat segregates aprica carraig dejah harborview istook throwaways vgt untimed tatty viliami rodding radde minkin buq gaillardia serveral almereyda blazkowicz cbos charalambides koppes gaetti wmtw bonfanti haibin salame coccia uscb prosauropods mammatus usip walberswick vota uncinate halia sortir luckier lederle calafat munny orgon fangirl firebreak ashim fulbourn maritim evoque hunsicker pippins lefthanded etymologists jarai landowski medin buerk medmenham neofelis habibollah bco ultramodern weatherston unequipped zettler heceta partee angiogram fastrak hindmarch incontrovertibly mdis dimino disassociating surena barbershops robotham endlich untutored gennifer hannoversche lanre suppositories kauppi suplee stackers rapiers whorwood harrad nasza broaches outraging fauria kiira voyce dunnell farang delftware audiologist incinerates herriott fattal congresspersons businessinsider gamemaker averett rendezvousing naspers avere marzan spfa gabrielino endz angulation mariz tomey kcsm abandonments mccunn merchandised maxcy daudi centex quiles rrb downend buth lucasta walger hengistbury rizla lefel vladeck konfrontasi bromby bernhoft luyt snogging borkman bloodier meloche flextronics galstyan confort norbit hudsucker stratis clach bertine ruhlman fahmida baudet bethge followups christopoulos parera counterrevolutionaries crowbars ncate kiprono spelunking rieser jerricho cabera skewes jianxin percolates barile castlegate trovoada nimbly schiraldi skypark mittleman tetela ionov wolstanton dziwisz parsnips psychostimulant udl herlitz neutrogena indicts agian transunion namir felbrigg donalds clevland pétrus valveless laugher aperitif evere kylee matteis mcandrews pigweed mervyns brügger garel kamprad casamento jerusalemite stamets romaric wetware raczynski unshackled ranitidine pattnaik yardy knoweth brimelow vimont preckwinkle coomans thaden cancio obrien kariwa contextualise beadles altomare nevitt naung musicmatch gisburn brocades kastler szczebrzeszyn gendler overholt alexandrite cpsa riffel kinzel pinet maxus petero jallikattu capirossi nagpra analysers flixster lulling karens duhaime alexian bigness pankratz reprocess landore surtax slimbridge loubser lcg montesi wahlen tessio televen oberweis bakhit vogons hinky inea quintos torto nypa stothert biocide sanghera zumbo intramuscularly niederhoffer nonstops dropoff pedlars tendu mones imformation inot cpuc mchedlidze rosalita desalegn ramb giampietro inma axxess bergler falluja scozzafava petrifying backstabber kaiya sory newminster warbles visagie skycar coalesces neuhoff admi harput transurban gronow urzúa halicki sirakov agramonte alighiero margarite zerhouni fidan butaritari sakonnet rustington jcaho bettoni reconstructionists toretto embsay myl osnes afriyie rnz dirrell schlamme kalemie mekonnen humidifiers sanlu shoshan gasson guzan strevens labriola hemswell herky hiob unlettered rdl slep ennobling gravida zacky purdey gavialis reabsorb blackpoll henrikson parboiled mooncake psychobabble zhengding wsaz scows loweswater crunches everardo hasu hdm bonamy deblanc remonstrances epidermidis häusler heskett mceveety amatuer kappes knoc kilcoyne qadis mikiko phalanxes bioblitz dael hinske homeruns signoff klinsky gitanos stolte ebright winnett pitsligo plomin carwash agolli quesadillas remolded ringstead griffes committeemen huella lco krenwinkel irland poplarville sherron breif ambrosetti eastry fininvest rampe sapote tanzanite percents seijas elenore creux govender licciardello edmon demurrage saraiya energizes canem mesko amlak anbari stelter csdp henricks chassidim camb trawick highroad afaf slackness keba yangjiang dawsons punchier redoctane naïf dresner glebova armlets tratt kff ibolya cockneys bolina carboplatin mathie fetcham florido heacock emanuelson myxomatosis postmodernists gormaz hattab sawers glandorf tyminski norihiro sciri krik milicevic naranjeros artw scates basting gigandet liyuan freedomland schier collingbourne skeggs jahanzeb sheldrick organogenesis shashanka orgaz colourists peppiatt promyelocytic ashden stotesbury tonners heraeus reovirus berriew zecchino roxio cywinski bluestocking aramon madyan veneziana aldate tfts lagares pfcs ranevskaya winster purgation bunkum brentnall thilan snorted disabuse neuroeconomics marcq peaslee cynopterus impeachable kilojoules rahaman yarrell virpi muggings chega hossaini epistemologies weisheit wallbridge giacalone wetherspoon mccooey rebennack finito ypc belarmino noja unmanagable zpm sabac pbd mfe headbutted reimpose freilich drays eatwell officialy emrich utkin ecpat dilhara koonce graybill dontrelle whitbourne lozoya wroblewski zaf crocheting smbc borsch cods lockard ophthalmoscope brioude skywatch ivindo treas tirole clausa lamizana goathland hemdale stoneking photinia chingachgook winawer eskew mcgillion bcv cartographical rebirthing marsoc dressup abdc harangues bellefeuille insupportable bulgaricus afterbirth mortgagor smilie outpointing ghotbi penseroso panchenko salsbury wgba kwek equilibrate obuya medzilaborce wildscreen hattaway smeeton mcwhinney putah porridges attash extortionists lemn scandalizing osako tsotsi heorhiy cockfights futtaim folklórico ostuni infarctions hfg navetta deathlike ondoy gwyther konchesky helleborus hakamada causer rdio floury hejian cringing dracena rastus responsiblity moul khloe zelimkhan toufic fontanes ingratiated redshirts shanwei tompa moretonhampstead photograms neofascist carnauba palis parainfluenza heinke amboseli fakin ffordd solyndra brickner houn phut hemorrhoid hillington nater sarwat irkut cndd atca capgras danakil avpr orbelian glentworth lennertz leiris bumppo crsp makela sillyness peremptorily latchmere fascinations fecteau juicing meddled cax ritva unscramble cycler nijo gulian extendible imas teijin megaphones tokenism grouted suppressions seigniorage weronika hampsten samandar lishman rissington oakum gynradd garlett chulabhorn pulex umunna ruis bandwith tarmizi soskin aigen doyles seamed achin pvh selles baptising tramaine arnside irsn gnomeo embarrasing lamberth mucopolysaccharidosis antispam molehills trewin colly nhrc rotundas grabban flannels tegin ammendment honeycreepers skymark trys citti naomie moso waye misreads roary duffle walentynowicz saadiyat cumani figl tursi infernalis tallant hys klsx apethorpe vecindad sensationalize meden pesenti urdang scheide partem shengnan landhaus leucadia nicotera norvig alimentos rids trifon encasement gianandrea hoehn swoope feferman viticulturist demotes toun icarian swop xiangfan sirona alfasud belyea decurtis ncta boyland quaaludes kettlebell cearbhall brustad mousing lovastatin terremoto korky ofakim baseload meteoritic vetsera madonnina secedes breadcrumb stcw bauzá osam ramaphosa bachpan barrak shahrastani bricket manafort sketcher zhizhong lengthiest solium midcentury brightstar abdelghani brignac bandirma jadwin prestonwood oxtail atheros fushi kiy hunkin peewit pressroom calixte remnick mcevedy schrag eurobonds bhm traje thsrc damico sames huszti randol aberdaron bugden astrogeology resuscitating candyland julija streptocarpus cortada lound dearmer actividades avold velociraptors serono propagandizing etuhu amihan gambits declassify kanamycin piltz schmoe niek beir eklutna seishiro schwalb benjelloun verpakovskis mcavennie measurer ursine bies rubirosa obeisances popsicles filch gavdos blashfield pulsifer steidle etap nazeri sagmeister eslava deryn aey entropia nesty ciotti thanou highlandtown daylami shoenfeld bradl gdot legree wcnc kytv pesh carparks lyes windthorst becase redemptions khandi seimone jorvik garlits balkany duvauchelle andrassy jenelle alson kozakiewicz danum balkanization derelicts wallgate widefield peiser taling coard nawang lazear roadshows irizar ohuruogu snitching chokri tecan shamoon powerpack combustibles sofoklis baltusrol buggie icds footsoldiers binnacle attenders gysi stagione repas amerikkka slaby yongli cordish okfuskee ineffectually shaher beardy pangnirtung hathorne mouches cordwainers myakka gaudencio kostal bullys atias riendeau dalmellington sapiro dius engebretsen welp bernbach léotard keever boylen wieslaw cornmarket mazloum sirajul murchie binner samedan tuckers polisi tunji souvlaki anastassia marywood ozette vevay letraset vishakha bargmann ataxic granturismo mahmudullah cammarata bhutta decesare barberie hypercapnia hancocks chinos sandiacre exline thayet sapience defames aatish ºc limeliters crosscurrents darkish croppers terraza mottet donelon rudloe taren carns hallucinated vange inversiones etau stobo verster takatoshi connoisseurship jamshedji inpe paraskevopoulos waldenström koyaanisqatsi transportations sudley spiderwebs esam pictorialist nichd punia penco grinter daham oerter madol unprimed verhaeren amoebas nispel musayev wkar marnham artc yuyao serowe hilborn garfinkle oxymorphone kxly wolfberry haverfield farkash demunn burgle countenanced barnstead marianists kleinrock eliassen hembree hypnotically intervenor mockus moeran komon zhongbo headwinds pauken poblenou souderton dreamless macheda blockaders lamos ubar rampersad mladic rajender kauhajoki activewear inskeep tagoe eliran jayadev clywedog kazinsky behrmann eyl signposting csea kretek mpge pollokshaws goertz minihan nasrat nives hilco conry racc voght pène jenrette carilion tumulty interrupters fnf wiegman bogdanova wesseling dewis vibra idrissi bullrich abouts feehily trustful resend splish wamc resounded reppert ganter gorder ukrop henno coachworks timetabling priskin divertimenti pabellon chiche miny chardonnet modou typhi teems vajira llandow cfh excepts folktronica ninjitsu clatskanie henni bragh sudar wilfork udoji verapamil effortful pharming loyer chabala hurtt cirino moglen mackley virtualisation itaipava oml croissy chatou lubrano kindergarteners sanchis mediana piw rosta hindmost broers outmanoeuvred lorian davyhulme prescriber pangelinan bushbury belmokhtar edwidge pulsate altheim immunologically polymetallic yelping alonge rosneath kiszko engman campora broude biersack cerra channell fsis haramain apaseo kosman scelidosaurus lyss bctv santorelli satre fifthly vicereine orabi cayla seosan cbre costo shahidul ancell raiment nrv eulogistic shivananda senko adwan lanchbery chepe garre turtleback thanassis worster skinwalkers latika witmeyer adeney bahan connan astorino tzemach attractants dastagir rifampin herlong tenebrous escapers limor copybook zav laxenburg shopgirl wissem obame múgica informaiton sopi buring masetti corentyne aquarist obasi matawin meggitt keathley gammond talalay loebs keevil dipturus neyra mcgiver tahquitz bartoletti larn geita syniverse degollado jarom bobrow umaid reincorporation biters ultramarathons atex brillstein occc seeberger vajna elisheva malc omers sherpao yahav oooo kimberlee idis pycroft acara rossant savelyev csxt tedrow suguri commiserate rothenberger deby gisbourne mcflynn norbulingka ulyett ximenez diatomite unindicted kareen daelim pgw espci luxair scoutmasters avidity corsage lindor carburetion cashell polyamides karrar ebinger renovator schijndel sickens willimon masculin misfiring blatnik augusts confidantes oberste popemobile xiaofei starlink camile wase eisinger queering reedsville sheldonian monochord levered rushi libelling ecosphere multiview tomate transall tifft helpe hotpot spruced wafs reinvestigation duddell interdigital harad mandic kulagin stanback yangshan thoughout scandalously holograph gagarina mcfaddin rawiri volmar rías doub nanometre cias hargett stinkin crumpet boiro fauziah minucci corrodes hollingbourne sicheng arborists sultanahmet maliks tobis demanda chastelain nusret tridacna bucklebury orteig moscovitch avea lineout hilles yuzhou domar xifeng livengood rouass longeron sportsworld arroyito corozzo lorak paillard satala hosler scrutinizes photodetectors stairsteps herget boehlert militello lollies okies infotainers bellusci meridor eloff sarsat duale sirhowy steacy moorfield metrotech loker annamma bédié hinnom beric lalich devaluations theatergoers browsable dasch endotoxins ohb woss kugluktuk galabru zarr bioacoustics sauropodomorphs saltney culbreath mauck pendency dejagah reguera spampinato drumroll sucumbíos solipsistic lindpere chorda explorable bunjaku closeout stokesay allegre callebaut haniel unthinkingly lahun risberg actaully haenel vassiliou leukopenia insulza fleuri transgressor remon moorehouse minotti couhig pedicure ridgers malbranque coronilla netherhall zakes strache votel mediterranee nahmias isport bilger hayseeds maindy richenthal virgine librescu ebbitt gramlich boonie lcts bonnor decaydance grecco coffeeshops spinello medco gasb plasson wisemen mahals edek chapell ohiopyle akara jorrit ivone loel mitsunobu starsmith metafile mantu sturzo gruchy oord revolutionising changfeng beheira directgov schyler isight doyley aberg ranters lobectomy workups phenomenom mccs sambava lehendakari khanal szeemann stickland mocatta lutoslawski aafia tarong homewares lantry radzinsky cefalo brokenness salini madidi troxel hueytown bentiu kahui cantalupo deciphers gobbledegook jimmer noisiest ndes marcora geers parthiv mustique siberut rushmere orahovac tuat aez conkers fricka rambled xacml risus tapert sheers hetauda chirwa kopylov szohr futagawa commments rtz apocalypses shamo angeliki topnews dessay chalford oesch thermogenesis microevolution nobuteru montrealer bridgeway epicenters refound colonially feal chère waitlist samosas yassi sampoerna victa pross eija nanhua kalihi explicating anamaria haymaking guoyu concertantes argentière fleta maxell tibetology bostons hfb farry dremel rmjm portago sainath casona philharmonics foxcatcher liden misplacement mefloquine tyndrum darbishire ktbc unalloyed bpn jabbour koehne minocqua efmd tazi velino collabo teschner concious juhn scorekeeper bulkington tarsalis nasality appendixes samsing krahulik chervil pleo aex shabbona murley maintainence flunitrazepam pegeen victimize mtbi arcam hyperrealism neea abassi yalding imco vgc maeby darleen williford wprost ivison spirochete barabati sibella apatzingán hashmat bewitch ajaib gelson amputating sads unibet interjet pppoe arrt geomorphologist clemen biswal edamaruku hairball fawcus guoliang terzis thyroidectomy berkovic calan rosiers coillte bugloss valleyfair delane oeser deckchair montjeu hambrecht excisions edris alfama qeqertarsuaq tracksuits bense vvn gâteau tetany castellitto stunners familiarisation ironmongery nexgen daquin explainer borghetti guandong xanthan kiichiro intented flatted zeuner blonder preformatted voorn edhec recants undyed deidrick venality rileys gimps rooibos glynnis rasco blechman churruca maxxi cellan katty harai béatrix nonmilitary lenine dke emili hiles hammell sungard jacobinism salur hallowes webshow facilitative sopel restructures includable aeds disfigurements okonomiyaki shimpo ringmer boogerd carneddau sommeliers nebraskans turay schopp neuroprotection nevs guesser durrus wvec vaculik lewellen mescaleros mcmackin cloward weaponization cortado tiahrt daker ortonville caborn darrelle polmar byland terumi semb ormen karrin galyani robertsdale nanney waipara bryggen rossford caputi unmeasurable airfares pajon redentore dauvergne altcar verbrechen bbtv bourzat nazarova demodex mondini seya progr gordievsky regrowing agba squashsite porath sitemap tommorrow manh rubbia farmacia hedstrom benrubi bleaker woth barmore jacir megastructure defar mnb kisnorbo truglio nothign hooman silvi mikeladze ceniza rysselberghe systematizing crusaded tasco estacion springborg eorl microporous barukh aseman sidiki fresu besterman pujara gleditsia denbo cebrián taittinger pachter ballymurphy filoni liquidations chsaa yongbyon conservativism hydrochlorothiazide jurca rexy namik thermoacoustic favero wykagyl chuckled dwindles hertzen pashayev winepress argc farzand nazneen petacci mukri oneidas achive hazle andreanof shabalin shef salpa mottingham testings multiband sharmin sinlaku nicot pestles lapthorne etgar castelvetrano radack nutfield afroman schoenfelder munmorah grandfathering keyt wbls electropunk kardam lumpectomy aflam damasco kashkari kazutaka sarlo lokmat lordstown chalupa lycanthropes frejus borrello mulls oresharski dlh hydrophobia kinosaki riam hambridge rhoose solca baecker vli planifolia incepted hadnt ahmedinejad pocho südtiroler gleams sumaila tautomer intestacy tabligh airbox mulas lujiazui rebs albertas kingwell remodels stampfer klammer aviat dubal dirtbombs moshiri erco rosinski ubt stoneage moonshadow cambiaso biologicals cleber pressbox berrio worldline mercola xolo hetti puetz dearlove responsability aliceville makiadi danns omai felgate culls enthrone truisms rothermel superclub gcg trumpton liljeblad equable grafitti seedpods trouillot hearkens arpi glasier paihia hallenstadion brams mishael salia cliona febuary boisson octopamine valco bedeviled feenstra chevènement stalbridge hilsea clèves bilirakis narrowboats gartz lurched rainton uei moonen koshland lobregat spectrometric peinado nrsc gladstonian verret reist pvm slover sumita nominalist grovelling kachemak quechee bozzi rhinecliff collations tarkwa myojin meily sherbert ricksen sabaneta rajen bootblack keenor godkin eclac rathkeale gbd pozas bertholf jso warsash athenia ambassade kopit gfh grouville manuele beholders makropulos capex mayos loginov miranshah techy cumby alishah leaker bensoussan hybla sjt tweezer unha ikuno upstroke presiden nikora pajares biopolymer neeskens defrance buchler bosetti albrechts kelsi superhard sasagawa petti stiffeners commis vrm luncheonette toulson bronchospasm cvetko befitted atención eustachius jicheng redstate careerist sambad sarychev dunson manakins relabel quiznos torosidis knighten ecis mortarboard phomvihane billotte piergiorgio ibew khairallah milers djoko ecolab ebtekar transpac caparisoned ecstatically tubthumping morter reasserts hamil rakiura kinkajou mcgarvie meritor zoia radok rightsholders swezey reformulating rathgeber pabo flossy vadnais ditchfield chally bitterman sirkus buttrick intimacies nxumalo casalesi nadesan ceilinged ledeen gomo frenais schutter etalon manias karsan dtw kadan ignominiously habesha phipson becque panoche irujo kernis pattens untypical rudner versoza vidigal ukai pedometer hamiltoni haldun sottsass nesse reichsleiter lukka rutkiewicz tuomioja gerbera methone tourniquets creese navratil carnesecca stithians supplants sepulchres spil webring parlous depriest kodu golb jazzwise puppini clientelism hangang porkpie benevolently shippingport xuemei tischer garnishment deinterlacing janisch bmv ewens ebusiness sperrin remonstrate valances fissioning odilia miomir buoniconti mauny jinghong kindl tiva hatikva awda hagibis tyvek shouf stepfamily mealtimes mzoli openmoko gungor hense strategizing nipomo litsa tvone joshu bowhill tyrannosauroid chingola yavlinsky kalifa chipko mulally fleetham yasue balvenie howzat poblano erotics shidler timeshifted phrenologist couserans gitelman verzasca synergism thadeus modelinia pedestrianisation amlodipine stemmons pincay bedsheet suchard kutlu bellhouse londen walkersville straughan evdokia watco russy gardon eking donnino fmb foce ruffins norstedts penteli chid boucherie berrys caselle passet sobat urdiales landsberger korrespondent heatly harangody supersoldier ummagumma shepherdsville flashier chiou morman sanghavi jumptv coxhead apella vsh cjs repulses effectuated pyman renzulli meliton blockings ptn hartsell natio igoogle chancay qhd aboot yego schans circumnavigates multa ybm tombak hauteur blanching digicams courneuve commentor kyriakides csca heumann avaiable hones hrl pgb soteria clayborn montcourt fedewa beggining satterwhite skerrit booher anencephaly chabukiani raunds hoofer soseki teasley fieldcraft strutted ksas undercliffe bikila hanen fundings issawi slifer gooneratne moneybookers damelin berezhnaya primorsk reker graceless glissandi bozek decongest verdienstorden niffenegger jiaozi halimah virgie mesmerize klingsor pashan chesterford karita pithead ramree haleiwa ibuka firtash krey gumboot carnedd annat schub motherlode realizability blasket blicher piestewa electus kuujjuaq alirio leisel musyoka ariely florals conkey nstc shabat pentagrams codorus werkstätte licciardi norin queenan teynham excipients praca possibilty marchesini vaishyas tineke rudderless randalstown lickers egov calderbank cryosurgery bernis dumbreck quazi vasodilators americanas fastrack viridiana clifty fanis braddan appt cenon shamshabad kawempe lykens baiana shoval miramare cinephile carbonera banim prevotella fuzzier saranga svanidze pigliucci torpid dressy astir cummerbund zib circumscribe hktdc nikolais girvin wickenden bentwood stefanidis panamsat ieu magnasco nhmrc maquiladoras frumpy screengrab tarapoto naela jeetu vldlr jrt bazzani undeliverable wcsc zonealarm schulthess coarsest ebaum niitsu clubroom odone greatham inattentional huband awdurdodau campisi breastmilk gharials borchgrave shouty clennell barran obliviousness flab tafalla riemsdyk hooe eisenhauer ccsu mechtild dzurinda quarmby pistoletto largeness buies wettig yodelling conceptualist jabri frauke upgradation melburnian maniago cinzano wurts screenname tianxiang infestans weeksville pelevin honley irbe extorts lieto dwarfish dixiecrat antecessor haixi vocm demonetized totland nicholaus overstaying nearsightedness kowt naegleria zaobao albas podell treille tebay olayinka rakoto muccino lampanelli smackover fooks cryopreserved latka rassa kilovolts supai souquet mentionable warchus unworthiness conveniens rezillos denkova slagter heshmat haaf cebeci devotchka ketubah frausto traeth mubashir kaller leaming artusi ferreting cocksucker abimbola hitwise caeser manava aerogels arag goldhawk saskpower morpholino grindrod coastwatch salmela scx geoghan pinckneyville programas negrini pinata heinle pyles swedberg millipore rostami bayad sanyang banaz fody elding schérer yongsheng karly supergene baste hoyles fendt darold osipova xiaoshuai bohannan phenological peza sfakianakis finny mondor bishopthorpe sahour transcom mbandaka metrobuses rongelap rasmuson yoshiya mentana aigars aluka aronov obscurantist uchtdorf mevagissey filby willowbank uakari sempione zotto modders shroyer klampenborg dargin balicki baltra jinfeng ddk eastlands lingonberry rooters zafir ndugu downhills charite zastrow handwork dallesandro booties alusi donwood interbrew blountville mangueira ficticious marchbanks kfox bkl shinny auri havey kinnison tetrazzini iwamatsu alquist almondbury kretser gamebryo compassionately unframed adjudicative gouais batucada fairhill defrancis bayed rubey utami dinty cioni hzds sinfonias avinu tangkak oreg tyn fetherston wighton dramatical yurchikhin likability dallington chusan bandt tamargo zafy vasilescu wano wendo mcguffie tawheed symmonds lability onawa hollioake silsden bowood torrecilla klown comittee westacott karygiannis ambe mansueto omarska kaspi wilhelmi kallenbach neslihan margene podolak webbook wilshaw choules shoebill fushan galanin cisf javale extortions hugel menand limitada cathouse patricroft teeing omero biriyani scribbler arcel nost heatherington undervalue desegregating pockmarked viviparity burgeoned kovats gpas alero chaddha illig ducktown clayborne asphyxiating xihe amod dtx lazarevic metop horlock toumanova mezey layng wanzhou macksville piozzi stekelenburg ridgedale bartomeu commissionaire kinnelon laboratoires rodrik flowerhead durex morcillo godfroy zargar melior kozloduy bonnel calke baughan cwmavon goessling petric mkx barnbrook straffen foxwood toddle dspd schmucks rumana sytycd absolutists ejaculations ravensdale ziegesar cyberdyne libidinous waldock bluntnose pachycephalosaur actos wzzm rodell zenón legace drinkware acclimatize warehouseman cryptosporidiosis panzhihua furling xiaopeng ninel overpayment archard firetruck mbokani coolen ickworth steffes davidstow kpnx iraola ommegang fergison vladimer storrie makefield gullick hotrod abrahamyan acron clunkers ziehl lynford splurge wilensky wahat mestel strehl trademarking unmanly bancomer demag laken kitri stansberry emmeloord perko standage chording nolensville reclassed uzala guergis kran shrake criers cliftonhill thondup adepoju lullingstone julita tonnages transvaginal enamelling alll jiwani weakland hairsplitting jilib antep fyffes stoen gajda tremolite fremd hugoton nion bojinka fing insectarium jigging partium imgs borgonovo greenbury bhavsar winnall gutenburg burnquist hanft ameera gunrunner superminis soubriquet monterotondo malkangiri bishopscourt ebrington obscuration undescended fanthorpe wilmont patinated lennix shalamov smudging lenska hydrologists discreditable laragh shimbashi remilitarization decertification atrophies downlands ndamukong bamforth schmiedeberg zaldy licker micklethwaite roone dalisay linalool flowerpots hirotsugu simplicio bogumil dirtee steamrolled alysha sengstacke wheeze gouled khwar goyim fruitfulness blant tixall bookscan newfoundlander aquaporin chigurh joblessness blackheads stracey uppercuts myklebust eurotech sluggishness zubar virajpet elvaston unselfishness mottaki electret grossest nieuwenhuizen magneti worle squam reassembles spurting mingxing overgrazed humidification cassone scrapings buchsbaum cospar deganwy acuteness mezzosoprano ious bulworth llanarth godmothers strelna ruritanian brdy assos oaky lammermuir craighill lengies eskandari malteser palmtop leyritz scms nescafe huntspill dellas fatha oaksey jsg jeanpierre woodrush systematised mahfood limas launius shemar menjivar clewlow baweja cevennes njoya maleate xuefeng shadyac bertinotti hoarseness lavazza bajracharya slimmest cyanotic boundries glave focker reaser baumel nicking castagneto africom tanweer machination fhp thugz interminably hammarskjold macroscale marzia monopolists cataloochee gornergrat bohemianism zelia benjarvus wohlforth balladares crisologo paloschi buelow domestiques vgl chudasama bsec sharlet dhimmitude hazelius popeil chemtrails gelfond sahwa tewes dhiman astors nemorosa duning andronic quackers bartolotta ambras rambis dunkelman bertino vemuri sitagliptin crazes lancret elasticities impels jamaa rabee cotulla romel leupp salinan palanan spungen rebated sherrilyn kalmaegi keycard nfci mokena lovera pullmantur sarson roven wrobel visaginas levoy yele billingslea weiming busemann sankha sambizanga downstroke vernissage rotarians hunding damak xingguo particularized stoler angi mirzayev mathies technip patriota tastier shrubsole priebke dismore plaint kerrod shann photochromic larreta swizzle psyllium pipefishes tymoshchuk rozin bauers durmaz anodizing walheim vandewalle pellizotti hengel adaptec ellerbrock unkindness thrombolytic mcgilvray baloy ghaith seadogs cerrig newes rushin iiif hunker gortyn barningham retallick sji lembke hongyuan gramophones zykov darlton sxs eten astronomico fastbacks baisers gilston lukšić velichko ptaszynski goldsmithing schultheis marial rastaman normington okakura longingly nauck ayse sigle sardari dupee handicappers yegorova ipca ludy klinker harvards gasped zingaro sirko aurelie brazi gravitates broomhead yamaki sweetmeats jinggangshan cuchulainn shootist duggars bandes tashilhunpo campano rossler taqueria shannyn hogge yunes foodbank delbene pascaline troiani abada darín mvula sanou contactable rudiger canessa combino idex cowls dyspraxia coluche esclaves sobyanin garos husna prust reinfection trestman weyoun yawa inkpen almonacid nighty lagana nsis kmiec nublu procuratorate tianyuan cenedlaethol akman browses sobhraj indemnified höfe streb koplik prade nalls amarin defaulters genitalium mergansers patka heti goodby klever oruzgan lubuk pendeford godeau callipers crucifixions kumbaya palafrugell beisel kentfield plyler ibama prejudging subin punja keratoplasty lisnaskea gignoux kuc nairu deibert detectability akos yamadayev fugro renascent sinnemahoning methylprednisolone locomotiv cydney magistrature eisteddfodau pegwell epigenome binaca rabbo lawers kuz chocola quibell baorong coko voros nooses macritchie labaki medtech daphna appen clementon suissa slanging maccarinelli aprox somniferum cyberknife alferov koreeda hanz krisna faizullah cardioverter trudgill matchsticks vietor aksa sahraoui sixpenny suqian spinka foetuses alboran unmentionable macnelly yonathan fuenzalida streetsboro kippen philippon evalueserve carvalhal silbury goudarzi zollo hemoglobinuria jotun douds malmsten aurantium chadema anglians diphu stabb sholapur ficarra ruedas kotok cutrone speedways roomies gallow paroxysm validations bassil goodger bizimungu arapaima fingerless nonnie callouts comports sedrick gaut motional portlock banus masamoto booboo kissena iribarren rockcorps bkm rll maghi fionnula rupavahini holberton cadential berneri soor sharpeners contr tianxia tiramisu jlg carlen hyndland pck noureddin dunkers maribeth eenie kopell dukw tuula ruckelshaus visentin prixs echolalia mingun algonkian brisbin madl stuffit greate grift wielkopolskie prevue lelièvre bratkowski piroska chieveley trocchi coloccini enlivening kloos fogged châtelard davidsonville kranjec eanna teratomas utube letterheads flechette cdfi lobert congregant wildhorse annasophia pandarus ricart fucheng tammaro hardage becaue megastructures tandi narong crps confindustria crosshill litang eiche pathetique rudest boof karadzic plotless pulcini sioned hyperplastic neemrana shanaya garmo tanton poujade marzel katims merlet chafford wassup garud rockpool richens tawni infirmaries gruene godawful hingst dunipace keyboarding kerse bindel mugford moate suzano diadora almar kilted benna chauvinists gordes legman charleswood abbett pbz byler muskoxen koukou marzorati porker espinar phalaborwa criscito wahbi flacq pottsgrove aquaponics apap pahala dalyan manantial trunklines galeras griebel pedestrianism whiteinch moenkopi iwp procainamide murrells reutilization ondansetron heliostat oxenberg heileman subtexts gernhardt turnley wiesenfeld wuyuan disentangling enap shuli oroya samon aconite apocalypticism gluckstein anston northleach horyn lfk hanban hewerdine triclinium nerijus takenaga quacquarelli gamston folgers bureacracy modele pedicabs cerebra audis wgal minyanim seana golondrina squally dyble vmix skues tesio specificially mabie schodt maciejewski bogas metropcs alza skitters yere genzken fugly frenk grameenphone rigal fereshteh swallowfield erdenet redcross scapegoated gilzean limina mihalik yasur numericable hannant asmaa aace fornos jipeng certaintly neola christoforou baugur somersham toppin tollett zocchi greason impellers sumeria franciska lashon heintze righton mrh shojaei fancifully wepener abdula tramontana comberton paua behemoths paumen wickland yalom glocal takeishi sceptres creal ovingdean gerke ntcc aitzaz orejuela paraffins pixellated rohingyas proxied mcgaha formato greasley haltom heit leshchenko marenco nechama garvaghy breite claque ratanpur peregrini kiyohiko chovanec gaida kilik aulas fermentations jakup counteraction dnfs prawa bootstraps mengzi hillwalking routhier backings soltero hanahan fossilize gamersgate felter macoris blachman wkyt franktown roehm yazhou utopians microbicides ghale masturbated vercauteren handballs tinne sunblock zanten burdi lewsey martley lgf leveaux helmshore rabil accouterments caram rucks clearbrook ruotolo geting mccraney mho shazza koroni jizo saitoti harville persicum jhc rühle disempowered revenu louds dehumidifier lingwood klop upbraided piret miyakawa summerbee externalized khovanshchina rediker velikov dilating pulmonaria redtop peyret drillbit galloper mjb hynix frenchkiss wastin msdos ortmayer trachsel revengeful monach croque artadi groundwood liimatainen guancha vsphere thebus alibek furuno cantine shust murua lanser collegeboard nalis granillo ketsbaia fastens cakaudrove unpreparedness reserach angarano artifices handsfree decompressing kiff nessman farrior tancock skyboxes joses hamamelis agoos stoehr vinegars leye ravaglia savignac muddles hogestyn freetime wastefulness klüft tullos komedia neato leonova deejaying teallach grunert leafe cicierega neele pentavalent debelle andreyevna azf telesat mmse druyan feticide séralini hokuetsu wrase pesi bankunited cavenaghi meq limbert carpani iclei unlistenable ponorogo trimley ascl meiers ayles lijst bardini hoberg trombay lovibond overvoltage visitacion coloso vmebus scalera burao gaard brockworth hegar quarrier fenichel nonfictional demyanenko dinkel norful praesidium grobman neediest boucek baniszewski allfrey iora scheidler polyuria chandrasekhara nthe froglet bijlani privatising ilango klossowski leuer taarabt chagaev andreyevsky reen ciy galilea prosequi libeling assef fossella prouder shivakumar jacobellis inle bujak mabs stodola opnion memminger daele acyclovir simlish mpemba bauerle mufc kamrani nanfang murakawa baksi inserter koob whisking millgate couve gassy welled ehrenstein ahsha worsdell frucht fiaf adomah mustaqbal dweeb baihe blagg incomprehensibly bettles mardis stoppin beacause fradkov sinewy detrimentally chhotu washerwomen pinko birdsboro protasiewicz fischerspooner nasc mailes rajkhowa caravanning gianmaria gauloise cumbo cxx beckey mahasweta nnsa calella stoppini skibbe flippancy oxwich fariborz leki breon relaunches sforzesco hohaia anif massimi montignac mouthwashes allonby hemifacial quinidine barcamp kaddour bilanz caygill fii bartons treharne coloniser ritualistically wagyu bonora kloves chandrakumar parrino warter mackovic marau astonishes knoydart handelsman baldhead teicher unrelieved hucks schipa goze cameroonians librería crunched moulvi pantiles goodlettsville hambros piccini wisnu kudryashov burciaga baltzell kiszka tabulator sequitor snir marneuli aushev transgresses gratefulness torroba untangled trethowan scuttlebutt acie mühlfeld tista pageboy bethuel yers egotistic ragg califon adrenalina cmaj mizanur gethard bidden yongfeng cumana nonsmokers schommer kossi chinamasa liedholm taleghani wareheim paintbox sylvana kurils linemates begain opnav riesgo cifaretto kleindienst kombinat ronchamp hyett infrequency kluk pentewan crepaldi aeroplan trun niru willhite metronews sharla gazoo bcbs atavism crazee densham nummelin daq steinkamp intracytoplasmic clac higer libertini invasively sintef grounders unpredicted sprowston crummock reimagine philarmonic tricorder hunniford hugli quinces okri jaray unhorsed unsent scaneagle catemaco morovis azel whiteflies navfac ouda mtns esxi weststar milicic miliary papillomatosis leysen gration tlt lewanika jast svod etoys wisdoms tintype jutted periodontics elchibey sharana bounteous friedeburg gambell olexandr mullery shanidar hauschild majorie variegation nairnshire rastro regata twinings jpii bellefleur peterhansel commencements mceldowney pracy enmities aitc arnison ithe caerdydd fairpoint gripp moika dolt hitzlsperger donnacona houches trillin renovo shellie sotiriou headwaiter loven bocek morshed schelklingen kille stiehm fermier arberry pistolero developped froberg khatibi palmed foregrounding bll tocchet piekarski richardville prefigure mcilveen accussed bolstad meinhold framestore vittles parnevik grh counternarcotics evah couteur batal tambopata krieghoff scarfs arry anisimova ivt bahen inertness dowbiggin wallers onera julians murrin cullberg inx olitski ouston iolanta shinola intervale tayyab porbeagle furi wheather chakiris gambas akamine mogote peevish papamichael keening shivas ostry fembot wishin emulsifying darwins gutheinz canapé marmol xpressmusic moallem boxtree lizarraga croquettes scharoun immunosuppressed intermittency fossi steber herblock plentifully thymosin psig kenwyne kanka kärntner armetta sellersville barnt pulverizing mancebo peppin panteg barcus aarnio vermeij viani trumpf karriere gouger shirato beur ijc secessions kongregate onair karges sokoudjou plantier servicers vocus katkov ngen jne kompass durazo flatrock mealtime chatchai kochin soller ugur aleatory juantorena huntingford rouxel canelli bulykin divestitures icmr hyperopia shelsley okhta lightwater sinnamon cise longlegs citrulline consorted paone jiewen zhihui walkovers fitow gcaa backstopped indya albarrán ocz wolle photosynthesizing microstrategy discontentment lanotte stillson sensini schoolkid gerta simpatico botín longside abos achuthan bequia zhaotong stanbrook fahrettin rossello sirkka ingleses bleeping trena stippled plotz corrugations hoeksema aspull euromillions swx vsop kamya jelley kresse schiehallion oduber trka edgemere shindand basey yufu standerton giuda atascocita navion reconvening liebezeit fletching pinery alikhan sasco sigit tipitina forss holdrege mughlai tix umeboshi enlow bobsleds litespeed verka rivertown tacconi rorippa skerryvore globulins tzec marilee awadi mozambicans banyamulenge berendsen learoyd atisha estudiante idole niccolini blaz miesque quitclaim foxford hiltzik krebbs hayhoe olvidados persistance balloted decourcy nscn blakk dwf brenin naïvely stoecker kamiizumi gvp hallmarking haematological kayapo mediagroup moute ulma nnaji cherrelle triodion ribet dialga tardebigge cssd lechero cosman newtownhamilton luttwak blondi clampetts carandiru zairean doughs genos zidek barosaurus albiston benegas alverstoke delima diceros incontinent evilness mariachis brodowski multiway euphausia refractometer pattammal uny chbosky danelo waterfoot ryogoku meselson burck orwig pooter ncrc landaulet mgy contino talmudists ogren dazzles chancer wfts acon sergiyev wordlessly kirichenko muela buble andreina masen eloísa galinsky tavakoli brighstone rheumatologist aabar pulgas arsenopyrite quenstedt freidel borum trenchtown unromantic edmée brisbois ahla cressing helguson peria mccreevy benglis cheminformatics freudenstein raggy tembisa jeyaraj nassi vieth powerbuilder uplinks soutter kaprun cacioppo stockier epifania honoraria ricciarelli pevney szeliga nightspots unkei liberto bassiouni malk ferndown budweis suppressants hildred perlich callowhill arseneault jupe bielema berkovits sammes venkman karamay biennales burtis tasci backsliding coln matip oppositionist lalime dubinin unseasoned detroiters consorcio potch havemann hänsch soulive catorce manulis assertively diapering margarethen vettriano lifejackets methylmalonic zeth awr reekers clingmans pitmen kathman gatz azimut sincelejo wofl asume bioenergetics militated langeais yankovsky heleen monkeypox koenigs killowen oestreich vallot belaying bejo donghua navels thakurta veness ellagic folliard malassezia shughart autorickshaw pinecone hosey spaceland raycroft pachora trolli pfe mutualists hetemaj vishnevsky cobija wian gotv tangentopoli achyut gerbrandy jabu windbag fanck ummayad natasja shoplift beachwear greis bovespa artiles hatf wigfall bisphosphonate llf anod verdine bjerge cardioversion achewood tubin mbv senechal mccrindle ramia virosa barce stuever hanigan taekema akj wolfsonian aiya clayman lakai selm bnot crug galerías sunward ather stavely wkd transexual hangup degreed sidus lanne cattanach kubel altenmarkt escherich klebnikov giralt cornflake jangling ncas dominy ragama lamlash timewatch blei rogalski compromiso miracolo gossow dawar kafiristan wtem sironi streambank yanar aloy moisten gagah summerstage aleu alcudia mamedyarov nilla koerber jowls pepco ohshima sugaring hilário hongdae cavalo krust suburbanites rscg kousa schwahn galison pocitos foregrounded ctk delphini cloudberry seafrance marjoe kharchenko bagayoko hartselle askhat hirschl abubakr kumarasamy pandals pounces birdmen juva obselete popenoe ebby flotta newpapers haematopoietic maracana remise cassey reynal jiles positif ruffell dioses kanzler fuh loftier halam legos overslept nwachukwu menc blong flye pandang bakoyannis kastellet thas comping rechecking nikbakht nataf leprae akifumi disrobe hannula anshu zirin boullion buchanans prescribers moonah luteolin slagel grantsburg lifestream anzalone marchibroda metronet barrero wierzbicki redesignating economidis diago nqf mudlark tiznit pasquino elaborative yakusho shorthouse meeus janica teletypewriter mudville malty guanacos hinnant treleaven scalpers vennegoor hasna greatcoat interisland dollhouses rhiwbina gannan kitezh jeudi karinska expediently sturgill thibetanus afta camerota neitz convers azambuja unpersuaded lgen kissers christenunie ettie beloki alesandro mountsorrel veneman kamboni seminara bakassi schow whittenburg milliwatt tagliaferro stancliffe preplanned tocopilla talisa boutelle mergen dumervil chelimo kokkinos macandrews delonte shuren roosmalen boogieman haveing fbos scofidio larnaka incising orgiastic shammari templiers gilma gutfreund kig crabbers bioprocessing petulance unenrolled partiers xperiment cottman squelched yasuni cabannes sparkhill longshanks mongla almazbek tredyffrin fariba rønneberg broyard northmore eckbo sgae acars jimmies ratting acps chalkley meseret mcilvanney vanquishes manolete bikeshare tristen krejci aysgarth carmelina covel garance kyriazis mccole sugaree llŷr instone guh vainglory trescott toiba tdu dulling khadafi galiana baltzer verrilli washo australopithecines whitesville centerfielder kébé skag metallization collomb kottar neila boyo mclibel minsmere pepperidge nemer scoffing introitus busco kabini cowon meliá bárcena koeberg culligan giasone tongren pedrad cernea reviser botches yeliseyev postdate gamberini sveum barbarez narron leire wittiest qujing mecs kindel mechanize seiber tudgay maby orangefield colleville biggy tagami mangosuthu norgaard mitag aulaqi uisge awwad lindhardt fehl breastbone hethel damiana naturalizing endler coutard suppleness wizened wearily necedah weist burgman hampus kemalism matrikonopc peschisolido gvk datacom gestating aczel shimmers eih breadon helpings riahi sabbatarianism sheilah gambiae salable menisci rizhskaya zago cossey woronov gottstein erlinda expansively owino fochabers mullaly bratislav refutable woolsack alperton seccion euphonious mckone radiothon dillion kaina unblockable damron wytv unsexed gulpilil cherise hotkey psychogeography monae ticad froilan mamonov alstroemeria cez bajofondo boyack vuna moua mondorf codependent worldnet symbiodinium paleoanthropologists remek polymyxin gallier stanchion meerschaum averagely bigazzi bordone excoriation harbutt pippard econometrician kipiani prostrating gateacre guerneville simmon ptw zhar boddam rimawi bioactivity tweaker particualr sportscene autobiographie piecework loooong kubus wukesong dobe boogers pincio mazars wormsley sharpham reko phillippi sissies apolonio stemi tullie deif bullit mungiu aldermanbury gallner merse votkinsk istructe uhmwpe katmandu bittering vinda movil occitane outsmarts sozzani postseasons hurban russophiles nakasongola rabanne gracen kiunga lochgilphead duelled itek interventionists siphiwe dunns waltzer swickard ilopango hadland abutbul rambow lealtad cityjet freelove mashreq xperience lambent coult veltheim locs altius hegemann dumbstruck fwv gyrate stojkov muge facer oufkir sueddeutsche pernía linsky nko lasry dionysiac störtebeker breidenbach garreau quiff baruta roatan wafaa dujail allenton daptone potshot cyberlaw radiometers orwin viguier begrudging jonathas cheekily inomata revver gtlds tyronne maligne eugeni nikolaeva bealey accomodating bougival sleestak sapkowski irh necid arnette recalcitrance shurmur kagurazaka dtour cysticercosis looong mungos binta paska gaizka sinharaja mowery haeger dockstader summiting vucetich wabo arsine vanover orengo hudon kokesh avara battenburg haylock anomic treemonisha kaminer naoshi residentially congresbury tombola permai baijiu griffel chelmno shortie cordingley simplexes ipscs brunetta gilde cihr perdanakusuma dahlak cherrybrook griseum jiali kastri frood cantare caniglia tuvans intoxicant siegl genden aniak sempra maniscalco erlkönig glibly sneezy ticotin bagla mtnl psaltis kedem globetrotting mangia unsearchable goron migliorini sesma cheapen diles shakhbut karakaya hybridised suasion nonsurgical osio hemmingsen adventuresome bellview sulgrave saltine calen nigina bermann cinthia rinzler splendours promesse wallcoverings cyrankiewicz brackeen wineglass maquina bavier adot cloudera microsurgical vadisi milieux onuki siddick reimbursable noorda eliécer shadwick outcompeted chatterer interbay biochemicals kittridge esala gortner glenohumeral etchegaray bierley delenda troisgros protais adebimpe reinvesting douses scadding kolbert rödl maynas mcferran dishforth mariappa commissaires mohammedia unkindly neque neurones shukra mathys soldatov mavens ilgar adnet evv washings shikata mainlander carandini syrett radiogram tatarsky pritikin boge bracamontes wobbler salvati spieker asms quiverfull baselessly rentoul eurekalert simao floreana muehl gorta methodic lastest micronesians turiddu vasiliu sanjit ackee wcd fachtna weasle holkins mtawarira rohatgi pyla superspy bigband glenavy deadend chilcot canalization melikyan gaspe naruko minasian twiddling strachur tolleshunt sej ftx barkov churchdown makovicky osterwald häagen gmhc hadary kilmory clwydian kinjo divvy ascencio crambe dobermann nonproductive serenitatis prosaically reforged oliech corish coopted cheezburger spads flobots stroudwater druggists melodist puppeteered nccam birbiglia deconstructivist koeppen symbioses genistein rampone distribuidora thulani bengston gethsemani standpipes motortrend overby valgeir leagrave amsoil isvs ringroad ledward aspd weepy zoloft gilmerton anzoategui fairtex maclaughlin longcroft zahawi mcmicken cottone uncongenial lugging wikicity jills leveler windchill ghormley launde bornheimer nantyglo mofokeng kampusch ingwersen worthily nhlers konte huayuan treviranus aleka reenacts prahl bazemore suzann creg shies cout coatlicue desilva broo aceval kikukawa sanyuan theaker svobodny outsprinted podded ryken hironori foronda thunderhawk arepa carys grun intiman vanquisher ruxin lukyanov mudcrutch booi pecari sadda markakis hankering blombos cupper visione safta lonegan picturesquely oldrich longnor regensburger madhumita huntingburg barnstorm loanhead checkland tesler pavano baguettes piggot ghasemi kavalier prys osias noggle asatiani olins skell collini chevin sloughing chass scheels thorntons noyori panh murt bluemont priviledge sandbrook jesco currach lenzen cboe shernoff pascrell knobbly maintenence lechery hanasi wnct naturopath hegang chyler quadruples pietrelcina shiming deely medivac worthwile irsp nagorny conrads nprc kragen earmarking ftg tretorn meltem oogway endy javel rattazzi longshaw sroka oximeter sweeties wboc bizz boldy dvorkin elkabetz ashti arkalyk frantzen downdrafts starcross oumou beamont itsu muhammet rafina remits stefen ibac welterweights zivko saffo dopp emar bodhgaya bourgueil redheugh plensa keese plainsmen himala carlon lape kreipe candee khandu beketov vlp zebroski devonte zhangjiang gallaghers nightstand breeching sunrooms gerben connotative crowed blandón strano bodart tavish dimitroff bertoncini kibeho konnie wahaha copernic praet roero cardamone petkoff scillonian visable rihan broiling mundum lenon belaid flouride goldikova dwango grev urusov northbank damps jenning ahhhh adeleye engulfment thrangu mcgauran spataro spaziani proanthocyanidins buchbinder batwa totesport ergometer smailes presby chenes zimmers sheshan sopore heublein swirly evdo nonni beholds janhunen salvatrucha bocquet bazinet cashen nssf connectu merrington snaggletooth odili latinoamerica marsili sist ardudwy kolari catullo keglevich witht tamasheq peguis pnin guileless cigi unrooted rayhan paulauskas temerarios degreasing smithsburg garaudy neulion brownings kauf huntin acacio shabbily omron süss yeezy niccol spek boombastic ragno papeles cherrypicked subrogation shortlisting boatright crofty biava björgvin ultracold durieu mccarley concupiscence gelt nyet glennville ariss snris fluoresces rondi marren jianghuai griess bilek eoraptor asid antidiabetic reygadas genset kipawa tardes whoah hunsinger sneider encausse graynor tottie chumps walshaw qiyuan parme arges malcolmson baulch iorg lre mixcoac rozan galeri iocs eulenberg snitzer ferulic josee turkman reasi flatirons sweetmeat goranson genographic volodya titina pennino orpha styl surette quillin crawfordville kaieteur xiaojun cocoanuts lindhout girardon fenham plitt shapland muris dougga bayoumi margareten dolia noordoostpolder ieb spruit barac sipadan lusardi ronayne feitelson arboretums vanegas housemaids penzo crcs flaxton superorganism proisy huba badda bárcenas plys ziegel rauschenbusch shittu abdelkarim pbdes arizonan otmoor markoe flot mosteiro nafas frühbeck exabytes microparticles meridiano slacken kameo kaegi godet cherrapunji employes zerbinetta babulal primarly zamparini edale crona maxthon kerney microdot incongruities kobzon schlieben pierrots jebus hoffberger lordkipanidze preposterously goodwick naganuma pasche wtm truncus mualla davidow diablerets sadurní seiberg diffract balkars jvs debashish falloon sclerotherapy westcarr sujin shoesmith llansantffraid julu hgvs kettledrum streetwalker spott nevisian resounds shifra valpo ambati kouno tamp aspi amylin toofer kirstenbosch corrinne barracked zayandeh yucky ccat hsas hlavaty lustbader fitzy lanese oocyst chargés knoebels chasewater difícil overspend shoman zhol yaoshi ldpr galvanising forzani wheelan kressley ignitor reely mindlin donda misalliance barnwood asile alker portnow aginst toschi bettison multilink ikaika cozad spacedev lubber maycock rotters penitente guntis proliferates butan aiha petke arzew offhandedly gunwales doumit signaal lams dextro bakka loincloths flutists calil lunched krukenberg lupis jysk belying acounts poulsson anastase ghostbuster fauci timmis qpo givhan ifg azeredo powerbase katyal xrf tritiya claustro strate scatology blokland upwell kmgh nsfc doutzen bunu xkr myoko henham helmke moammar ghareeb privado ivth chapuisat franssen arenys mellem dallan noseda borrowash uclick blueefficiency jannette unassembled weerakoon gitt freckleton overbuilt mercaptopurine budvar helmerich geysir reassortment ilab clarisa revelling transneft wolverley geddis derealization wette delmonte ridot allas chabi castmember bergendahl streetfighter heringsdorf cryptogams kopecky xaviera flinching oppressively coning berardino poliana geotextile screencasts zemaitis bugeja identikit puricelli botwood tierkreis feve macknight kadazandusun zahur shallcross kilvey kalalau piras wasg kopra kihlstedt pyrford gottschall expeed morrible amga calmy ikl motyka superglue antiinflammatory geezers schlecht boliden anani moorhens wjxt esky pingel woodsy mugan ylli merrells asellus kalona kebbell stranmillis tubau aures matache losi twite reinsurers dtb verry superfine lri arkhipova jibber rosoff naulakha fermenters stylophone shoney counterspy snakelike chanan zwigoff captopril mirow cleopas cronshaw tretiakov eichen bekas dockworker atossa kirsan nsel saleha ihg tvu rossellino rygel burgmeier ryazansky harir platypuses jimtown collamer cellulite vrdolyak hubbardton zahniser botnia reimagines guero intolerably buderus opendoc ranu laczkó thundersley fatullah airo propanediol bulent anjem peashooter postgres landguard hras conjurers nymark erek orleanians sindona forer gothick greeters ballykissangel rester pistolet kreitzer perogative eatin elvina kudelski stoev newsbreak chillingly aydan jmf actavis supernanny multifuel syerston silverdocs mikail galkina kerrin changé guai kingmakers markas playdate mssa comprehensives kenealy bushwhacker edginess vgp helbing toddla solovay levinsohn antoniuk baldursson jamelle hoback mollen blixt jmr taffarel gozitan bolometer braunlage mabiala borghini flandrin libe radicalize octreotide homogenize irrigators hanuka zeff meilan maluf agostinelli narsai lipitor nifedipine akhmim tripti guitart saavn xiaochun rotberg uckg erdf massamba cilgerran adeptly ryser everist moomintroll overwater talansky substitutable mousetraps gsusa knmi beville benicàssim fashola questi malmquist piche feltwell reattributed seiichiro tande psychoanalyse melniboné banzhaf bouveret stemmle mcnealy tude diethylstilbestrol rendre nippers inacio supercruise harpurhey herrenhausen lipoma winogrand riefkohl heslov petrilli micco harpham saturna rosemarkie isara maqdisi julyan abasolo shiftless disbelieves yilong manzie carcharodontosaurus brandolini flacso nolita edds flashforwards tunks baruto hotte waipoua rensch drippy contentedly whiteoak sidel christijan niello longniddry brcko diisocyanate lagrimas perenco thoughtcrime ryad venit walkabouts copses againt hto venusians mayron stainland monied mulisch eminences chartbuster pensford reoffending mbanza tentation pafford argentinosaurus grushecky incb kawalerowicz exagerated wxix bricmont distressingly camelina melawati femtocell authoress havili niehoff philtrum dowdle parrillas johne bathwick koret mockers klippan yalong unowned furo cking sirènes citycenter gjoa sieberi pavlidis kcci chamique ghalibaf boedo zanes pankov peapack erandio riordans absolutley kewl unprejudiced alwaleed eggan nethercutt russom ventry maquet spiritualistic karkare ance heglig engracia koeppel bychan kierszenbaum kluster homfray zwz iacc grossglockner mulet ehrlichiosis raices modafferi tedglobal seabus kuam xiaozhao kellee niesen kopeck cafo cotidianul eamont coody dalbir rozi hvb maben niessen harnham mitarbeiter morigaon saddling nazr entourages cailloux antoaneta stankov mphahlele obsequies ostern sicel yachats etxeberria sadists phreak blecher ceratosaur sinjin floorless manises bialy umizoomi déborah choonhavan hinchley geringer recapitalisation phyl aals decaux crashworthiness westbroek farivar shaftoe dairymen dundrod ergotamine brookstone wapner eyedea wagenbach espadrilles ryberg smike jiminez nebrija gerde pugilists bellmawr zore wtvd resells pust plaids cyberrays wek oborona esmer gerwyn nuan cheapo sealskin genin kinmel amien catechins gautreau kirit coolspotters konishiki essman voyles ohlmeyer xandra clementis hiep molluscum marrus raziq arvey borates qudsi biw hensman puffers decreasingly narwa sanj kawar spheniscus unisphere scodelario syncml gajdusek springman aapi santogold boughner ashhurst zaouia furuhashi santes selye jmj matsa lancelyn provenances sandee lonhro htx donté tincher medicinenet plateaued abaga mercuri lesieur recourses putas disklavier mortons countermand intersil cataplexy unsubsidized nahta letarte bozhidar freediver nityanand garabed baggers trivago cilegon atencio phantasms groundstrokes mdct gvm aalten langfield galder meraz fictious basharat yant heartbreaks dierker lempert muha scuds ceq arriagada ryb metacafe vaginally farsta bristolian osterhout hoppa sabca assir manorbier golinski briedis pasricha zonday ipms antispasmodic mistura khasis undrinkable athletico rikabi fröjdfeldt stefanski hollanda mercredi suder norteña regales amrinder shoshanna shoebridge lizárraga smss huaman isabellae ekmeleddin rogovin ercs opw heubach bellamkonda somani rashba gillanders koetter lanced weathercock mcgarty bandbox anitra rorqual cutsem quantal artcraft dtf dermoid pantycelyn somnambulism amasses pallonji laywoman electrospinning thirsting tritschler sarwari cogency ashover levisa raeford glori kleven diadems recommendable aeternam gerrick prosecutable bogar chiau iheanacho kaag tadiran linemate costabile lingan multiservice tsur presupposing toyohira meale abengoa sixer savusavu jeannin masatsugu tattling tamsyn monsef gulland pianka jignesh chanco bouie luath sope forsa energise mcleary enb baristas lovebox hangeland snet grammies bagian premenopausal seubert plages fiddy omnimedia cowing torger rwandese viraat orlistat kinnersley kamaruddin gottschalks klukowski chetwood barstool wuerzburg niavaran teasel wanhua errante karembeu gulda terreros trudging masco byck donaghey graphitic yablonsky occlusions craxton kovacic stampers ryobi banques volgodonsk cprs hagge sohal speccy sked consumerlab bhaumik ferko higly dardan errick retesting misbach magaliesberg kabbani vlieger belzberg hupmobile enculturation olimpiada compatable antman bagirov bagpipers goofiness fanchon antelme fuzzed idms dadaab micali wuttke tatenda whiggish seher reconciliatory moncoutié dramatising dibden dowman gurin xenotransplantation jkf unprinted cataglyphis gigantour positivistic kelpies mansudae thermometry gameshows livs penalising bibbed cuk transshipped minarik paleographer ifis feigin mururoa zyryanov laureles ruini mogae mraps nerazzurri pice kakul seidemann raivo maximino esterhuizen jacey hambali copé morford nagaya mellouli sinecures primorska bestest eyeballing macd overharvesting absconds antoniadis anesthetists teston ertz bushtucker exi maisuradze ambidexterity larking australopithecine workmate mcmillon hegley airds graveled millilitre eastbrook yake sentimentally outsmarting carjacked vorobiev thrillingly flutters tortas wickrematunge azita ,where maltzan wodaabe sccp itaewon bikie fusobacterium samadi navstar educacion polyus haribo conventionality rautiainen consolo hypomanic hsph shangkun winkelried guiry grrl payg audo chatmon maaa oomen mammalogist butenko saharawi donee yir gibralter canoers catts postum ajak unnacceptable layback vlcc rapf colwood minal sompting miyashiro ponch segways langenscheidt tempsford overrepresentation kvinesdal tiltman johaug bombeck nsls colsterworth buche toady blackburnian thickeners wenyuan deliberatly rosenhaus knoller kolodny poplin lustmord tamela siendo copiague eev unblinking derülo cargolux pates gwaith klinck dzhugashvili montagues glahn brox avms gresini wtoc jeoffrey fout halban cliente pkware henbane mothercare bellybutton earpieces videodrome seing amabel jhp cortazar freshfields boops mazzella boym kistner rodborough rathor hentrich neckwear majko waterjets sauri benzon autopilots platooned lewisboro iuu kisiel idealisation bonifay roxann greenline rbt normaly maroth bloco gillain horenstein adrie steege rewatch iweb catlike banri patriotically lewe scheuring dräger snowline dagobah plange okimoto affliated houndstooth rhon luxford goldtop arthralgia cesky hagey ayoreo daalder mongala penz rapu alladin isohunt essentia pokhrel cuing frenemies velocidad ogbonna linet kuhns kaiparowits brabec rothblatt grossinger nirad enwezor tooze vézère tayla bloatware swh rompe netherthorpe transalta silbo grenda sridharan spectacor crimefighting louima genoud millilitres incentivizing iweala geoglyphs malloys easynet meladze boldak baldon aghajari gorak qaitbay chuquicamata offically chocano melwood awsome waleska salicylates ambarawa biffi tyendinaga stotfold imclone minjok runabouts oltmanns abdelhak kebe conures poweredge rowans dyspareunia recolonized prezzo smallbore prinsengracht stanborough vandenburg whch camming standoffs lutèce luwak cortinas luxenberg duratorq degc bussan bouyeri manzon irbil ekotto bayman whippets balikatan paddleball niell wiggily adolfs ifoam perrino wlross toyz gaoled hydrangeas bougherra erap francey baute francy hegge reinterment plungers zivic zelin countrypolitan winkies ldm yukata slimness posti kolko mahorn checketts lhm bohbot braying tunisi sorella thorrington turds yangcheng mazzanti jcf danida demagogy tkvarcheli schlein bronchoconstriction falkow loping jeita osumi noddies manorville kxan nlos brodgar posers finos quangos shinbashi qai galdo caking terrey murman palmerton eury cystadenoma cucu dolbeau borjas goalby farzaneh greedo lauaki nariaki maneki chatri hurtgen hsupa nanofabrication dataquest eitam intially cabbies frump rebury gafni woodies sopp mobilink iloko plunders multivac metab chatwal centerfolds dusi mytravel shayler steinle ordinands mease ravencroft papademos aleesha derailleurs dessoff dosha glasbury talash namechecked purnama arganda calhoon argentaria némirovsky averitt afuera webroot braceros mekorot roschmann ravidassia wez darmody mkhize klecko fleeced cecere breyten tbu conficker nanjo stinchcombe choderlos pein unstinting focht kobal judee lezgin koleva dalilah kht westminsters zambry pialat perfectionists bylund amadei arrecifes sdsm safiya jacobsohn cybil ssis tevere ipaq bemerton barsa twirled vanille mellini suckering guebuza aquanauts spisak swiftest vinet greive goslings tigerlily rosnay mizos harlen wesolowski ceska batcheller ryazantsev thurlby hennell eleme fraiser irremediable polasek janeen alne yitzhaki jawlensky burgettstown affordably zewde nocht hulet icaro basdeo decklid zurek sabean adjourns maaouya overmuch nabel mekelle jiulong atran anthimos fettiplace salesgirl gibault lineas hatz pedrick campolo jid teatri grabau altruist heeler pennekamp omaar vtp wsfs toshiharu haberfeld wincer bulmers cennamo soirees abalo chabal bourneville chelwood adages aimard garwin koepcke trimalchio shinbo riddims gottardi badwan carvery leashed schaufuss ballybay pintar weena micrometeoroid papadopoulou proske humanlike villatoro gazers rescigno cocody bosher abolfazl manukian hannifin lamphere gamine coverley tippler screwvala demong netjets mue jethwa txiki citabria wakker rolvenden semidesert audlem saurischian paudie notetaking riddhi danley alberg maddeningly crudeness pagulayan acat mossie patillo saipem crabgrass nanosatellite defination bellefield conradie megna shaer matalan karia dharker seetharaman jathika oehme pyracantha hitti luckes invensys athelete rvi oxeye hirer quarterbacking minkler garnes seribu klaveness msgt gncc krizan disqus vigurs degussa petersons gumprecht riskiest podres gangeticus talarico poth taseko spolin wrightbus semmler methot mbenga pertile toubin shamong ncds asfour wellbutrin romanticizing lamis jashn lazarou assailing kamins guimond oleifera chiefland suprematist parthenocissus banaszak pharmacodynamic akila mshsl mannini tassotti steeplechasing honeypots péché gorefest reverberant sherie abena flighted pucklechurch lavonne montebourg conversazione wfmz influentially sonde requited creepiness ozcan posselt meadowlarks bunchy haught fluorescently tekwar burkinshaw pinschers tristam palfreyman strivings bresnick posch thatchers rilles dunstaffnage adjaye pacher smolenski schepers bushkin decongestants greenies miet smallworld cryptozoologists kielburger krishen behnken avv boretti vigilantly shakila petruzzi greentech wrockwardine lere verey macko xtm britvic scal thodoris matthies ventola zitouna polydactyl soapstar lamming pontesbury isat ptd polinsky tanen greacen mamah tubifex sucess gdps sibon eshun sunburnt wollert gwydyr xcaret pipistrel tolles loie elvy methemoglobinemia ginni proactiv fiberoptic downmarket escucha vandam metula nodo taiyi raptures cfca raiwind scalloway abjection markopoulo obligingly shoddily persbrandt archdaily chioma bellido pagliaro megajoules suggestible hinkson bivio kakei pancevo khaldoun artforms grodner hushing forstall linthorpe ousts ustaz ayot bloodroot microcirculation bigwigs acrs sidepods makinwa bumgardner kinson hanksville sancar salaya keth rushan sheheen unremittingly kupi pemra durani otm maberry nhulunbuy lingwu naturalia dager lichtenfeld pukki romanija haino umlauf bmac musicke sahelanthropus irishness supereva sekong wahlstrom sheepscot kohlhaas bloomin ultratech kretchmer intelius delagrange golfweek minnewaska isci faty valiquette tourne larusso guermantes dateable franglais zolotarev louche hurok rolands muscadet winscombe avvenire beantown chellaston tacchini unremarked buzby roomier piously despotovski vinerian brunk lordswood sörensen mezzotints tamesis francina vallarino supercarriers onthophagus detestation erzerum royces yanagihara lineaments hristos inbee yellowface egu regionalists ponding psyops zacks gunsberg psychotics hosken tailender chicama conjoin ngh nzf haweswater kedge intraventricular xers gunflint romps uliginosa surreys hlt prettily chako zenair asomugha kunzru borsos boes jarrahdale pamintuan matrixx lamentably kierra bonazzoli rowes skandinaviska mcburnie romanticize vuvuzela moisturizing circovirus castrogiovanni basilea lappe stinton ponceau wolframite adma contemporani giuffrida outshined foncier quos khaldi fenghua sasho malloum tulafono tibidabo roychoudhury televangelism falletta glucosinolates whacker roset novenas guga hyphenating subfolders malveaux chuffed jumel ansbacher botching droppers fritchie konaka bamfield raissa significances langeberg ogidi meting morritt maggin mawi whiner mondriaan ratter lerew torishima leinwand grigny imara torigoe rzayev rieth henrick kyrgiakos feffer velos rooz daff chadash kalkan giannetti shibley langeveldt zhongguancun saidaiji panaca schlanger reproaching kni rustie honkytonk maisha squanders discomforting hamachi freeholds schs lirio burpham isme droops toonz tanh freixa gomersal viewshed lasar zayani balancers puw malorie borgs erzulie merriwether ksaz diggi genette morss amenophis mcelrath clontibret froots keoni limner chuxiong lyburn phrma reche chidester amounderness paluch colbourne sariska indubitable tettey shuwa tyahnybok adicts harked fhfa arround gyrations ruaha discuses medland jpt gyron ght lafreniere lumidee gravities mercan biosocial assimilative parasitizing oilmen seadog ethelburga capitolio malariae biocidal gope bruccoli interquartile bassen producible venema funchess windbreaker dimaio selcuk laxmibai richters schroll perhpas orbus arthropathy stockinger floorpan kiszczak loughead araras estridge gorry wildcatter yedlin mappers guthy andrii vasospasm baraan indispensible tyrannies tolbiac landeros gunnislake freecycle jobi twinjet carras shorouk maschke perquisites shafroth agstar yusupova sticklers whatta nohl nijdam stoykov bootes juncosa jessamy sujoy defoliated mientkiewicz anorectic camarones aftertax copperman waterbrook achiote cqs waurika maybes psoralen langmead gaudete vernaccia ginge neuropsychiatrist gsas ednam yttling gilbertsville lamantia chodas spazz cavos microlights munif oestreicher cheatle heugh balamory armenio pcms dolgov kahrs esencia renfrow acklington paektu eschauer blackbelly hongkongers graczyk eeu ahca shbg rulin greenness sharonville miscue nyhus acclimatise aerosystems avrom gengo inconstancy hpm qassimi reconcilable vitek llandrillo kanie ghattas hysen lamoreaux langauges maseno multimodality superbrands donard worx flad shinju greetham harsch foreignness cagr baichung totty refuseniks adenuga mountlake krumme topflight propecia snorkelers davro muyu dobrica annaka submittal halba soulié rieck glickenhaus snitches whiteknights massengill aharonov stihl nake overdrafts chandrasiri chion daytrippers sollee talibon incautious bathos bleck finnessey bouazza westly babenco lundborg jungwirth wgms biggert parasitosis shambaugh mandich speightstown superlattice tolos solman oldpark aerovias disallowance bloodsuckers surfline meijers eyde totec gabbidon flexilis bournbrook etherton lairig fortinet zuckert rodanthe saawariya grimaces hurlbert afoa jacquire rowney brumidi chornovil civils acupuncturists trichomoniasis moneylending corboy tripler badder codpiece eifs eryk feraud manasieva thissen imperato etno portends lyashko phulbani mattinson hufflepuff puckered raloxifene iruña gbf figgy upreti marinades abecassis motts sundeck cryptonomicon mumblecore pietz gluzman springwell esqueda opin curae shafting nebot capodanno takeno joch strikebreaking dimity pempengco durational chalvey dalkowski correlli njs gwersyllt brandstätter sirica warmley gawk paschall mamola platooning inexistent fauteuil ninis overdiagnosis netzarim denters kutscher throughway wztv pawnshops furin thabet elvio moakes huasco mamay payap miralem umgeni warrents tanki hntb bodysnatchers mannucci undresses principato roerig gerada enio stoltzman shurman slory devilishly oneto nkosazana presciently bladecenter fukumura ranby mincher widad housebreaking vaporised bailer ooms bengie saussy shavar nervión dinovo rabaa asoke bellver thringstone duzer defrosted winternationals hyne akingbola boultbee underwrites echeveria unamerican mattin egemen palaeogeography lambsdorff argy unbinding brigante aptamer marchio bcbg flatbreads earlsfield shofner dunay ecstacy jiwon jaenisch udy javal potes nimis gerberding torslanda salves busboys waveriders houghtaling lehnhoff masuzoe conceicao kelechi matley verdicchio flatau asdrúbal larenz capewell mwb travelator guangcheng geotagged jabel gamlin blissfield darwinists malpeque besley eimert stamfordham transaero exploitations larena thall lukashenka fermentum chimoio purp bakala colegrove gugulethu xenocrates uninsulated laskowski theodoridou fsx prusik rgk werts astound kuhrt olema meditational amess algarotti wendkos ashrawi practicals doubleton viscosities tge spaceboy shoka wahls szenes dcaf bresch handgrip navteq shyly pamper shokat lasi shibetsu washford varias zuroff silkroad orh karmah namedropping macbeath benigna derrington bodor fairlington nishizaki letsie bolet explora eschede talgat qilla chitungwiza dubie viriathus akili cryptologists barbies howgill yanchev seversk rupinder gardnerville gleim mermet vezzali boseman myun gabions boyers overreached leoluca zwol levithan xiangning collagens kristinia leslau edtv wlp codi bielby togaf vawa wrda olev vicker johndoe hensen iov chuggington gardos otso teir dodworth hvp protostars morana nfln giric kigen spitter overscan harat walding shemtov meeta aepyornis mezei schayer fibroid yatsenko einsteinian hobden sprucing thow marinol meguid ruppe perjurers cepero koshiba béchamel aspden verny changeless jene elektroprivreda angelin matousek mortlach bobbled antitussive fluffer siddarth rehung xenograft overawed councilpersons incanto chipboard bernas dehumanize jibjab spermicide inhalational ahlin pigford saik gavage khatir mytholmroyd momchilgrad rori midea wega sorgue rosling wiveliscombe coruna contorno boffins aún bouwman yero awre blixseth hihifo milram arnarson superdrag nossiter cristofer riddlesworth xer wlox benini msic internation haury palpated scavenges footrest irpinia jcg hendriksen motonobu akyaka stroganoff alphin kollias manovich peterle carhartt kuhio tattletales trevisani gwendal pennings neachtain scomi grissell sharna papermakers tadej stargardt onecare scrounging grewe millender dhai craftily formateur merwan covelli sakkara bastianelli gerena rasm betley latrice colombus fdle blights bernabeu lilliputian pierangelo cossio lipscombe glycols deitrick malua qma huaqiao xinsheng boria rosental fiaich durgan bryag knowlesi contrasty dowagers yulong slighly slurpee kosheen dubov airconditioning europes therian dtra chartreux lrf tfu cordera funnyman atheletes minestrone yemassee freedy celera demant loquasto euregio vanderlinden clanking courteau scatting schily deiss kidan stiffelio rerunning wiuff strelitzia evron mistype micklegate kalnoky otunbayeva slowe alsatians leszczynski khreshchatyk tsitsikamma stensgaard gugliotta shakai baratashvili cimini yasawa motson pxi zolciak emmetsburg riascos eustice kimora newtownbutler miniclip deemphasized jungfraujoch mercuria tassoni brazda groveport phulbari edgaras firmani kubby limin blerim raschi haigler popsugar visitscotland tellis staffroom birkerts wraf darek mosko epton radiolocation oversite ombudsperson lambot dunville foudy tiffs cenelec moov cleminson hogsmeade amanatidis kobach spartiate alkire tenuissima patels pinger condy rolim neatest dirda hajdari beazer dande toones powerstation ilec jaybird dafeng medhin finklestein skywriter yary megale sipc rajdeep authoritarians prabal suya parros publio sulfamethoxazole zano encouragements estec gamon kintbury evered nonfatal godon bacau mpondo mobis romario delfont gorme demián glenluce liron werre mosop kakabadze talhah dossett galon ivorians kourouma seyval voiles eskisehir sejersted sheepwash thorius aspira berwanger onl intone winfrith sunniva mainers ayed mkg negritude leyba betim dermochelys crewmate calmest rossmoor volturi raniero heliostats bulteel brigette mangope starzl wildermuth gherardesca baitfish unboxed danesi tariel hildale morelle siat stefka follia agayev jonás steelcase photomask ishbel stachura tirmizi sunuwar buehlmann margelov kitv anote vorticist catfights wysong wideout muscadine gilchrest villians switchblades teutsch rdas restaging gallions kenter hitlerism cusson duisenberg gcms qcf trattou fitoussi itns birkenstock lisping alzado danniella girgis saleslady vukov chupacabras gunstock roulston retrogression salloum vassileva smerch siyabonga meisinger lahinch unice sperlonga hemlines rampaul giresse feser doorbells greenore starobin jambon bertrando tannoy ledoyen macmillian dobsonian pavicevic hollamby tocotrienol troian gft babji okalik escriva rajbanshi savoldelli preddy stethem shovelling fuzzies bffs storkyrkan silverhill bluetec maaskant angelillo olisa kabunsuan markinson saurel goyle interdit guidant trisakti nugegoda candaba capaccio paimio gizo arbeter furen shippagan nely animesh shrm wodiczko makram rohloff demare jiuzhaigou crmp muzi ordina falkingham omand suzlon siedle scaglietti piotrovsky pescheria bondies hopefulness smy durlston pharmd doormen gashes hexter unrewarded pentlands psychostimulants feudalist standifer taschner konate gillin pliego cacs hicken verdura condict dalecarlia acmd groseilliers waxcap urias jerel mylonas sakellariou wangs bigbury mustoe topete poled smirks llynfi mobin phab gundel misers declinations dallinger aerator hanneke footprinting seffner treaters marcianise burnish jamail robbs niñas cheston mzimba freakishly isokon doxy kxjb parvaz crosshaven lookingglass sambi paciano cockleshell hustad hotlist verschoor cifa carafe kobelev doney lumpen bellavia protazanov stahlschmidt herida mindtree ehrhoff asgeir reappearances gambrill inishbofin stadlen belonger marchionne nka ziq lychnis maeder pentreath nhr levanto yokoo spinothalamic pratto maquiladora tomlins maeva tumas noyd desikan petechiae kanine riteish helplines fefferman adelita liansheng yardarm degroote bezige jeebies setty medair jmw duvernoy biolay amaris vdt sanal laam miramonte livestation remen derkach impotency bruss melies konecny gurdeep preconfigured moriarity nystrand hyperinsulinism klyne pampers stratmann choisir muhsen yumin canelas benzalkonium murrysville bernes antipholus ezeh campout zhongxin maltster assche damaru dunnage bunds adastra tindemans goorjian jims lijo katten hotpoint becuse thomasian tidiness tmvp zhongjian sufferance zersenay baralt cioroianu sarkari chaitén guaraná illiam corless hassanein morys conesa llao hishammuddin shuichiro zarifi bauls attalla wwu ackerly ugrian gdst asadov jangmi hoong assents bja affandi emceeing usag genz mellers waah techint cheddleton nute spdr cokey ashburner fakty giannelli bührle gorlitz brodsworth nhf kazanjian teddi motovun koci amerks breithorn topcoat stohr callegari steiff kozar longenecker biers heldentenor caffin armelle coddled hazels dutchy damaliscus clavicles nephrops roey jansa bromate beechgrove yahyah stembridge veltri otton firebreaks adeane davari kinderman hayfork sabath geordies paharpur millhauser epigenetically kandia sundon foulsham wawarsing workrooms muckers strelley morayfield qvt graining piked tohill stratham smyril ecbc mclucas bisso routable antolin feedburner bindaas gromova suelo grassmarket checkmated denilson lancang huebsch metier googolplex fascinatingly lybia serendip norihiko grottaglie brodkin rabotnicki agbar grecians bowyers anchiornis qbert turab burgio tayback geschke sawhorse fleeshman exhumations perfectv middleville elsasser cranswick berdahl camco arreton resistence jacarepaguá werbeniuk weisgall misconducts militates codebooks njenga kadison glenbervie daigneault skunkworks intradermal anj qassab datastream sinisalo kvaerner walthers friedensreich wesham rageh perryton kidner kandra abercarn cavens dovi subra hender carsport furmint schjeldahl dannel landscapers pigmentary fbl würtz furbish sarcos stuc fenin remes edms memari locicero dorfmann hodapp teosinte watkinsville fountainhall seith inspirer jireh issuances oceanology rawk cogley victimizing measureable nnt kesri pastilles oseary wendron oleoresin greenspoon rouyer pencader hotung tonn klengel assertations stammler abiword autech heene kligman espcially bacevich afganistan entier jpf adomian wect deets winos zenko cyanohydrin jamshidi rajadhyaksha nockamixon rajmohan ajusco transection villagran gowling moultonborough sikma bahah samdrup qsa munnerlyn basciano wirkola pezza robaina particualrly prioleau hanspeter bitner knsd mbengue sheberghan flinty maon apob membury pockriss hussien deerstalker hendricken vindictively laurene chinedu someya odourless notter bluma heddy mcgeechan nabateans sokha hammack prognostics vishnevetsky progess oasi lacouture gwala razvan shirzad hugman vanette adresseavisen aleksanyan zeichner babas bews pasni bassanio fleshly ophthalmologic binkie dadrian cesenatico geduld shoen cipro aquatints sasin poelvoorde buzzanca orthotic mingulay thickset bbss dilson hazarat oswiecim goedel zandberg insted xla glamorized sheran reeta infosec unengaged muckrakers marianelli kippers buentello rybolovlev auvsi cesarani kotchman spirometry markfield sakoda constancio avellan planchette deaderick mope wheatly mixner gusenbauer cfac diwaniyah infosphere winnberg stiliyan appleford antich seydel varco bambenek downsville tulipe vabre geoeye adipic heebie glemp jaleo pomc klees crowberry workless rueter lapper rongcheng radiopharmaceutical nalgo ceratitis sensationalizing derma siput daljeet storebrand hizballah anonymised hids scee luzinski sandars carmello filice rotas huur asiya zanda duttine dixton niceto merrison konstantopoulos acetazolamide sincerly picou plaats uncompelling skycycle peatbog yutian piner dsos siegmann gkids timofei tompion jakobovits dempsie eryri impington beauval puchi salway jins cannibalization rivarol shawa mangaung godstow kellenberger schieber ofqual outloud talulah instream dolle dolgarrog piddock limi yuyama lustick shishapangma mascha furedi tropiques prefabs eisenhardt cropscience earlsferry tilth ibraheem parliment nephrogenic eaglescliffe jazzie ethe eastbank biggerstaff usurers lourenco bewes intercropping europeanist cymry szczesny knoke keyte roopesh honorato weepers mikaeel queijo ayanbadejo moonface limthongkul markethill zagunis lcos altig hondros steakley illah milou merrilee illarionov albeniz yetis shier thurmaston frak moritsugu rutted ikhlas hamdaoui barsotti unselfishly furlow bravi godwins cecconi antwaan unclearly changemakers byob kookie terseness southchurch yuhuan anycase reinga isolina churros moiseev cdsa pridemore stickies fbg shinners mcglinn echs fewkes symonette fasil mezza therme aggelos moonwalks orangetown yandle subparagraph jazzier cines ullico hitchman endsleigh trademe fernery fawsley manatt voge petrassi qurei suau rimma borracho stiner éclat drem forestdale axelos bobsledding ramfis diani baxi anchusa hyperventilating vougeot ostp villaume cames rajpipla kolpino licet morellet kettleby heugel sumatriptan demeure ansary teece batia delvina indoctrinating hemiplegic teko workaday pulitzers pyt lelie kiyani gemologist corymbosum darcel franked shiono lapuz rabeh wcmh viettel strykers cherryfield mournfully mosinee rozanski sohm compstat zhixin skeletonized normatively béarnaise nestler snodland addin kirkliston demick baturina wasner summerhays unreduced audiocassette tmcnet aucun bichette strine sulphides chainrings oonagh whiteclay pachman falcinelli abimael deckhands ctbt collura powertech explantion florrick hearkened seend skenderaj ulcerations tennen camaret pulvermacher fusaro customizer basca adik dumpers blace specifc deluce zgoda alsip smugness hublot bioterror pettyjohn geat grutas wtov montelongo budinger reshad sonneveld rmo xpert cihat reassuringly madikizela cantle calciopoli rezaee internalisation kurka keddy cuajimalpa displeases deliquescent sahani trommel jinchuan ratepayer beml wbcs chw labbadia wintershall abstergo overweening wahida ullage watchband manina clouthier atin relativly keyon toyosaki excretions germon warmongers narz speth jungk stonefly shoul bugno gigantor palamau disingenous seige behaviourist vaginoplasty guyan udalguri schyman bilthoven submunition goldcorp phyllo wagenseil luyindula streetside ivanschitz groaned mohammedans dayparts medunjanin gilkeson lechi welburn kagari valere ˘ mugabi pmoi smotherman tateishi anner chersky djourou palley sooden eleftherotypia chromophores airbridge utsira narsingdi amplifications chimerical mcmonagle staplers colborn kuck abrigo leadgate vanasse kumars abdolreza ezekial commissionership partys morrissette vocoders ooltewah woodturning drtv wdw chage launders charyn matalin enesco sightless roseline bipropellant hohensee kiken liferay khori mourneview henzell pierian scoones compiegne anglophobia viteri schortsanitis samsudin precipitators yanaka pinx freshet izhmash friman minnaar betrayers neurolinguistics endocannabinoids controle osse massiel bruinsma miragaia pejic hearkening alee hamud kotche leuthard myddleton gasmi vender ventersdorp landport ervand kingshill antimo gedman tiddlywinks maceachen moonglows subscapularis apurimac elkem bijon slota tarsila waen nasw punga businessworld crunchie shariatpur hufford breazeale eyesores helpmate bifidobacterium juventino thion bilimoria shoutout enochs bachianas usery stiassny resan schefter stoplights personajes corpi lzr micheldever sinaloan herscher natalizumab hlh froehling centralists rsis untarnished misgovernment anad rimmel seli britwell wetteland peetz putain anjema roccella cwrt emboldening labrang concealable cleverdon eynde showtunes mindfreak chintan sinor caseloads unionizing descision bater wanja tstf morningwood nicorette citronelle seaming katehi kkh lassitude toynton gamertag natra brijuni tolani vellu kogut cheddington augarten teza guitarrón stumpff fleecing anikulapo grottes toia jaimee strensall communing ests bouchette gansel pagliarulo filppula ateret mcts sturry bologoye weoley phazon surprized lieberthal dcvo cuzzoni dehua steir ribbeck izzedine elq larche anaemic garfias brackens kickstand cooperativo clementines wtb bandol morleys xavante gkp robens dreisbach ofws amelle lyndsy zarooni sphincters ouzounian buffoonish jayasena discrepant azizan johnette mcz janneke mastella lipizzan darrall kohlhase fefa nagakura kasimov delatte darenth reichart wormy tavano jordanstone lasater methamphetamines vedeno benzocaine protectively maerten nyiragongo zaccheroni anacaona kernville bibury goldenson reggia oeschger winthorpe thela fibreboard barnsbury wrd debolt kipkoech goproud straley duralde goldbloom aivar goosnargh piedimonte logvinenko haptics yasgur jafarov wgu superball turfs carthay ntaganda paing ballistically annihilators blacke harrys oracabessa forro shurley zuba thurleigh zydrunas magnums artworld roei giannino religare fundin innovates aizlewood semra kenwyn ouimette sangbad blonay moisturizers wonnacott gruhn eddleman kesteren polysomnography groundskeepers taffs illuminata quickstrike milcah autoantibody saumya cruciferous recker leandersson neustadter sestina harjit prestia jimoh mudde regnerus bockris utkarsh paramananda epatha communality varient nanson tarakanov reids sabis ciit cacus becs fumigant crimond inditex maheswaran emera proelite handclap aumonier phytosterols niederreiter tnbc lipsticks phonies ullin gentrifying bettmann osterberg evette grayback sarif alehouses dovetailing cluxton nuzi gooney saffah slorc huanghua tuch sultanas vimla armerina tomac abdulle seybert hust heacham haematologist lucidi mamaev fagor jsx nonviable backstrap kpax zisman allodynia kibosh olyroos smolan lappé capitalisme porphyra quarryman senia coiba parkington morimura oxidization reedbed katcher wafi replogle pencilling kriegler aliquam metus fairuza aioc cerp arbonne makenzie indiepop supergun homestyle huevo stonebraker lafell zeidman recuperative chrysothemis unlockables abdollahi dewatered eyler warnick laniado haeri palazzina braemer hosley dohan masuka bellerby zanja minnpost towyn deflator nikitina posin burrup stitchers dromey rá cacheu siyad lockney blit jiggling hirokawa giya massinissa sheilla chollima mularkey zuoren eridania xiaotian nccr addonizio saxman koumas pinarello baldomir clemmer ktbs inculcation marlan jehanne wiswall semiquavers ventanas speas appignanesi komoro nvqs kowalsky tetrahymena buczynski zindani katanec aldag beccy gradishar yerxa hogfish datebook grandmaison kayam foreshocks carreno branstetter kerameikos notturna cocozza josimar owlpen aguinaga chulym bersham storni muscularity tannhauser gangemi sentinelese summariser paravel vassos wdtn elop mishu meylan butin bunke pocking hdacs mahamane xinqiao klüver mobitel nanocomposite afognak stonecipher mossend rumtek gilderoy doppelgängers nissl suvorova zachodni fowlerville ssps sevres whippingham averoff organica gooders niteroi brasa ptes glasman saragosa breadline svf akaash grimaldo dilligence gbt wussy namby ugochukwu flagellants deneau pavelic gilhooly uhlich lebert mardini mxp ahmer wlky statman kkl temel helmes deputise hilgenberg quadrifoglio paydirt turteltaub pochettino anointment hcfc cheeger wretzky cunniffe grandmama sharn baraza mccreath salkin ruti eiriksson meriva horam sydsvenskan verdoorn contretemps homogenizing subleased glorieuses sobin engelberger overlarge khandekar tumorigenic lalami micrometeorites wely clubby mocidade feathertop allehanda rademaker thekkady carpegna ipab salum vendace quitters mallat bindura nectars wabtec cmbs financings hué antiship lighton fullfill procurve bourcier beji renseignement larges tacuma wembury piltch saher superhydrophobic ineson fondled condescend leithen atcha initialled shillito operationalize grolle larrazabal unmovic apologetically ineptness insana redial zver windfalls squeakquel nowrasteh vladi aardsma pneumophila ceriani homeware dazzlingly countrified fledgeling adesso neistat andia forden saughall gardar nacewa reorienting astrologist dowa hobbins crownpoint nagareyama roundell feck moelis qataris sural consolini depaiva couts vws ismaila ependymoma söderman teobaldo reems cafod santen hadag peltzer saddlebag fullard drut wdi ionut goodway launchcast nucleoli coscia connerly pisoni hilleman pascucci powless tomine tatge maxam vlogging cprf paur everland beitzel buriganga cyres columbaria nexter sakakawea koguryo muche awassa droeshout mcclenaghan tedy ccap cayeux lindskog oswell silicified usdp scalfaro nng kechiche gbx mawle mutuo shicheng lastminute adjudications selda prina teutenberg ehd roshini kapsabet nabavi hapton rafiqul classicising floorplans roquemore jpp celcom yohji kabanová securicor marstons kese newsnet nppl cervelli hudnall blackhat almus fidgeting economicus tangney perchlorates leibbrandt vvc invergarry stellarium nephrotoxicity vienen tregenza sinaga stadelmann jabra lindop carletti grandiloquent heterophyllus sbragia hakem hdh schiffrin overtop whitinsville ashis miangul harvington reverends donaghue fenfluramine olonga ulk turgeman shati appealingly crolla beauxis fsas petitclerc hollywoodland axiomatically agps seaquarium wilmerhale treva bickler melaine martyna cgb digitalglobe trovato holytown ayisha exhilarated boddingtons kandji roomette nitv callups apprise hyperammonemia zokora axumite huapango kahama biji truckstop afshan lmdc jenette litterbug thayne photomontages dauncey hample somebodies lovegood bahariya kcm rallings bolgatanga dfk carolwood fluorescens rovno yuming whizzing alyas unlearning bomberos goolies trochowski fertilising humaneness mahrt sipped npfl scowen hospitably adriel crittall chikane layo buckyballs grafing shuping camarata yakimenko bieksa sacko detraction tielman oche zdzislaw bodiless clivia suhayl hallaton catalon dominga opes kaimuki imagina integris arkia kasmin lexcen munsif whitbeck mcnasty accessibly rokkasho chegwidden toguri iub batth leston münchausen linganore kernell nodi ihle severall pravia adolphson kez harrar gayest aucklanders footlocker lippens israfil canlas rainless maytown snodgress andersonstown hanzhi kingda laitman divisadero thoss dooher alfredi wuhl prorogue jenever anterselva orus soheil svelte nnu theat daragh larussa paskin kakure masayo bibimbap radkersburg durno amodeo stepmom jiayuguan parvaneh richaud bowtell kleptocracy ferren umea avoda weaseling liwei kalispel unchurched tobor iteso japandroids delaughter nationalizations ajuga patong krosnick khutbah turano dhaheri palombo dongjing ruhm crosskeys baroe dishware sheetrock hamhuis reichl zaatari clumpy rigidus stobie mezrich totalizing pargas mccoubrey misjudge eastnor bullrun koziol borschberg zhenyu ortenberg larrocha gleniffer glenullin sapara pueblito voluble danjuma nordbank rezone ngawi alac haircare berden ertan ranum hristova bananaman cumene oldring gyor brynmor cockerels hidayah alstead immobilizer ortman knxt snood passalacqua edify abdelhafid cymer jutras cuccia sawaya kuebler archenemies galvano lisbie chinley sellaband barff bohler grubman hillcroft ixi greenbrae yuyan estaban garble lmx magin outré darch ecolabel otavio auw kerrii paderno karamojong wiliams seppelt herceptin sombreros rednal seago audibility wifes premachandran ellyse lunev talloires qasimov thorazine boumedienne languishes wibberley fonoti tophane kathoey wonda escambray impeachments tenancingo heliborne vasya obenshain alexandroupolis wijeratne anycast biglow whernside atika launderer aisc cfdt mirinda gunja bowlen bleszinski moulitsas versicherung seikaly ucca mischel tomasetti bellina bodices bisan humanum lübben scramblers stieff mimetics golob jinns ihram allibone yalda tristate sloughed jouissance deale achal oxgangs carrickmore spaceguard sunbathe rusland yuo cowpeas radici monokini mosedale ferree puft stashes kiep undertand mcinerny sertoma churchgoer consecrates wallboard schoene haleyville gassers movieline pardal torkham bognar frant olwyn mauritanians gustavs hemmingway hydes cambus instamatic hanim elfstedentocht documentarians taslim nimni kuser indeterminable cawkwell mumcu lakemont bombi huntford dorning machame caramels peleides kyjov ceridwen breville piquionne traversa shuguang dixiecrats giussano disgracing neelie shadravan copenhaver readyboost acess cerd unscented hanso stebbings themarker tomonobu hypophosphatasia dunlevy audrius cllrs conformers sbac dificult muamer beogradska shiffman maull dudesons blowflies majnoon avocations dampens witn houlder mujaheddin stigers rhees sereny sigwart misapprehensions porush hamanaka maurissa moratinos auklets blithfield hitan moskovitz wobbled quitted motivos trinitron arquitectonica raincoast zeeb fengcheng commonplaces knurled tibshelf imide gerdts scalawag filicide juban derwentside tangental catarrh verdoux buav smadar greensmith ephebe toreadors repa carolers jozy leonids yeang boquillas risher nunchucks misreadings palletized munnabhai peloquin cholesteryl marilia flatlined lamsdorf lenalidomide ckoi spose westendorf espo graca pasveer salignac showboating bandying langsdorf herjavec coene unlighted lagartos baracus weinzweig sazan emitt illogically harbury kirati fttp dramatise morillon aperçu yitzhar chippings yaogan obligating billesley doodlebops agrio coccineum scobbie accelleration jobsite prikhodko briggate allmand unparliamentary haemon pieczenik densho pimental ifw lixian newpark delgados komba mascoma sawahlunto natterjack catcalls ritola parnassian delavigne gravesen laryngology nagori waran adye seifer quantick thornleigh isobelle narasaki pmbok hyesan scorning rogala alberghetti compote wimer panella depoe czyz curveballs silvstedt salifou blyde phv etiwanda falfa pagent hypotensive aphrodisiacs ahas charnas jinbei capasso naptha maale tolonen misplacing choppin cocq tweddell scrooged lankarani dorwin cruzan fujianese alisdair goranov razzmatazz csra groundfish crystallises hypersexual europro avto scowling ssts fretz nzt kaarle insanitary internationalize convoke teferi formichetti perinton cayless oberle grael jellystone dunsworth emolument iraklion afriqiyah menthon mykelti schlachter airtricity nanograms dewolff rnt endplate middendorp berhan gerardine laâyoune caserio dewpoint farebox visma paramjit orlac raitz aqs jomsom hoilett tangalle noades proudlock waconia sauceda asthmatics longabaugh casselton customising scrumpy unruffled raffray batho ottobrunn wna lamberhurst foolery overplay moneymaking vitalic grimstone owuld navone sustainlane burnstein unloving pgms hetz rocinante eegs thousandfold moraleja idara ironfist accl basen ishimura pimpinella cronberg iford knowth exurb bodysuits perforatum temo shizue streeterville adaminaby paonia mimivirus thorat jaidee doigts hepcat chukka aeroponic ringley acaster satnav daktari sublimates irus pisgat verratti rinkeby cantore iwelumo gardot harmeet gutch joypad waterparks waggner häfner tigresses capizzi buildups sphygmomanometer colleran khooni diogu utting epicene mandane sucessfully accelerando carasso incarcerations cinedigm wud rxlist mutilates sandline nijman borchetta wieters nitwits gillott cedarwood adone chryst blaffer wyne summering iosefa adir frontex remerged brinegar superspeedways harmfulness ghazarian mauren nicety purebreds melungeons aberporth oldsmobiles payá jaisingh ciénega afh deepings sandag mercosul kamancheh ksat likey llanrhaeadr bucksbaum emaciation pentyrch ichihashi rehouse discomfiture glamourous hga softs hawkin flashpoints hebrang maltodextrin bohnett sigala bookfair crimeline hajri bgg proliant choueifat wasan sonhos crisman vocalised cataloguer ihar intex abac menchik recalibration luminato umberger gatra lemonia angouleme astuteness beik amurri gibbous varvatos adebola convulsing winker putland hatin stutterers engro baim sambrook kirbyi chisox wolfsschanze marcinelle frommers markovitz schiavon bartmann bobbit koivunen luttazzi passero strobilanthes alipate quander selvaggio defragmenter treharris trounce monthlies tnfa narragansetts diester cwru hillers maurici oab smoller ommission feudalistic bernath qimonda blunsdon scholfield silda cordus cohanim bodle exhales heupel fickleness nonlinearities dulcimers kalau overactivity deathrow tyntesfield bealls resurged hickmott crau slivovitz bunner bsat brontës felpham coracles bspa karslake ebit dukeries barrowlands arats leibrandt shahenshah wittwer makavejev heyuan leimen silberbauer holifield mongan schaden etwall debbouze buffini pecknold sensitisation perv degolyer kinya canel komisarjevsky lemonis gjelsvik katariina shawki oltp zeneca siau tecton xiaohui frasch ulcerated enock handman casartelli orchestrators lucks nazira earthers oakford subtlest loitzl ardell diar spiritwood apear ususally stanchev bechtold buddenbrooks mahnaz vpx gelida projectionists spawar gele stomachache aaberg froide ultralite micol nalli lobbe unsalaried enrollee hfo adolpho claires gangplank donofrio southpoint desmin bready chalayan booky chillar thorugh ltx fibrocartilage hibernaculum buya marula palia berried dieudonne streetly hatab fischman macara thurm prometric dorward holoprosencephaly shanan sosuke thede deafened kuchera piercey osia netzero stoppani rankers fiancés indigenization ornish phimister magothy bedar moueix metn aitkenhead vinum ikb pnnl cretinism heighton ermakov deroche ctos triballi tootles consommé tocsin chessa cassinelli gabol burkey chandio bompas sulayem rousers aesthetes tigercat koretz eurypterid vituperation skeins fridrich freakum tendresse pocketknife naydenov pamphleteering thyne uldis fidelman kezi berch guishan poligny afghanis sansho webmethods moorad stupefying lukla precocity ruberg prebisch perkasie quetelet hoogovens indovina transmuting mercados beco zeitler ivalo sukie keatings spurway geeti calfskin además yalo enterococci tihama ackers wpgc netherlanders arteche shinier flopsy outcaste abscisic ebeid parijat dedza betrayus forsmark munmu thibaudet collectivistic talebi tatneft groomers koat mcgrattan factus slurp biggies analogized pichette angor commoditization ameena hurstbourne ivm auerswald ettal nutraceutical feer fcca maultsby subjectivities spanks exora couvert voorhoeve irrationalism gregorie addictiveness amerykah transcaspian curral rozanne regus ayerst bolick cheoil foschini dtrace bhut gibberd leckwith ljungqvist okoth dvf pottu gingell abdolkarim fossile khadar conchata kenard pretto kharma acebo mejias assocham eshbach treuer kirkbymoorside buyoya ghazab opio massasauga fialho moremi smolt champéry hazed ocx fitfully slothful stephensi oozed tuffin annear treeton antiquaires gfci tomelty okcupid tenontosaurus keishi navantia khizar batstone heartsease brunhild flightdeck wakata laurice abily ataur miedema empathizing vetro chanterelles mammas percolated coppard fems franjic lucheng siwanoy hiawassee wairakei dwar kooistra sikand colugos archibugi biopower sgrena jalapeno baffler durdle semidouble patau ljungman peros fundoshi maniitsoq butterbur buske vocalism vpo kerbing folkingham lindenstrauss gudmundson mehitabel ldo downscaling hanis artistshare senneterre rizzotti spsl batf grap animadversions melismas interposing louisy laneways chalkhill psma tossers borri buljan oncken soic dbg rehbein ailred prehn yoba ppas nyas amalgamates maione silvering hemsedal adkin rhines maerdy huser kingstanding skerne hody interrobang lindu solntsevskaya goisern guadagnino fatiguing osheaga bleat motorama permenant westie polyvinylidene combet grunsven barbastelle thoughtworks nbpa omro crumpton alltime interpellation berlino lccs charmless fdt precalculus golm tinca eaglebank loosemore bhimbetka drippings nosair solley nemat nithyananda nashat fasel stonybrook poizner tredway maybellene emk compatability gamesa thnk utl hongwei juicers chiaia tenleytown konkola khayyat nadas serrato brez kreft tadatoshi slye hurdes szarkowski shangaan newth kias contriving benadryl parman wfuv underinsured comstar bidford mdz pmx biochip cltv tofan euromed wikianswers mrtv cornfeld goen beharry mcnuggets hulen ogiwara lipschutz toub soleá vodochody tonnelle woul softie scalextric grindlay acidophilus paonta sebrango opionion multiform kythnos xiaoyan romie jruby kirundi dinnertime defour shokai afw kleck chondrosarcoma forcefulness wyness wakame deceits tahj maritimo pizer croci hegira rutley rokocoko muliaina lurches crushingly phytologist disinhibited gmpte convenors monarca lynches serfaty pellucid lyness distillations herger dilates ofrenda abdelmajid ansaldi jonell detriments weatherson risinger fizzing matzke rora swillington giannoulias sarcomeres xingang pokharel sorich fraboni matano trehan senaki sayago swagga marginalise pridmore galanis kmtv marro dins barnicle boreray pagnotta beckner dudh erwartung icad gaillot cuthberts edghill tydeman princessa withnell bradish channer thrifts cardiel akbank heterotopic gevrey vacuously coslet screwup strathcarron crociata fylingdales whic pelagio leade monheit khazanah tabacco carrabassett gitomer dgf dasani rodo nasso kirchners dulé langshaw melius zaccardo waddles pellinore unidimensional hopei socko buckwalter bagert britannias cityville misk gamst dauphinoise epley sauze humph kenza derickson empanada levier rolton deforge isely aridjis downeast wonderfull ashwaubenon staver saintliness keybank auctor considerd mazibuko jezek nurturance marijn wischnewski netiv keepass gamm conflit junkanoo pinzon millhouses koumei andile danin haitao crinkled reapplication servier ocana wnyt westwell clendenon canari vcam kasota henck brawne oxman grisebach ggb ebeye fraid trackable aggi bandoleros retinyl boyajian reboarded archicad pedagogically klaveren lavasa wevers lattisaw saccule maringa walliser briercliffe eluard gangbang kafer aams kromowidjojo wum staghounds soulard battaglini esmaeil keagy peldon barreras ohanyan kelda metasploit medimmune timss bpv prödl choque linebarger vespasianus dunthorne frosties herranz tilal boola villepinte potatoe nigersaurus facenda moulted holleran culmore chany bobrova nomade woodhams sidbury scho gotz hardingham werblin holderman harkening outspread kosen dragonoid biffo baggie troussier velva pramanik mmtpa empa wijngaarde premer henneberger miang celcius wazirs aphc loton frayling consignee contadora japans reknown sorey lancastria bowmans gainor fonua chlorobenzene translocating sbcs prendes isere midlake dubble mamula ofb gayley choekyi recruitments sugerman kermorgant gpws sertich kalifornia magnetised zenia eggy xiangxiang novembers showaddywaddy rommedahl lithologies bocskai hongxing seeders crowdsource laingsburg transversality malthe sundre mahmoudi tetrachloroethylene acclimatized aurelien nationsbank solemnize khalife mesan glenshee preud holsteins sandboarding kencana redouté parolin katalyst davoren henries champêtre jellybeans olis organisationally baslow trophyless honeck borneman aynesworth linter wener landecker magnay ackoff sportage maiti cockenzie visualisations mccourty silsoe ninewells peders rabelo postoffice cesr olina hentemann arato hersant murzyn titon marrara weaste patia murderball roddey moresco avdeyev rapira rainin gispert zheleznogorsk angula straightener barbering warao supermac winteringham schmale minik cyndy schoop wiimote leedom tootle windless thornliebank deviled wakai osw parroted yakisoba gizenga njc cucine wieren keezer niggardly bishopstoke doxiadis tegla louderback lazzaretto overpainted chaigneau arah wellbeloved yammer bankcard fellay leweni gercke leitmotiv mitres bettin chida loof doos curatola lomma hornak dalva barrass wasfi younce brownshirts autolink kornelia brahan swartland malchin madone interst ayten uras haule mahyar sedwick elumelu lorman adderly capeman adrenoleukodystrophy gururaj meachum damilola khairat krupnik alasania forsakes franciacorta hanff facials mettre gabu wcax plucker colpa markert dulces lusinchi ifra dreamings cafritz rever exept qaisar wilkshire haemochromatosis paicv ludham phantasies scalloping sassaman gayoso demonically rikk campiglio bscs liepa osgodby marienplatz bromford plattekill imahara kery darndest sternin kohara neuadd traffics propounds girodias bronchodilator ozyegin twinge eluned gendreau ochamchire burbury balwinder indiscriminant tognoni leffe ellistown raymont krach corazza sizzles pecota crudo jwm colorize mrak thubron raymonds vansh instapundit tanat ledwinka affymetrix tabarly pfahler pureed vembu ziyuan sextette ochamchira photosynth expansionists southmost ballfield feburary neumeyer anichebe pvcs yakitori lomonaco tegner dunbeath zehava connetquot qpcr hoogendijk sdw deacetylases edhem ponytails remenham iodp nehi loussier tamilian honiball daydreamin lucyna lamone cnsc mccrane cdcr islamophobe emulsification caressed sfas solignac dystrophic wdtv clubhead elstob effin edra lehzen armands bootup peggle nihl gbn iniciativa disrobed mauzy achel strumpet jokic luchtvaart bocharov myalgic prudery munchkinland howatt fauchard imls fhd defunding doorstop xango unsynchronized atea annunciata amtran bhv fulu wifebeater kabylia mahones bercovitch bartonville forepaws retyping urbanizing dalloz chryston canwell phibbs ntca marfleet teagan unfancied sotherton cocorosie marli unigo biodegradability odr precisionist verra ramidus delisa precariousness tolstoyan masar beechman rainmaking inboxes belet bengawan fcas pehlivan friesinger chaa mulqueen unfitted losail mireles weifeng spigots gratifications abayomi rozzi casone ktt southway capulin bressay soci zobelle hisingen bedraggled mceachin fervid tooles pabuji synergetic demonised stepmothers unfortunatley delaplane bearzot kliper passthrough bikita portageville trotternish ehmann capoue durmitor kakati tataouine napf sorrowing greenisland charanjit bollworm sunter barbudan nationalising photoblog guyatt bayville roundstone bashkirov zahav stylin scrumhalf galatas vendola hooson uncontracted alsc bulsara abruptness granik eligable schrad lateraled teetotalism kalac poilu hins monologist quaglia borzov roquelaure beseler hobbles olympiahalle mashayekhi lucchino lasgo broederbond spectrographs jowers béguin soldner renck mawdesley berkhout vhi shekau novikoff graffham skywalkers aellen chorded hibbett klutzy bizzarri licinio outshines haub stonefield yuste karey ripia baffa booyah akinyemi volksfront kühl romanet bassford saadallah blackwill giriraj sidetracks grippo handhold songun avoth slackening vilallonga lunching arseneau claudet northlight linlin weprin flyways tunisair mccarney misapply knollenberg psychiko tauran morphosis calabozo bohart shadowboxing untraced marcegaglia afua goecke siouxland hesperornis shoard sigmon olia repêchage ruche baset butterfinger caponi ghaghra gobbled waad septuple autogenous alamin sundancer wenford lowri wermuth wauchula capitulates subclan edelmiro bmrb soss shreeve lemberger kintampo hamate hallettsville bigfork hostal earthshine succint jadoon doca arandas gundog levanon trullo wanis laber zealousness stoor shitamachi waissel chodron kheli lanthier leming aghios aegerter balestre termagant fukatsu aasu gawdy hipkins qatargas shoucheng furlanetto lumos pugilistic kanmon mensalão gpon zohreh polydipsia primare confectionary rgu smallbone chibana kalik theros jeay carryout hrb marittimo decanting multimatic wissing hawija masinga shleifer schindlers uwch yadlin salahaddin poovey noorderlicht gunsmithing kxtv brusatte hamr lhéritier grzyb attackman bardhan louna bettes preforms frosti proverbially puett conciliazione mansaray griefs guidence iason nonvoting pyinmana wainright measurability unacceptability leics lefevere luf gabii whiffenpoofs dibona precooked sommerlath fisken kifl rindal moping counce toupée meritt tavani rgl matiur lingotto drieu stamer ledonne sagot kjer sauza atwan bourdet hodnett maws satyamurthy lugged kulig sounion crie akhras trinca mcgruff troodontidae angliss nhadau raco lobov mnangagwa shortform unwrapping attachmate kandilli mtel sulaimaniyah alberici shishkov pined levander hennin allenstown zalayeta geopolitically fluorocarbons blameworthy guettel quicksands toughie blathering roughwood listservs riptides krenzel protezione pijesak shewell paragua ascherson lapolla epilepsies fany scabby matikainen uprise krqe wireimage nbbj mashallah mohri voicexml endwar whiners ilaya aebischer kollman stroppa rosukrenergo aristocracies northtown danya lustrum gourde nemr motsepe bossley tuncer mickleburgh colombière mincer kipyego nsca akhnaten coppins urbanos parnham jaal kanis lugt wykoff macunaíma luxon flipbook hopfner ruleville vólquez gleevec viser wendlandt ghannam vtx geimer surdu unappetizing arverne klarwein dipti vwd leist yarmulke pokeweed transcoder hillmer noctiluca actuall howry camouflages ferrovial maltzahn contiki unlikeliest curtails lefcourt pavlopoulos burmantofts countercoup sweetcorn chaiyya virji taxpaying inbhir oltmans piliyandala galex honkers winnersh verlagsgruppe peltola orren lacher mincho pblv volstad reatard hairdos longlands pretax terezinha gafford bussiness tesser ahadi hurriya springtail bonavena bronchodilators boylesports ilton collington gattai dubailand talagi saoud carrothers odelay anier polytonality agle clockers dannielle mahre vidale ronet jenet erjon dantis injil bakero modasa roselyne defragment yuting hellp trebilcock rungwe lollo odair laffit traphagen aceto gallay poseable jhunjhunwala dumisani yuzhong dorell basepaths joline reflets smush jannes templecombe clst tanongsak dussollier poolman shorebank lysate haslington corduff cianciarulo proffers woodrum wessler bunde scholium libdem fenning vinessa condren beddau cornhole columbite ibach fison manucharyan coppersmiths cardena vittorini wampanoags renia hijet ganton arrojo fetu iajuddin giddiness xandros aebi elmstead colomé bollmann equinix chichiri shackell paartalu sonobuoys wics wellies haresh schimpf kabara tdx girdling kozelsk waymouth sciencenow murden starstreak perenne obfuscatory ultrathin disproportionality weequahic newmilns schilsky magaluf clambered eggum brehaut kaplow sicne wackiness fidgety riffraff taillamps rinck rakshit moneygram donleavy yucai donators lorig twigger icps smola hauber roquetas munchie gasthof masamitsu jish bsds informacion toutain rehavia dmw virgilian crear occupationally bhamra unroll verheiden binzer mandatorily cinergy ovenbirds rocamadour khalida tsos oxpecker ottosson magli lotito endang leden zadan mischaracterizes humanness rollercoasters cantey dustan rizos aafes brightling goseong anisur drako sporran belligerently okwui leviton nenndorf califf flouts mcquoid dagsavisen deewan sexson hellbender posthaste chinati giberson ekster upgrader cales vssc revalidation judengasse stramonium mathi contrapunctus kefu agazzi bellette creevy owne rugge eglon rajaraman andreyeva aversions klumb cowle retrying gtin brathay underfed bichat cassida reinsurer contursi unsecure agnelo tollerton hollein reinertsen rifka ryal radovich gottheil yaqubi cornard remillard tylers undeservedly uphoff uysal hualong vipond buildable dropwort bowkett nulato spanaway ansdell indyk pedaled fantuzzi roji ayob bookkeepers sudani compa corita tourelles cichocki zangara ehart hoai dirie clases sparkie dunagan bilma boualem scei bickleigh railay kenwick colrain elliotts mouride silcott vallois gawande mushaima peplowski schoener mazuz rapacity decroux rollerskating bgh cauterets bortle rideshare chmiel jtd blumenschein amcc javaris yahud necromantic rukavytsya smithills colorno azab trus zura grommets yosi jahoda jiemin missen implimented finback agnone puppo russki dougy yantian kubilay synthe unrevised villere ratted joar singlets interiority wehlen ndikumana loux microenterprise formaggio scooper mjs steiners sanitizers mackle oversold foreward celgene sensually feldafing rockface overfilled rinde hofbräuhaus oschin eulália translucence perspectival pecha kaukab gaiam sadowitz haysi paganelli linos vassilev saadet bratman pjp makowsky faiveley wady alday stoeckel sidman ronis lerby overidentification donno kitanglad plandome karahan blackamoor journalese willfulness nager borren pmsa dwek nsit goetzman bonvin hoeck faves alkmund cherny choppa wholesomeness kutaragi neid paetec poisoners divinations tiffen nonylphenol ballybeg jarque haehnel uttaran geoint vitol bergenline orfèvres pmv rales ostro unpatched noshir erme benedicts zubieta splinting varadkar fournaise caspa chipinge marzena supplicants toneless motherships antiquarianism hoeft clambering rigidities rosbach lfv dayer marchés engelaar schmoke avita pachyrhinosaurus chere kandelaki holzner augustyniak upend kleban ragon agitates tylertown perriman góes ahus sackings kempter penkala zubeldia knipp janela potterton culbreth foxall yetholm sandling akter tehnika californie romping baic udzungwa deibler inadvertant baduy exploiter unprompted hakimullah josephy tollin nachle deveron anthropologically dowiyogo nfld lawngtlai uxb tahlia titlis companie samura komunyakaa uslu goodden mitrani faga hoveton yordanka mableton greasing budanov hillarys throstle espinel codjia mcclenahan gladstein wuzhong guzzler doinel grippe hanton lolol gruben taguba brina hawkinson hygrometer wible cigarini profi hungers stoffels choudhuri stachowski seein ruffling machavariani senreich leighs recondo nhlanhla documentarist jozi barmes spackle kamman procrastinated fredda bucy cowick bonvicini sundarban thurnham lassana butyrka buitenen dronten mamasapano bisto porc iconix kimbro matronly kilju qatil razza padda cudlitz finchampstead mondonville lison morgenstein zii roadworthy nanri syndicators mutal rezai miskovsky bookstall manguel venipuncture tributyltin lawfare adeniyi jehova atheroma explicates lavernock greenawalt beaney rollox usherette ahwahnee catspaw tensioners jaxson dmitriyev tchani delyn dolgin maryfield krens sinquefield wieler wilis duku jacome sheffey ythan kavin saige jaoui moorage dufault uvr fatalist prazak schwaiger ballons inured beese polyolefins larrick bonera kitada wiesinger spratling curdle bbci seegar domb inverkip nango nordenham carrascosa detre ambrym phormium uygun defoliants mbulu lafco fischli turncoats seasteading cannaregio ballingry rangasamy affligem glazers broecker darkon challange gibbus baschurch fraternizing coccidioidomycosis shetler skrall holburn susceptibilities polarise akvavit albats zefiro bogusky franchize bishopwearmouth casseurs goitom sleights aleksejs bouchut pulverize albopictus achatz kinokuniya colistin befalling kibar himan etanercept mastocytosis hamou sugarplum kazakhmys ranchland reassignments homotherium radoslaw zbc milingo kupperman koyi deregulating hymettus tussles androgyne losartan massai weatherboarding nusbaum hallworth pompilio oehlen challengeable gunnerside murciano koppu wahnfried chéret aerobically starline sinofsky kwashiorkor ranawat safenet rusli beya publik kampman bolthouse bufano caldarium sunchon lauret aswat telephonist latanya gayet driefontein yadong panek mirail grochowski mejri barloworld displayable sawbridge fedexfield neuroanatomist brightnesses immunobiology jeffersonians riak arek bmcc cinda fraîche artreview antitheses geiranger subassemblies watersmeet hovingham stockroom brennand musante bruer cidre nautico spreadbury ambuj harilal cucinotta powergrid schaber pucara moggach gollwitzer rerio hyosung masterless brinsworth rusha preclusion pehrson kleeman maie codependency mégret keagan ishani puning bogost dierdre radziner protoplasmic glackin rehousing sabhal clasen lekking granatelli semans nulli fleurie precursory anderer sarro unornamented khanabad reexamining pliva kalomira geib woore neubacher gwilliam waljama macaron pettijohn yuhas oginga aeroportuario freshening boin overlea internationalised ariff mozote squawks sijan lechuga smartness aptamers antiproliferative halfhearted mccreedy seehausen kabua snowmobilers kersley conjuction morcombe duerden vandoorne pertiwi swatter joson sweed palmy angelology budnik navarette denmead georgetti butterfish nitpicker caravane chugg envi brost dybdahl schroders tett hony offloads dubnyk refrigerate omerta verities chandlery crudity mastercraft bollier relearning esteros bangguo hammacher kandil violences fallers pouya tabone banko enthrall albach raber haggins articulators causae kindleberger scampered tikun karalius yawns congruency forestay koken balangay barhi saib frady berlingo makart traceback transito zeynel tessé heffler tytherington milea mison stanground indranil deemer steeplejack cansler perno kande mwale farinata hippe quilp okin swn rassinier rusnak lesperance chatta renos llanwrtyd mashaei barnie reinprecht prophesized lybster bbqs skydance mazower bralower aveva warnaco seamy adoum lochleven phylis kiersten zorc stibnite alysheba rimula gocha woulfe securom debacles deuteride rybnikov greenspon catalysing pursing pcts geg wurzach yaroshenko deflates parlane estacio pozdnyakov kreek profesionales dardis lillingston windsock additionality killie savagnin crescens oarfish changnyeong cirs fortnam optimizers peslier abcp quintain rudas helfand gadir vitrine daul salovey lawrenz adkison glengormley aspendos raanana arlidge threet voicework mathern monchengladbach akutsu attingham odam badil diatreme cjn mercuries tunecore taurean flavonols woolpit fafard makeda sanand kalaba zanger ayal croutons nnamani defecates gisli pocketwatch rozsa euractiv chaix murguia hammerman geodynamic busche millenniums fiorelli cossutta kesgrave scovill sekret charulata liferaft basketweave hemodynamics worcesters multiphonics harebell schechtman niuatoputapu konvict jetway badat bujsaim nuhiu schnoor horribles eriboll wika madlyn ubr alewives gavitt reeltime moinul moonstones abiquiu rictus idyl azpilicueta vatel machacek ninigret marinero ambro ginna enf konrads deprecatory klimowicz alrewas omh groper beatha shichinin dkm landreau tanuku rathborne ockbrook lykos teunis swo mispronounces pianura vwp gassan styli aurigny heijn commentaires sklyarov decontrol hils zaitseva burtons mulches overstimulation neuk infinis baccus bogazici waso kanesatake controvertial rollup leijer showmatch immolate serber mayardit olayan kermanshahi ullens stoc broaddrick gnm fulginiti ugbo debach avago qqq shipai urquell goldrick kyowa karvinen scagliotti manderley primitivists depressurized backspacer lightshow lvad carimi chhay arval pahan knoyle adjudge boening obrestad sandycove mismeasure millionths lillet proffit maisey jbt wakako boozing iuc hemley ascensions natgeo pashkov glaise healthwatch mapfumo dovecotes najem bouchez inflexion lmh solider machlin palamara readerships puo sparty renovates cervinia palkina biros qada audino kerckhove goodlett virginias bucherer kleeb thrombophilia dausset neocolonial fraisse caitriona atwt mastheads uzel lowing birostris digeorge bauld shuaib clode gubi handholds nutmegs marotte tetlow sabelli williamsbridge yanyan cunegonde clanger balink innuendoes ccoo succi courrèges sanshui rills vigdis trainability beauts vamping huntingdale kraybill augustino pashtunwali varah ltj chautard genel tractability mourvedre clods nilmar blodget liuna ranter teasingly diefenbach surveil abjectly keyham twirls synaptics dharmakirti grindlays moelwyn loughman guipuzcoa dreaper jinxing pandeli stielike ruwais lonardo grassmere andravida skenfrith epact meaulnes lahiya skarlatos adella prizm marimekko elgood mì yolu nehme behrooz hopen resurrectionist feverfew jacomb ortwin intar cryostat pointier dugarry tecom pauma rochestie javer sarf ommitted ptz smtv kirklington christien poynor bertolotti bakdash gibsland mănescu nonaggression probally infinita indigents designees parasitise doblin arcidiacono steelmaker unobtainium samari sephy ansu chacombe westerleigh kalp pulte yasufumi schlicht kraters wichman dwarika artimus ravenscrag zenati alternativo shikasta elmir xinfeng nikel janah kcrg merenptah zaniewska sunkara subsectors noao hagins borro komara hansung tangibly yousefi galliani artemesia nanobiotechnology needlestick websense cryosat procuress mamzer boogey jawaid skarn honiss arvor fennessy hydrogeological crvenkovski zwiesel willughby fanya lempel skubiszewski tongham suster wintory sauls desco wythall sterckx unrests musacchio satriale boneta luddy threshed kragthorpe saun brontes petunias okruashvili cichy widcombe zazzle jimny dielman abderrahman insh pestilential liskov miaow periorbital stimulations antibalas jhi corporatised knaths loseley vegar duflo etchemendy christens compactors mickley unsuccesful baldonnel zicree noak adiposity managment kolesov puter phyfe hinthada malaxa relativists westhaven dŵr sakhawat fordwich lindens wildy cartwrights ristic hapoalim mukhtarov generaciones moneyline maliha zambello aglianico fandral polywell ippc prucha plantronics luten odier laurino myford vrla joggins xiyang domnina brickfield superheroines llay ferriz tobacconists yiyuan hangleton lxr qadar koivuranta comforters otavalo minger osmin oconaluftee rannells sikat driedger lisak temnothorax borrás lambright bassets sacramone secondmarket weilerstein weeton prayas paracels bufalo thundersnow intervision dittmann symbiotically cervin qiaotou godbey porizkova newsvine potencies gotay mkandawire gyrocopter cirie readjusting meos hueys tidmarsh weiping forfait artos detmar albinia wiemer rieke milliwatts handovers teazer globalfoundries enjoyments wddm fvc babers corkill dilbeek heavylift grandmont roved shakeela elgart gevinson dreiberg brandel rasin nsec hoshikawa myhrvold hemnes vestibulum gatenby paolis audel amblecote sntv rozel hiders scrounged fás alcivar titas pasque meritage hibachi coliforms closeburn qaryat manically gambinos linzey aerostats bennekom romen usuhs packinghouse haysville ftas sunam nately godane chipchase explications aleksic shipps derlei manganelli rido ignatiy nahari keverne wqs timesheet ,then strallen attkisson wavendon kuffar forteviot jollity sharifs xiaolu fergy cheverny haysom mhor bashy boesen miyun carenza marketwire bosox peasley marathoners witheridge mazzacurati cutdown romanesco magaziner seront prodromal tőkés tuitions bosingwa kazee gerin nwn gedrick colindres gref necklaced windfarms roelandts sistina msms wtn somen woolcott spectrographic easthampstead vegh redeker portzamparc booy undecipherable arnotts goaled mcerlane engleman sympathises howett kullmann cablecard hoggs martinetti zierer gershwins klon nikken actualizing bobic greetland gurov pmln stets ranton glassberg reuser leibold sapsuckers maliszewski dunner bajnai fritos intercessors giftware mislav manot denkinger powerfull evangelising feierabend leatherjacket mcginest wangfujing maenan gorran zelikow guralnik hockeyroos oupa capstones rocklahoma bivouacs zirc chodorow fluttered krathong satyarthi thornby cabaniss parilla kozminski matilija gornell kancho emcc baginton tweeds reprograms yaqiong pacolli heimerdinger baitz zamarripa stirratt averin sperl newmains grisedale kancheli trebol hirschfield rase fishergate iffat windell hazlerigg angelenos pegboard sawano mansoori bdos prodrugs trashers earnock dunvant vélizy rabon liers bahamonde vallie busked plights jtm spgb palud nastase lezignan yosh chicharrón ellner cornacchia sharonov yamatai hinesburg shosholoza articulator muffling wxga yergin ccee sube orgasmatron tanno zweibel alleg janay xpc effrontery hanun kharas sureness okaka reinsch imh denge mcelhenny entenza wisler petillo husan romera fortey shneider ohia avowal chebbi paphitis qingpu bearly sewri hahnenkamm theatermania meekins farinha pashtunistan khudyakov northcoast sozzi leverty abersychan zaatar lagniappe bolar remineralization undefeatable issia bulku acerno wava kadafi rossport juntendo hdk clein chagford faah pseudacorus remelted kilgallon kapchorwa quarantining nobelist mohib xiaoxu gondjout malby zippered formidably bidegain pasc pressurise sofian seropositive onuoha ratier bergsland insoles wyatville preeya hayatou suunto bablake detoxifying mountfitchet scorekeeping egilsson habsi nazeing orizzonti kaneva xinyuan kavaguti emoting dubash mahallas burqas milnthorpe schulmann kassen courir købke anakinra supressed metheringham denyce aeronauts anxiolytics freedia extemporaneously clomipramine seiden rbma walb putout defne clericus sazanami attritional bifolia folgate indianhead dungu bussel scholer gumming aproach billow nawabi kilrain abasement vrp belongingness amwa choudry thornback pitroda bredahl transurethral paticular annin perretta staubli exfoliating kitteridge spem eitzen abdulai ribadu liftgate eastney rivergate stalky pluta hwp beldham liborius maestrale midis mahto snax nolet chacho barmston beneficiation agglomerated derwen evonik markyate cocido fleuriot sinsinawa shifang acklin fidele starcom retzer iglinsky harpooned heeswijk askern vosa événements majola isch parliamentarianism windex saunder fluoroscopic rivadeneira rochel imon airservices tudhope splicer misidentifying ragueneau karrimor doublers suranne magendie mishina hamoaze krotz letscher egami yusaf aerin hakin senselessly ussd marzullo sirpa tadataka nyanda lejla maschler saganowski superheroic shallenberger metamorphosen horrorshow chongzuo lauten iribe nonrenewable skok pariaman ondcp coul coldbrook triche karnowski tollefsen tretinoin poudel ncaas ogando googlebot sextants teressa charterers faren ridgewater advertized caic oreck faseb ubben fofs ghailani accost gerlich entirity hilleary sonetti trickiest tailpipes ostracize mohieddin naturedly hovels lampl dizaei radioland aquarela lozells flyting dispur vorontsova otherland wpo moulden karissa marimuthu cranie tuilaepa rateb tousignant esterbrook hashemian maladroit stupids hatherleigh darkin mamut fanfest upfronts haing dorfsman athanasia erythropoietic yaping gyasi fishbach scallon mjm mipcom dayyan muteness seldin cakebread natallia turkishness nonalcoholic managership wildwoods shrady breezing madadi matelica maximalism nermal sandbaggers unappreciative strenger lambeg snowglobe logistik knuckleheads tillingbourne cardiomyocyte hfq fellman haddiscoe giaquinto khouang bazza yamai krystina winnenden chavela annemie cordileone blea showjumper gaver riney lisfranc larmour purifoy ozell phyllosilicates hohman thinkfilm grevemberg negócios nucor okeover torday bacteroidetes danyel salekhard surveilled pettet dabao borowsky wallem propsed tilmann maddens travanti jankovich antonito muya brainpop kodaly hovsepian icat karnail obika strittmatter bruntingthorpe precisions theurer rebounders nuncius calvocoressi diack hochevar popocatepetl lacunar colvard cowers feras zonen ustica bijie moitessier backslide glais vortexes mohajer psychographic lizzi revocations abergwili kofta galvanise smas hatty tkachyov matucana mervis mirowski musavi frostproof sadowsky akte rohypnol puspita shoukry grevers harbormaster lymond archant infectiously highview beaubois invloved budgies biologique dietzen tribosphenic ouc icbn daska selita straightaways recinos vinoodh gemcitabine miraval tter wylfa reyhan besmirched zhilin klyuyev boroujerdi catergory chofu andr colie creekmore ahenakew janiak cadeaux poonia randian kuszczak dennington tanika abdulsalam pardey skomer unlikley antsy yurie kouroussa bizenjo bechmann langbaurgh akta yurevich arza mcwane brindled boomtowns fscs sheepherder karson acnes barsac uninviting lexile averbuch shults hallion ghencea medoc endarterectomy whatsit barbarina humbard rousay tuncel nucleare pyongtaek hetal sarles tabernae obliques passfield stori majar escapements corporeality yusri poortvliet minshew seeling gunawardene lording lutie meric moceanu backpocket flambards chaskalson brickmaker gateside verduzco prope pulao cronista edelen ruhengeri maiella lotario leinbach leil chelford rogate lupines whatevers mumby marinakis fastweb uniqua carsdirect spitznagel roso khine balladeers dementors lingyu italys scribbly mutley saky lichuan lightless triazolam hullo memorium acquaintanceship pptp gladwyne annigoni linnie artschwager fremm roncal kirsh acedia rabka paleis schreckengost laprise cbct angularity reflectometer goldfoot herpetic tarsis exclamatory skittering pepping sunport häring maidservants mcmordie limington kaffirs necrolysis crystallise dinastia maotai veerman flitter wazed ektachrome emotes rerated acrosome marveling pikine hasib pression germicidal boringly carefulness jubair campoy epirb kende bounceback louts ajello moita hollowness ledgard glinton multigene atrai tudorbethan allg pflaum pouillon aschenbrenner angelyne budds heureka sheepfold farokh unwelcomed morlan lightworks schrage zhiwei heptones naics humar millns scalby felmy outbidding faceplates basenji shukur ernö hindelang zerelda trusler unocha foscarini geosystems lefevour fegley schmaltzy isw zentai maximisation shaha ungentlemanly olguin epithermal spiegeltent cerner fengyi networkworld klann wuhai fery paygrade simyra ppos mcvaugh tinniswood blueskin rrh pahlavan backflips ngd mosquée irritans wongs odac tisane kivalina shamshuddin yoknapatawpha cinto younès ostracod grindin arsham tocopherols abertzale cinemateca lauric oppens deshon paloheimo môme hydatid opdyke pervin upsides frecklington corbiere bobe intikhab crisil tankards skandinavisk delahoussaye disqualifier kohle viramontes ginting abraaj aisleyne brabants kangol wavelike jamario harib arrhythmogenic basualdo meria wessington baobabs parvenu kohonen sfard chamitoff spelunker lownes greencard hakola rhome kilims diardi metsamor psychonomic colls parche hoshide gadabout ecumenist riffed progressiveness warehousemen geesink steuber chkalovsky saffold shangrila ghettoes minie pudil allesandro manneken aeroponics eenhoorn sexpot scrutton salsberg mugisha mallison bcsc herbers boghosian alico eichorn tref tessitore hertzfeld sfakia chevelles karystos mccurley anual ruili elze altice jeev lebaran garas gefilte qarni torm wonderous halldorson zulaikha stuk filleted stoeger mikuriya ettington stateville beekes implodes kiliwa mironescu barkeep eibner unpublicized reknowned strati kingma nakahira zbikowski lindholme caisses chrisp beign maccrimmon cosic fraizer instow parchin mpq vavoom taepodong murbach kehlmann plx marghera jennett awilda whitewell jiyun tropicalia absinthium hakman cockettes heveningham mufi katisha rimonabant erics baturin ramboll micropayment sokolnicheskaya anticolonial moeser centerport exame nouriel giba krishnendu argyl emboss kolp checkley scroby samois kuroneko wemp lbg qdr bashkirian madobe ahlawat civoniceva kernohan pennycook henricsson martien hotfix interlinks oversexed steyne bankrobber ilas archaelogical stansby tilahun sylphy milchan mussi steinbruck nahhas mistajam thermostatically delee trachycarpus zuiker polenz duberry boabdil matsuzawa shorthead pyon sophistical eisman teshigahara pugno ngcuka titchwell tetro ocps dachs tobs mcdougle whitemoor striper duntocher villasenor perras blatchington untung dispirito synn epicurious torriente pitera foyles moisander zambrana emos farnerud lotts romansch foong shirlington inquirers codswallop bythewood kaufhof gonson solanine streib claverie kisatchie trajano lipu impolitic geha capitain blowups sajda govtech magliozzi anoints lakanal contemporaneity lops prts fgi kirori ramius seuthes nottebohm kornienko reprieves siby pruss ashenden mision montevergine harkaway schaack rahil warga equipement clocker deepness maslenitsa kadhimiya jide hdcam largeau kpf laurine footwall masire aacn creepiest overexploited perihan guggenheimer merill luedtke breu coldblood emrs etg paysanne lindert launchings imbecility delforge sherzai mayet rashawn houblon innovativeness wtvy dgn visteon stocco savvis faucheux alittle karera crondall oee mariotto finigan kennish shouters conflictive bedroomed undramatic mayland advertisments malae predeal bakkal fanciulli fakhar rosenvinge universitaet djilas nihar fsma bolanos stegosaurs glucksman jaza chancellory delice intelligibly overachieving vorilhon sretensky delaram balchin roadchef rodat terrio harison mavic bilde palmore sintez kilobits catroux twillie okot firths nafion zoet bossie katigbak scuse gvhd waitaha hilling charouz witcomb keratectomy mehmud malbon wegelius lockshin minnillo policja usti dravet janosik assenting kosgey felica yiadom postindustrial tanaquil usurious schererville criticsm threating broza fritze mcphie cspc ludgrove nativité sevmash kywe biswajeet dunsinane francolini pepeng mustansiriya idei niceley mollica zatkoff apics defoliant servat hereof cossu oceanlab lythrum kratka dagerman gendun ciesla odera mangy syntel yorkshiremen laquelle wainganga wvia müllerin aronsson thabiso alandi jellyroll ghanam herrod signally tohan mwenezi shafaq willenberg picards hoed cyberterrorism immobilisation aulakh careen foleshill murrindindi astolfo fornes sublevel peverley stultz zagorakis karpf participators cdns hubspot wolmer wimmin nutjobs biodegrade darkplace subdermal hunkeler yongchuan schepens highfill fucile déluge parvo selchow sauerbrey funkee thise yacoubian rehabilitates tatin leymah laindon damerel embolic monterde reunidos dolgan convalesced chartham misstate chorney idealab uniter fullbrook duncraig moodiness jaspreet ipps zaldivar lavash rgh marinoni bubs overpeck disgruntlement homeports noue miscommunications abrash sangare koncz dubilier seyran terrail thrun overstay bardy desertec theodorou gmeiner wataniya makari fluvoxamine oberholser chatrier yaizu playtone acoustician holim arrate piara hutin werley ratched portioned afram iwpr baerga woosley launer hyperreality debarkation anglicism sstl outdoing tarle nzc wefts shunk roychowdhury taufua elph moctar leija shuggie martill zemer dimick dicuss declamations erwitt kunhardt vinings parameshwaran holan moussawi prizefighting kirkmichael ulceby ouds unexecuted umland coff subtotal literalists addit devisive gurwitch manzullo mudry dewei longbenton goldfinches thieriot mulumba josina laboy cavalries juande newscorp mihm tweedale abry diffrence rosenhan liferafts ballouchy threatt grabavoy tarique myrto divakar vaporizers pindell hols hyperacusis rabassa feest agness faap abisko intraconference ladell brandwein relyea kruczek polshek ayar frizell cizek thereza ardboe cleanrooms ruca brunetto controled obliviously cossío degema peopel watc moreaux lifespring veizer bolado yerwada rowbury neroli haselton frauenfelder idrissou freeston iwaya jumex ypi moggridge pykrete nimule melanic hollier gilsa buffay puzzlers landrover svenja ottershaw ragamuffins eefje edstrom tekel thei mcweeny irukandji lantirn huckabees folli lchs handycam hamsher cantil zouaoui interleukins vespro aragones gne beac radzinski seznec cyberattack soza bernabei neligan ceap craigmore amhurst kakan markranstädt vrouwe houchard harwinton reloadable pommy ezechiel etablissement pintails mirboo kirkoswald stickin easly laurentien gunten apotheke procede icrf floud unctuous vacationer layec gnadenhutten uclg stüler journos soldano hopefield okumu abras desaturation overreacts micklefield robiskie tarlochan montre recalculation docomomo kunyang frechen mubin mccartin dismounts borowy mojado julies stoel shoveled woodhenge severns ques perambulation tahuna gork diac guideways scandone antifa muzzi mispellings astrologically diomande allots kanbar jitender degradations vileness wetback rosmah sridhara gaugh norborne korach fradley ibri manzer ecom frasor umid wicketkeepers lancelin perriello strohl vassalboro teleshopping samuela easthope syring primack trimboli capucci hous cicogna tabucchi sates redouane kargbo simoncini milov opensocial varnell brockhoff pongolle methemoglobin preprinted sonnenburg zarakolu arien ecas triqui ruyan derangements simos fireblade synan fekadu kepis menwith kakodkar halama lahmeyer ammiano europan brabender berbere agonize paulien mathu audaciously alioramus parkyn hamis hidebound cavit tintwistle  sidway picardi unhatched murfree gashimov hermawan kores tarapaca buckboard koothrappali doz zonn vcat millercoors saqqaq snri calida kalanianaole mellody aneurism mercieca spello rosoboronexport malkowski shaffner qinghong eyeopener karnofsky ecgbert lelyveld milkmaids henmi lhi dyslipidemia arwenack tyerman innerleithen eichenwald rosaly regrade conflagrations giler nacro zathura youview ebk shehadeh strida kargar greenhoff pheomelanin wellmann cyberathlete seamers chapped szirtes picault evinger theofanis hoffert financer barbours reiger shepardson kabamba monotones asikainen hegazy mwinyi zoabi foodtown desparate colchian mccrystal hairiness hardwar therapeutical messagelabs rades boryeong suggestiveness encontrar zeese chavalit blasdell rubai rocamora cornbrook cultra rejectionist zherdev crusting mhn russos iadt gotts shela masochists ettridge tchaikowsky ciavarella betto kiddieland trilla balakot lubell randazza zzzzzz renon harpole aspland glushkov gyulafehérvár palimpsests puking eppolito kinlet lightsquared mynci reclines greenes celona calanus nowdays geragos riggin paraprofessionals pawsox abukar pullet lorusso shahba abdoh stölzl murni elsi sterlite knidos muratore beavon northiam minutest töpfer efw rededicate martinon duros hefeweizen gorgie revisted voiron levitates worldskills sansha pearlie thnks bayton ravelston lipsyte decouverte beitz coalpit valedictorians cudlipp hazira brashness lgw saughton ferriere granjon lipnicki gendercide ziggurats lolas idiosyncratically sensationalised cremins deductibility superyachts buscher modularized zes tecolutla salka sterba ozama mmmbop asuma prebuilt melbourn mahinmi satisified ambien rehavam laywomen wahren flaum evart maghoma verstegen undestand sigir phlomis bowthorpe scutt fetoprotein shahroudi corporacion monee mersad rockeries stehlin blago unstained liquify scourges demonstratively fullmoon süssmayr diamantes productiveness parreno ecru kolinda clinking adedeji rehov bookland cahora wendelstedt schuetz bohen solinas frasco sheli eijden zahraa redoubling tamr kentuck spymasters madaraka invergowrie kà ndv lugovoy fontoura kanas bartiromo transducing perner walland pentatone bénouville mungiki harberton infelicities govil glancey sherrer teranishi chesebrough ments afewerki ruley muzzleloading jackfield söze layin zlobin hypoglycaemia bridgeford wordier recommitted sightly ultrafine firkins gartree advanta kuntner rasselas unscrewing havenstein yorkey disinterment awar nizamani puskar kuangwei subparts soulfulness sketchers jaunts vahedi laris gayla vilakazi dessaix banisters ayda wreake hadra westvleteren schneiderlin lougher giardiasis annmarie kakuma bellowed unncessary unquantifiable pelmorex mayar feemster golddiggers flunky southmoor sikandra hennesey revett nakedly quasthoff murcott reinjured hindson molan lusia mogambo stope shene wipp brylcreem patroling chauviré germanotta hawksbills adee liposomal sorrowfully tankless zarka premix lavishing guangyi zeira plantlets cooperator recoverability witherell fusionist salgar pharyngula gilreath shallop steelmen obsessiveness lettable aughnacloy alread kishikawa infernos disasterous onychomycosis jorie diarmid nurminen fakhry mohegans catsup macmullen makerbot chukchansi masquerader charabanc diffidence diaboliques ndez stillie langenbrunner stavoren paraskevas wenyu calp obraztsova superdrive fleitz stoick duey pianeta osbeck ponsa toiletry eshaq togiak rereleases etorofu okino intellij jath isolda andratx tertulia bradenham ukas mediterraneans haathi kilmun tatia senft bucyk deindustrialisation bigman dehumidifiers scorsone makkonen turnhouse kuhak berghoff gashed pallini schlosberg barlaston freephone hentgen concurrences luner fiduciaries asphyxiate humera pauperism aubuchon aquos walpin cosmetologist sounes remanding lonza unmindful lopen icef pogatetz favalli gliosis oversimplifies deardorff accomodated rokach bussmann greeno hamat newbrough valorie zigmund tomiya cadaverous bedwyr dumpsite webbers baccalaureat hellings kerbside invitrogen briskman hogans aneth transmittable zelnick butyrskaya bpu corna dmitriyevsky makarevich apma vuckovic kampe boukman breakdancer bondra fraise parazynski aberdovey watersons sightline bedgebury maddest wastewaters sampit pingjiang slake btus riddel handfasting jungleland filipowicz scoon zhihua rapino asaoka adulatory maselli taimoor rehydrate concreting afanasiev uani qingyun myners ruijin gideons zilliacus aggrandize levamisole waltersdorf briarcliffe workes milkwood windproof fwy lobell neaga vinall trapps brainbox elvie raiko raevsky xlendi silvo pacemen tacurong walewska caorle christenberry dongya agey liveline conceptwave brucan guowei hartwich mayoralties dyak kecap poptones phenylbutazone mehrangarh schexnayder worboys niyi subalterns vallauris banorte mcglothlin stasiuk lvd quandaries thromboplastin gasco perroud zhijie aerotrain shopfronts mutualisms pierse fundraised verison kozai smic saabye stockmarket inseminate spurdog weichel gourmets amphioxus tulkarem bbrc gholston soufiane megakaryocytes rubem corrada kgi attachable smooch fulghum trepanning wickedest shincliffe cardamon glyder garendon singsong foxhunter duquet coattail rushe belisa subnotebook zantzinger bastyr delegitimization archigram chacewater thek templepatrick mackenroth smythson dxa taquet blunter stanka linthouse khutba radulescu moyar muszaphar paranjape forstchen wloclawek ashlynn medine cowriter bumiller knopper kondopoga stainback boerner vierra demonologist arnol joleen defectives graffitied cubbage lelis aircrewman ugali abengourou mousie mww gettleman monetta graine menez jetée notwist urenco complexing inwardness zussman passably hafencity armado anantara nunberg carno shenzen monetarists jahid jetro whined terceiro aleen deregistration kangnam armenta wordsmithing aunties mulgan fanmi greenhornes lescano griefing gtos tgb vilda kornbluh terracottas roho morwenstow hübbe tuckman thimpu stepps vietnams piercers khoza dubcek paulescu netherley larg tjapaltjarri brownes florianopolis underdrawing demacio robdal microcassette katsuaki nebulas blaupunkt chippendales dromedaries waaaaay cardine souillac dattani churchwell villone eow zabransky sendhil freshened lcss edey sabaya amrozi cmss squelching tittel uddhav jayaweera riesener correy chelton suncruz utai altium amirite backbreaking ohama hiyo amarjeet jindra passyunk laquidara engleheart teimour shanmuga carthew nhatky kinchla parlotones molinia creditanstalt kinfauns buspirone thena overstressed youngberg fjc treiber douthwaite clandestino ellens ishimatsu appetitive wongsawat downpayment costen wainhouse cahills sunderlands biabiany lattanzio rigell cinelu dalis babushka nethercott lorius kasprzyk letten pandigital obodo pearlescent jaison ferreiras sicken chatan unpleasing tomasulo cags isae phillipine ragozin hotwired idria lovebug bundt ghanbari geographics bergomi limberlost mutz oglebay necesarily dependance pirovano internasional liposarcoma rutshuru πr georgij wnv woolhouse timsbury koach explosiveness isho mayanja casquets scarily lawerence pilibaitis cardiovasc cinelli ostaig eaga nccic evanton bomani rumore craughwell throneberry coxes dustbins interrelatedness softphone dursunbey fleurier jonno superimposes kawin comedically subcontracts jtrs ozimek cignetti corsetti dziemianowicz kolby lorens ginowan michalowski asbel hemophiliac krant multisite venora isixhosa mackler sparsholt argenti exempla rwi pedram mossadeq thatto ccne lalchand nihill ferrill trendsetting hoffecker xiaoguang vims yve liow temmink calt twinbrook yahiro heylia padberg tumb kypros abdelilah tharman hungar wellock manassa smerconish weich exclusivist gusau dewinter ibexes sragow driers whitcher norley wenge leisa safrole immortalizing hafta milongas koenigswald fedorowicz jermy serviss jimmo kaiseki jolokia backbiting schiltz gaywood doulting kiga armwood hardacre aiders macmullan jauncey oblongs mythmaking granath glaswegians sck hepcidin struss ndk nymphet reince eilis rogie crouzet kog etkin sprayberry taormino sanco radiochemical sqi gislason atieno gridlocked vasanti toxocara stiffest rozendaal broodstock heuheu micklethwait ruza loranger nonexclusive ippr argoed syndrom sturdevant purlie shinfield slocumb menorahs gostiny pixi disabused zaib jakie zipzer nrega fafa fayt beachcombing iconically myeongdong dolling wce thebaud dianetic shaiba academicism leucistic behaviorial goldfine zeynab silvennoinen piggle lubovitch eddins airlocks preschooler rany agglomerates folta petralli videotron sepi shvets prebiotics drizzled adoptable obos profilin kerrs biang reclamations kemna dbk ardeatine devilbiss ehrler samaan tryweryn baisley preffered reenen baddrol depue theodolites odoratus incandescents crangle liebster roué odni geoscientist villares sheinberg mendeley xdcam busek kalua shimao thurland uncorked sandero barkun ghione wintergarden mycoskie jbm iasp boula zelienople speechwriting dettmann manipulable oncol vrg arugula salaheddine kgr ndma bedwellty boseong hatbox nly cuvilliés schwengel dinnerladies reccommend tareque jindrich jamul ourso quyen alican scotsport bürki jilt indescribably caracoles shanmuganathan sowe athwart harrick ahmat floaty seelos oluwole shrimper mithen penrhiwceiber chiseling mahale disapointed copings dishy egerszegi rezvani tsogo argentario dispassion laverde ayd broadstreet delev wafb cherubic sader ibec bezanson wassailing grishuk wedin cision grimaldis szczepaniak epitomizing kest leaseholds tuenti rubberised flexx trawden wehle kirka mashiko bheinn dispossessing patrushev pratfalls apostolou sweta newhey wyms microsites narcoleptic nahla supari forkball narus odle mackowiak zurcher dunstanburgh tetri grammatology colorism castlebay cmte noninfectious roitfeld biches wittelsbachs soltam buckram careerism burnouts currahee vasara uccs poller skokholm turbaned firnas oxberry pensionable shirlee daoists milbrett lilys codford rubiales nmci hoglund kartell rensselaerville juridically filipetti alw trailered telem pathbreaking mezze portswood deecke methylhexanamine enrc tcpa tarell vasectomies aplus yenagoa eij tsultrim lerida asrc candybar marista stobbart buchheit bioengineered pacifiers haapala mackeson elwy moelfre grymalska ziman nishiguchi nakumatt laraque yertle kanju woodmore planetout felch biomimetics drypool xerces roley hershkowitz okasha mizzy magnetopause doublings tardigrade beauticians postiglione zipfel ajmera lubricity mechtilde quadruplet laughin kyunggi valdano nauruans zaydan spri hominoid rahardjo reydon cincy interlachen auman sarne yankers roomer hallmarked nakhoda aprender huysegems tetraplegic speedmaster yima mimos gouty keitt micrometeorite tarquinio refluxing fuimaono abergel saxondale mallinder nadda batanga osor icdc jillson srikant subbotin kisspeptin matteau linson delme felson amcs payscale ncra steamin oberholzer futaleufú sagaponack travelmate abair macconnell shong kolaghat weltner hummers jonz nordlinger upstaging facebreaker embakasi varel prea hoegh michler glossopteris udrea sevil wlib mckown mismatching cavewoman jaggy yamagami sigge deko outboards brancacci bashkiria jiwei scarratt cebuanos urena kuttab rupie rickroll salhab cavaday rolvaag naoupu squabbled nahalal fractionating aldinger overlake hoverspeed pamby weschler klove hirings indosat bianculli maualuga ghiaurov dambach fleagle cogman nacido porlamar dijana savored mammone alving proferred esquerda bespeaks celtel dnevni alpinvest bclc zwerin milley pisey partos jerre sousaphones faythe reichenowi demonise interpenetrating guangya sunitinib metoprolol penhale retranslated pract lactulose dwn hayler petito breneman balers preppie lovelle whitting brakemen hasp wisbey olinger marielena arjo macramé parwez chameau spätzle fainaru loftiest cius bogdanor neustar berau fleder whitethorn bendlerblock lessac caran lenbachhaus beihong cogliano palombi disdainfully haycox popayan laviana nyathi overflown canovas blough inverary yiping rabab takanami bungles pantaleón supersemar imrei astringency cariocas tirtzu inukshuk veridian schweikart ansin turbie terryville frado angarita eichengreen canai passionflower thiet gasbags jutkiewicz helmar joseline ivany gudermes aright scalpay ellert lacerating fokina robicheaux cxl storeman shenkman thomassin bastardi ibekwe orosi permanents antidemocratic neshek bentzen okrika stridency piscitelli helis pankratov toniolo oline scheunemann edgemoor pinola reemerges containerisation triwizard surveyusa inchkeith mungai reseal counterdrug gaugler caffé kashkashian ence mantou desensitize acctually altmeyer newitz anosike ntuli dallek irsyad stahlberg fahan dalgaard rabu igate radloff rhodey eyestripe horatian bankrolling breman immobilise needled mccullin sinas chademo dormans novellist doraiswamy shahana experimenta sniffers shafiul maslowski pridie regularise leekpai canobbio tianyou mitsugu journolist kalaya eyad idy luchsinger maiken sensitiveness mokaba malvidin porté steelmakers paulistas jili cataratas depigmentation punctuates andamanica tecta guibord principlists coiners transparence plack darkies pavelec maraval sivori edite hechinger runnemede hirtle recolonize exs pliyev breamore macierewicz rehydrated crecente bigwood pinkins acsc doonbeg litvin anthelme conseillers reestablishes kapadze ectopia determing xiaoying operationalization shlomit bignone wymer suchan siyam varyingly quiron figgs deene bacigalupi oswalds croisette dalwood gunmaker osita chaddesden terentyev bagheria ejogo fallsview bordellos pianism consignations decaf ccfl vacheron zaba freezone slowey robshaw grinkov imray pompiers alchemic kanbara tritle peppi hightide forelli sportswriting israelian charbonnel kylián lanhydrock chilbolton nipe wellesz memeber kerbel orache marbut douville breugel vatopedi retana reconvert saprykin kief haval geminis crouton scharner metallers kornegay worldham idma zubeidi dsge karlene borota basker kangleipak intercuts carrilho demineralization santina huguenin coverly belacqua tchen packhorses rajapaksha funked joz namuth reignites raska afforested fakhravar maharastra brimson xmc pinegar antioco narela klark teddies madrileño pokka nunsense legorreta syle oxalates qumi haner heresay deqing pilati vaus bafin canigou zenos completive hauben shanzhai auditioner cortexiphan letson pierotti medjool iguatemi authoritive reucassel vernie martock palaia atarot kawas coromoto madelin mova bolme jopson stacee maltman unspecialized mamboundou outsources zavos sarni censorial tbwa taing khanaqin shatalov ransohoff vladmir specula buongiorno seronegative blandon massin bassuener lohra consuelos mensdorff pergament muen schang varya sarraute ruv carbapenems flourens samruk oidium gashouse salako landingham rokus icus whitepapers verweij speakes ehab atrios anarchical wctv roiling kines tootin halat speediest ungoverned souray antivenin koplowitz rustamov impune lathrup echeverry opon onnagata meintjes belot olman caprea ratajczak savable proffering idelson weli vermontville ctrs longnan mvl lippett bobick abloy nikolski estreicher ginsu brutto targetable novator resona comahue inishmaan ascs superabundance equivalencies ruweisat shrooms thrombi architeuthis zaruba dinho pokernews tzvetan tumin turlin excoriating molefe scriveners ackermans logans crpc reefers iannuzzi baquerizo whay specialis kingsthorpe hajra egomaniacal grunsfeld vasovagal swidler resorbed pharo cosmopolite julier shuffler fathia popples disneynature leili jussie bollani spiezio clal scalper patapon mounded econômica wafiq numi belardi prasenjit sadow dominations lapps ravera torsades dargai dauger autopsied gullberg waft glw estephan bernales wachmann medad orangina aerate mantain citied lovellette kasztner orioli licensable ragley horizontality hypomagnesemia imaz dkba benhamou registan polier gwaelod hensarling exsanguination mirghani garavaglia mcnees propagandize margiotta sobranie ahds spoilsport norridge veldman mennie blundy vainglorious celedón kelcey bunthorne olodum hardberger thate precociously gibbering twirlers oramo thiselton magnanti sombat zbornak cammaerts sippin metamorphosing ellipticals brahem thrasymachus italicizes evangelio lempiras succinylcholine semimonthly ventriloquists helenio dcfs ixnay djamila weimaraner fotini leasure shuford heatwaves focolare sambil kessock coomber hanikra diegan thumba bowmaker roelf morrey nocentelli pankey mazzuca sandvig tatshenshini evraz atrix soyka ebsary malino edito khanjian kraljevica napravnik wintery cannas wycheck labradoodle schuco indelicate mhlanga wickert dustbowl alderbrook prydie inebriate studbooks reuland ladwig shinsegae darwitz shanower copthall nametag pyfrom domoic labrocca familiarising highbush cherryvale riegger amdocs halak rebekkah krankl geman malyan sasikala aubisque kyauk cza baldoni aerostructures bunted duplantis trousered interserve ivd meiwes wycliff jolle mixu brms lalic ruggerio papercuts cordura kaplowitz rendina mistley wockhardt cawsand plutocratic boxfish uko karelis backlights bevs katoh pacom praktica torode catsimatidis bareheaded goulder communityamerica schalcken amoudi khaemwaset capanna nafe postales chrismukkah charlayne wilgus runneth otman mitic cardiorespiratory timko curreri aliadière eraclea hieronymi phonecall pekao sichem melchester gosht morever geissman andalucian océ twd pitsford mettawee varosha superheat chernikov rylander rabson kamarudin factious schrieffer arresters trigorin steinbauer heatsinks splodge uneca nakamatsu destremau acklins desmodus kawika casselberry abéché nyad gabbiano bilyk theatric himmelreich uhse saloniki boppers joorabchian mrj maybrook holocausts getaria iaconelli kesselman venditte bobbleheads spooners elkader biskind nobilo rafel sensitizer timebase domracheva echavarria pilferage reilley intersexual wadl tincknell dheri linse ppcs mushkil ponchartrain paraprofessional irresolvable barez yars diedre isaa sofyan lazebnik batrachochytrium antigona aveda tatas intrawest reiserfs riady aaqib barnburners sypher toners pitesti farhana pilanesberg olatunde hedsor miked waihopai cobourne trevathan kelderman kltv gines hackforth wsls houri bankrate khdeir desalinization milanesa wilborn unpractical fujikura vavau gandhar uemoa bennewitz hardley wishman daboll marjon orefice dadford thik zeder pallion augst guman vigouroux kanev reoccurs kidpower posas wichard masoe schmelz gader hmn hmph moxi swinnen keersmaeker upending ukti geesthacht otterbourne empaneled dcaa supersized lahmar lightsail lipica americanised felindre garnishing giarrusso clennon mutrux badiane horris vindice maní deolali schundler minquan lessin graziadio klinkhammer nuyens barú tollund ciccolini sangen occhipinti sitars webre vitrectomy amethysts thackston lohri ballardini mcgeeney telecare husting accotink cedia lamblia chittering ssns riach liveris normandale baloi efface ecfr jejuni strs margarette nembo lameck setoguchi irglová hetzer astellas rumps schar radel lakoba eoe fraiche tessalit apigenin oldmeldrum ampt routray ladwp cheapens longhope counterbalances menkin lize eliasberg kashag walsch sharmaine roraback yesteryears marean tuulikki fuerst baggini zhucheng leintwardine aeromagnetic lipshitz jurman soapland tisci systembolaget itin assistent bioassays millivolts comalapa micromanaging mingchao borloo sscc safai lamott éminence laubrock duffner suchinda abade ollivierre penalva flaim humaira irrelavent boneham outfoxed delabole houndsditch idrac murro strobing imagistic dalessandro wolfley neurogenetics myelomonocytic cabrito adls jihads mangudadatu teamers fidra overgeneralization joique uncrewed lifejacket llaman kyron crèches reissig ynysybwl aspers koenders prescriptively shiley snorts woolhampton dook ponnelle ajm pseudoscorpions aldersley djankov cohosh hultin cornrows mesaieed roccaforte precourt negoro beldame warramunga epcc stangeland bolotin abets suicidality microsystem falchi downshifting slea pagos kerplunk swarf kolata chri seawards plainest diplome nusseibeh heronry nsta zishan wyndam jurnee meah meadors vadims maddren haversack bamidele hahm chanos blackcaps cagni winnowed blaak gilette abdelhadi bikfaya diametric soundproofed zonker organizaciones texter bwt jerebko aborn floetry sensitised hindlip badmouth plooy masci allopregnanolone mascaro sulloway sww aristolochic dilettantes rowdiness mixologist feijoada anacapri casby hacktivism bauke pussies soussan celmins marzocco mrcc gamero itchycoo stoles wrigglesworth payaso peacocking ruocco promethazine rlh adelberg murcutt revanchist mestral rhosllannerchrugog trainloads ayso yasbeck faizul oberpfaffenhofen anseong oakworth scis sidelong adeje colback karti svec upbringings amsouth onj fujitani wsbt corie connexus departements ehrs alhurra cloudscape huxleyi garavito refloating prospera retweet hepler fabela guoqing mytown voluntariness otx killefer mcelwaine fridrik lawnswood gsxr petcare kval ristow bustles wizzy cushnie darche savov rangiroa apparatchik straggly hoerni ruak kotin firenza riyo calicivirus jousse eurosystem dimuro mindshare monopsony ricciotti wkmg granoff chipps raim wähler sylvette kempin duwayne leick pollino garfein starikov dowds swissotel chairmans jelsa stael tomatometer aubergines partakers mugwumps borhan mjj shiping mohannad individualize apgujeong virtualtourist airbaltic mvi sunseeker tachograph lillibridge andric huaiwen ruku bahcall hellmer shital lumme yelkouan convergys pettine nimai mirch encored craniopagus manseng kincora bloodworms brohi eareckson wnuk norrman alkartasuna anuses gainers chalupny tazza leatherwork mughniyeh dukurs mazagon pccc togan kligerman foshee phillippa amac rotaries hvorostovsky auvs hattestad claybrook becomming monsour weaponize calcs youki portend eys thoe wandell gracefulness inmobiliaria collagist viceversa apian giovannoni zevulun magnifiers shellard disapprovingly limnos luebeck pummels ioda suenaga otterness ikenberry linbury kolakowski hagenauer marchiano conventionalism thst olarte ijv demobilizing jingsheng enoh pajhwok lansburgh lowick balliett mementoes prudes secher eijiro schuss inglett tuvaluans rubie additon berkswell roaders chopwell ,that dosunmu nordlys reformatories chopo modish sherrell cluses chicky hamoudi parx trumpkin deify polachek babbino clubmen smilla aganist kordestan belches greaseman folkard maneuverings umenyiora superfluids tobel belland dxo matrixes parioli unredacted gorie rault tocci craske demre rieff madhouses xico inattentiveness tenuto pillager tweedbank delderfield noirish bassy barlborough pretentions heintzman grazioli drx breadsall inorganics kabukicho stewarding sharkboy hargan trixter aesthetician antolini harinath kroemer phranc groeschel hotbird worgan shikumen iphigenie cadiou mystifies bacnet alian sawer wanek groveling ducote bontecou gafoor bruchac usfda persistency rusesabagina mahouts seersucker intrinsics samini dissapear cafs conspiracists soboba unwarrented glitterati anexo sfumato frigidarium ghanashyam keyzer seff gadomski chicco deliberates ceco wers earith specchi leitchfield cerralvo callar westlawn haukeland enghelab toroweap actioning woodfin maramba orchiectomy fletc staus dijkgraaf interrelate kulina henchwoman ewloe bolívares kesi journaled josephe dystopias unshaded cooped mehigan hypothesises dunalley wamala castan lelli stewartville shaggs spradley hanaway carpena nras bouli wudunn sellinger dusko jbara eliav freerunning mutuelle puggy tuymans shavlik fizer viendra krukow shoyu geoffrensis schiemann vytas palest buttiglione stompanato manhandling cordingly arizonans mank sussed yuja inkblots angsana starn ghettoized vezo vladika legatees crazyhorse kogod kamalabadi mazouz dassow turckheim gardere packrat llanvihangel bloomingburg biazon bouge pinette baymen figuera sproxton privilages wafted hindfoot comon demurs maksoud ponsot piehl noonu naziism sacrococcygeal polytomy bascially maroš zaccaro digipen howgate misstates rebun generacion harmlessness etfe bilborough costis hirosue triptans polovtsian westernisation folarin dazzy faton perenchio emran handbasket mittelstand suhler delvoye chast shiancoe gainesway jackfish toolboxes bunzl pureness chlorogenic walpurga razmak baghban filezilla aldecoa fulin sapcote exsultet bledlow abusiveness rlx jlr sulfuryl bittercress eassie flamant conda merica brutale hasnat toine dahms fivemiletown golondrinas kld checkpost organoleptic symbion planetoids busst boyang camira basescu woudenberg congruous chouchou cvetkovic segu meiser shigeto chebet carabello fernet lambertz rackheath bibliometric pidcock paghman hitcham ventresca hatchell safm terron penzer ordinariness vicarages masnadieri froemming isserman consolidator akano unedo sexwale thorsons klimke unltd degn dgl suellen sidetracking kashem buttonwillow harbourview nyungwe allbright nydam rastafarianism blockbusting aswa freshwaters dallon amoebiasis herzenberg livingroom wolsingham banavie malielegaoi boshier poire whoville iwakuma intellipedia hdfs loyals orrock strangulated inshallah macal whther trentin zhongwei kwwl khwai oxmoor velux exobiology cintia crumpling helprin arren pajitnov bohunice usally renunciations noirmont oio jonkers carabiniere brek cepu landesberg siroco kleitman gollin niwano prejudgment lowdham dilkes antifolk behrang baggaley ollman tourish birdseed sidecut omalu siefert formanek hashid haad csrs oae clippy collarless jaroff ogling chryslers schwenke fayne chaak munib bkt brockholes döpfner mondschein fratto brignoles roffman mindfreedom pachouri mishchenko rushey georgeta dobtcheff baldia cablelabs wahle kenana priddle discoverability albigensians seculars remmert ostomy rothenbaum hogansville carven arced sportscast manipuris unconvincingly thouroughly kitz holstered fêted ganglioside jll becta hadspen kornacki taniela wintel tugade shili conferees plopped lcra stolk sodha arslanian devilfish bravas miedler lazaroff saim makary shamanist ollé vails bastardization sayeda adfc bilyeu sopka thomases metolius jugoslavia chopan adja korsmo mirvis lavonia prescod westonzoyland eagleswood istockphoto ormolu haralds shrieked inchinnan sayeeda witsch jorunn elephunk norrath watersport snodin mstp americanairlines maldwyn shriekers myasthenic tgd ennstal begur demko ffas umh odejayi amrullah batre duckham disneyworld chanty atopy sinani kopernikus cutbush becaus hypermodern biyani nguoi gainst dioula kreskin zamaneh dreamily drebin sevim mza slabbert ifremer rahima burbano eugenijus folker elsberry pekoe degale adubato filumena mundie raelian toolshed giusy overambitious internalise wargo adrastea affronts gotabhaya tvam customizes unchaste shlain logiudice delobel maratea stuhlbarg paradoxum letterboxes hollerin stallholders adium mahmudiyah marlos chaderton bodedern soundwaves ohler ivanko sylphs donalda pinhoe callon kritsky bahgat ilfc ralfe hullin kaufhaus marszalek rinky cihangir etam falanga standi midgette malecon kenis izotov ciganlija venza kirkness linderoth kreitman tomberlin lamping clotheslines zellman binkowski morsbach kabaeva underarms petruzelli borgström autostereoscopic steenis pertinently kandis trinkaus neowiz nesil scandella boab gonnet sinikka ballerino abeele klezmatics highpoints kexin straubel jacquez nextmedia demised widford dabbagh theanine sleepovers johannesberg asay sequesters moralez cecal ranck glidepath helixes bervie bialowieza tshisekedi flatulent sluman warsteiner debauche hamady torrontés yangshuo clts bryants kertzer miyasaka handscroll reformational jye silsby sabkha rafati régimes schorn föhn gyeltsen buice hootkins weisenburger zabludowicz peke dannen emira cccu morín rigourous boulahrouz mysa glanders romanticists smartmoney lovefool balliet trease glenfarclas sugartown niane nahalat murison darpariaethau cederschiöld johnsrud harlee humorului hereon shuberts fcic sirkin kominsky forgues bolivianos blomgren paleckis autosuggestion maani daven rhodiola hepatorenal tatana itria palop pillowman tagaq unfaltering utahns suntech sumsion jawbreakers nmdc microgeneration delistings myrtus feuermann roylance monache eakes donadel networkers kazadi reorg pegel ketaki shiploads ccdev traille sidna suranga tippie fette ardie waddling stinkwood tintner smokejumpers honeymooning comunist houchin meperidine biaggio hogsback skutnik dygert epcor crang auspiciousness kimberlites springfest maggard tabea gaitonde flickered unarguable broadswords mealamu mischka requiescat weaklings agonizingly nibelungs pinizzotto marcondes nagalingam mallinger ashkar dagmara weidong disapprobation laju że kidby dinenage altran honkin jamesport loaner shoni ramgopal obsessives teetzel kohlman gravidarum sportsnight artman ywain aand zzzzz farmborough mermin aymerich knaster paoay belaire laufenberg eifert suschitzky barrowby kallat irek woltz watter bakhtawar kaarel ayanda hucklebuck gatherum shinmachi holne doggerland filgate jacc tark whitta meritus rushbrook hydrox ecocide misick trimmel miniter bobola arabists sivivatu kosin dogmatix lanos ipea mutrie punishers platzer mehan pantaleoni onely stampfel unpacks ayliffe tianfu hullah eulis sovan televises dieldrin aliah gretz cowherds sweetin falstad lydell kodjoe dialers neuregulin capicola benjamina cki heaver goliah karacan predations hornes inscribes yerby londis elledge lookbook grindell poje pactual katselas fechteler grinzane atteridgeville piddling umeki cooky kightly planty pentothal coupet flewelling rechnitz arsan vilseck patronages memorisation larroque murless lamacchia reclosed hrach shawlands unbridgeable myrta labidi yangmingshan lell pracht holtzberg liveleak dahari kosem tallil eaps beadie jerko ellenberg grapefruits charton wallbank tweakers edirisinghe vocationally yusufu quen jatinegara boughey tullett omark lendu eneko telehouse ettv cracraft dengie ají villumsen begala reasonability jihan thandwe allergist beacom larive mohanad isaca hoeing gure chandramouli bordley molaison externalization fontán schanberg nutrilite dacoity weart kavaja manati dble narramore recomendation brazillian langguth spycraft barreling kubelik postol cobell rompetrol kawakubo chiappa brittin knibbs frazar lowliest bonao ooma peasy directoral lhf fanzone calamar montreaux virdi dorks lamsweerde emadi bullrings olimb careys aflatoxins williamses zarine bottlerocket indomethacin suppliant allderdice pumpido cigala levchin rachubka tinatin saju valvo hudis dalmarnock cubbins georgopoulos terina stoneground calibra impressiveness soundchecks balanda preconditioned poky carthon protamine gourevitch giovanny poupon bakhita igli inspectorates cervino bunz tokin traditionals vly adeptness morcilla nganga thiaroye jentz selichot soppy kraits blincoe waisale makoun kecoughtan ducommun zahler kamuela abridgements dougans représentant lopping kopan gottingen despatching einfeld coppertone delegado funches superheavyweight tajine kuusankoski spooled flaxley instonians beseeches downwash jauss peated firstmerit disgree musulin arblaster mabberley lukan mansford northend touchpoints gulping breskens nalley spastics saila vadas khalek nuth knoche athari ketv dogcatcher retama spireites enigmatically goldwell rines herricks senekal brouillette pollione hairier chethan butson evos powershift discords staphorst handwerker ardleigh ungated mujahadeen jewkes blumstein yafai obx bacalao sexinfo turnesa kulatunga nowacki pocketbooks unops connived erucic lozowick pbgc repack pactum lanciani deerslayer ottilia croisade krogstad carulla mench iava youngers rodders wetz panderichthys relabeling franchiser multilateration palookaville shadoe braai mwape stylee disinheriting onfield saiqa pirouettes gafa enx kleeberg nufc hillsman microanalysis albertsen colosseo cecs nicholsons schueller askerov bisky dinizio bogland atenolol silvy mushir castigates carteris tiravanija quarterbridge cyffredinol bouy bellringers douc dainis hanvey nouman daviot mahli javari reclusion atai thrybergh nyuk marjah newsbusters linpeng rivetted craigleith zisa bedevilled saletan synetic mcalary mccloughan tasleem ramphele shrouding themsleves shekh backlands paisan sagtikos zarzycki tanioka beaudouin entryism valadier churro imison folman gobelin extensiveness oedekerk smidge skicross multispecialty rabbie razik supermicro dannon hongli nzrfu morans mangku lakeway euromonitor paralyses lurdes ekaterine similarily fbx zakheim queenfish xuanhua toploader datacasting mediaite kappen hechi potiskum ruscoe prees softee ismaeel javie bgf sufa varella neels zooniverse douar marimow astrocytomas piggins cnnfn yuille douillet khalilullah mashaal rapel nicolaï middeck daubney fagles hadia gadaffi apeman mnouchkine bartlow abey rockstone broe tomoo smbs benchtop argyropoulos fsia pigeonholing aviram ayari brundidge morasca struggler herrig mayoría sivertson rubenfeld thrillist mfume anw jakin questers strathblane tieto yps aiz schenkenberg kazakova marfin durenberger warrener caland cew maruca sobbed caravella makeups kristos madni obt galahs kasasbeh swingler instable edv carreer abotu fidi baltia canners zadkovich tullibody baumannii similary elsley windrose slann boonyaratglin dithers kulusuk asel allures cruisecritic phulwari trotskyites parakh losangeles comins futter velveeta marylander henein catatumbo microstock hurtles mihailovich individuated puterbaugh eqa novitski schnurr educare hemmat muleteers failaka manuchar lomachenko sfjazz preciousness narkiewicz haidara eruzione uncured shuga watsa delaria cohiba jinhai ruqaiya glassblowers scas thebaine karazin nerja guity mitrofanov palila aloke esmark pechtold sittig annualised fiocchi nibbled wrightsman afridis doley buraka tranzalpine fauldhouse morreale epidemiol transplantations kazanka addtion tumescence moulsford klunk flairs incompetents britnell obstreperous vaupel pendens warberg joing asael kaws amobi comtec hyndford ghezali brandie unflagged heckles anabella vyne podger duroy thomsonii polarizes pisanu aldemir handcarts akuila imeche oberhauser masetto heimberg spinna dehghan blurton charvis megacolon overcautious compl bagnasco iljin sacker hollingdale ironware psychoville hebes glucksmann dlpfc chokin drh devender modrow acedemic broomfields amee jadrolinija afcc newser hibernacula ixtapan slenderness scyld compal loanees dolwyddelan pencaitland hamshahri kinara cuningham grafs tindley ciena blando kendalls shemaroo mcfedries topock waterkant hyoscyamine chits snel duellman zounds jackling akol nfts sfos bruneteau videre jeptoo korzun mahna zhilong ecca naeto weihenstephan torrico earlsdon falungong friess urfi jeantet rathman gladioli esrd apro ariary mortimers raphaelson protus bonallack castledawson piskor glockenspiels bayham dritan quickstart mmw sakyo aquellos dogberry cablecom coagh pontcysyllte rumbelow africo blackplanet dmap paretsky walfrid habashi windies clamming combusting ghf misjudgments elior boler kravica garity neupane superteam kovacevich calanques erromango islandwide calcutt nhlbi reappropriation hofmanova himley barehanded lochinver brashly puddletown digswell fortini coxiella whataya mileham congresspeople gromek articulately rateau eyerly agresta flahaut moniaive sarcopenia chinguetti bulgogi aletti abta councilperson firstenberg lequel isikoff bosca hagolan fordice hasti leftwards demming breightmet kcom dahe vacuumed ooda intersexuality degryse hosn roofers gerad freebo probasco summerdale kashk pand radif copplestone shammas buscot guney aftershow trischka ibsley slickness cosac wisa baulked ibcs ktlk balsams kobie chouette crem latke grappenhall kneecaps nyoni monder backworth cherimoya diamé serration pestilent cookoff oneok bianconeri helmingham promesa duplexing willmon mountebank ebus falloff olliffe xivth balash vallat rampantly dainippon hoffe achmat streamwood periodicities resetarits nastily burnable maderas benishek quilici entrapping greggory goldsworth wiederaufbau bignall crummett biafrans xns comolli poyi iddon eavesdropped jaud triffid reshooting minick goldens zadro gangte woodroof wittke spruyt temozolomide incubi yizhong brownswood asae optum carboline bukar charkaoui nightstalker bakley silksworth everbody platforma semaan seidlin raddon ceus sorbie fieldsman dyfan veliz kiyota rothemund teunissen wmas supachai archaeologies corrigendum ganis insignificantly lakshminarayan frud waitsfield rimkus religionist quoddy pipher myrtis cgrp immunogenetics chevys plomley crivitz toolmaking dfas commensals esplin wfb scuttles pamphleteers croughton sorrentine ncv minimovies wheelman weyrauch latapy bacchi cloudcroft wheelersburg drechsel palladin egberts dejiang sagnol albaugh flyout scotish delauter overwing pollari henske stehle submited anie birke hyperemesis moulsham lanett regrette notw burfict recombines halmi telescopio elkhound anielewicz uttal minges caroms swishing neier urushadze stackridge aetiological nalty odoyo tymoczko przybyszewski sherlyn meininger hedychium farooque metalized noever fenley monocled abene naspa chiasma knavesmire brewsters heesen sanikidze inonu fato geomechanics ledroit clinkscales hookahs palanivel hamu porumboiu copertino velilla payatas gungho hargreave huseklepp merja perrée lonmin naoum custards barki tuchel kabocha saronni carnmoney hanmin mcso wtih cholodenko mirk intangibility faintness jiahui doctora fingerling insall polishers ibargüen tadakuni ceel heff meowing revuelta waipio dooks patrica unforseen benthamiana trau otk chynn gorleben popsy zvika tzena gilhooley bacsik transcon fillory auther mehamn mcaslan smyers dinna lazzaroni fuw dakotans hipperholme lauterbourg beatlesque joralemon zuleikha apil liljedahl scentless industrialising langkow lekker friending rassel motahari brantas pharmacogenetics nationalbank rikyu tompall fallaway bacchanalian brocks shosha eurus guidepost cirl bated preoccupy coequal hilltopper komor pianiste besetting getta blayne darklord korus laudi cambusnethan cordesman brettler meerow bethencourt darrien alko ingrate agonizes vandevelde shoreside grod gulches playskool zigeunerweisen boepple creditability worl chondritic sunzha hobbesian lorean epcs namirembe hummert roggio ménilmontant mcguckian joeli nightshirt akabusi gurfein brockmeyer garz lmct hindenberg bolzoni stormfury wackiest iecc toos azahari letcombe hardbody shenar zscaler thean remarketing jetted ilkin demographia mwenda theoklitos crosspiece gurgenidze parmigiana showreel vontobel irishwoman vulgarism tumnus devonish bushisms udb stokenchurch altadis polunsky blastoff megaloceros seion kewa najarian genpact dafen bachelorettes moretta strategem bayana atomoxetine rathe bailin dramatico sorrels inglehart goey reillys hanceville boldre bernadett gobbling sophisticate interministerial ziekenhuis chronis chambourcin sheats jazzbo tianjing llanfaes mandanda daguerrotype adenoids apostolates elastography posession jelli sharafi fuegos popinjay kingsdale turowski losse immunizing bejewelled allured stockett redenbacher fungicidal minett agroindustrial phyto rottier weedsport garsdale bucklew kinlochleven samme zigler psammetichus widgetbox jixian sarongs sanny shenae traffik agulla johhny hdn sankaranarayanan petkova marije performable naturopaths sheron opare danielli euzkadi wynns reguarding jco hayner viniculture cheffins hyodo finnieston ssas reprobation strothers fumiaki winterized pabon gherkins opvs hulshof guiton kirtlington briem palladia edil iied fusilli tizer ryler hedgepeth mosholu itsi leblon stondon torrisi vimto mastroeni pwrs asinara cremains montanez cleasby ujjwal ebele boschetti sayar teramoto scognamiglio nbaa mallord snibston usamah progenies ekati undies everone skulled untrammeled samiya softy moily iliopsoas machimura shipibo richardt quinquefolia dragoncon buile searson cerin matsch burnhope mayaguana pillagers synchronises cile nadder controversey corendon leskovec keratinous benevides succeded ceglie rtus moisson sirtuin shoehorning mckearney ttxgp teleperformance lignan hookey showpieces maltbie brkic zaripov ikramov oppurtunity bekaert cluemaster perusahaan airmax aurélia kalpakkam angley harit burham sassine sikha originalist vanderburg phenacetin sexxx gingivalis hollimon bushwacker tsuruya gengenbach anonymization nahrawan astrit legitimating hummable gröner armit buhera magasins weerstandsbeweging karlsberg diyan rofecoxib recommit langworth tashjian dionisia lifesciences domperidone keshishian fiveways gartmore gipping mckennan tolong shash danielpour lanskaya sebat egton mutahir macdonogh kieser imperilled kptm sampietro muramoto ulin taxidermied taymouth deddy derin malefactor crann endears tattletale merveille arem tackley lauffer korowai newquist mulry seigfried criminalises cystoscopy bomhard upperton sterman tenoch appart laboulaye chiweshe candlish chalifoux paslay linett stabilities ecks karar ekofisk arritt coprolite crathes mithali landzaat sherwell chuppah hypercritical shives delinda accp eclectica qru harelik leistner pirg sukhum vasilievna kompania dugal iryani serlin poelman herskovitz sparber directa secularize precipitator noncombat boyde mcgiffin lochcarron scuffed sauté maglio wasantha toktogul rollcage durso khorsandi tonis kouilou quaintly sanmarinese liel impoverish nextdoor kappos hereto lomana prates issl pradel wfi pooping thornely skau mumbly prospection reddaway shortcrust amvets milage enniskerry kösen qurashi uncoloured lezard otas rethymnon anush marena skelley pruner holdich djerma woodpigeon unconvicted aronin drollinger okudaira laubscher viveiros lumbly levox shefer chlumsky grrrr damxung longtemps scrs bulter prutton bohner swavesey bodwell shavei friedenberg taverham menomena monnin esdp melching wlbt utsjoki rerecord senser premierleague islamo merrymeeting norplant konec huili milioti problemo saraland canzonas peñate werne ifpa graveline damman gracq ornithomimids magaw butterick gaian southerndown sisamouth wpbf pettrey hillclimbs arsov surowiecki maximiano fatullayev payner encrustations northline arrestable kurkov robing hinderance peranakans collecta belaga bawag draperstown aeroespacial impactors whisler unpromoted poderosa agoutis lehenga menabilly slackline lomnica pdus iconia kayley stunna tostan cojones witcombe worrack chewable ghafur canst eyecatching khazan antitoxins situationally tostado geremek steineckert juang popworld tagge multipartite rubbles kockott fxb bednarski ledum adipiscing sobhani wisecrack vathy slating bramhope defrocking kasparian meiggs filippenko ivoirienne virtualize whirligigs inarguable kepier unserious capezio tigerland turndown szczepanik fiss dinello rotos hadrien languorous snowsports aspr corncrake daviau onuma sarducci mihok epistemologically owensby sydsvenska nolot doster sherryl ixworth caucusing jevan pahars shijian planetology cravats imar sabbiadoro rahmah lindenmayer lineberger tenho mcclatchey culpo wittingly burgoo bearpark vear pleasureland lagunita ejectors mmabatho gionfriddo misdirecting elkana carabelli subframes jonbenet valleywag afinogenov cippenham coud deveronvale wainwrights liuba ballfields feasable cirigliano blythewood menaker unselective igfa haroche hodgkiss jurys tcks reincarnates latendresse liddington vincenzi cockbain micropayments weisenberg arav vinnicombe hoola dique kissidougou naumoski bopping rejig pilsley duplicators rivieras petignat melhuish pinheads naisbitt coffeemaker socolow kolomenskoye verburg masculinized unimagined trrs poésy mayaki pixilation clomifene kalnins scotter houchen cinsault shahrzad roskell sosin endevour gastrostomy conceptualise chytilová ecureuil lowriders bladnoch santell reattachment ambassadorships tysoe amardeep reinstitution enigk unwrap scuffled lubinsky verheijen leogang frieser intercoms bellic soundest mcelhenney indialantic dogtanian khda garbarino encourager lambrusco uptakes blastomycosis kostyuk mercopress ariela schoff dhokha kontinent ukrayiny fractus espenak homestays bontrager perspicacity atlixco brightwater ochres pibe sulks davenham memorizes gaxiola canson leibovich guscott inventively hrab zaretsky yapese overstone soldini nonoxynol koelle soffits bouras tropp paleopathology emarat breiner scenting rataj ratcheted taffe vizion sailele coskun profili pulite roest kotch interglacials castelbajac saywell alvaston jansport huanuco dysregulated corinium orakpo danze backplanes surra hanushek dustpan mahonri vtl twinz leece bgan shukr wuh capitaland moteur okayed dykema rollright sigalas musen oenologist carderock intuited hypertriglyceridemia suhag tamudo entreating sheherazade sottile neumayr volcanological mbakwe liscence vaudevillians niedermeier streakers schol sonapur sedlacek candelabras havill waikoloa moluccans parbold retching whistlestop melnichenko vitellia jujubes merinos burruchaga virag jamukha courtot pallares koodo myne jackley brazel moladi annalynne borza alwayson bolikhamsai peci whiteadder guseinov jannati passivhaus boedeker menchu philipon norriton yasen impetuosity monohulls cherrill disanto temanggung expiatory gulliksen grabowska holtet unscreened rehiring mulesing buchwalter kohlmeyer shinozuka veran doval glossier steininger dadda footpads irwan eradicates zanatta iiasa cisek zarko capsa tugnutt delphinidin veiw dinges sicoli shadai agler strogatz swansboro ziga flodin adalimumab gearan fenosa saied gething dadong hartshead knobler infibulation bahadar ealham vanderkaay bouchey schwannoma yose rubdown rajabi smarten perra chenega tognetti plaît gabrial kurtenbach nuremburg dictations wappler geds drafty banisteriopsis margaritis ryann zerka dottin misconstrues randt eafe baseboards manvendra jetters betony conehead symone duchemin tasiilaq sidlow graybar prudom ilac ntaryamira todate basura npws felger prig likly ngola seropian eurogames secondigliano kylemore funfairs saffery riverwoods venkov echaurren grainge sandtoft hengrove suwan tonkatsu indispensables joanikije datapath moules fwp narender ezzor clevelander nakfa sundaresan pretest sonobuoy baboquivari economiques hutsell normandeau munguia uhlman telegraphing kornati shulin chcs wilkey chervenkov fiords mpssaa misterton imambargah cespitosa bustier freeloaders toodle kinetica keiler outdoes licheng cannistraro carseldine nemiroff ciar internationalizing crosshouse denationalization stepparents munce cdli williamsons sendek yellowbird motability radamès jawal bithell sisyphean phal norworth arini ajaz nahb oday chimbonda selmani kresh catino movimientos palan faizon connoquenessing usme burkesville karytaina landbank miyao unfertilised deann alexandrines savides koryaks luse ouyahia scha vollmar minjur burnitz nymphomania ashchurch pipeworks velayati rinses jti filley rossin touchard baiter joycelyn etheredge subfloor minskoff kebbel wdca amortize irasburg roussopoulos gazillions kulpa deschain hilman minniti norick allwine untucked laugesen dervis lambrou recomposition turista mijikenda diakite andreychuk vxi ayuthaya kukan gunmakers pmq barthé pyxidis honea mathijsen enfranchising eurich wadhwani surachai sportifs dauner augean haulover provençale ueb sifford galsi merabishvili solel zoie romantiques boverton boissieu siia garbs snco mctague carcases duggins clemenson pilarz sulfonylurea kitu triboro weinbach usov reincorporate torqued camlough theocracies jania omegna belleair tuffey togs blasphemers knuckled pittville tonyrefail jerde classist tatoosh hotfoot matern coupla kawarau erdei klopper reinis shafique glomerulosclerosis keas mikulak gruwell andacollo gerima goerke florestal nrh opta qoute defi songsters edley zaroff kaeser noti complicite upperhand umai brader araf smolder jahreszeiten dryosaurus cruet dadoo uzzell romilda koshin healthline jooste nabr estopped campogalliani ictj geodes dayro khoshaim fastly mousavian bleedin nutcases scribblings nemetz metapan mosaica rodeheaver psychologic syz swetman arkadiy unkrich zhisheng nagasato gilvan westendorp qcc erlotinib knezevic armina chieu cluzet schinas medani bullheads cassocks lapstone imporant phyllosilicate saloonkeeper moodley khanan sammis haughtily capc wikus bredero andersens globalpost silvis campodónico mosen rakeysh disquisitions harut toskala donepezil wakui pinkman kreizberg conveners jaleh orlock attemps robart antonian henllys hostelries heckington oliviers fluide amiesh imagi xfp voxx globecast nht tuckasegee brumberg rogachev svenning phurba forvik zappala adherance olari ferdinandi ambulation jetsunma corsano draddy doret anonymize gastaldello burum aih negativism friedewald sharpies membe zamacona skilbeck rafiuddin debark ruke entwisle bressingham raghuvanshi jackasses jabbed madresfield keitany midlevel yongxin disturber finjan spicing tolsma huyen meetin ascertains cutz muralidharan symms villarosa rumpled boulby niksic monchique klenk andora codron freesheet triers hopis mcconchie hilker roseway yaohua digressive hadrill rosengart blimpie lawbook rollerblade ohev dcgs lanzoni speta matatu lowis imperturbable engert tarmacked gaboriau hornyak lawyerly wilmoth habis mccarry dobs professionalised schoeneweis pasillas obua jagdale patrulla decimates titillate hippolito rundel kirkleatham grasu shoshi lauca efros montavista shvarts schmerling correns lookstein claramunt bienert portville motera kiyonari frays huckleberries salari lgbts snorkels gaup tartous leiby cowburn kafando wiele bayston jerrie cliffie paintbrushes kosonen syse kyson toblerone wilpert rimm respekt lagravenese leazes churrasco hengqin conmen mrozek masazumi agyness brassier crammer cassopolis dismissiveness alwa superweapons gurry rhody bno renfree microstation cocooned tuller wuv stagefright secca thromboembolic traineeships cossor grynberg atsuro beaubrun seperates yna slaithwaite pangburn allocution pyrethrins maneely hillstrand cosens umes azahar jbj gayles reveiz avercamp pellston porphyrios yfz ardolino huatai wbk weezy pielmeier churnalism wilce qutbi personation bandsaw satele feirstein sotalia kantilal ermione banchi srtp transgenes finckel wyndmoor hofstad horizontalis giblets gunsaulus zukin towery rheault disjunctions depósitos merhige rathen mcquarters sidhwa simiyu dierkes backswing joxer atletica assessement melone grann degner rawda laayoune hilti sweatbox garota toja undercovers vondie lochearnhead vett apnoea disinheritance incommensurate aduba nepotistic zorbas deviously pierside covance scareware johannah crumm taoufik sharry clearpath lucchetti barasa lacosta museion vanin salvors leira aladar decliner selfsame ekho salyers emcdda bottari choreographs swiftlets deines cotard remédios grinstein archey pallin possamai jurmala spogli haverton bertam iino rollerskate kanjorski isenhour airhogs maharero faja leyshon skewen palling jariwala auvinen vershinin lulay ilze perms lowfield enforcable luedecke screenprinting apprently thecodontosaurus fishhooks caapi talamantes squeezer sachio karamjit carbis polyculture zardoz tugg laszewski vedad brimington dranesville lineback imta defunded touger puxi amiability matinicus carisma guthro candeias ottewell reprod martirano galwey camarero ginocchio dadon shelvin osterville ludik tothe cqd tittabawassee loompas blaspheme börne fedotova laocoon sarbaz deeny savident zbb bunshaft faine fkm haddie gibbens diaspore lucine annuls nutts bourelly impoliteness lituus barias hendron amoros fosun voudrais coggs endress ulugbek neurorehabilitation zappia trentonian brico lockbox torpy ponyboy yizhar equinoxe kyonggi diasporan wonderfalls cadabra beyene cheatgrass ranny intrusiveness fergusons gianotti longjiang gmx leja mattawamkeag badsey porcari cormick retuning salzgeber brabbins economise bondholder stracke cyf isss deitsch kharafi cohabitating dolgoruky derngate ezarik shochat mskcc murville creuset ottolenghi handstands ccps itemize cvma xdsl yizhak faour brisby shewanella nizhni nuyen korek woollens godowns delfouneso microbursts covidien treuhand unrecovered edwy grimberg clothespin spectatorship amim sangs acromioclavicular penrhyndeudraeth supernumeraries unbidden disdaining glisten macsween poxvirus nclr larita torksey bodell nardis koelsch monsoor chalana zucchetto thaker palyama pregerson periaqueductal kinng ppw swilley bodyboard wesc riggleman martinot sileby mifi subversions srec smallthorne motorshow apmc cirone swinley rcgp herran essense barentu frecce maynardville flyertalk maiale durran doest repulsor pinte lprp kinniburgh codner condylar dakka desforges deele voelkel beinin varmints republishes xma heavenward aroon azahara clachnacuddin lhakpa amiram carrieri pamplemousse fayer vasicek isakhel pechiney nhd weligama marucci lensbaby cédras wearstler folowing changez myaungmya fairydean dorthea arvelo tsujii horev lauch vasiliou poopy glassport prestissimo mccranie halldor bolls scieszka mengniu waine artspeak scrappage mickelsen eble trifunovic wva  duneland kunen recolonization bigfix racquel konare diskerud endellion magnetize picu urbanite duniv philosophize irremediably approachability brownbill imesh posibility maggert potternewton stephans vijitha penclawdd vof longboards cassillis cruzat rvu wharmby unsocial lavishness pralines lefanu peppermill vietri blacon amiloride freidrich espe recertified hosing baros glenurquhart skaarup superjail dhg iwd nazih brocaded stickier malyn teakwood longerich moxifloxacin tnz ushahidi ganache grumiaux yuquan modernisations brandley choji unbanked fauzan sambac vukmir bulerías elderfield peszek orthochromatic neaves salination lorina fosa shaohong gowadia megginson compensable wisener meconopsis chmura zöggeler stanic unbroadcast ogogo zinat plyushch rashaun glba exor levetiracetam rafta dayron kjærsgaard ciona feleti dunelm pubes jasms notizia maraniss prorated veljohnson russwurm uncc buback venceslau scratchcard introspect bulker galer scooba sumus ponant grimonprez mapple dandie dialectically classicus gundula naseri warmhearted sioc seedhill markward mckinzie frenzies paikiasothy grannie panajachel scarwid ngoy beddow gaborik boulmer polylactic gwynneth kolodin jagang pitsunda anthracycline cariani dohme zarem prenger slavitt prak carjacker sakhir childfund sencer mamady tioté xay biltong zendesk dgac jelloun masterfile slavov ladybower bupivacaine drools belridge armouring yodok tumacacori unluckiest juvie masley pomonella shawal mdoc thinkgeek drysuit learnin psephologist matityahu wisam hatosy benami soubry bacino regasification tebessa polyglots nagell almadén materialising deselect monotonously microelectrodes brutalizing incorruptibility constructal schumm downscale cude mahasi schönhauser diamantis schriver akopian rougham nordhaug rootstown tribunale webware hoffstetter hostler knw tolba zadi bunmei flubbed whaddya shigetoshi wenches messolonghi vasilij covello osmington kimmell citp raytracing wherries müntefering cleeves touriga keflezighi coloradoan shearmur flamands borree megacorporation ibbs garvald bulliet halime iav sytem hlophe pacal physarum nyree jorasses cheok unlawfulness forthrightness hrn palevsky chondrocyte morani piazzola metas couesnon coxall coathanger blizard troupers oxidises shahad afzali florie incentivise chlordane cascata itzkoff agna mastandrea bustelo guillotines grabiner bobbidi tongi wiertz mardale aquire famosi resalat sportske quiktrip munther bombproof yaren davino rieussec woodseats tgvs mullineux montagnola pomposa indoles freemind hepplewhite modwen menageries vaitkus marez sunrun chuch ursell dilts assiette epon flâneur tabatinga huttner vmg pramuka kaab manhire sukhorukov montagnani isobutanol roizen exterminations tadzio zaragosa friulano macroglobulinemia leaches eppinger pokies gerhartsreiter beyblades sauvegarde jtac spirograph ihes brozman wawrzyniak guestroom maresfield smithland liheap quillian calday nontron dehydrates poorva huddles sobczak golosov przybylski pekkala gwithian tebaldo ndereba shahzaib kelsay aldrick lemel exb awasa unitaria autoregulation saundersfoot volcanogenic tshepo brachman brandram barath rudston malachowski yongchun stocken wardington tatnall transeuropa aktar haggarty kainos langway didem triola cleage weihan trojanowski gccf aptidon lichty mumo cesspools dmcc semiprecious augher mallan luarca kashfi watzmann xclusive cmcc edner slipperiness nightcaps norwin feighan sheetrit grese hhp felicitations shreeves utseya editoral stromatolite meself moez yemin moondust marczewski luminol coffeeville stupnitsky foolscap duumvirate beechy sween markou crbc mouchette tiven gutowski capellan beiber wus cheaney stephentown sinde modchips boudier deeble tundras audsley clotheslined blastomeres acheivements dommage sonnenfeldt airmont falleni wabara raewyn indisposition geck fitzpatricks allègre yaletown bouffard cavehill nezavisimaya pelchat jianming marshwood spearmon enoksen macgowran fuzing madian tiida mottahedeh hyu dahesh rgi glissandos penketh fieser aultman embera biztalk goldenseal festered dyane nivison ichigaya krygier saiyed templ coatham gorgonia synchrotrons cheuse libere enad crimps malediction ceaser shettar otaiba catalanotto guemes kamunting uncac nadig dyukov prolapsed trapero ncri sheeva gastroparesis wpbt tonye eggington anathemas twinkles shampooing blub cource dobrinsky rotifer cwmaman mailonline bilali fogleman hartikainen bvl tufano dablam veba wledig saimon dopes dorsiflexion pelmeni possesion vibraphones emsdetten manipulatives strathaird bayandor hamsterley dromoland somin chups bilney parasitically jabour dirtnap tripwires heatherly blackinton bunnett konjac fahrni knacker chiaromonte otherhand seiches qadiani shick banyard manigat lrad witchell meprobamate cantele altruistically hooverville lamrock duncum moloto cayucos cambiar raraku dislodges thembu stäfa archeologically eagleburger recapitalized blandina piernas savitha fancast alkyd tulliallan layabout pipkins crudest fingerpost pakistans sawkins juans marteen eggo cerha napolitan eilon lizhi walen pevs yated namche backsides sukkar tiens proportioning arnet tonsley nalp serpell wangdi deepcut tonin tarheel nannette hegenberger brueckner suspensive pandera fleetcenter llaw kikwit cownose hydroplaning canfranc ishige rotoscope epitomise weichai kiryienka chouraqui fradique alarie stalteri diarios cairnhill reticulocyte mettenberger vorticism sklodowska klause nevirapine unseal ruppersberger shoukri vibiana vacherie isleham mkiv keppie jbi neus bienvenidos verkhovsky sadir hunga tokage dubbin haussman zinny crociere yobs auryn deya venkateswaran neog reepalu knerr holnicote gervacio glamorize sportingbet davol catacamas marrack foiles skillings bokhary ladys singletrack eyeworks badir seldomly bunkley iuss lounger lenczowski theophanous dynastes bozidar weixler bahawal disinterestedness moshpit coccolithophores pultusk ajita scantling sibur cuvee hatam lleshi dariel ribadesella keyah konon eyestrain bosville joner sirus singkil wunderman jäätteenmäki fuest bawling papagiannis adec ayalew amfortas weichert wawer diq dorrans aguaruna amberol regretably elbel pasqualini irta dogsbody ignitions menik serigraphs aerially materie ekantipur greenstock rrd nesi dainelli chabanais monjo achraf katsuyoshi diversityinc unmovable shipworm gimmickry hammerklavier eriocnemis balzani balbirnie epode takuo bombie mortadella aumf ziebart bowmer obm bonduelle dockrill sharbat adisucipto chojnacki chaifetz anglezarke lancellotti amatil oldfather desso sugarfree fuwei reshapes kassidy frette markee incarcerating xinping kligler amaré bellan peices akinfeev golog lukash tressa maajid mabrey linkins characterful allanton drakenstein alcano glosters overpayments jinda escravos myricks tasnim jehane tricoloured hewell ghir globovisión moranbong stocksfield agonisingly keji pleno sorrells emnes footie uralvagonzavod pirtle shallal cirm havner plews meuser quepos peruga writeable diesmos quicke voronoff grosgrain ermua glaven plagiarists schmitter voguing mcjob donka oberdorfer anemometers nevadan bugun tammo rujano babou yanic bolinder keiley jumbles – honeychurch mutaween akhundzada brumback hofs radwell binnig piccione soient muckleshoot bransholme krushchev teichman kneeing rggi arzo shonibare iceplex scurr bougon farahnaz shumsky tiznow allthough nikam ogba slipcased rachmat afsaneh arrowwood soneji trezza severnside perseids eisenbeis unibanco ascani remaindered aspo gorongosa hehn dipropionate arabinda rizer viano ssem authoritativeness wardian delena ploeger fujimaki sappi unprofessionally brynamman audiological clodfelter centrowitz siwiec coray dharamraj landenberg mutato yulan lundkvist biser corruptible marcolino bordón tempur bruening officiants mssr caídos pondy padanian shagging keens synovium cornflour udraw bumiputras kalavati sofrito denzinger intensives bullcrap tanada jamsetji bashas vauvenargues rythmes mceveley turkmani ginsburgh makaha tsvetanov niehs emmc rvp haloid aaargh jesualdo hansens snaresbrook orihara simelane pharmakon habita digitalb hayal exadata snøhetta turino petrodollar barklage congregates bukola abron furhter aymen nonsexual marom schierholtz chrissa cranworth assou durland yout hillmen montsalvatge bailamos yusefi deleón montiglio llanstephan hassanali subtile zene ribfest shamelessness selawik makula jibberish vonlanthen weepies feindouno widemouth kreiger tinners dyserth rugen mojos umoh ernster mawnan raynard romelu nustar brownley lisahally rupak blehr oligos styraciflua saire abruption elberon santy huguely biddles radiodurans charke brookmyre bonsoir arnow junaidi flowerdale pingpong sveinung zillman helali seminoma refah beyrle alho yeaman easterday nettleham weberman detraining daems sarj mantoloking pollença buzzie genachowski kydland tarusa drager aharonot spensley duato carmacks pinhel nacd stephensons kobiashvili vildagliptin garaventa tokidoki piketon dueted worner crevier oshodi caporale josefov schanche cytarabine cheapened masuko galluzzi aguiluz mosese clavo chinwag saffran aygo saxl ptsa belic illertissen ncell chiney ixchel bainsford overtraining goheung astillero vcv olympism koppenberg firminy sinfin refracts skinnier trebanos bethalto dence deftness obba gressoney lyuba retroactivity straubenzee mitsunori kearley pontnewydd killens swensson gallé bhata peepli clamav shiek narasingha novitsky iaq bendixen digirolamo camogli cremers tamped marinating diagnostically lanker sadad gritton fakt geopolitik kupchak dulaim semiannually adibi hogle zuger maravi doctores mohajerani neener nunns worke clubcard pmdd slughorn perejil gabbiani diegues mosasaurus whiteleaf derivitive myria throgmorton solemnization debroy serps daks hfpa poorhouses colfe rockness selbourne burgard shadle cliffwood rosaiah minorites swayamvar abramtsevo dlj birkinshaw svara konkret untrimmed sfrs plibersek chiltington wkc cadenhead eii seaperch vassalo aloa dahlbeck holborow soufriere casarsa halda cmtc mcgrillen avants geurts santanu benison blancornelas rorer ninilchik arop mattoo penmachno francella umit teufelsberg coys redeposited hriday acampo contextualizes heima redha windoze decoupage chenghai mjo rosière cholecalciferol nbcu digon liscard samarrai pixilated cvf yandi roundball kosteniuk sysplex fayzabad dechant emilija ozouf saidin patrington overclaiming blissed kalonzo healthiness sefrou jerson cobit inape gozer rocheleau misfiled dembele pursglove houngan costings nevadas burhop agribusinesses pierron microfiltration sabam chillwave kalfin salkey morda disentanglement papercutz blacktips inadmissable thermoforming giros athanassios wenzhong friskies rospa reaccredited tolkan delorey chugiak finaghy papering witchhunts chocolatiers stupendously blenkiron schenkkan folz dromio acemoglu brookmans imag prostar glowacki grimsay asselborn duncansby diametre cursorily hotchpotch fgw heligan chatwood salade basili marzocchi grossers andreia jubeir grua rabaut fadiga moblin parkton morres tamms qaq maynards pompom salsas naughtie nyahururu gadiel searight roadsigns loutherbourg balby starer barkhan niney remyelination susette olubunmi acpe petrona robertsons freewheelers chellie ventham meticulousness reitzell choirgirl ederson rianna badakshan overeat stagedoor nanocrystal igl turkified kendrix tweetie extraditing lakka maladjustment eastley jusoh kharja cicchetti discounters alisia scepters wepper diarrhoeal ardiente despues disestablish importantes sots volesky perona bryntirion vinal cretins sackboy availble dayside meline balmori laetoli hfea episiotomy deboo shopworn taproots mcskimming ramola muxloe quickens broden loebsack baiocchi jicks divans biodynamics jassm alliedsignal mcevilly tableaus kwiatkowska brutha moloi chiredzi holdem ksee zazula municipals basinski piolet sakib pelser bohjalian kdi gushiken kralik scotchman qmul castellina fermina dhiyab undischarged greenlighting rosuvastatin tuono niagra kastelli salvager notamment kreeger dateless somaly aviad rusthall bramford thrillseekers gurg maquisards eastmont ethell pizzuti misidentify bennigan rosher ramonet dreijer decouples whomp amelung commanche norkus ultranationalists anadol myoblasts metzl shorto derrinstown mothballing kinsfolk leibel yoshikuni joura odah hakkarainen navs zankou neuromas genx randerson oafish confusedly boydston protv papadopulos mudguard trifasciatus aequorea spermatogonia blashford lysiak demaree roula unfortuantely erjavec akinde zahorski barwin klesko zilpah collyhurst viewsonic tamsen mitchie postfinance reelzchannel wartski westfeldt devic allaerts marysia mazariegos dinuba roade galperin achievments stridor safc ashkali hafan botolan desbois talam mandeans dhiren strabag sideview dragas segredo percée tossa aedpa paneuropean mmrs stiffy clacks klumps clairvoyants wallraff richetti koefoed livaneli sandfields jubelirer minou bagai erakor situtation kininmonth stoy cheesecakes barkleys glasshoughton jarius vernard zemel berewa swallowers boutte kundig krosa lenhardt lindhome jellison eldeen cimmyt bomet firmenich barzin laholm viles orientate siss ametek goshorn honister valastro israelsson englebright connal hungrier feresten visclosky ivoirian hsw killgore asika chevrontexaco tenere giacomin yavorsky isua pancuronium burano gastroenterological güines tamez clatterbridge ahmedi philliskirk creegan wowt jcn goalline porchlight schu raubenheimer slovakians maasvlakte chongo saeco largess hudepohl autoworld taynuilt dhondup rodnik premarket fraz tricolours imee shabaks periodico albyn gordis dalmat microbicide heiti akeel caldon ocado hennacy vsnl fongshan bredius backhoes aristegui plotnikoff tabárez moonshining ovp badaruddin petted dimdim educa mirvs qamdo gearless canadia exequiel tomotaka yess grayrigg wistrom golliwog fiapf buika pterodactyloid veneranda seckler hafun jorda diridon quarless zarafa lessines caundle rautaruukki playbacks tomassini knacks pootie aslr custodiet mullenweg copacetic maurstad zmuda mohns patashnik trosper cabochon tambach khiam asociado beden yuksel machat zfp youle groseclose galipeau sandflies quie couey ghobadi wftc ermera yakou tulay falsities greenberry vecchioni chomhairle pertusa schager swats pluriel pollinates sheyla cadder snitterfield daaé chive aghdam underinvestment mahamuni hgi legay mawlawi tadgh martinov srini sibolga hengdian grassic helson sideway kumana bouncin sidiq politecnica tripple rogi sunedison saisiyat classmen contamines cybernet longues unmusical sopho sixgill blinken orthopedist sorber pagonis leenane monaldi flareup spraining adrain reffing ecocity cameoed incisional lipsitz hbu thamesdown penina sambourne fonejacker galvao clonus fevronia zerby richebourg otterson blueness filmstrips djuric loas kernodle mycophenolate frontis optim wafting bonas equuleus yamamori merrymakers psds poilievre fraidy turnt lovick unseasonable tastiest kouadio unchallenging koita sambueza mandt lyssavirus nucular kyriacos cronauer havnt magilligan standees toshimasa urmson emett langewiesche jillani nadirah ssafa bilocation loroupe refashioning chaldon coxsackievirus pencier unaged selectees wolfinger tabulates imputes landres tequilas zorita foppe sillah nabbing sugarbeet capay raffetto pueden matmata searchability touvier falles celestis araneda buey diriamba voler electroluminescence pushpakumara mattawoman privalov calmo trivialising unhook ganey hmmwvs boada daric preemptions halwai pzev naskar lenah mastracchio anuta sittenfeld califone llangwm anec trejos binging cansfield aurukun federating oxandrolone donnarumma immodesty spindel biobased struwwelpeter pillot mabius ulay lumbago heluva instal knowl salopek roeber kootz riling féraud lubchenco borka mersham auditable bolney radcot montsouris xiaoyun porcelli gadzooks leaton slovic semley averbukh heikal joanes demetriades aerators mascota shakhriyar badbury jackboots sperms hobbiton badalucco nizer pny wheatstraw ferma doubter iglauer radich mcgillin jaragua handelsbanken bhisham childfree perran whinstone luq dietrick nonny nanon egholm lawder seargent googleplex naxalism broadneck brassieres ethologists auvray justs tamarama haikal palter halligen gawky mcaliskey flatscreen squints anovulation iapa ccsp agonised golina marasco burnier profe jizhong epperly nyserda ediscovery reorders melvindale keetley shoreward newcrest pagliarini waldinger chokers shrubberies tchenguiz takie pentiti brownouts uza radic wincing shayk ablate alejandrina feillu proffessional qaasim verrone wickerwork olkiluoto decebal garozzo phadnis nicta mediobanca belives matchboxes shabib bramshott spaun oceaneering rescan scheibner landow fanz illion wreg safarzadeh krest freels dentons wallaroos deschenes tuneless unterberg ircs gasland jdub nln bifocal kooner kouakou svartholm lurgi bottin powerbroker woodgrove delbanco barredo osheroff follansbee finnell pettini yfc taikang triumphalist therizinosaurs mbtu terbush crucifying pshe rawest nepenthe tifosi mortuis ieso abae toge pucher orona meggiorini zesh ciff palce mbalax hoboes misguidedly decission scavullo dubas equivocate armendariz backboards sublease lachrymose edir tise lancha conaie inktomi lecesne craghead logmein sugarless leedsichthys sidan mthembu varnishing cubin azamara itkin cogdill ytn stutchbury swarcliffe bathtime anshel bogglingly vtsiom wive graeae liley hadaway fiocco daish notario tosha sutyagin smiddy vignerons creaming gardezi ileostomy himars naughtiest falfurrias jega karpinsky gompert yueng ryabov fakri ipilimumab fleischner blagrove hemse aspc spindled amerada barabara bednarz faiella jagodzinski optix wateringbury mcgegan hully katiyar gendarmenmarkt shrivelled islams megamouth shurat lacen guruge tajan hende yaverland catani shiran chevreux hillson ransone silcox matzen shoveller sebastiao mansbach alila racier alizai diagnostician ifil vavra nevski benedictis urumaya peles coplay ryke bfgoodrich pasternack mcdonaldland dref clia gyrls illovo equestrienne basest rasti dyskeratosis jarad orgun haidee detectorists stassi diepkloof molests koecher voke crowl lavishes jnl penpal purling boeckner jent subhani mugnano fahima gisevius konstmuseum parlier godshill knux creedal elvino prestigous conservations dorléac avgn douris snaked foglesong jezierski glanbia kalvin lawnside nemet egads hodak moonwatch pommels metronomes barkman duvel effusively minicar buddenbrock faryal rollerblades droney blackhurst phit dicenzo ringgenberg divorcement osmena gelmini milans isomura yiwen fatoumata attahiru kopper cofa creigiau washery aronia bolkonsky chrys yanna makgoba bartenieff funn shanken yongala kvoa hyaenas numberplate rdw klimas andee burghart kenansville otai murashige lodore shoigu donoughue lojack ysgolion boulé mediaweek poultices bidets iyogi lbb roamers storagetek cellnet mabyn yuanqing alaka kranepool impoverishing canonico cetuximab pssst zajec folsey aasm mitie khambatta couder bidco bahk voriconazole bearse reinvigoration pibulsonggram mpower smallcap bortezomib djebar idfc aido bromelain hassey salterns gentz thermogenic weisburd dydd terrine ollin etuc pedobear annita myrlie bhoy strøget easels smokefree reaming pavelski wato persecutes eryn taskings rionda bundanoon dhere casualities beckworth thakali myeloperoxidase festively jablonka hassenfeld lissoni pollarded guirgis hazelbaker mcmakin baidya rhoderick jonel horwill quisque gestes ssac museumsquartier reservas blagoy banharn dolder filmaker harvestable sandbagged fouchet ghostland pizano maver duncton borislow stinginess casentino wakley priora loates infoway harmine haraszti assocation kippel nelken chinchorro aecc dyment ballards laudehr früh rachell attendence noury cognisance scaramella palmachim pozniak eisold traveline alamar malzieu fictionalizes manit sobh halters bacopa kasdorf akot roomettes bioethicists dandyism aafa ponnusamy ignorants tamrat tastee wilnis sotogrande iccrom incredulously cavero timewarp siden yardie twohy yquem archs rockafeller muñeco retrogressive plinking lobotomies wildmon adlabs schrab gutkind carf shaposhnikova loffreda indiscrete glanusk crescenzio baleshwar bombora overfeeding casy anglicise sravanthi ipswitch alwoodley appro truste bastienne nurgaliev duble nonflammable adeniran kinglist shouyang nbcolympics setterfield mugdock sayasone kurosu poventud monchegorsk robonaut paramours ielemia biscardi seppänen lewan serait gaffs nafti emplace gogel stratégique comida moodys loutish varshney avy unerringly prefering onie schacher himmelman quater shadd spyrou prah wigdor tskitishvili sharable clcs kbh hiwa aleady hrdlicka nkole motyl ucpd croaks gazela noddle soysal cyberarts moshinsky corporatization premarin milgate holdco dnestr hurel diat gardened jetton soref fangzhou maguari muno inforce yeaton stant gesser lenardo pigbag bikey symbolisms hashizume superwasp dampener megatrends kapllani nitazoxanide matsson looky meeuws poseurs prendiville incriminates bidlack layby spady muscio highborn tongaat gibside alhat mulyani roughage raffa talkradio domizzi cpic pnrs eringer maslanka serviette timoner moseneke funmi carrels aadb dierking tein terracycle kardan paffett cautery vaccinology gubb yesayan guarenas crathorne abetment gellatly kiros dagga shearn droogs hooi rooper diagouraga dälek jalani walchensee monita warith wigfield wice kikka abera yonosuke roese choummaly tafolla qgp lautier ringelblum novakovich kamalesh perriand simpletons skimboarding pujehun egot wheedle warshawski evanshen abdelnour haiman rscm cichon rodenticides mattering volvic linebacking xianwen padano chikyu deaden milic siems yellville brinn stockist mckinna pdca granelli imipenem rahme britcar sakashita wium ramelow massam ffion pbxs greber widger shoreland biogenetic wytch calio tolzien yemm cusumano atliens mccague forgy chertkov artesunate siaosi strathtay dills precuneus valenzano southcom perahera sandfish causas bourriaud burtonsville bannocks stoppable gudjonsson saxilby ataqatigiit sommariva commutations pöttering anestis isted mugica hru prejudged boti lecroy whitter bonine gorging rexer pinchers kebo tradeston somby wailed presbyteral hekmati wiebke hardan fkl penshaw nkala gomersall marcolini hinkins cfk hibbins multifaith rieckhoff khanfar delwyn chickenhawk hylda kreiz kibuye chiclet alwatan chegg tirrenia khashuri geniez lindoro sentimentalist haseman wilderspin mckowen feerick reforest dropzone vanger shillinglaw satc alguacil huntercombe gilovich gbg basks jaakkola lorine mcglew grilo revanchism tral bessi pacaya aptx louann johnna hadjar dashiki magisters goldi burnat etemadi baric gunnarson nonis stepparent lentin mccluer jaguaribe tumbo gerris demol bonnen stockstill worcs hoggett gilpatric prym mancham smallbrook associados olasky hetzler manfully macas hirschbeck dubaku flunking cambuskenneth calcifying vinas staniszewski discussant barnetta eichman lousma boreyko mazzi feir hamouda parkar feddans peregrinations banchan gephart njord hanagan sugiuchi misplayed kelang lamichael chaddock sauerbraten bonior kashia sutin worchester bochenski bilberries karlstrom boukhari hochstadt sekulow carfin tienda metacom midson aloneftis beechen balnagown naringenin leamer eaddy keyline ockrent yanbin marsay schiel airforces shiff kaltag girlfight redrock harimaya slingo gonorrhoea slayden amsterdammer lechon holmlund graysmith kondratyuk dienstbier amrabat dunand agwu paternostro keuchel oshun boxscores daingean knighthawk perre prozorov gosia adoc chele coronella dykeman kitaen gazzaev mcnown songer mincu riseholme hippocampi idit mawei inon personalizes ständchen tobgay iliotibial gonski federov piia cardinham girtin jayashri mcleese kvue bimbos roston ponche ronel clavulanic cyclamate fordell scrutineer bonfante conchiglia decanters gyang commanderie marambio novruzov accessorized elemento prew latroy courchesne soret petrological cherington ciancimino wassim zifu hanie drys timberwork essentialy portu pousadas backache sedlar rippe nonspecialist wickstrom ethylamine yohimbine brv norfolkline paeans wasko asiedu itron ladu maravilha infelicitous iets yanwei shenkar gouldian navorro fayaz lmk knk mosis unsighted krukowski wyc zoubi duchesneau klebb nabiyev rainsborough badelj demerge jongkhar alvington delana velits qaidam grunebaum selbie superjam peplow sikelel maala shmuley sonification casm maasdam chidlow hunmanby rumaila gew annamarie ikemoto gummies ukba woldenberg askrigg kamut yehoram metion gesellen romaní flatback fluticasone affability tollington cnam dipasquale spatulas ankan karimloo houts harger zhuangzhuang battan ulyana unpublishable demostrate rlr chateauneuf chadi wakerley artistiques wsyx bodysurfing kamewa autoexposure hvd enosburg excitingly elten fritzie uists salha hessa amref gyrator techblog nordegg heuliez zenna bozar greensward risques morva riggio unbalances coplon mepham warnapura muter hymans parshin mindreading benkenstein tunng bullseyes washbasin derating kapinos recapitulating purkey morace janjalani czisny disick henllan feichtinger libano mflops kendle slotkin kokal motorcraft hankyoreh micombero mikardo hypnotise cohrs sayako punctually faceoffs usian mukhin wadler circumscribes renay tiley euskera dnata obtuseness loncraine buddie bullshitting mazzello pattersons wolyniec niddk odel auki uthr decertified krejza gubernator pflüger shenhar margallo weidenbach korniyenko roundedness hersee fango rspo stellwagen chande gittes rovell enthuse sanel banalities rotton geissinger vanderlyn cyberlink apostolis induration kindertotenlieder egalité puda ignatia krass mauves worming abysses bilqis walderslade howieson expansiveness badstuber dettmar dzama wifey lundon decubitus mellowing textualism bahla chiusano getachew plinko clicky kudisch moskvy ambassadress auten riversides kavinsky seasonable chengdong tafer roelants samh suplicy slub extrajudicially dumlao evilly sonnett pudwill posca nadella postpile celestini attore signon oravec demob dystocia gansey ppmv dargent paananen terrasses octone kosolapov mgarr williard dickau sauser southon niquet cheslyn nancie touchingly waddock mcteigue unobvious kuah masseurs msba ords langney eddleston waldhaus ashill comision haberkorn equivocating dabhol dowlen trysting lagerbäck earby malave barzman nursling woodhey grabovski aiono flacks mydd ijtema loueke razek junkets amidi chrebet maghery bereit yokels boatbuilders raese sfca darine steinbrueck pomes gorbanevskaya jeck sobti turque unremembered gloop effron kindergarden ceyhun dyett skirvin boeri sibneft yaqoub blendon kritzer reyat inquisitiveness valmir snickering ellenson yudi wendla premalignant vallehermoso arner decimalization suuri makie buerkle decoster hurtled dealtry reddest xochitl mhlongo delanie palsson pelu hoedspruit humam karponosov brinke bicycled kennie spiering killay takac bbcso lentivirus finishings pouter metalware overdid chloromethane uahc zimerman yinan overdeveloped unauthenticated skeoch auchtermuchty sifts aghia sanlitun ocoa discused scupper danet unburdened schwedler circumferences carolynne anouska pipedream efavirenz macroeconomy abib peaker stanczak mitoxantrone grantmakers misbranded plink innoventions ibms benington pocari rishab unconcealed staelens hetar evgen mandrills zuno microdeletion jirka torchon havea allesley unusal hardebeck pead simplement teetered anonymising voudouris daerden baldivieso porrello kuzin minea marketization challanged reichling mercereau covino tonry blute overemphasizing beever superhighways ifixit huestis ikramullah bakradze moussavi nikolskaya strakhov velonews axcess phibsborough paradisus nimeiri vashistha hyperbilirubinemia vrv daara rahnasto abacavir sulphite tranquilized reseeding respirable wrighton fenzl amonte cordery rreed sculli canoed hussen quess etem teruhiko bendus experince hegemons schnauzers nassiri jurrjens papé bagworth cuckolds deisler niederhauser fuquan brissenden hoho broadhaven makley sibthorpe thumpers speraw chibber dzau galimov deleveraging rmw mudang nonscientific exculpate jussy aikau perelli globis nienstedt irredentists brul mcconnellsburg aascu hencke paltalk qianmen kidzone arborvitae mgg jakubowicz loxford silton mcsharry masika niedecker zucchetti fimmel kolton kontakte gundling valaya dobles citro cantaloupes llanrumney fiap stebonheath olkaria itches lawhorn flatlander ruari yousry koskie enjo nsai zylberberg tião hotpants detainer panettone marou wjec arrindell lkp madacy scanzano studen cataclysms sharlto stwc yambo lurd conason baliani letton anderman amper fruitmarket securitisation phenoms armoy euobserver overuses lohnes tainio turse ouf cardiotoxicity ominami tichina podrazik phn bosu funaro schuerholz macaroon hilferty schoolrooms perttu softail embajada akehurst nimura verdell abourezk iwg aiona unipart maggies iken aecs crisafulli savara wezel kauder midgett buttolph markree nkhotakota leats laydown harkless tesuque mphasis proably achuar crosiers pregunta plishka lineham talkshows phytotherapy schlage artprice brandenberger brassaï eviscerate profet omes ensberg garreth goair intraparietal impregilo cesg bisutti grimsbury consumptions whetter jeanmaire larian cambi lsds glaspie radermacher carulli ambx goil harmoni carndonagh iolas coloradans heisel murofushi nikesh youqing lamorisse ragout ischigualasto zutty pennywort churl tvardovsky unprofessionalism mapletoft gianpiero kimia cleanings likeability kilu rorion bellbrook metelli castletownbere barcellos quadriplegics madugalle dalham sembcorp fiascos eket cosmosphere craftsperson caseworkers baseness entasis famiglietti beniwal manches uncrowded calis goutham liquefying brocq lamonts edwinstowe lamplough beston furtiva bleicher clunker ameet tanney bunnag kolodziej kondi puttock wiehle suketu schweers teneues lnu takahira disempowerment cby garad minyard markopolos midcap nerlinger anshutz autopolis travaglio puréed exl rezaul junquera ishola thermostable sudarsono sxi vaginismus jovani qik disingenuousness despoiling grayden politicise illegalities reconfigurations carabajal romanchuk puteoli whistlin muriatic scalers stroppy shoudn tuija campain mirena debeljak laulala saphire pepín koroi castoreum birecik cutman faleomavaega papilledema lannie chizhov hinny brefi rabbe brunansky bajas kaleida yuschenko murin knuckleballer uncommanded tetchy pastorally lfw keetch keteyian swearwords dayrit dreesen limpieza lisanti kremers adrenalize ibg carius cisr responsibilty cetirizine suspenstories panduro tharthar nazare muray barbarito wenr dreisam qiming panitch intubated storycorps kalbfleisch tabbing wowwee cutlip selston kantonalbank spol hirudo unmarketable hasley gallinae ashkin kanamaru lovullo dimitrijevic coppelius retransmitting acnielsen extroverts yough kenon sohaib fuson meskill goltv trautwig sedgeford mcj refuelings jaks xex banegas grauniad tianya schwartzwalder grooverider marmer timeshares imation hapka psyd laurila muncher estin steamroll ccba lorenzino sepideh decoursey padoa coned dapp newbigging tribbiani dehumanising sententious pogrebin ingenieria kerne greenroom scanavino malkhaz rosete anthills amdur leidseplein sheerwater pointman lovelies streat nussey pasteles iber mahlasela schnapp segalen farveez chudinov stepashin blaspheming uproariously climacteric umcp franich eighe witta sè fearns ghulja bheja airspaces adolfi shankle pilfer shanmugaratnam denesh nobunari santero capodichino mofatteh probenecid honeymoons entender snowboardcross goudstikker sharar yasuhide poundbury lisiecki forseen kunkle cackles sharik cropwell juntunen lueck ganti mutterings semington friargate gamze crymych ireson spectroradiometer buzkashi busyness bioplastic flightplan tenaha kely moaned idriz mitcheldean ferozeshah epoetin rokas akora keltic stewarded innervisions eleri queenwood willmer dibromide bréguet pickelhaube solaiman portanova casley janita reincarnating steppling bresso transferor llandough woerth pkf escwa batenburg watmore drissa mayeda baab xao papantoniou ringler gelernter cervone playdom mashiah puttick lobão asperity mrbm ukeles continuosly chewin solidarnosc collegues mallmann kohlsaat redcurrant immunodeficiencies dymally multiport isaach manacled inculcates constitue boggled wite terrytown macki renney mohon nopd vavrinec drillings aftereffect masino shortsightedness zhambyl dindane gfe wayburn gwot acclimatised rarified shimoji stalemates virts wena pilocarpine keyspan homebrewers atilano keiths maggart mastopexy orane turcan rhisiart heatons taqa bobbies defoliate shibam cdfa bankwest ncbs kindliness ashlie dromaeosaur wranglings hjalmarsson dharug boyson quintupled adjudicatory warrawee prefilled openair candlenut masqueraders ultrastar almac koromo sopris robak hispasat livecycle pheonix lièvremont marlis trela ostrosky pflimlin xiaochuan elgoibar calypsos marcó andale wretchedly dunera formans albio rinaudo sirigu reiners overhaulin caravello movieguide briese poidevin suryadi grewcock amiina cathi moiz taubenberger bossio obraniak priviliges upholsterers permanant davoudi brokencyde mouses mythologically ignatios vardenafil ratlines holtkamp azran recirculate benussi simonsbath sibutramine abarbanel tabulators takeup disrupter bobber warwicks ladykiller thupten serac visored fresnos videira charland kowald dhalia knowledgeably damarcus sniffy appoggiatura herrion unprogrammed radiocommunications rockview retrievals seabeds paramilitarism redhills pevear aphasic hauliers clawdy spirt parnitha eyraud freinds cosgriff cyra ryes panetti anyday phw inroad chelsom campanis lifecasting ruhle snuffleupagus adlan mordden sasamoto delagrave dauphinois completists farai szanto keogan piga glared goyescas ngon torahs nelio hendrikse digicam baburen srbijagas vanak ramsland stanningley austine belsay micronuclei vespignani chukhrai hellers calipatria asherson gotbaum shangdong oshio obernai mérieux ujc eede firming indycars culverwell birindelli goodey djou norbreck razorbills mccartneys konrath petroski billes aronica taubin timegate serratos feiz encryptions esterline labey corestates embrey chantay midamerican siedlecki elyashiv afros zaxby ghassemi hexed karoui masae neidpath napped morgantina cliquish boystown cowhands utopic rosenbauer ghezzal diterlizzi pawing urbanists mcenaney shawon reassigns amphenol ponticum lambrate langhorn oanh bisby lindheim chakdara stefaan froxfield abati koner labus frankenburg dihydrochloride orzel mccarthyite tjeerd chomps investigacion sifuentes trenor joycean smilers frax aerolineas assadi corleto zaslaw mansally geovanny govanhill hooiveld ifaw agyekum wote clamoured counterintuitively acidifying sudin nerka slatin milmo rossif shamberg skloot filderman wenhao recategorisation frech hospitalist salko yaohan uncomprehending wirawan mottisfont strikas meschery woodlanders shian ollivant klowns wingen calved bours rasmusen wenjin omnitrax helotes europass matchpoint marziale weizhou gtech sudipta bockarie mynarski beheer lubricates bitterling thisis toothill dohrmann tianwen pikus follini sweetbriar weinger xinhe sahebganj graser lerach suffragio spinnakers shirked anvik neria halangahu balasingham wassan pengrowth busload frequenter eckes macrumors turanga médicos raemon punchdrunk tuca yealmpton undy birdsell bradway kerckhoff prisk kimitsu felicis inactions mihos ucan mariages lekota runscorer mukham keresley molsky wanlockhead unreasoning inhumanly otterloo jamies ulliott surhoff cityside chutki chukwudi toosey viss himma differents northlink vinoy cuni snaky tkc nolly taliya metioned buder ekos deviousness overblowing mordente nouble gazzard lachemann tigua nsubuga okitsu quickplay bakas findability abdulali burpo kowtowing vidro chambery giorgadze reinman ndia colstrip discomfited ledgerwood dagomys hattfjelldal colruyt jammie silkmoth bonza reattaching kibria peyronie mahsud datteln semliki jiaxin ingos helius cogwheels khizanishvili yefimova remm sentimentales lihir heartsong hitchcockian interrail awni jabarin criddle berwin rhinorrhea investable nurlan leithauser hoorah naburn lengthly mizque tottel vouchsafed ciriello howver curis toribiong faiq kingson isaza mudhar earlies cohee rombauer felisha diversidad pazcoguin yeend chateauroux zaton abubaker miking procrastinator abdulsalami cordaro cliffy terrano beefeaters retrials harleigh bluelight meddles rockmond ignoramuses stahler vaporous monye nissinen kathrada borrett ruhul kuperberg biglari sarty oueddei catenaccio pambula saltimbanco cncs succintly strollo eulogize littlerock somersaulted petruzzelli dimmest freestylegames haspiel wamo scte sajani centereach epcam brisingr nanoprobes swetland orthodontia underexposure hcca devient arkivmusic turnipseed tamaddon catchup silkscreened lohmeyer noncitizens vidra publicmind calpine emigh esenin redcastle parasitical cansei ragu castmembers bureaucratization varady cosmically raffinose harvell yoxall shatterproof vorel allografts denshaw machesney papaverine liebestod potapenko delancy dubina austrade unprepossessing rigters tafur ghelderode duelfer rendcomb competion overexcited titfield jiving hustles claysburg kessie faultlines bogale lanthimos sudamerica runnicles wildebeests reposado butternuts serba jacumba compartmentalize iols appiano beiteddine naxals linnemann cossins scholastically cahalan westons wambui scheu nidec casabella girdhari sakra creativeness dilks speckman galy everychild badsworth mimika ahwazi keswani oxyacetylene askeri arendse mcelveen micronized porcupinefish overburdening magnanimously ellagitannins datastore woodhorn hairdryer beersheva piq rapine gleamed rigshospitalet jenners sidereus boritt monoglot ornithorhynchus lawrences ormand nonrandom dscc illest vpe oxtoby plascencia banyon nayon carryduff ufg aligners aafp nonsensically wragby bamir bico mankowski holgorsen informaton hne gasparino cilfynydd pbe torgerson winchburgh orya jajuan loredo canevari uthayan biorefinery mirebalais bogata adss dogfighter pancytopenia improvident conservativehome carlotto dioner astrobiologist handorf riposo cleadon leyda moquin sickler roozbeh teevee oneiric feoktistov jasmeet busybodies photis unceremonious repetiteur datel biguine giacconi minneriya multiprocessors mwyn gateley shortchanged graul aggrandising carolini misguide strickson swagg gilliatt mauga kurten raeside perseveration lavori leandre cloakrooms excipient kytx eniola buic masferrer pessina sbobet ninny matsunaka aidi discounter jepkosgei wingett koong triodos panayotis lavy miggs bataar wbng zinna placentas melekeok traceur xrays urokinase yekhanurov minooka gastel arisan mohen dederer claybourne jazan mccarl sekera rhigos proprietorships iott gaiser función muson driesen nutsy pellentesque verbalized operacion seeiso unheroic finberg xiaomei germanischer ziti nechung sarft rera warhola meshaal frykman abakaliki liveness avana satbir luetkemeyer binghampton sexualisation dreamchild zwerling whnt dalmally arashiro coaltown tenía boushey gylfason fincas geiberger tohmatsu sharrer kazyna bolasie oildale enlivens wicketkeeping superstrong aselsan countrysides elastics kojic mycoides plagiarise affiliative karyl jinghui bioengineer gogar tenconi micachu oex taean gunville rejean paltiel cipp cdms kirgiz huifang shuld somiedo offerton ballarin freidman palpitation grouches vant ragdale sortation wystan ruhama baizley kaboodle missiroli afpa vennard dwts babajan evon sizzlin guanylyl unhide tubbataha triphosphates paperwhite verveer baiza krivine massel polzin harpending irreverently mmtc pasteboard bryanne blanquette ramadhani arbatov firmest movado votevets sjo grazalema stert dishonoring morays amortizing clintonians chunked fmh nykredit schermer chaloupka sporza beitia eickhoff counterproliferation norcliffe longcross diltiazem boatyards brandings barlinnie treaded orum kohnstamm ringwraiths clipless cragside epri sinatras yazbeck nilofar cajori landu béhar strikemaster shefki riccia deehan unvalidated tulear eremurus chafetz osteogenic plyometrics barnas giacinta rubbished dpmo fereday earswick firebrands wpec tual gailes budged knettishall unnameable honorata istabraq hadopi kremerata hermantown hollender waitstaff udeur radinsky taibi whipworm wijetunga burster hunchun maybee gymru wunderkammer tressell ohly fluorescents antifascists budka javnosti transafrica raskob hsy hirak sauver gangitano tafuna telmarines gaboon hibler perseid castlecary ensnaring montse ontake brookstein garua seders scotched dionicio basavanagudi kloppers rathnew esac dougald littleover gadzhi exaudi merchandises roid lumpenproletariat couzinet swarn bobos herze behrs gajendran kompa damnably wiling sros stepdad yeoville nerida untermann queenswood ghionea volatilities queensryche flvs slingbox abrosimova bedrich billown plently shitrit dunkeswell pursel kahla turves gunnera krumping bozic gillaspie shiyu queller fairall bodek strongwoman whataburger salivating ingrao molted serat geter rungius bartu fopp hadler blackduck benoa vph pianosa sikua radamisto cwalina clearways familymart goverdhan recommission knaap lattitude sealions aswin emani pesar glennis seepages gullette kesling gonia ayto behenna spab lakshmipathy pablum goodmayes bahaman murren frogging magaldi karolyn hareb crinolines creuzot tachtsidis unpatentable cissbury aquarama gulps ormsbee taboga sorlie liljefors osmel microtech urney urpo viglink worra sonequa nehgs clarabell reschke semshov tushnet cofie stambler sayeh gatr vidia morrazo talty collingtree sewel parkesburg hiway microloans dorsomedial vacationland cicle flambé barle nyers tononi rucksacks kurlander becames enunciating hifu vibia missle jihua fedorchuk bokros matsuya irinotecan audiovisuals bjcc vecchiarelli rehberger demartino mobileye mascia mcgarity rjb nimrods swindlehurst kalogridis ethnocide sancocho merevale inbreds baharna igbts holtorf luczo thermophiles eulas estan fragging shebeen arii foodism acip haematoma makdisi rocknrolla haffenden efstratios bugbears rueff grazyna moignan santarem tinku gardone bbbb nstar undesirability fouch slaveowner koonin zhenzhen waylay veillette gimje merrilees zut kujtim zondi edgers cosmogirl prady gronlund dieticians mazowe taggle sawy marsalforn vrf arrogated garven hourlong rungis haemostasis soliz hkex margaretville kinlaw gajjar unlf registrable blankfein harkema irfb bakersville wrvs limbers lke condron sibrel pokphand footrests midnapur chimneypieces whittles nnrtis wytham seferihisar plastilina sawad inotes gorgoni ensnares zehir nobukazu attender nuseibeh tylorstown heldenleben caergwrle eliades tarlo nhek friendfinder delrio atriums msft hyperprolactinemia gjm jarel defeater evangelised melva bouchardon colubris tuson kennedale qanuni chamula appology wienand uncrossed nedved unmistakeable patzcuaro calverts zhijian lizars hecatomb capillas recoinage ruyton genecards snuffbox minuti sanclemente filmstar lawgivers pqs nadelman complutensian holmbury egizio cinephiles bloxom hoani comitted microscopii gunalan gosa bandslam bradbeer firey lisanne omundson galadari resizable obrigado ceratotherium pohjonen herskowitz ramezani disadvantaging feebleness riskiness imparato analeigh canonicorum hmie elswit beeen hardenhuish granito kandler altor cindrich aaiun cfif industriously osbaston arifi retreads trumka sharpsteen gawne ruders hurtwood jilbab fredricksburg thiaw uchikawa baozi tarlow karran calbert mitac pomezia puhua henrichsen naparstek borovoy persuing ifrah astrum volny conceptualising muslera hindalco colegate hidemasa nyamweya clayesmore tailbacks isbe wainthropp cacciari doorne oxiana yasujiro xeroxed detc overspent teradyne bachtiar zaleplon gipton merita kitesurfers shirasawa boilly negress szot airtanker ankama lignocellulosic thouless acai kosmala zaabi morchard fairhall shtetls dicynodon bredenkamp gormless highdown moncler postdated marquita herrada entenmann sharmas emori edyth tounge westernizing breakdancers tashard colico louisette shakh roudebush tabin xeriscaping posterous unhidden valy mccluster koppell mindo allegretti dullard haith nooruddin torchia arpan vaste qss smailholm carvill meenan tanko gopalapuram albanel tudjman eesa knr darijo outmaneuvering alviano lekima trancas waddams bhagirath rafaello lendrum crocketts calstrs tirelli shaps schifcofske porgie cmdb clappison tryna tunnicliff estampas magrathea buncefield jash twizy greasby airconditioned fesco orgill aprn goolsbee stashing wyatts interboro hould hoffbauer mariem barugh yuanchao davidtz sywell stratify rudisill hirola morstan strew qayum manolev peralada barmak hemline schiemer swimbladder seatpost smalleye maheen karhan shteyngart gno trakas luxx winching hachiko bosshard ignarro fricking mourilyan gonpo fastidiously rosenlund srijan tresidder beerhouse dabke hourglasses gouffran tigerdirect casolaro hadeed circhetta stellen donadio euroclear lichaj spinosi hevs refus millstein zanele rivara mcnabs kalmadi gradwell bezant morrisson kuijpers upshall warrell gurnemanz tarabay mucks indianness esho sileo yasunobu dismukes derogate hubail ffii chelated bvu eisenbud wric macwhirter tomoji informatively kaestner choppiness fairstein adulterant atlantia teall cabinetmaking chillen eastpointe homesites sercan ruttmann musharaf paasilinna schliersee tenderer alexandar usml jawdat drolma parikka swingate rockette jeryl sweetpea uplinked belchers runje joannette caru usfk amyris habituate bullers delsea exult stokey massiveness nelthorpe bladel hovenkamp aveni vraiment waagner intrade indahouse chichijima bofi geare mchunu peredelkino tehmina hesford pcj intoxicate popularist balsley vcjd andrina asterisked hcw xto montrichard naukluft greenedge osteopenia aarif pigovian clavero rauti igas tarpey millepora pratz kaufer seacom bozzetto bartter tariku ismayilova semmel reanne giolito rhf rakestraw deaker oologah eeek bandipora gefitinib ccamlr clybourn modularis mcgimpsey churchs branagan nijholt pough alexandrino flh antek slotbacks minium muser thabane kilday meskimen efua gordhan trepashkin candiotti inxile bothma qaly windemere yagur mayerson ellerker saqer osteoporotic rolley neowin goler konger sawka kondrat pogorelov schmemann goodhead whitner offiong arachchige fardon siggins humungous bruschetta amoc cerge yachimovich techiman darty wadworth folse gaviotas chadburn rtrs barreiras bacrot pcca wendlingen rugh heared truswell rosero kafeel hooding nold markina adme uofl midnights airtrack skyscape brookshaw onida greeklish blmc chugs misskelley willford mukaber tidbinbilla pclinuxos ninefold telecomm marginalising gerstenfeld bluecoats cabibbo ibizan misogynists rahter shikotan lentiviral nantclwyd eitingon decrepitude bedclothes trimm luzerner benninger niman mezin metzgar nansi burrower lavaughn jiko kmex kudelka kalsu deemphasize freeloading aacap homering mmtv writerly tevi waterbug trapiche guedj pavía slithers snobbishness munnelly quartiles bisou mccasland jdu segares spoonfed brassed blackfly liquefies vertica loevinger sukhanov mishkenot langtoft adegoke skoch hidcote reflagging hummin fregate mahamoud minyanville censo shuckers blankenhorn grich wxtv lunder cursos bassinet kaban bestway akamaru netbank foege gîte kazmierczak loduca bvh putaruru afrah brandstrup polymyositis narayama harendra pitas trewhella pefc saveourseas rostraver macklowe entin matachewan zoromski buccellati kupol boenisch wallonian limond estro iland newcomerstown angeleno stevas nevenka kresa chiatura weve footsie sorab itfc behrends olsberg milovanovic sovereigntists pomander fibrinolytic luva murvin malaki yanacocha akarit wagonload pdcs prostor carotenes dimasi gustibus gean ampelmännchen tresh premis wibsey jeanny nonagenarian cutaia hilbre tywardreath veira tacheles collegedale saadé wolo godforsaken pietschmann epigenomics batool trypanosome tchibo niea superstrings cieca bubblers colectomy ravetch realworld kazel tollner perdre lyo tortellini arbia grinned wiskott winget tishrin feuillère ballasalla hornbacher veiel toughbook scrivner mbare chantrell hennis banket dhalgren holum malerba opentv mehlville ynyshir imedia suneet kenken enkel skycam vernita ingelow meeteetse awali frigidity gnaws woolacombe blankley trowse zaozhuang lafeber thoughtlessness hexamine combusts apparatchiks gschwendtner odometers moaz bompastor nonhumans almos gemstar nocturnally thursley reclosing sublethal swaddled kilsby guedioura lcy nande breezer lsst agoa asshat cardelli vulgarly chanakyapuri kros swoons tocks priyamvada homeboyz kidulthood mashaba hospes flowerbed sany pluscarden jhan kayle mounter caramoan palaniappan individu golcar cfoa crrc habas janga trathen netsch chavagnes krief mobipocket fagbenle baida animality doomy bioproducts kaaren tongogara hotez mckinleyville photocells trone falkowski staib shirreffs phocomelia berghain sumber khd alio mansfields manhunts fiscales wijnants aeri axions vassilieva bonasera iort grillwork porteños morely lamalfa papplewick mangahas scire disputatious kgun merley guthman assiduity advertize coursey oosterbroek pinchos rasikh clarifiers vouvray ionides grigorov howkins maddi exceedance loll bordelaise swiney akinbiyi israilov wengert rocketeers cataloger qualifed paunch trovador hyperpolarized shuddered carucci fisons buthe trita lonewolf isenheim nicetown chudley coarsening scurried evocatively vignali qmi dassanayake kivel bassaleg lodro bottini scoglio blurting kleon agema matv pirog gravitron hadramawt kvamme millian marline foisting cudd kuras radheshyam uos galani fluoropolymers bleakly emmaline pennywell bluffdale adhikary anglicize hwo konary duinen apoligize conahan kundun séguéla wintz snuggles peuvent dahua monopolising bowcott siemering darier ndadaye tychsen nmai tgh ravoux tayto maarouf meji frango radicalizing geechee roughan baglio bothel cherkasova broadwick linguini buluan neshin ilim kaii budaj lvh righto elstead reunifying battipaglia piang bachelorhood tcho americares dobriansky furoate tewson rocinha zolotaryov kreon tokmak chiotis jacarandas capitalia funnelling contorting apor tiplady dejar elean pontyberem zulqarnain unwaveringly mogale priebe yuanjiang leoz epistemologist birnbach barrowford nonreactive godding bbu kisel qibao grittiness thernstrom glooms sorimachi overthrust natixis cennen squishing waesche ztl soku opos zolli khagrachari taybeh thiazides asayish karole yudina mumok maculinea wissa teabagging rambos hilter granov dragun azzolini caddington actully bugiri finbow bechir brentsville weaverham bahais ccsvi allans craighall koussevitsky owenton readmissions vidaurre pancetta stantonbury forner shimshal tuyet fireeye lautenschlager ribero blewbury gache malinauskas sealyham elat boyen dragger hovercrafts dziuba recordholder senff defroster rensing rehires scotlands lahej jcdecaux demario totipotent regularisation zineb rachunek yonath kanakaredes hysterectomies monoprints burgermeister karanganyar gatski incidentals westy somatotropin corsetry maceoin intravitreal hultberg hastiness gergel avik zarei allí caskie brimful adva verdian lokendra lawfirm charreada fasken greuel queux paccione fukasawa hectors preventers xconomy norsworthy cliniques wachtler saiten undulates regaling reemployment eastmoor reacquiring calgon yra beneteau vekselberg vinification wulan earful taci berezutski talkington boao garzone musbury iisd fakhreddine hadzic faithlessness cubley flightaware okd gubar ritchson veneziani echard midtable figglehorn brout wireman fictitiously hellerstein untuned isono ribadeo shahrani zeig deshazer haxhiu alette larrison detoxified woodsia lôn cuchulain issimo sual pagis supercop lics ewaso rmbs stada namaskar nanomaterial crowmarsh efv muath aeolia vilcek ropley loizou kupferman canutillo espin raggatt thater maydan contas csbs deeyah drls doua hince britisher postherpetic wjrt astatke billis bonghwa cebula benskin reneé elvidge harleysville inverdale rythm naohito ewerthon flyboy levington dumble liapis strausbaugh mierda snowplough somera autobiographically regrows ihome disfigures fruitcakes eglish rhain potentiating anodised paleness mabbutt nickelodeons classier ferrán diuca morlet wdbo igdir lipomas sadbhavana schelte berdichevsky aarya schlitterbahn kyre adada scimeca middleby makkum leetspeak legislates elkus tromso craigton listenings azaan champfleury carcillo parbandhak goonetilleke jpj menelaos hoteles orfalea backoffice dissuasion kouao brunker antiquarium soskice bryner muscleman lacework gafcon unviewable amarone terc bechtle stanbery greenspaces dibutyl cubbies kretzmann wutaishan panchagarh kirkstead telramund criminelle kolonel kingshighway lazareff taiana tranquilizing rianne bleakest queenly citrinin viton leeanne teaff kastellorizo jamy hampl whyman stema genro samaira fillans discotheques illarramendi nonfat taillevent keiffer groins firecat larocco beattock dannielynn loibl decoux pramac nahmad ballman fardell naias roofe gravey supercharges microglobulin pundt ffolkes hecks paralyzer mysto zizou tiering xindu prusiner felman euthanizing elrose wallgren sanitaria salsify yela mcanespie babolat hydroxamic shanly yadel facp hamaoka bytyqi citril ennoble macniven jingdong portglenone hijli pemon exminster heideman bandeau touati bedevil hervieu formigoni holahan keola munstead larrinaga deports dimmable prei enikő shiralee degang crozon lautenbach lazor toberman wormed altmire mcguff petascale midwesterners hlw fabasoft lacinia klindt vogtle rht swiderski urwah damonte silverwork melée croo circumambulate cpos trounson douching dettwiler lesk llanddona garrets semlin ueland badry malaprop metronomic abour keilberth vempati megatrend buildwas lushness odean pendergrast kilroe ceramicists sevcik bozos telscombe ladybank socialtext forwent blasberg saifur maestas ramcharan tenter cgap zysman dolphinariums aécio wpht zayid wld gaymer dende mythologizing calvia beachcroft galerias stubborness unflavored occasionaly coixet erber lindegaard monex brangäne rogow hvga defragmenting hawara imprudently dickering correspondant iguarán sifnos jafr mccrorie gakkel reelz muqam lugoff magoula ranina hbe bollock willhelm yasunaga roorda libermann spigel topolsky zizka gyllenhammar gushy pinstriping perou syler biondini tricastin venerables behnisch itchington brindis bewilder contendere survivorman superamerica dolorous gromada tabua statcounter roszkowski coelius bided rivelino wuping buket goreham yimin cuitláhuac itma davignon clop sossusvlei hydrobiology antoniades maghaberry garafola fgg margutta daniyar colnaghi electrostimulation guarnerius fastlink tabcorp lowline sccrc ktvb landberg mcjunkin kfda aapc autorickshaws internists landfilled biospheres wenyan gaunts keszler luminarias weisbach drillship earthiness dubawi mostostal primeur kwassa shtokman gabara hobbling feminazi europeenne drybrough photostat crynant heffington buratti shivendra barriscale eurojust rimpoche neudecker isaack zyuzin miremont willox revolutionizes pacioretty xuri hiremath esquibel scoggin esoft totani jiron mnajdra openwave czerwinski sterno daping verdelho zbv élémentaires underperform kfp apocryphally customizability zagier allmen righini kapusta filippis spidery exotically cincom roseann prepa johjima ganem hoelscher cyberbully hoolihan bewsher hasland mutaz rezza kincraig desmopressin populare busso ombersley lenfest folkboat efo umbach kawazoe segol calderoni yandarbiyev enshrouded decelerator schr hasc faiyaz gallaga islandmagee domanski hansis quietcomfort rhatigan rikon montaut pankisi scheibel dirtied comeaux brender antiterrorist iacobescu brummitt gosier chrissakes dauterive strontian rugao lightkeepers bording niaga kujawa danelle givan eastford alevtina jacy steinbrecher schimmelpfennig unchosen odl rayvon chernukhin nicc nicolaisen retune apuan makeing innodb soulseek possibily fonio sláma pliosaur jaran razvi filleul janot madliena desalinate kaiserhof donyo theoharis doorpost spirea tousled kondaveeti estor shalane silka neurodevelopment pogs causalities geolog tanenhaus harkened carshare mccutchan angelinos nettlebed painchaud biesenbach permadi tregoning definative kuerti guidugli memorability coverlets fairgoers molodaya shieh averre explaning hoebee verbalization riether stranczek ,if liudmyla zharkov scurria jakeman monastiraki luber hachama aioli odditorium fucino hayt illumine nagui dtcc hikind alights pynn boeckmann vitrolles recessionary perich tooro barrowland koshelev carmyllie pigmeat barrus shoesource photog domanico decoratifs brorson electrolytically ncmec mostafavi appreciatively stetler luxottica reverbs gougeon indecisively eliopoulos wholey chihi cogsworth eldard loughinisland bulkers allemann sundermann chemtrail hardelot footboard masasi shippy unusualness sincan justness swazis guará mukhtaran withold antiperspirant kuci boudhanath talke ballinascreen dalhausser kouwenhoven okorie balcomb nishantha sleddog sebokeng muzzling wombling hively kurbaan hoegaarden bollox sompo dichen proprieties garnica asph zwanziger scheper nanotyrannus zeltner winnecke fardre rosello pointner rhinopithecus thabeet greste yunfei montier abramova xinjian bischofberger puttees nikky partings inventorying baxa powderpuff brouillet belaboring basted jotting corston claytons qipao hirelings samey glenconner chortens labiaplasty definitionally billett rezendes gambut chinmay blady beiler giddish jadot meridith whetted junkyards ghengis bureij tavernas jongmyo carrasquilla wvon buttonholes lammtarra twardy dracunculiasis spliting boneshaker unattributable labral epitomes marcedes kidrobot nomiya spiritan bicchieri packman saigol ossos mehari kochel barbash hotcakes liscannor tasselled zankel sillerman burgum sarasola abortionists decorus kesen staveren jazayeri vhc artprize corver calibri chignell mbacké dartfish punny bradie palese easterlies pookutty samworth palming quietist hagit wonogiri multiverses marrons wrongfulness nyons puenzo barlby aleksandrowicz ideacentre umbarger stimmung efstathios cinevegas sertao mumiy icmec brynjolfsson rehabbed salsman appellative okaz feagin gildor wmb fezziwig infomedia shurta leleu argenziano kestutis cupitt recored leamon popik dedlock sotin riblet duckenfield fuemana coaley pothas vamc petrof osthoff siwi liqun pranay rooijen zixi freegard bouda turbulences cambrils philpotts lewkowicz habia dziedzic dongtan langelinie wailea ailuropoda looseleaf reconnoitring cabergoline heidar afet railcards svitek hildenborough meerman actuates nemitz ingrooves abermule kasuya decheng repointed avet coddenham mediawatch trabucco aapl silbar kollege mofongo klina straightest woan godtube sigmoidoscopy promotors pepto pollinger merzenich sucher charboneau guoping whitecliff zingers eaec timbavati morbidities sharee hasaan arico kaiko ibragimova documentum sophism medicexchange ectoplasmic bunkered backbeats casalino shiau olr footsoldier thetimes chaiman lagomorph kotite emong stewartby oligopolistic hawpe recordists dalil parabens dyken togian ahavat wathan overide hods abdulnabi sevengill taghmaoui mpika vors hdri horcoff drms mignano helicoptered cavataio embalmers pestano dcma peruanos hkma dbo silkscreens kolosov chiaureli breadbox conexant cockburnspath pillages besancon anshul zebari sfbc ditsy multipla bruneval ohri gianlorenzo aokigahara clutz odhar lionhearts canete drex slicers tarmey perachora macia grados lvb jianwei auchterlonie btq gaped farebrother terminable daufuskie hbot saltern sonnleitner diselenide salei wanya orlofsky ableman vanney saccani ranomi agaist pennybacker sidedly perfer safadi sayeth anhembi monina mignolet dabeer yusha skalka herodium hecox antipathetic doore uvi chadlington cornishmen robustelli hartzog hiers ooijer oilsands inarajan dismembers vlok pelourinho tonkolili turkson karamarko jarc probot servicemaster toumi lesinski athill grodzinski prosen orapa dotzler musu demimonde wetv rivalrous aquaplaning cullins goldies brennecke gisin selvey berani microdermabrasion snuffles zeevi cozzolino sweetbay sukiya mceleney townswomen hambrook drano racs donigan taravella whyatt panych bhavans rapturously murtada nvt inexpedient zakiya delisi uncorrectable spillovers kembo kalms meldrew lynher rosenfelt dhavernas globemasters kiffmeyer haraz katama coccinelle garing hammarberg gobowen arbizu sallanches zuhra aschau plagiarizes benzylpiperazine hido selya subretinal zayda zumbach delagarza rowswell batelco pandor idealizations indigenisation kng vigano natpe dodford lotty passent nelis abrol pussyfooting blidworth uhrig raili kilonzo bookending whisenant retter gliha hackert decompensation pasionaria khaddam laveranues damnedest tractebel aprea waskom colorfulness massen deia blippy redmonds mazzie gayot yevloyev sludgy harrises ashna saborio dewdrops dovolani taligent balie lakmal deads charlus careaga gerrity nedlloyd fivefingers cinci onancock bruegger neveh jettisons scheinfeld detectorist xtrac reming zirbes disfranchise branigin cynog quindlen voorheesville hakansson ratatosk raws tebogo marrowbone devora nhleko butalia sidoli fache baptisia mawas torain phillipp dkt hanjiang blushed suassuna collaterally rayno bilharzia delbos buhai cañaveral matlow nasab faddish stultifying llangennech ohhhh kniphofia trudged yeste saza burani budded deap genelle benwick phaedon chiew gawr creigh shurin korydallos valie alite kylian borje khobi gorres starchaser vasik jubilantly negba acheivement mareen urvan gozzo bureacrats herrema thake atls cawthron mongiardo croakers moshtarak osteopetrosis quesos capenhurst pithoi callidus ogwr groucutt skynews darcie kassell stuhlinger zurga ohsawa ippa heybeliada schönemann shepherdesses aldunate gnw cringeworthy reticulocytes heggestad asenov quizzical calfee impermeability manjoo strugar bagster repointing gastronomie mazarakis dongyue ccla apga pueyo joydeep unbaked eleventy minsterley ossama castlerigg skirrow katsaris codesharing clynnog barkey natalka hamidreza mesud pedros trencin bacari bazzano wornham jackowski fasher burkhead zhongchen chignon magistro laundromats tracfone mcpeek daunorubicin caipirinha warshak itca poutiainen mirvac reall matakana manoeuvrings gite astrantia whipsaw meows doobies silus chaudron hemin inopinatus cerie angove chiong reata astrophotographer stibbe misters erlichman happenned sprake screamadelica vivus simum wittenborn barcham tikis wellsprings resprout typer ceroc onomatopoetic bobonaro selmi acclaiming bruggink ilegal bertaux modolo huskins bugbrooke sludges gomory ilsfeld orbin bonci bolinger diaoyutai steinauer turbotax regulary ghilas esps kolja tittensor singo nelon amatrice sipan motherf mvb kelan salsano mcvean dissapeared bruerne milbert comanchero suton keela oncologic rapidio ardeth régua libdems anahi ualr trumans genereux ossawa reinheitsgebot menosky esthetically interpretational herrenchiemsee sunbathers banditos ottl chuukese decapitations jaksa redenomination paulaner sempervivum natalicio procrustean faneca hahoe kumis follwing janell elijo rossoblu ceridian weik nael wolfensberger hosseinpour kaddouri akhvlediani rhossili zeyar excatly apda nimisha fragasso strensham aerialists clashfern pasturelands miazga wending chettiars hohlbein thormanby bisenzio postminimalism espirit dorge impetuously dulls favara dauman contrabands xiaobing michod outshining incongruence campell martials beckmesser adagietto buellton stazzema partovi hodding bupkis prosimians guaifenesin kaarst insound vahidi finocchio cocca yahrzeit humdinger bioelectric stratasys scumbags llandybie peasedown binter adroitness belcore puchner jiechi gapper kandha albaladejo cgis mozeliak bagosora tollet wachler cliviger bashara cnil uplisted circumstantially leefe mickeys coppices xsp corniced iema cfbt advisedly pepple hypocracy riesenberg yodh landrigan parmly sideswiped safiyya berléand lmsr feijoo ecotricity angiopathy lisps pandered wli hollesley leafleting labuschagne mutum stangel kosofsky skycargo elika prolife trinculo sarig susse felicite elts riadh frecheville chlorofluorocarbon peligroso derrett anvari barzanji onkyo muarem symphonist djohar oesterreichische itamaraty umpc shinmun mockumentaries mihailova veejay brint dzongs kanat epiphania zahner agx soundexchange dozed saillant nihr milanovic polyarthritis abdool aarc nerved spansion kapral mcgavick monongah luderitz russkie mabi marxer murrelets conjecturing rwt tacker arimura kindof tibro prac arcan spodumene cherrypick huaqing pushpak redmann baldino druthers davises pyrethrin bohne uncf itaipú gingy secularizing conversationally daddi elodia proser kilger popularizes softswitch emec hyperbolically manouche dunum bustamente meadowood pickaninny creamers yabo flechsig azodi muumuu pereulok lothe varujan eshan kaupas socia worman ludeman bahad atlantans lenoble morari proselytise nograles chalkboards lindmark aimco hagas hoved leggott aswany pleon herpin krulik tindill trillick salda tauren wastefully butser passard abdominoplasty inacceptable pelagornithidae naea echolocating tucanes crawly safavian meifod agropecuaria empts mrdja bilbrook saydam cmon imn frenchified carens crossness misnaming rayas skeie saeeda bergessio bersin glycated visconte coastland watermans blumenfield muzammil felecia margeret silodor eyeshadow sawadogo beseeched chipeta deever fernleigh mbes universalistic alvo andron saltcedar skimp pederasts kanaly scotford alexion skv khavari kingsbarns physioc dorsoduro lisnagarvey mindspring correnti ninevah fimian steffie wyken innateness strangio ditech antell tumtum parren shoumatoff noncredit canyoneering barsetshire jaunes rhomboids prioritises tribhuwan mcconnelsville blithering rübenberge geise clerkin vulgaria pertemps playsforsure medalia bankshares amagiri dummerston fusha manesh cierre sylbert swearer eskay marolt salahis troch petrouchka icod marsman navickas lynns endi galotti barrys hakes scheindlin zubi ruegg mairs duffett nexia plazuela moena tuatapere baluard nilay arfield eadgyth welbourne zent guzy powar furchgott raulston abudu imraan majano keisel bosavi ehnes electrocutions rebiya garriock labaree ossificans pavis savills arboricultural unbent selley ataxias fulminating ankleshwar tylosaurus lober busteed degus littlehales iddings alameddine zunes durably dealbreakers famen tomago deferrals vossler uwins faku tudge mannone pardoel quepasa mcquiston allegria haaken ethambutol mâconnais cogdell signficantly overexpress parranda rakhshan timestep remediating knifing seidenfaden sanidad mcwaters jerrells rousselet mvnos backroad dufton bndes laax skyworks torreblanca jerrel loganberry nubs cullison femaleness unmc buchdahl djamal lecterns thinkpads solidarnost luddism bvk rostill francies evaldo mozyakin tracz xiushui rachal arsc melioidosis sammartini rosmersholm memet huanta denominate prestage driburg hawked satura nouvelliste seksu alltech aanholt pasqualoni pontbriand prober westcote homecourt megarry isabelita subianto busloads kutsch sapeurs ohsweken shaniqua tutting flowrider stefanyshyn brauch munisteri kordan pichola aksenov binstead wiil bielenberg ekeblad surfwear meulensteen hbj dahuk veney jalonen beringei muscardinus knla sherring nadzeya kemnay sabb whooshing zenone jingming kellard scic propably serigne plimmer kalenna merrigan carnyx belman carvoeiro jantsch tanztheater bedfords yakka interpal splats ballclubs ajdukiewicz kutler reticule suto personalty trockel wondrously ferriol nusserbayev mandarich huggetts haggs riotously ysaye kapsch enticements overriden sloganeering schimmer momia spleens zhiwen bunkhouses ziprasidone mspb flagey zanin ysanne mellisa sabden drakoulias quikscat qasemi ablyazov quintano availabe automotives dayers garbageman vendaval ranched shafiei prsc thottam roag medhat shevket serrell astete crystle iprs messiter assitance herlovsen shipbroker renaultsport koteswara schanzer lapize quintuplet hidding biello affraid numis univocal nonono cryptomnesia vitet hothfield yanov rayle sissako rrw bilandic coyoacan stonier dundrennan oltra moun eliecer conz outmigration powerlink linzy peddars mastersingers pankiewicz hinck couñago firstrand unsteadiness gegner lumbered searchs zanders corvairs knost dermatopathology dunmail levs branquinho formulator decriminalise bandstands indivdual orthophosphate hartsop aggrolites mahen kingshurst karmapas labas llanfairfechan porterbrook mannville kevans kerfluffle mutara brynteg loutro hardial nonrecourse iafrica wims adduces clusaz hrdy deniau mayfest steinhäuser ringold goonewardena rucha mashes nikiya timme delanco etron faussett mikros poupaud gontard firoza jagatjit ferrare folksingers wikramanayake humax najman rasmi demineralized merkl rebars cornflowers pitroipa hotheads fkk ikebe heartful bohem joanou segontium ianniello invisalign valjak epigonus erwood kurara motaung decribed sinkin cous campanario brumwell whitsitt skavsta masontown maharashtrians hotchkis mirax boggan chamisa politie bryggman gudmundur fraher disingenious swainston irureta pavier burias howcast minver czapla hogrefe autonomo kurson bridgland breacher charfield cargile sexby meltwaters vignal adriamycin deselection sanjin ticketcity whittingdale sovremenny sensa circumlocutions bcms jiuhua nielsens yurij greeny cygnets gelsey improbabilities faithfuls mesquida roggeveen luthe lorey gome attune fole ymax smoosh robocall cantelon dulcet whedonesque jwoww faubel hempton peppler ruwe dichio tookes enamora pâtissier huthwaite howle empanelled clearasil stanescu moverman ajani wahconah annisquam zonca montjuic miltown dakers transmissibility sanmen trimbach mcdo affluenza falola bondsteel abertawe consigli springthorpe cartmill tomiichi ankawa raghuraj buangkok plaxo deadwater izady companionate depe bancos sidell mandroid candian joleon kaurismaki cityzen robar tearfund humbles guotai guanaja acquafresca ranty saude scruffs kraenzlein vaughters eginton almanzora bridgemaster sandercock schebler leinen tehre segale rosebowl plastination alkylate confinements hasell factionalized saah pumpe ajaj miconazole magre choclo sigfusson drusen intertie ordish guodian binyam biglerville treichel eaglestone pyy shoebat omrani nordegren vauclair horrifyingly kleiss kronenbourg kesner husing unsheltered gospocentric kumyks wriston tryptich tamaro sarwate nellist disquieted minimo plockton lucado immunodeficient afari reallocating tootal montefusco umbilicals pantalon nakas tidey widdle managerialism splashin assails iannetta aaco meny larded zurer prepayments lloro gamecity partygaming doffing pavé puppetmasters energias hafizi jarren disembowelled kammerphilharmonie rejoinders lowthorpe sarb inss eucryphia guarentee devinder artola ichimaru tokonoma stobi montelimar tongil bioprospecting planaria jodee judum tatem ofwat duberstein satinwood coxheath burghard bridgegate grasberg murlough rareness wrth babor overreactions sekkei coiffeur duboscq xinlong bingos glug schamberg kuitunen montanans brechner whiffen cutshall wtkk galp nadji bufalini girardelli alaris polu bergelin hollenberg chawan secy sarazin nasruddin competely steamrolling bourns noninterference carnall lepre tokunbo workover clophill nkem betteshanger daneman koobi vedernikov atkinsons revin shrager isbin comapny putinism stawamus concertinas stateswoman murderously kandol breydon rittenberg coylton micex rutina malebo hourihan cheapening baychester shabazi feltsman sieghart allocators bicyclette makeout pcyc kotido piggie siguiri larrington nessesary veloster weaponless thorpeness nilekani jonelle acxiom uchenna presper purifications underplay dogmaels semestre mcleroy masisi enten mccline maricarmen kilogramme ryneveld jermichael velappan courteline observants unnerves grindr kwtx spiddal shaalan jamiel altruists alexeyeva holda afleet transcribers undrained dahon sandqvist midwesterner preparator yermakov fragniere kwv brahmbhatt ifop afkhami bradshaws trbovich arcahaie disassembles duquoin nigeriens dillsburg dentmon lewman elima hovik chawki halstow tellechea arvell bergmark slurping cheerless hernandes lly zhuji gdv rajasa buzhardt haemophiliac myrl afic yusup dabbler vry sprezzatura riels wooburn ngx reregistered muckler inflects cedella ribaldry backshall unfussy grosberg striefsky swabbing kecman kolesar phuntsog humidities janish kubacki ruaridh soms bohlmann phuentsholing bradham grosbard musli kason papalia friske alabamians iafrika stifford laicization breughel stampalia elektrotoer kraftwerke cetiosauriscus luctus manzel beyong floodwalls stejskal prochazka telefono jamaah hijabs holgado chargebacks maxes nanya griffi chais undergrounds mormando suwat hütz unlatched splicers anogenital disorientating lurz coloradan ffos excluder gmes aahe mortuaries picassos synthesises gll fargate grimness onchocerca trollery rivercrest smithdown slops ospi sackman aviance liberalist jiggly nomoto ganju cruzeiros yeares afci briody daneyko noyo kurnitz bramlet gabion dörrie skorpios cybershot padura richhill hiraethog klrt churandy voluptuousness ulcombe dirs honeydripper yahiya bkw wharfage plantel abased slusher greenheart knowstone risp champignon kneehigh belstead mcbratney mava reoccurrence bancassurance breds shibboleths ransacks essendine maxwelltown hurlyburly ahoghill wowser panjandrum gvc dialectologist jurançon weihnachtsmarkt werkheiser declaiming vyntra carbidopa zhaohui sitko rockenfeller firestation palaeoclimatology glop futz llw tranexamic misguiding schlomo shashikant boediono budiarto depaula flaine elmbrook nicholaw socializes jtp maximalists gilfoyle hagwon kanafani zicari tabletops accosting undisputably qazigund rafidain ecns aughts maraldi exenatide fumé sublimating empi stanlake pether sportz youla overfull wamalwa microtia ringdahl degress jinyuan oxenholme caahep ihedigbo abps tailgates tbsp amaka linderhof iuliano jeongeup sarraf concelebrated socios bardell irakly vinik florenceville wedren gigatonnes windowsills equidistance drohan pecho nooyi walcutt saitta crueler stenholm bifocals granose geometrics balles nebuad mcrib terabit grunwick storlien cheongdo obermann isolations ziaul princi fesser bartosik ageas bristols penygroes bentleyville swingtown elong fornham kurlansky stigmatisation deià perricone ordell klingenstein hechter tiguan brouard lauralee grobet sharah everolimus showmance agrosciences iufro valliere henenlotter costley kasser ivg spiderwort playgoers icefjord blackboy bewl farges ecopetrol odessey narveson steinhof noncitizen kirchin unutterable dellys sanglier zoete underlaid thornhaugh scitex stolfi delcambre sstp smisek korber insulins feghali swindall westroads shakman grandmas kenoyer bristo stongly caoimhe danos feola saggy croyden bankatlantic ngongo mchinji possibilites meditech benesh bukharov kittipong unmercifully nemov griles neurotropic anway hayriye homeaway mailhot outweight fiander springers inniscarra marthaler gholamhossein fchv schimel millibar finelli bruney graziosi speechly yevkurov urol schedl allaying aguecheek npia zadroga fujiya jerrica doublewide ronal rectally beels cridland gladis sangji chewie llangadog unsettles burkei gawronski tomboys gailani reganbooks pothead multisyllabic ragbrai lassic farvardin aronow filé unnerve piskorski cheika delvalle vesterbrogade behdad sibenik daney sangiran hazlehead wargnier nasserite carfrae poofs twofish kimche nowness mannon pleasers fournel fary kaiserwald basem lopat hcci sentinelle gabbi latheef kvbc buckyball schoomaker seniormost rosenboom nacm piszczek uhaa duflot ijichi ibori macsharry hartcher coggin retweets arrr thassos waialae pinehearst eliad spigelman lisc entrecôte pigging peelle huaxi tvland dupontel hensol jawbones lyminge russkoe scnt jodhpurs oscon capannelle pangram farted kthv pipsqueak cornelison konono carkner elegar giac cumberlege oehha cober momeni suyama entrepreneurialism conservationism enrichments hcps constructeurs lilliputians aselton aith rogner blepharoplasty mamoon szukalski uvira allders policys superintelligent craviotto nicelli burfitt sechler situationism dooney juvonen nakorn vujic internments meeson tipplers kosslyn kamana jncc forecourts suburbanite kosdaq nfib trefilov unknotting sonan mazagan jujamcyn tumescent imiquimod tabberer rfps galey meshkini ropey nipp mompati colossally tricolori cremating kazmunaygas hoppenot nebb snapback presenteeism erectors righties monegros shamas brachet unfastened begelman renou hullavington gumulya uchino superfecta meihua baneful ladha dorasan minuted eurorap latton montminy clodoaldo oshu dieing lahaie gamey amoss bugel neuters lambrook fungo kabasele unfurls saunter wingreen playgroups catalogers lacina pieterson villechaize rangemaster wrox hoiby corren lightheartedly francisella looksmart gaglardi fukang portaventura quintron amre cmrs wyda configurability vinai inseparability indicom shawish lipofuscin aneurysmal biocatalysis shalders friendfeed griddles bennets gorgui zednik steinfeldt dums tianma kalw ouspenskaya justiciability zwilich lattakia jarringly crotchets eaglin unpasteurised heinecke doomadgee dahabshiil cuckney clendening solorio manchev chokling strangeloves superbia superfans cubillo pasatieri gioni balice bedspread spiritedness pifer joling turchini jakosky amrc harwick unadventurous relex yarlagadda singalongs pongala annica lessingham kedrova fiar siepmann seré zicam bryceland baym muneeb differant iapp macaroons adventurousness grathwohl lidington qurans mackesy valuble wunderle ballgames sugarcoat stalham secunia jaiprakash crossbeams choristes ponniah paranoids hibaldstow belloumi travelport mdpv shaklee inflammatories fifteens chingari dlbcl probabilists nicey daggerboard towboats rimell nocentini qiagen lare wasmund desquamation buker runts frankweiler karry icex kozmic oxaliplatin starkness porthill senba adeni pearblossom gauvain mehrzad weetwood desideratum poett atayev tiera naudet eut ofdma huaibei hydrocracking gezhouba runco brimer tengboche andretta heilbroner dinks danyal carmazzi vald tesseracts tallin foetidissima ballbreaker mavridis basinas wehn hibbitt millerstown andic laino enlargers kalashian yonglin cerveny kuller hefley dunckel haxthausen anthropomorphised barsamian berthod restauranteur indosuez laperriere goulard bassplayer hideaways lijjat guidewire fauld satyapal jouet manikins smallscale janakiraman aiyer hailer salmeron everlong beatmaker akerson mabbott alphie brobdingnagian parres stoneridge reconceived ooooo kinderdijk cevahir embroil injuria syko bankoff hasher verderers socastee barewa kilmessan grovely gissar zylka carrols gulkana ibai beineix cairene directtv bittel tomake seino tytell scalabrine grotzinger ecps barefaced debevec tiare bearhug camelid alium vasiljkovic sassan beguinage uliastai hexachlorocyclohexane stantz allées wieseltier saccharides chopp rajendranath enteroviruses unitive ciji simlar krack nessen bushwhacked cbgbs maimi encumber neoadjuvant caddied asep mirit gunnels rzd demoustier wyff macneacail rezaian obala calamansi dahlkemper cdz zieler singeing biotechnical spio wickwar osbornes timana lightings benney moriconi kunder interruptible stroock dinorwig millien mows pifa fritillaries limani godchildren piker bankfield bhojraj bpcl nyh rodolpho heary babeu zimring rippert buchina capellini tavare masoumeh shabwa thangai nelia cfia goscote menorrhagia mcleavy racegoers sheilds spyders sijia ramush insufferably harinordoquy hesp imbricated biet youthquake mouw fariha retractors wickard obss dexion kedwell bibik underbite fursenko ecotourist aufgrund erdan semiha volvos antilock timorous mayobridge stalnaker audigier barrueco tollis zolotukhin yovani fosi kokubo lafc waterlilies fettercairn ornamenting henkes salmoni starcatcher cardioprotective masloff enoc blucas thomasin arvest mundhra buachaille enic shewchuk shortenings chafes dulais kyivstar morticians immi yauheni danter blondet sahan addazio ranya commensurately cyclothymia reponses moennig panagopoulos leason pommiers oprichniki cfmi jossie guit alfonseca centralian ousman serms kamares epigallocatechin jingchu bygrave strangelets tumlinson leiweke rihards civicus valbonne seacraft vaage nagarjun mhsa languor lhcb rosborough slh nasra lagergren bassham energising lollypop kxmb motiur merbau muchness tizzard gioiosa phull muktinath wndu munhoz thessalonika adorni aiim pranger krupin trislander jeckyll anslow muuga etios mccarthey azzawi gragnano demeyer oquawka forkhill peelers jiaxiang bintley llanas fernhurst fajitas dalene heraud boysie condrieu feasterville skenderija iedc aliou calonge buckeystown invoiced tosta stockists caixaforum suaram andrist vondelpark distington lovettsville baculovirus levinstein graden rasps chrysalids playmen plekanec somei eclat brinkburn browell suheil mó goldenvoice ofn derris hajizadeh legerdemain georgei umbrians woodmancote onw parameterizations alerus gagnan orlanda cramb narey hydrolysate honks goradia playlet mozilo xover bonos boconnoc mezzocorona mclardy indentify farideh hunsbury capco bahrein hayatabad suicidology undersecretariat hitchhikes aeriel illanes alogoskoufis reindel steenbeck taim levegh chandeleur essentiality blepharospasm mahender corduner okeford kachu charcoals clockface güttler globokar realisable shavir stiffens allbrook degeorge nissenbaum rectorial ruskie hcfcs nyckel disconfirmation manelli spidering guignols poloni vongerichten deschner crivello tiffanie subsecretary creason panadol farahi waterboarded extreamly caerwys crawlies carpools dulzura sinz pilrig absoluteness bhaya krap worksites andriani ilani eculizumab cincotta prevelant fiancées greentrax shotz volpaia wheldrake certifiably benignly foxbusiness mahay moussaka burnfoot sakichi absentmindedly bowfinger fehring yantic serpe montecristi benkert luing ripps farrells pspv sprod threesomes hillfields centella birdbath davita mcing picower schwarcz kocha crigler manassero proximities gruia stradley catweazle didomenico bendelow edano secour starfruit leavens widden killik finessed bke duffs boulden hanegbi uhlenhorst diekman jbic maaco unhealed serap imin krm haberl microgrid navaira pusc radakovich hermance zarouni allai kosse boim buttner calipso ventotene shoffner claverley hoffenberg rosane sabetha nilin akale floridi zamecnik füle hallum monter liebreich principlist tumanov barten dobol worfield leafa natela greenwoods hgcdte zhaoyuan langlie norcom smmc mbugua priggish roshon gmap peperami odlyzko dowser shaddock polycarpe obnoxiousness prodrome bracker niemczyk creuddyn showbands lewry labranche verschaffelt cappuccio hintsa lamely nolberto mulhearn bernaldo sailesh atayde ortved stubblebine afeni hyotei sourse mccorley timpanists wholescale townscapes tuffley congruity jigsaws ballz understaffing mouthguard maclaverty atras shella reconfirming nyts mithai mckeel sapio klauser crystallographica cycos romeril chanonry tarasco ctenophore cozmo nativo sonestown felinfoel guttentag lockheart kbi horseferry wallia coalición noser takamizawa pheidippides villoldo omata workshopping daycares plumm dealin doretta upthread refueler joppatowne klabin hhg sprats bussink rifu triesman aicar desmosedici wuld teks shomi pzu boundy unlikeliness wildavsky zares poppinga depa springboarding bubblebath budock stevi eitb flots medion akamas gigatons kokkino mimick femtoseconds lenel alima gonta maleo smolders scrabster baldeo nullam libary ener sebek marginalizes paulsgrove computeractive foglietta shrna srour cabaletta bance josemaria zennström delich sunshades waunakee jilts pletikosa bioresources unchallengeable zyzzyva europeanisation blanchar roberty ellef hamdullah macroeconomist momos pinkwater livernois creasing htis wxii mcpike biagiotti eurex yeses guyanas mentorn encinos weyant jindong jeudy guihua neuburger kambi outboxed korch ,they pastner rochina retiral akris halophytes dilston gildersome mouthfuls bkb netti yoxford sunart annother bacot kallos acetosa avastin serbin selworthy zinberg hoxby hornbrook pendley julani palauans cwn virgilijus venetiaan bergtraum kaaki swallen pneumococcus adron schneeman bullmoose arces bonna northfork peaces josefson kunigunda buñol demchenko cyncoed nubi anupong churchwide aratama gulia chelsfield blackstuff cuéntame sansonetti lanners sambadrome definitley membrillo maylam penally temeke germanico cupidity ruwa ayiti mauston flagons betweeen medvedchuk hegeman gunsights adila parasocial montem bemersyde readmit jeelani eckstrom ladany baised cookridge brunnhilde zuppa vauthier narval forequarters tiefensee admissable jackboot cringed sahli darks suskin daikyo foulston targamadze cagnina elyssa raptured povero metricated rutman knuckey handbuilt autophagic leiomyosarcoma fortney weelkes daddio glitterball elice dabigatran lumax thinh flugga petrovietnam papathanassiou buitoni motoyasu tabernas eyharts mousehold jetz osbaldwick everday yukai korty louay iita hoblyn chiriboga potage camdenton sabco lehand shucking reprographics ervil chidsey sarabi friedmans eufrosina diametrical eiders dermatol kinzler monkeying yegua yaan madlener lappas declaim bananal beeskow privatisations ahmir zweden wilzig eventers heyhoe titti rushby newcourt prety giovannetti melitz jgc moteab disintermediation fze guayule worldpride staa heekin fermenter odubade méchant teahen aouad byfuglien unpressurised borena datapoints cibotium benhall cefta wohlstetter assane wingdings shanto aguerre massicotte hogin specialness corian vollaro lukasiak qkd ovulating tonja decarli ghahremani appier remizov bytheway cmsa quadzilla eibach sukey urg waystation standring squeamishness willmot bedpost astroboy furnell afps cawsey fuencarral underprepared jijia tavita lojeski lowi pozole rammellzee msq swierczynski anstee rollett lastingly maseratis kirakosyan hurun onek deerness einzig srms hensher souqs kauffer misplace rosebay waunfawr boora killshot cantuta sgoil maguro lfh paranoias stripey cascella taibach lagunilla eadfrith sorafenib repossi battlefronts ethnopharmacology skiving dolidze barkell haitien milevskiy deutschmarks siyu blacklion haruta vecellio shiplap modjadji lundblad clova ihde bufala grishina forgey tomescu elymians senning cxr gohary kaichi galguduud queendom lagerberg buzzkill kakkad muthana decety delara panzeri yigit petagna siderov dignifying mininova paektusan benka cumulation tyshawn rayes killingbeck humerous renegotiations banzi xeo zarian analysand ngubane pörtschach coychurch progressivity blander coastwatcher denselow zaffar wron endocast cfact borowiecki brevig nonbeliever uncouple sediba quindio mediapost manicures shakeout mahgoub myristic quenches whil barming hable drachman purdham caunt gledhow cadila stopcock retreading childersburg perovic rogerstone vonck stankowski chelmsley rifi youman momand ewt saper hunstein simma hoene congeal arakel aule hammick paddleboard howmet shafee romasanta gabas shippan conveyancers leonide dhanapala wickremanayake blantantly rwandair bancaria karbon kotara itinere outernet shoy daheim langata naccache jianzhu arthit quesadilla fatefully qlogic clybourne showtune stemcells gorre pelous papahānaumokuākea gatsos overawe titsey markab helgerud dreamliners meneely tajir atcheson lisia tarnation schoolwide terracini boesak grossfeld goolden incent munish siriwardene misso colins webcor wway scourie swir jsj positas silopi seining guiglo halem adeyemo wiedeman kacie phibsboro ruggedly soifer yida rodricks abwr barsoum omegas fajer nanofiber carnlough irchester babby aaos márai stucker lears ilina labanotation sitel lettuces jera sawit attakora fromt changs genessee meitar kvoo brantham steavenson toshizo brandman urbanek tajinder danshi sbas varno millworkers werstler grisel cobscook adelaja barwood statz shrikhande buccaneering colbourn stansell bensons grupero wasel kornilenko elderflower hosenfeld crcc simpy astrada ocreata exacly xti panelli cottontown spiels manglapus puskas présidente mistyping clonally statelets gizzards kosowski laggard africare espersen wellfield moneybags thoen skipsea maskaev wachsmuth yiren scratchings gasque quizzer barikot brandau providentially metropolit mesika resons beehler thurnby comparitively lambi bobrick piercarlo lüdemann woub iketani eolas courgette dissemble dawsey tyrangiel replenishments shihezi janacek equador siloso abubakari azizov mesón busser auxilio oxf vingaard derosier picolinate szymkowiak kcf ayele scabbardfish unkillable gacek farocki unrighteousness givaudan deadbeats cetyl grushin geisenberger sabour perquisite histoplasma asiasoft accidentaly mycoses shihri daallo gecas merrygold dhale confabulations faqeer skeates stiffener halali boura cassaro whiteabbey catterson kanuma lexton monteforte rustics particpate flavorless leaney zdroj decompensated karsenty sagely gvardiya bravin udhagamandalam halethorpe athreya wildau bookmobiles freeflow jirón cyle gravies olympiades watervale solemnised eliding gryzlov moneygall cental ushanka mayrhofen propogate balouch lobley hordle callenbach colposcopy twerp uphall debashis zibi campling avelina crooker beauregarde darland inview unrepairable ibarretxe mccaughrean bethia acep gauch emptier kapaa shijingshan endal hairlessness tunb gasparin absi ctfc mahlsdorf bridgen trecynon antimonopoly cohabitants mirinae voestalpine namazie auditioners rosmarinus ellacombe bilmes vesi auspiciously cipta digitalisation direly perey liscio kingswell zackary unredeemable gansbaai ozs lengai eppel cheorwon lighthall termine kareli rodenburg kurzman cheryll anin dexy dingding bukamal willesborough stickmen manque gelbard brotherston pilotto hohler gillmer baronova staehle medef astrodon mald tisi librado calnan saumon boisdale warnham depodesta mercker seabaugh butautas montas baloyi crownhill lomaia botten damns rizzolo sigl filamentary unmodulated riah stanzel writhed norouz doisy odesk naugahyde microsemi prows enow veramendi regente tucanos professedly laters garshasp finerty clevers icaf welsman cardiological mangwana golik mpri intellectuality medaling siddartha grzelak feruz disconcertingly bakeshop barbate cnvs streambeds orignial weissenstein bourjos gangbusters tarika vaynerchuk contextualising lutine tassler regularizing gaiter broadclyst zasada licona bellenger rectenna prewett rhaeadr huggers golant syas houdek castellvi dumbness hefter hamamura antithrombotic retsina schey unbonded restrictors minack ahome eyefinity perthus fouracre helyar prayerbooks certanly galat undershoot massimov sharkskin beaut untrodden azarian ajara plisson nyasha uncoiled jinglei bresser asesinos mangurian stannage pirogues nadym malolo marpessa ofd wallison camerer irrelevantly slavich duplexer merenstein meyrowitz revile ratemyprofessors intune lisl suburbans corralitos mckeegan proofpoint dady mesoamericans coutries asianews irruption lightbourne ossabaw helpage jiff hartsel psos yibna cinched wilcocks flury wishology clomiphene stup baala baptie tirez koops alkhanov huaneng everex coiffed availabilities solidere cashner hodan recabarren cvetic nendaz angiulo dentsply pagai barazite victime jeanneney jinky vinther upvc marshack cpz waldbaum febles sandipan romes hspd eckerman frang libatique jawline toha smilow heliophysics ninkovich wambsganss tofield ostergaard joselo fados nassour nelsinho appologise jetski competant medidata rtds inves vestra skryabin urbanize glasby pagon buess pathobiology vitanza sargus rowayton berky kaust dawkin cervellera roiphe griefers komei bealings tajani unexceptionable nakaima rdn midcourt busbar mulheron yoshisuke charlatanism tapeta calderstones undercapitalized parmele savoured salivate flechettes stroj kidar izzah kobin jobing osnaburgh ecofin theofanidis photopigment sandomir effra kepel covan brusco kasubi adulterants foetidus dubby gorgone waitzkin wimbley powerboating medicinalis lifeguarding bmn gropes synthy blethen zaeef ladyman cellulases venton jersiaise slmm arzak quiett batfe campilongo stamberg cler sharqat bishal pedigo florita foreskins rurrenabaque blackmarket gossiped gaylen escot beschloss accio zileri thereat defaces superville guiyu jefferey jemmott waymire senlac ceed lasantha fowlmere gulangyu hanung selee corrour overington schwantes fallings bambú hughesnet kilani lunchboxes gorby gardenias medgyessy teraoka nazik cevian bellhops yeremenko noves lowlifes hackwood chippie alpizar scratchers boilerhouse nanninga armona wayn devenport lastpass dyneema kopin recompensed stormin arraf isreali softies calcific heskin subcity malev brewington gastronome hesen luciola natsios iyan craniosacral grubin tabloidy edamame chattisgarh futral mirant doored antonenko rlj medecins cenarth jejune llamazares skylarking franchesca wude yeongam veselka miot fastfood lebovitz deutschmark jubran eix guédiguian monan radiosondes ferrant sajith parasomnia burkenroad sensis novelo whispery karumba madiran ohtake freyssinet permenantly ondarroa outbreeding lijia teca brichant lupillo imrt presumptious sheathes venusberg weitzer bbcs ponzo rowlings kozakura akber stefanovich toomua immigrates oloroso robel ashrafieh conterno kiele gradgrind prognostications gavins cannadine huxford ldg girão zenbu thurow doorkeepers keppe welco wifely schulhof outdistanced giallorossi turski cleavable nehantic peccadillo freelances heilbrun lenor chaumet marlovian silkmen oculta mcgartland warumungu vedia leduff bioterrorist barberena konchog ryecroft albarado shehryar ruhs silverfast tomassoni flato destinee lisker ditter amena vru pepperberg rickerby mankoff hendrawan visanthe macronucleus kamler whih crossey ockelbo ekerot sanbornton yetman mmsc rutsey unhchr eristoff perritt subaquatic maramures cushley crosas neish nbpf ibram denly kamangar relvas merkt reimold walsgrave rustu boldyrev walbank chapline visibilities zhimin patellofemoral eex leanza slumlord mmj gutterson nedry mineable hackbarth lisdoonvarna velours cardoni drai zerbini maccaig batiuk hyperrealist eichenbaum llanfoist manthorpe mueck goic nohant carolo neureuther mcpp rostering afak hawkings whalin chemould steil aalberg knwo entner basudev bantjes astafyev airheaded yaque houchens nuez extérieure mamberamo macauliffe coppedge dettman unzipping krombach exonerations antisepsis fishmarket rolon kalafatis bedsteads gavigan emeghara englishwomen serrania gdn slathered ssms jilian barths deuter hindy hesket ferra kovaleski gabalfa yijun cryptochrome gobber sharktopus shearon marw luno niskala kolpakov murielle rpx sathorn elseneer esbl kenthurst comaneci lopham rummell airtimes achmet busbars sedric servient soilent universalizing kyriakidis chirang arkema botteri zaina brancheau sulaimaniya cafr telesystems gewurztraminer prostyle lasan aunjanue clarins fleeman milkis urpeth blees acryl vinery achron eflornithine haying vinters resistin eisemann cadishead ulker klodian companionable towage bassir brún newi youren fanar smullen teuton litterally ostar ibiquity hauptstrasse wolstein aguacate usss anonim brockham capesize szmanda salihu pemetrexed aviel catchword contrario ghiotto vasallo stagnates velandia gluttons salti mihajlovic novastar mysociety unteres haloed backcross gloryland tullman yga angelitos foulest kondratieff crif ijp quicky federalisation goldbeck fedoras avina puji emilee meretricious nawas latonya oversaturation pyrexia ghufran lysippos cafayate gertrudes riederer banchieri methwold baldick falasha jaworowski gosaibi peculier magara vibroacoustic jheri untwisted hepi hunold wagged skullcandy pathless birthweight penpont henselt cfcl anthropoids bachardy lobiondo welldon reenlist mcgorry kurnia spago shougang bangarra updatable tje tolchard raineri deru trevethin kernochan rieker prigioniero moneyness onthe kerasotes troxy cristhian dummar fruticans bioenergetic donside weggis lessore atalla yehonatan aake maje kales alere lienau toups maximizer grage malbis trepca stenungsund coper loveys poate zajal spierer democratise unimed giusta assoumani mercaptans adeli genzebe tuama euh shijun piotrowska bartkowiak locky laughren yessica feury delfland maidman metalious tielemans filgrastim ehlo sirrel bespeak oich sheu langbehn papermaker elgen exaggerator cicak budarin lieing hydrofluorocarbons kissack leodis yungang melvina alverton waisman abstractionist fassino cumbers muhanga biobanks savours prausnitz latulippe burglarizing delisha muleta busic nicmos gayen akomfrah rafaella giuca koshetz tadawul winzer realogy migrante kontiola wdef plagarized tadahito abunimah dickason murino kamli hulsman panhandler matc boath ekram barquero legkov poretsky amuria stratcom piggeries whitleys bitingly herfindahl alanen clignancourt kular brotton waymon fuliang thrivent etra credico kamaboko heartbreakingly overpainting gego arcu rechy cringle stoclet yeso carate cremator rehear borsalino trivet krekorian dotun sonner sansum cazzie coyer cuscatlan pioneertown customizers irimia provender neotenous marm resemblence snoot superlight bunking duhks everpresent mosborough rohrau shallotte firetrucks casada cuyuni cattery kolachi lerato faurschou navigo pengwern camon hezlet holmesville mccallan ditz deckchairs bonsignore reposing starworld coachbuild feethams hildenbrand amoako kazal beeks letendre youyu colpaert selco pimiento parnia cofresi nabers multipage polemically idonije mitterer nonphysical wachner mandoki zales wahler inflowing wowie tirebiter manhart sarmayeh dudka hidradenitis mohammadyar ridging aneke saadoun duanwu siginificant bluebonnets sidique witchford ekici osel baliles aleotti redrow serdes belber droppingly propjet oqo greyboy bitartrate psinet shervin langis hershberg mutoko filipinotown clobbering elkhan anucha mcwhinnie drazan raido witherow kme zusak neurolinguistic pinget nottle michas resarch arnao acsh tradd birck wellingtonia macoute wnat vloggers evenlode fulgham ugonna vinke gaylon vanderkam colza carnan borregos geniality irbms bracquemond alckmin yousendit alloush ascp ferda reinbert fortunoff kylesa dockett snowdown tameer sangatte caramelization ghl skanky courtships carreta obikwelu neutralino intermetallics mongeau dalmiya pece mayombe assel voirin childhelp venero bunaken moei carrega presspass gearin unswayed bartold immaturely oahe beqir maximillion anderl cuidad engemann bluescope tamlin arcati blakenham mentalists playlets azw duisberg dunbier greatcoats mirams counterblast seismographic szczerba bubbi oatcakes usak burnam crewless muntadhar kimiya underutilised witkop awani kirkee sbj dnm yunel irbs szetela usec grigorije kutum molls zhihong hatmaker rvl southminster contrarians dromara annastacia tooks dashers boschwitz prator eizaguirre alliott digged renaut laureat diabolically leopolda heirship undesirably vallegrande lorinda akqa egad motorwagen leclerq sild bordman evdokimov fengler mazaheri jereme cailin podeswa disembowel granita worlde chavit amulree auo cach pollos choicepoint duley fonden jeanjean abergynolwyn rubiks fialka activia whirlybirds airborn fode ponseti rozina iffco unifrance autoparts centinel exhibitionists convergencia ascod dunhua oneshot sportstime fishnets otcbb chowpatty piris sibghatullah swartzentruber nauert barmaids aev inauspiciously dancefloors derogated frontbencher jeetan peple glumac zhangs xrp chelomei florman udaltsov bilik maaten daithí pentapeptide weidlinger gloated shalhevet zoppo gentex backley byock meidner digestif nilas granahan roehrig kafar micropower tweetdeck eliska antedating phalloplasty arcangues trillionth helgen hasee gehrmann sitecore hongguang prabhjot alyea citylights pitchkolan ncx amirante renovators hydroxychloroquine broons garishly touhey renninger dolans larrabeiti gubba screechy neets walikale taisce gundle xfe bradney ekv plavsic coffe tingri homburger biddenham lecourt subventions kirshbaum chds everth petering hessey roundtrips muskett beccafumi faline johnstones peahens unecessarily vorotnikov steorn westbridge naida maev gourriel modnation salvington samaire staes sydnee resections nonobservant kelvyn charna coordinative durris anze vbac milutinovic menia motorik uelen dancehalls rokia breyers tudy matulino intosai oxymorons vises munshin punkish nedergaard studwell xfr markovsky abbassi rudebox estalella glycolaldehyde anuga nonmedical trinitrate challow danzhou honigman petracca steiglitz overzealousness miuccia gyuto pierini gatchell eriberto hersonissos megalania delicado orner agvs dambuster vitaminwater seye aswani toukan dbas tartrazine kinbote perdigão helstad gearchange keelty chaplow humidified snowcapped tsatsos mackintoshes meanwell glom phototaxis unscear somborne barthomley korobkin dunmall mantelpieces passeier rumspringa villarejo rodmell allsaints capitalises ovelar cadoudal chicon photosensitizer bowdlerised retranslation binsar malori ruminative aflalo isoflurane maxiell tureaud mirro zijian boleskine hartong clickz gerenuk kopplin gitanes taumoepeau dague industriali haverthwaite gettelfinger brassware portended jiuzhou rakhal hilberry juaquin marturano chessplayer dougher corking marseillan wickersley fengxian sotoudeh toreo ueyama mummert langesund cogbill bipolarity zylon crackhead gadlys schwindt zuleyka pompously sestieri salespersons nieuwoudt raphaëlle sukowa enerji churchgoing watchkeeper wttc mazeika perdicaris insync tulo persada ballywalter arkins ameliorative arambula rookes codetermination halai hutarovich reynie mystérieuse magnetotail ferdin gobeil lemesurier winstons naad bonitzer ulibarri asmal oure terol zuehlke closson ternay phusion solutia pisarcik hekman vulval mascarpone epipen quants gyrodactylus sevran teabags ftca insensibility qpc payin severities weyerhauser tiendas moosilauke wakened raap ziobro housel brightfield misagh harja battalia microcap dowlatabadi tyrrells izmaylov deuda craigend traffig mövenpick yevseyev plutocrats bucshon flasch kendama barkway browbeaten avramopoulos ailin aflp predock tapatio ayllon bozena thyolo tutone aiuto pruess kulemin froh chanels reul pääbo bape byw aouita tarried wfr lammi feyyaz aerc micmacs restrung calderhead omotoyossi seidell brammall jaine hurdzan cardno guttenplan outgrows marnock tresman colbran wyndhams jabin shoaf funkel okkalapa kppc rynd schinasi voogt martorana rewarming armantrout manoury anvita tenille delargy bausell rurka sebouh atonic gwd panner nixes dotage afam unarticulated lombarde lanseria fieldworkers kikkoman creteil yijin slowhand edgeware kolzig neylon altana chatburn cautiousness kinning bruntwood piggybacked breakroom bancaire huckabay rathmell vende malarchuk guarente xenografts cemi evolvement incoordination uusitalo tabai cyclophilin anjulie edgworth paterniti gallai horsedrawn shrien bluepoint pompiliu overeaters lazin naegle ketura kaleigh thundarr anozie daus cundey nantel goldsack hackerman prologis quincentenary kading bryngwyn wtw bukhsh rancidity borca anwaar elephantopus frasure bsam labii vath abce brunon wappel confict retyped anotha waj broadwalk geisa miqdad taxidermists cascos boatworks predetermination unreturned mizukami suposed philippoteaux cienaga ndcc insomniacs rozhkov sissener engeler achived sgg tarling microvision zoppi albariño passauer koyuk convallaria alom beanery kamishibai ruffier rotliegend lectus bernert overdramatic grevy undefiled bhimani laj railsplitters dangin akinnagbe pilsbury wgaw rotheray cubing zhurbin zehetner guanghan gluey slanderer batini quiets sannie cantora kulis stierlitz receieved howsham apperances arnaldur hoisin charliecard kamanda methylone navigant tsegay tollbooths coore reenlistment crooned tsurphu dabengwa lavatera warrap refix splashtop hamleys metrostar floreal colombani apprendi outwell gweneth wtok wittner donorschoose goodnough stuchlik hosain corkin bioanalytical negm raffish sigfússon ljmu erlestoke leimgruber andou enunciates juny skander yongqing configurator trustco kswo ketchen hydara faucons reverso emaus bheki avandaro loungers collender shoniwa sariwon legimate huggies canonero dolgoff vits tokarski delauney swordfighting blauser hooshang kulyk sollie howardian gerdemann ndvi khondaker bentalha diers uspenski benjo egalitarians rascist michelman somersaulting thommen yoff rideable carbisdale squadmates bergdoll bambous salee zijin ramsdale llantilio arguer scaduto toshifumi zhifu jinhui decidely clamato karthaus mushka boothstown rectifications hellertown ferguslie uneccessary japanophile katawal workplan attributor cumpston schabowski qarabagh utj bogusz kernes bartholomay langhammer dematteo isfa hinderaker oldaker feets teenybopper adland cheteshwar fashionologie basudeb carby bramsen prowls hinxton chaperoning cledwyn madonie shoppach crosspoint pelkonen gushers eyecare osipenko abler brima igniters unistar hnpcc henwick cigno cantagalo aberra cadley yapping kosilek pavlica engela sgps rustomji ketterle kreuter corect thabang sirop banac sjb wachiramanowong holmesian bartling cortis stainthorpe fumar gics zairi reresby yarrington mahd seawifs tropper camberwick sparv leeflang styer kimmirut baljeet mclaughlan fazul borinqueneers chanteys woza wandy acers propagandized berhe puah grimalkin raanan weismuller ngobeni glatfelter seila xisha brinon antao wabun fabulists puva eventid vaporisation kiptanui pinkeye aedo eavesdroppers keyholes telang jianchang oxblood hartack sepehr agrihan parnells tacha zaker sulim confortable abongo wistfulness bankas knepp exerciser malila ensisheim wvlt ribby reya castellabate quoz exabyte musayyib mtoto adesina broadminded fudosan gehrlein beezley riemersma preforming kralove finmere camiel cicoria scob hansch puligny resuscitates delaere fendrich southwater vanelli whirr johno ajil jgb wangechi kyats tonini trifu wesely aray partwork vitalii saddiq eshete unrein dureau nattering dukha ormer hydrating orpik nyangatom anjanette vilani opdycke vaibhavi tipnis prurience unsealing rockwool clitoridectomy burngreave pianta recumbents mcconaghy moonbow gasometers patpong thuresson playnow sudsy legibly holdcroft froglets oliviera fonzarelli griptonite salie spatiales bistritz soulfully yasuní slickrock greaseball hyoscine neringa chimères videoed cutchogue stergios aritz kanke gtfo kasrils cardiol paté bayada langrish sarabeth hofmans abderhalden gamehouse herbet wakefern dozers omeish barbella buley exning ripponden iev cohu burfield mammoplasty carrock inamorata veiny compris carnoy broidy barrand groundball dogu caprese genesio hallwalls kwhy tred maulit facinating lifson troed testteam steria rajabzadeh mjk stron buckleys gulko huntingtown straughn exoskeletal studentships sternburg weizhen plainwell moonset fings antle altmaier multicolour pfitzer myddfai getronics corfiot peggs nossek chaon ortygia glenford taughannock breeched rokr saffa ramiah mikovits searby interactives ngardmau luambo fritchey gwinner mosisili lizcano husak chavarri nebiolo grampy vicolo hancke quadras reggane uonuma amberson bouchareb metts aipc spik addtional contar impasses breward rilly overcalls soilih linny ethne caracazo spicule stairlift laide adlerstein personnally borve clarel csfb perfluorocarbon cecchetto stanzi anthonys mainak muraro kailey arnaout prashar alica leor donats scraggly rapada azima footbinding asplenia ppap adjustability skokloster torrs durjoy harpring sardelli besra hediger reddings baechler moviestar retconning herxheim intracardiac bestower afos cerumen spevack masanaga filippelli convergència ogd bibbidi kewpee junkins comotto mandlik berrill shaibani blasphemed doscher tonj adhamiyah humoreske resus remounting kellye dontae daiso monse thannhauser virological akona nasad mulya cnrt reindeers kusc doronin yuxian gimpy omotesando aymond grumley kelam rittman ussery ppx impex reckonings koussa conserver matal celent troglio mistrusts hiner kinkell asraf countrys sonai herault malnati entangles diety renehan stendardo engh ansanelli chaviano raschid jousters salmaniya whiffenpoof quistgaard bodeen reportorial aerating drcongo lidderdale comed microdissection costcutter stomu carcharocles sunniside jurg rousham obadia pieres ekaterini santaquin carrbridge croisic khau lonchakov baatar xantia humphead rawan steinski cinven lochsa sammir plaat muallem serigraph allmond garbis amcor melamid airwalk aitcheson terminuses convalescents glast carrizosa mirriam lettin montie makhoul mumbi simmie rosburg serhan arellanes modernizers gamania engish thalassaemia hongmei amirault krovanh velvel orien stelzner paddleboarding stolojan pacelle nadhim abdelhakim hachijo loughgiel lindall garches handcycle devins kurtzer bleckmann dumke digresses dstl paperino fouzia poux minchew glowsticks bootmakers fogelsville shudehill kelsie ingratiates grecu rotfeld guanahani zayyat rebutia unreciprocated rubberneck adevarul pejeta jarrin voicebox psychrometric cuerno chauliac sarka leves poltical labrada onate phats stowford bloaters finistere characterless coxwold micon dengfeng neuffer manderlay kincannon reelect steinacher bloks dln colker itemised puopolo computor maxinquaye batched haagensen unar velvelettes kilmany koshino taqwacore kilocalorie kosik reformats sanctuaire afes soliola yazzie arcosanti deruta haemophiliacs salvarsan libanus enedina teamtalk screenwipe icasualties ziporyn francescatti udana wolodarsky donilon neopolitan crinkly sistersville karriem rubert dodman fasih nahasapeemapetilon pungwe lafrentz mulville ruettiger rhinegold houlston carion gilesgate falconí afterman exelby preprocessed proctored rootin eisenia versata kijiji matondo cjeu exorbitantly winegardner nhpc maaz barellan zuhal haikara jostens idiomas hesson korsten lahman darcus chishty alexanian simonova vandalisation profitt fishwife vreni princelings stilley igby moschella feherty ggw flemons borio nonmember oxyhemoglobin pierer yorkhill grenon stockbreeding quil strzelczyk hydrozoans timanus paranavitana mucke sdrs daung photocoagulation tightwad astacio microcircuits delory inacurate spandana repass tamyra yopougon borlongan voluminously garms spurlin elvers chorn lacedelli deily heyy aliff shahal lifa greave duensing tahmoh atep tridgell gadhafi dniestr bronzing cantering bigpoint vonetta reappropriated garath drumpellier economizing rondonia gyimah maiava derb yacoob wackos ajala saenko dehumidification prodigality brookbank methi undeletable whelton ravichandar throwin carmyle replaytv petm drash alazraki hellqvist shellabarger spino reinnervation lolling michalopoulos goodhand ysc valiasr horobin beardyman disaggregation khagendra barkett mcclellanville gares bfpo buoninsegna szoka mcanallen zoophiles handong ballygowan somtimes concievable kagal maruta jocketty pinotage furbished niver ritters hanapepe splotch malem nhial bananafish recalculating uddi agms naturalise etag cottager iosseliani nerpa elidor frisina marinate clewes wfie videogaming hartly hagy kgan taris facture koteshwar ellmau isri redican kakakhel manze fedco cusimano gareau davone laurencekirk nouv fanger azize copters dhh moimoi isoline termeer glaciologists rememeber llangynwyd rostal pedroni werkman hypothesise tollesbury fahlman lanus eug sellotape schnabl szatmari rogart rusape hoyzer nupen raisings newswoman kallum marshburn barnetby kovack delaurentis sapey macosx dusautoir weightloss biostatistician ampico lennons vernoff zmed jurin alices sudarso islambouli lynndie hascombe játiva hardies sublicense sluizer dinkeloo celex kolobnev playbooks burtynsky shadowless gehr akti shringar nazon weasly ncai homola benander duncairn magris housemasters agsa woonton akahoshi astrov hanborough kleinsmith barlows jaric ruolin ewl mateschitz nonpathogenic disgraces wajih begijnhof winser grimod worths linesville abstainers prestin relm aniara disy lenita gagnaire camaleón liwen solidum poncey headrush muhs themerson deterding threeway seaspan roncero xinji jawans serifos cussac mokashi saarbruecken guelfi pedja magimel ringham pealing myracle arestrup batebi cpq darwinius ruardean mnookin tramon kawthar weinhandl bavetta poppit componentry rowetta soundalike bqe sheko stethoscopes preconscious karabulak fusheng melangell acassuso chastleton prowled elain legget polybrominated malesani carnosine paritosh liriope grisoni bernon brandberg hillmann tomka ustyugov pluna capsulatum vengefully birao bedsole lichtblau samek bettinger brammo brabenec unfactual lert battani kubis sadden slemp chemotherapies jingxi flegrei hohberg lobstein tupman hennion contruction kalangala aspros realme corporatisation fischoff ospital brein scatcherd ribblehead bookclub lassan kalliopi moonachie greediness becnel leapers lupit aldy kalus ashenfelter branum laister coloristic tesauro jiye flahavan silveri valspar qiannan holscher whiffs shiksa adbul liks rupertswood cockup fuzzing okpala opacic obvioulsy godelieve cué mcdc andrex sweers zumar benacre prepubertal kilton whitewashes gorgeted gambella friscia tetney apcc picabo kulov sathyu arcano sby souleyman schmooze sinndar majerski hohle imli kapugedera johnta thursfield synonomous kasyan mennin glyptodonts bickert razzy gurdev geaney pcast afz foreordained medlars shikabala wilh bza ipss rfw panka featherstonhaugh carders punker negresco eurobird brutalize ekeroth concering doucett piatkowski puchalski coeds handz leboutillier callejon reny perfectibility whow flattr towry zagorski bjorlin yenan gitarama floros aner peraino hrytsenko numark konia stationhouse fortius chiggers embarass pbms amster esselstyn voltmeters hemsby fluidics kalaeloa rtms baranyai korbin asmi puletua clingendael pietarsaari aliments lustral allami priem fedossova extranjero torresani bioaccumulative rlf framwellgate aeth noncontiguous cleyton bedinger nykesha hamblett jadon minibike wbe jcvi gillikin housen helú hesco ngema daigh warzones joas boscolo rossitto loyiso bivar sugarhouse nemir micrometeoroids hogen trainspotter cusset tratamiento sockalexis dtap aphanomyces elona ifosfamide dissectum shlaes krown lubben appi rtkl yordanova gisp shahriari zonneveld seaforths brickwood lajamanu appraises lewellyn sineva olestra ultraportable baloha lexar kedo imrul gunst cryptogramophone hospira seafoam shpeley dsca barbery gudiya pilbrow thurley aneel nswc hamai garnons negligee shandra pariol basrur panathinaiko metreon cutshaw kantaro shingleton outlives caddying apert lewine odora seligson fencepost kitchenettes steegmans sabawi yatseniuk pimply scrawling ariege jeremiad horen unchr sendings kedgeree ultrasensitive playworks pehaps dormand qmv hagbourne psaila konfabulator meduna sieu burried stellas fashir elucidations azamour truecar everynight pitstone groarke cyder desuetude bocian littleford fregonese morrisette fairplex bedspreads poznanski quiggle hait walbert minstead elward kryger nonconsensual broadoak diversifies fountainbridge saem stoli eesc mordt chinaberry catcott nalgene lucet hawsers romneys terrestar schlussel neuharth busabout ondrasik whiteheads shorrocks dclg haemorrhages ahlus watchkeeping vih externships bedlinog aspirator jokwe manmeet bings casner dovish coyuca lonnen chivilcoy micronucleus nobrega jakobshavn yakob bleeders informatie mallonee asfandyar roamin reclose garnant versyp mistrusting necromorphs worlders munsel loughrigg humanised sudakshina hinnigan schulmeister yanin flitton faraci trepte dmarc sbirs kouachi cack ceccaldi ocarinas wlt onlooking hoggatt daybook hiper doriana whodunits ranallo kirkheaton avci hidy waldbühne assael lawr oportunidades morett hollandale fornaci annabi townsperson palliation karadas sallying venerdì rokhri pandher stammen verlyn orgin slacklining voas wkt patru mdea meris railpower erel pittsburghers dambrot mondragone kirkstone hieronimus llanddulas salena orido redactors pollington merelli rouzer ajaan firestops mrnd wincobank rheaume biomarin winterberry kucharczyk etns hirwani jopek houli earlybird laboureur savané marras bdn mcausland gaultmillau lysandra branda chkhartishvili phh rysher wangel navitas vesnik globality ranae frankham erlbach moorends persnickety repacking sissay sulfasalazine horsmonden nury louisianan cwmbach breeland thone pryke nordyke saracino soemone buffone millea suprun transcorp safwa subburaman ethicon soulforce electrodialysis starin géa soozie squabs navo yazov saltmarshes kovarik bielawski kohlrabi justes biltz knowest colinear jenbacher lousteau mcmenamins etiquettes burtch jaroussky nafar slavka samouraï ufu quenton stanely dumay tolimir leistikow tagliatelle gelles cregagh xiangzhi siskiyous maxjazz tusd pamella dontcha traute subversively burray sutomo sideboards melloni mevis zadik camisole hamson ordure imod sowerberry acoustica caher didanosine topliss cotinine saes ters sedova pettoruti succesfull govenment biffa ogunbiyi bengeo farty tahari haseeno holaday fauchon suppossed kristiansson amongs karise kléberson ausra laras verlinsky concent globalising tanigaki maeil asteroseismology filburn calabaza politicial glassfibre borgeaud remarkables whacko unrepentantly pozzovivo aprd commonsensical dramane isfield couchsurfing kenes yorston laq norde enterohepatic zeger cremello folkstone parasail barel cenovus vanneste diaphoresis dpss colaianni irresolute rideal seys raoni embling garavani marcham narro pérec sterlington buffie franciso sophina liverpudlians burga sokolski kitley abban fussa folinic scarpati stude isovaleric topcoder blazo seef augusten zehntner auguries hemond hattery zevs yabbies ekranoplan gleans retentions cheal crabapples nordtveit muthspiel renouncement kavre cornavin cker iros bachleda chaldees annisa irton mings melasma muravchik mohai gedeck histroy agcaoili noyz dopson caveau myza bermeja dergoul jonet nocella cefalu fangtasia santeetlah kratovil schreffler howabout littel unscrambled ruzi mughniyah downdraught trampler alterra antea hasmik demonstrandum siahaan mcqueeney yekutiel conable geldard kstu millhone laiwu anla mukogawa jpac strenght kfh fresson weizen twitpic hadorn erddig ellerson shono carlita khamovniki chinda veysey scota grotberg djp unfamilar nitwit clintwood wansink connington humidor moviemakers carriageworks besset dewe moussy dolch ziming verdery soif plyometric suchomimus cstr swype mcrc alrady blaqstarr opik conceed emmit patternmaker heidy hydrobromide pharmacal sakartvelo aleksy nastar violaine woodmansee daguin seethe lamya readhead berhalter ambev lignans neuros ruhrtriennale crism richenda garefrekes guinobatan virtuously tiptop sauerbrun balza ladji canright cagno moqtada dundela gringa lcme grascals polebridge comission monterrubio bellugi autocue csq perturbs apitzsch yadana krajcik sandelin ktrs shirland sharara nyko tenseness kuito wenming geumgang kidon amout shaqaqi manke akoto claines mautz yoshikiyo jeanrenaud rouhollah torbet proceded chuanqi rochin arvinder carco moluag wickerman viggers hemric stanlee bachna acerola arrivé ugv recomendations thrusted bhimji dubroff sdrive aronberg fctc ulka cullyhanna minja kirkconnel mtj underachieved hangnail nolfi abbondanza pyeonghwa triacanthos balikbayan enlistee thous skjelbred mberengwa dalto minia misconfiguration neuvirth teahupoo cande kiker spritzer ziying braathen giuli guell tbas sibilia tranberg lenti astigmatic dialysate birdhouses schoolyards dolinsky norelli haemorrhaging tinier postl bupleurum crommelynck vlti biki peterlin ezard kexby knackered olaine herer bardales primis shorland cvitanich kamien varicocele gerstenmaier tippetts actionism brusati imbedding battiscombe ferals bettine czarniak ziglar trian hiptop botos fleabag akey cuisinart centerview tryscorer tilma grebennikov freeskier dannelly prognosticator backen refaat blackcurrants joughin ruffolo wildomar wtvq waghmare rcvs platel klochkova repitition statesmanlike laro nokesville darwich pattishall heiney hospitalists dazzlers kullberg kealia verjee gaieties hibernated lansac trendiest dejen androni whitstone montechiaro finkielkraut besmirching bluecross zampogna poutchkova leoville favorito norine wheely coari degray sissonville hailie eveyone afpc khasab sieb sumin deconstructions rassi birna deprogrammers carhuaz bulcke wfe nesson standers forewarn implosions gervis jeffcott denley powwows biopsied bedes phylacteries wastepaper terrariums hedgesville knowlson borodina orsillo tautly multitalented sherlockian dankner lehoux bürgel lwazi shafiullah launced odero saarloos lotteria redish jammys groys pecc ballynafeigh favorited lactalis henschen discrepency faily delman gisha nationalizes brislin carmax flavie acclimating cruceta reamed shustov berstein charcter adkinson neuromarketing verot bitsie portmahomack cowcatcher granfelt unmit jansma annulments wellton abuelita kefaya leadley whop diference wvla brosnahan rozonda alcea pardner aviacion sauget jumbie philipines countersign fukuchi otoacoustic jopp pilning fanjul ripostes kisko soofi openable scaroni worldport kvitfjell uppermill ruden nitroaniline gwyer babayaro chanie dicillo dohmen excommunicates rendy issele tâche democratique gaidheal zagel seghir stemm xtv salopian fisma rosenstrasse katzin beque bruichladdich breckon ihave hardnett ratchaprasong leavings lintu currywurst ilmi beefier feminizing gamechanger relaxers skripochka claerbout ventrella outdoorsmen yadkinville nebres vélib arrgh astons slighty takala stears luuq pawnbroking pignataro yazeed rankov landcruiser gurnam harrovians bultman schussler lewers spinotti dealy turnersville kajita imaal jobrani dodgems jaggies pleonastic gereshk piratbyrån fortingall melchiondo involed vanino repsonse ggyc querry agera meibion nacke quic wedgies farooki foor neffe fitzwilliams redistrict dejectedly ghadeer madari dolk woodchopper mebbe fortymile brants driveability lacotte stargaze intolerances gambarini alaux stifelman octodad pugilism deroo almut beglin sunfest lleyn khorog chesaning tfb halfwit budrus zarvos gengis mantovano crevecoeur terell oechsle entryist whybrow mitigations mootoo marcovicci sighisoara faenol fundatie loovens siteadvisor zaky edenic nonjudgmental obici ftos turrill nbh silverstar boudia curcas stuhldreher scaletta kranzler sankai textura duric glenmary avgerinos virmani mutebi autodidactic triggerman unessential daiane mispronunciations cyangugu larbalestier coar bruntlett sybrand vedantam ghettoize milstar masip capitanio xingxing peppone bonhomie headcase brinkema riolo defintions kiyan alkon girding tão piperine purrs druggy wkow boond sadakazu absents willott dongling dicking outcompeting tunzelmann lophelia pagnozzi zagan grandfield kishenji minong lutui tôt landkey dumbell paratrechina antihypertensives unsurpassable kettling pelino durn smoochy allover hishamuddin kamte goliaths deichtorhallen duthiers metanarrative diammonium ligustica valland coie lathen windtunnel silverbridge bartles intertek gracanica akopyan vadivel professionalizing vanags flanagin pjn armo preziosa narum pokerface baross schotz jialiang kbyu prasher emigré unskillful diazinon hisanori xos goldminer numberous darly icpd higgenson adjoa bsx raphaelle grinker caressa teemed gassings nayeem sliva vendy wintersburg mathangi malaba montenapoleone fidell giggly amarasinghe balal depósito fixates commercializes reeked bobridge mackell klain pennacchio causus nonresidential litterateurs noahs pevensies striscia tursun mimicks himalayans aimlessness nothern croisset quasha thistleton rangnick pomroy ahari pinkville eddo jaitly meling onh metaweb uuk palmateer madworld palka nure wastell yocheved foldes telescreen berzon nesar jiuling tunnard gopalkrishna hursti potboilers redefinitions caze dinerstein jetbrains kalamaki yufeng delbono cartwheeled heartrending plancarte narvesen ekow champcar gallegly ickey dayley guyler gettis lodwar stefanowicz mavuba broderie insouciance rutu milham penylan ostrum michuki yaqoubi hongjun makaya korup hollymount astd pervy yanhong recapitalize odegaard hunsley unduplicated hwyl nvda helfferich xtp jimmerson sucres moonfleet nixey fayzulin schobel sajad barzee brutalised tapit southcoates aquis repassed wazi présidence beichuan disconnections leadmill mezan flexpoint mayrand staddon gritting pedlow boericke hutongs parvesh racho lengthways touchable enemigos lincicome intermarket baffinland henkle communicational untrustworthiness garduño kyger arano yopal riter leitgeb ohnesorg susta chusovitina fortugno hrawi ashling mortems suddaby economista burjassot besom soldierfish ticheli periosteal kindy collectivists alcopops briner fexofenadine commercialising bysiewicz cocacola alize rwenzururu abdisalam sunmi cottoni gebru notepads fabe cryoablation grunty baart achaemenian floberg earthweb zgs jaitapur jamilla taula rivermead crouzon bottone ajijic saltgrass rashaad gorner monem electrocardiographic christianism yoshizaki geotag pickfords laroy experientially resizes coosje haselhurst toadlets westhay autonation novogratz secas ecosport balneotherapy gilje smokehouses sudhalter whirlow azmin menagh bladerunner cigarillos buzzcut martinico vivyan groenewold skys biryukova wannamaker aberbargoed springettsbury goulue pastoor ringelmann tubitak chure hueston phogat securite fouhy gambero isx jibran latitudinarian zampella prq regalbuto turizm seide beltram teakettle etas eschmann galled markale pusing bisgaard hyphenates afte eryl crims motocycle darris kamaliya palmsource ndaba akros morasha backrests lemen bobrinsky fradulent wraparounds riano jivamukti inza bluestockings foible kazuhide tnuva cohmad urx gurganus gunwalloe askatasuna youngish charlottes constitucion woolaston indentified imaginasian apung ramdan drouhin qahir lascano semashko counsil oiv fleig patchin gedächtniskirche cybercafe ixo cooly langmaid lykins kzo doell harrisongs pegrum mourier zix lischka jondal sheknows forestlands dirndl westferry kolyada goodyer menahga sébastian slydini recalibrate bashall honu shithole wardrope beigel penacook fecht graco irvinestown dubi bichard counterplay saveri langmann ercot tejal shaybah dipo prises takur hebgen idealogical heijningen filthiest corruptive nettop ccta davus hoseth randone pumpherston pary gubbay susac samart dessena kittner fahidi lamasery robilliard actioner armleder ravening navios bresnik oxygenator hufkens creake tuberculosa faluja morphew shirakaba bradd ojjeh iskakov matinecock bellavance lisberger stingo satchels baille naicu halff slayter radiotracer newin hometime flashbang varnedoe kawczynski llangoed azia unknow nordenberg dica kepple maitlis omaezaki papillomaviruses trashman lutfur osem chesire macharia niida paharganj helfman lennert moggi oglivie siderúrgica bectu sawsan mutsuo capriciousness gibeau tandan compagna chepa remax llantysilio silberg euron eurocodes autocars untempered pseudomembranous kightlinger brainstorms mondol renken llanafan kauswagan coqueiros plumpy pahinui resealable airlifters arola dhone pilares mondaine imerese rokka burfoot soens tuffaceous metters crada epically stevey martynova veinlets guiliana kenefick angelas nurme baño underpayment frankwell robalo hypermutation hiyas koules chellam shibui mirzaei edgings counteroffer societally friedenthal nokwe housebuilder corralejo crudes brunot abes shahri hjs vigeans debriefings bancs guandique bentota vestar mindgame analisis encom larkey reconfigures unostentatious schnetzer atasoy gliss gegenbauer moistening bushmasters laquan skijoring xiaojie nebulizers gco penicheiro worstead scotlandspeople nark tasseled teares westow beachball ludek kenge tapiwa talibans wanniski kykuit palitha gainsay aamot acidify jarallah cantv goitein neuropsychologists desaulnier perignon korac abeylegesse glinn aptenodytes prps miyamori aldebert iccc nickol fanel trefil meechan literaly hertzka mahsa uarts sheepy hougham matlovich antier tricon adonia rrts obeida batheaston unwra wrecclesham penalization abiraterone seter nicholle vermilye benhur sabaa videsh literarily ecodesign makiki amona doong msns joti soyeon neryungri polyketides bookworms handpiece schoolfriends chinaware czekaj takura greenhall verrerie yoa seedier shigellosis wachman wardensville atlanticist murck krh microtechnology windhover dawnn perplexes geeves varejao ossana gjorge outvote karapetian cdfs wcos fahrenden fumihiro montz davinia muraviev sadullah onozawa qahira fancying helbert gazdar mbanga shaunna berzins tretter commixta transcatheter alphavirus vilà jarosite baiters sempé illston lodder samoyeds asaduzzaman secuestro varey marcali onselen dfz vaut vaze labella pasquarelli chetrit auston seascale glenbuck trien emri miniland glod fetcher formlessness lepard cockley dashon umble schmidts büchler goldwing qalyubia birchett varadi vernix toughman aletter samim zehri rumman bortoli schillebeeckx weinheimer weijie tantalizingly spinnerette borgnis baldies nagayasu rejuvenates gorio pigeonholes gustov bankfoot perfluorocarbons mirafiori whot shcherbak ruther beermann palaios posess glamorizing blacklands haefner sassano hoeryong lehtovaara papanikolis talya claesen jaouad blilie thakrar wordfast checkmarks flatterers bettauer shaari tepi soboleva goarshausen musashigawa airtouch unhooked doorframe ligang dossari boldenone aurilia skille rectifies calcott obomanu dahler cabezon edgmond deutschlandlied silverblatt jungers lindheimeri whitsundays marsans sionko seaburn musse rodeph syverson didenko microbacterium straumann mmix evaldas portero canaille dmh turahan lemelle freydis feuvre rajarathnam mastenbroek addlery cholos schwarzsee tunie loyn gripens tiptonville pigtown tarawneh innerbelt kellers navman haldin willm mollenkopf doege stoneygate hadwin belhaj parapolitics blackwoods rimming espelho kutia chaperons gattopardo zey lugos wycoller residually advil rieppel belsey herzi sanely carbuncles zhabei heartstone hpn blotching yuly texturally topalian quandry erbitux nesrin rbv bellringing lesnie tumults sylvanian tastelessness somis kcen ferpa ostinatos emara whitefin djermakoye recentness pangma seegmiller barlay olenka alphonsi avlon sadaka glamazon nimet gmcs siebels cheatwood rootworm deltoids llansteffan stiebel fessenheim salicin nbtc schlicher mediaflo lensmen zingaretti cartloads steans werks nattawut sambath anwari delino yardena pirton caccamo hsx mukhlis gdh claudi adjournments airflows rechsteiner golis styluses carusi revkin misgiving masumoto denegri sheibani maysoon uvaria rohi schoenke rheinenergie yangquan kuske parquetry sumantra brisset googler marzook mcconachie malese siphandon hikkaduwa klitgaard mobilises vindu strause greenhut tcx adiyiah grischa jummah koranteng rankle noirot éclairs smallbridge mecke tetsuhiro doxylamine choplin philadelphi rossborough nitel oxybate meulens trolly contortionists ajas pearisburg vsel kekana wharry panagiotopoulos insureds offie curtainraiser expectantly soton correlatives rohitha albala duijn moeaki gabardine robach kaiman dunnings rhn homogenised inews othar outbox resolvins retox deciles salceda boated butzer allicin creepily intramedullary akumu mahiga roulade fimbres nangpa lki kplc huarong allix djamaluddin petrify pietragalla precised fazzini antoin coinsurance ambrosino syna foeniculum cadgwith primicias yilma sherzer huilin kilgannon preternaturally eeshwar ladette degaulle zetra mainlands neediness betten tsujimura chafik argungu gossan mazzarri lehmberg whimple trackback carrasso machugh grumet natuzzi turducken letchford andthe izen hettich inglesby blenda powderkeg kilfoyle spooning kitov tyneham vatsyayana hartburn spanghero haldenstein nassos renacci barkann proforma sizun wras shumba featherweights houvenaghel péry illiad disposables moghal odey rozo showstudio questors diehr jesusa loehmann búzios giblet rajoub nqr shevtsova lilach locanda devra somekh ilpo yoshinoya bonni colloquies vinit biddick schaffel straightjacket neighborliness rolnik mozaic boomboxes trapezes haimar pupusas spack hongyu marketting konecky careens bvf sadah malpartida mascoutah tronco grafstein cpes warthen terian racoons pashtoon heidinger faha hilscher olmeda ravid letch winrock cailliau behavour bergères goffert tcga boulis forna heartline colorama hongyun lindsays edmands fujitaka alnabru gorenberg sabudana oklo yuldashev fugelsang zimny kransky sutterton razim hypoperfusion deysel malekzadeh wallerawang shifta myrtos yeye garnishee niçoise ceric kwamashu ibbi söll paglen frase dozes ucavs tottle lcmv lehninger yodels serebriakova notenboom fabini malacanang pollie zigzagged wvf mahbubani whins fuselier amchit sextillion firby mailbags centralizers cityview multiculturalist farmboy scandlines kostadinova peregrym suardi tonina tornero redlined ebg kotkin bernick religon kifah bartholomews hollyfield relabelled hamin fiscella bednall genuflect gussets mackney qualis fena leicesters tarnower beatties manella seroquel tatang lassos shiron closable kilmallie sighvatsson woodiwiss boscoreale ghur poeticus arepas loury poha dreamgirl pboc brunne chlebowski llangurig mayom vokey skillsets kalatozov koom salamin bottai enfold scientistic linglestown bhengu finagle csts crossbenchers incaviglia dutertre bonura perjurer narrowsburg pimstein wellner particpants jrh soula protuberant tianchi greiss teper daybreakers boonmee nymt allroad kornhauser mouradian cowlick jaffri lnx tonel trimper cassazione schoodic capadocia houliston kalikow elfa bracklesham waul gardom aroldis mccandlish blurrier aspropyrgos bluestones surnow blitch dislocates stss aldham childre grindheim certaines elsmore finans tjp marreese ropartz ziploc eyewash wassel karlton takeley gelabale wazowski andrij esajas lerangis avvo finma hongji porrata secdef divestments mouriño sapkota washpost uitp shapoor tieton khac recross progressiva germond cashtown elos chemi knuble kusuda miia romanowsky genth corrao plancher hommages eastway dessy klaar verklärte mccallany mbour mamadi ottakar oosterschelde bceao kooijman kaca eeles pandin saabs pltw zywicki ppmd kilspindie rackoff medigap faiman idolization pepetela dastgerdi antipathes edden gamebird darsi fédrigo promed raupp tabc einbinder chucker coverlet magnetoresistive preeminently piceance hegerty ikg charnia adaro crawcrook pej knockholt multidecadal dvorák littleham revu amoa boalsburg witchu cosker fortnights moneychangers etps selinda djw synology giannotti adiel santanna lindback smoothe diyya khazei calloused downspouts huajun raybestos tanged charmain repowering gröger lukins jouster rakeem upmost facco kepp fassler bartelt caiola amnestic aquilini gaos kothe sujal hornbeams faid acually bergl eoa dozo mozartian avow biaza scerbo guerino yoik naturalistically sherrif paneriai unlivable relentlessness balmond louey mikhailo netl ballymacoll zuccarello stonehurst reserch wjm yogev hoggins smhi chuff satwa jeppsson khalip eversholt croteam cuckoldry roqueforti tintinhull mze yelvington nightfly headcovering siyanda abdirizak monvoisin meisenheimer womer mocvd thamer reqs roddie eduction unresectable loverly panino housesteads schoolmarm teruya ilyukhin rayborn pricier vanderhoef unreasoned leriche consecutives titletown midwich zmeskal livetv brei ibtisam hdls sadhvi agalarov ascd glaude dagpo kedumim koomen papenfuss khn dfcs varekai crpd giegerich shepler yandong elderberries threequarter besancenot hawas brawdy mallary memorialise krizia ozturk akune electrabel dorri nanoengineering yachtswoman nightwear gulval emagine beause almalki anyaoku insulative itic dorosh incenses econometricians semmens teppanyaki biery cloch juggs blonska johanneson tabuse urschel drabinsky teixidor iwade overfed perfunctorily scallywag extemely yizhuang damascena djivan oppostion taprobane baali komack zezel gire septeto skulk telman munnery splitsville ispi barbarously davinder bartholomeusz inus zenovich rangeen tsiolkas ranalli snufkin robilant hajrudin shivered bakwa holeman bugaku mitarai ritzenhein banny cuttino fortun sothebys goodnestone henize grai garthdee giengen doily altiero elxsi kalmiopsis emachines johni sathyanarayana menconi foghorns perminova berniker bienvenida afikoman onesided kgsr stegodyphus yatomi wajah hypoactive crms decleir lindbeck pentacene jahir staniland champetier arlett rodenkirchen borletti inflator kensey pinwheels thurcroft condotti sybarites nauticus bartik sepehri formentor umbers settimio almandine hannukah caprine poinsettias mossbank krisha ibou dudin karlyn tetlock aghaei watsu prerevolutionary carletonville scrunched immobilising zabou lsms kagwa stensson margreta countercyclical tume kunuk douanier dualling llanuwchllyn gedan domme cnhc pythonesque cameraphone whisnant herson brancion teevan miletic overboost duramax grünenthal intentionalist megaplier alamy rluipa wingerworth machars scaturro getgo buckalew gambol maniam kunas uiv mockford allina kair depositi pantos anstead bufton röttgen sutherin vacio retegui seghill gobaith anba defaulter nahimana pontificates kottaras aló krogius laureys scoonie pereire bastiansen scut treiman dhaid rangaswami limmy amihai overcompensate itslef wtvm thorbjorn karone meecham baylin blodwyn ultrapure avil skrew kouzmanoff loiza factortame flatford gaelle belenko krtv mcallester firewalking ashli proppant socarides juvincourt anatinus poneman shaibah propsal mzungu scattergun torkington sembler mahayuddin bickershaw tuwhare mepc lockups tosar fuz gaitskill zozan smartlink tellme kerlan spaceward naypyitaw cheban trolltech kristiana szeklerland elburn knapsacks coronelli cornball carryl vernus bentt ransley ervan dobrawa gallan sunbather eskander pineta rirkrit monea bedu psystar northaw kiala fuensanta tigrett rubby usinpac buybacks keilani isuru odil perkowski citect millbourne truvada dolev oishii naby hollygrove kronenburg villarica bobkov riffage baldor tpwd babbles irremovable azimov leapster cherundolo criquette pleasuring inhumanely redgrove sagana cabas vujovic lillingstone shantry hooser bendixsen rzepczynski mandawa llangyfelach sniffin katcha sirmione boyardee adnams radioing spirometer promiscuously alieu millitary arseholes guasimas leauge titford lenkov diked bazaruto usst addys dickmann estermann sofres dishonourably aryo taters snoo caraeff handier ocurred jiujitsu kyllo tzahal qambar jackeline brkich konnichiwa wanta anesthetize mardian dylanesque dsquared clowe goddards drog tcad baroody nmsi pozarevac daimlers hootsuite trully phythian steatohepatitis germander brougher marcelline doas coroico peroration unmake immunomodulator celada inlcuding bermudiana museological tarriers freis gintis borsen hyperglycemic saucepans stathopoulos werthmann emprise morefield yulianti dreama makoy khasanov pdry kalian mongstad leons hairballs mazzaro quedgeley hassim shlesinger upwellings taigh sickeningly melaku plethysmograph seidner cyngor maumere kangyo savaging dycus laryngectomy laminectomy antonette mangone flexray penone geraerts endtroducing negahban smertin sarthak ftes sawford balgowan hestercombe athiests cbfa roisman dafis seens stonewalls injuction pressurising cinnabon inaam ckmp ammoniac felshtinsky kevjumba giustra hhn desanctis cieslak miracosta carlsten bloodflow talau calamia shaohua alala fluck antrel vicaire budejovice gumballs zicklin moea rupen haimovitz franchisors ekwueme fleischmanns pomersbach gilsenan arbutifolia frates rahmonov accts wakening trundling lexico manzanas keny gamonal finchingfield cullybackey disbeliever mittermaier unseasonal buco schoeffler yarl sibbles thomopoulos panke tropicals cleta marchmain ehhh corletto abdulqader dubarbier polygraphs newspace elbaum noller golshifteh pavlovski velloso plantersville ghawar prespective engholm djokic paintballing mileti pittsville baitha begetter obdulio zappers jasons wyfold inpex calderoli fanton contaldo unevidenced yihe alexeeva uncarved nurgaliyev goldfeder caciocavallo obligatoire timbira showunmi imri corani honglei mangers kellison obtainment condescended kerchiefs falkvinge canzano boogies pierret cotti ghedini kanstantsin croation ogbonnaya perrilloux bovver fcuk seahouses barabash revitalizes licko multitracked kyber palelei ericht peyresourde bialystock holovaty zarnecki vecernji yoffie predappio lapinski mechler oncoprotein greatwood steffans stubing champalimaud forgacs zmp lanting kinmount payor färm lycabettus driza salong schissler copperweld keiper nandina fronius hoffen scroogle keam majken pakis bergenheim riester zaran tichon mediavilla ulead gayl balikh bellvue pullbacks eurospeedway faletau silvaplana libba bendtsen snegurochka mikhnevich cutover dayville aymaran szczawnica oxcarbazepine histo consoler morre chloramines leontina jaunting handwara keerti asml colefax schaake polynya houran maig nizhegorodov pachira ticats musks mabu reiterations unprofitability conciliating rasouli tudhoe dexa tannat bergenia frann visvanathan tayrona ormat sachtleben crusell desmonds pactio nefas homerooms stogumber plakat aranka flamers forswear esomeprazole loj masterkey problematics decherd bulverde auldearn verifiers fbis mansu egoli kanaks btz ivre arist filleting undercounted transportability changsheng algún edilson cinesite morson cacharel swiftboating wehby ottendorfer rhyn interpetation thohir tomma jumah luhnow caylor mathen wanat sedates fredon deprave flocke sivs eslinger llangorse mwv meletios tanksley kohno middleway meisler habegger nevzlin yakubovich qorban woodentops driveable ecat refiling zaiyi hardener ferihegy treon prizefight agrium jlpga omarov tatsfield superlens pmqs speedtrap gongshan kesting graviano ehrc coumadin uspstf manjiro twizell medicos jeeter culson utcubamba hubristic dinette recipies gpss trittin adjacencies mythologists orduna litzau jamilah dimaria pallam grandchamp nutech isoft joppy lucedale rathmullan darque abrading deekay vlavianos bichara hiriart maerlant semrau leco kallawaya migdalia fergalicious caravelli diavik sorgente muscovado edwar allanah sanitas daumesnil nethergate innovatory kailyard tacvba preska zuill inverbervie expatica cubie rosti bourgain carisch cranage streatfield gianola immobilier midlanders daskalopoulos korneyev unbid bhit nischal nyugati vxl emrit dientes susato liebesman predesignated taynton ksfy krahmer misdiagnoses schellnhuber nebahat inefficacy bartica gasolines stuffer sickman lyk mimoza zopf intas hejira parati ohlman jfe tlingits gards kitco vampyrum roşia tekulve aded caudell pearmain mulana enm mutilator beregi landgrebe mitchem nyclu oakhanger discectomy horsing delinquencies dykehead mesothelium saltimbanques greyback opentable ezpeleta sundal moustachioed tendance ncrp coid pavanelli leafield rahnama zins kilembe biodome zide birri secretively tullberg greenpark newgen thrombopoietin deflowered alit rasuk sposato scaphandre bobbito anophthalmia paasio sgw refuels turkistani blackly rine spondylolisthesis siddiky guarnaschelli loompa brighty winebrenner fontaneda nirta geraty kindig hinglish tuscon srimuang judder kisber dorade ircon elease unbias indianised glassie glaisdale angoor wdam patagium khek boldrini goldfeld fardan prophylactically eliasch senselessness burim lonna waaa equivelent rhia mbytes tincidunt seamark unbacked coursen psychobiography chitterlings accessorize disillusioning nrlc chiarella hirschbiegel vehicule cullingworth emps worser tuftonboro mevissen tamaryn lekeitio chichicastenango goding nivens saloth ruairidh radiall moisi kolesnikova frisa brents tetreault lievre woonasquatucket sharri zhaowen eraring yogen mantee northsound disneys dhafer kerhonkson terryland rakt kruif oheb wratting usutu crossford honnold heriots gathright friedrichstrasse standlake podebrady roover icrt clawback afgooye cmms rosano hutches immolating ebling basnett sturz kosch cripes mcgilchrist guelzo hagemeyer fiermonte articulable stefanou mjpeg hederifolium hadwen jawaan eyenga wissant sadikin khayam rusche overinflated dajuan messily olesa ukil lisinopril mccleod sasabe dresnok asciano ingimarsson bawl jaynie dockings rykwert nuha mosab biopiracy scenically carrolton playtech chiropodist escombe godwyn alloways bryndza perambulator burtsev pentraeth dillan insulations gritsenko quantocks sharaa junck coolman baltiysky ibookstore artangel casevac blankenberg aymes reformulations wayfinder uncatchable elliots edholm kazimi swissport chci danskin sagaro spanbauer magnoli seneb utti mctear skii klemp morillas skah asajj tencor sqd hilke abates bitesize outpaces northglenn hirji wolgan freeriding ranadive crooms cifor jennifers mixson omfif brasilian epcglobal shigehiro seatrout jeté helmsmen gccs fordney polarstern wbgh waimanalo gaca manai loopers meanly yevgen chaucerian cavey miserliness carnitas mixtur positionally kietrz verea mikolajczyk astras cocentaina atton schoenenbourg najin xrx bawls quenelle bakara vainonen plumwood balikesir sonero xiuying stagers kamco antimalarials burder counterprogramming hoas anthes ajaria zapater chulack wielgus africas troldhaugen pasteurisation buccellato anying ammos danjaq butrus cotillo warmack israir eizenstat exford breunig vanhecke afge riffat newtownstewart suburbanized granai scrounger kfsm novozymes sendup snookered deposal brachetti deramore motoric braw malew dearle drinkhall obgyn coalwood starehe polyphonies wolak trenk delysia ibat arhus pekhart panchita kobel swiffer léoz dolge yanowitz karley widner rsno bertheau nnrti faad disjoined dreg hammams edwalton budrio nambaryn qul tanihara kettani stanco csav irinej francon zoopla marinkovic orginization newbuildings reinstitute mesmerist rothken dragnea piccone shirburn redesignate odlanier flameless barvikha jasminoides lianyuan microvesicles wannian noyola sattari marjayoun ikettes prouse lorson bouldery yema recognisability sternness beeswing araj dallen mkek transvestic sherrick qte cheif sigitas ulee unendurable amandeep magdelena ridiculus michelago cadc eoy harmfully galama alayna rafaeli mellgren penyrheol mppc maidenform priceville prweek lbh standfast gallaccio broschi rainsville elefun dreessen andenken maunga ltrs phoo expositional salesi cobridge stoical amondson bessant hillandale slunk soans mcammond nalidixic appliquéd eraill todesco risom latife alier gdg cavalierly jegi renningen capdevielle remorselessly swanmore mindblowing bixente simpering celski ramsons maysara suwalki dominatrices pulsates rollerskates antidiscrimination vuoto kumquats ceibs gadret dmochowski assistantships villatte duhart nephin vlj newbuild cpds kokka lefkow strossen impelling froyle corrick disincorporation cultism tapash chainlink bédoin whca sloven kadian fortifies tonderai sedd dangjin shufu achilleos nugaal jerusalems eggbeater homecomings mikos hryvnias osedax rollinsford hornafrik uresti barrhill trustor twala hohenwald epicondylitis bensouda uwsa naohiko noji ennen kasza sturua lakay megahits toothaches sahba scorrier papian uigea jff brenston aymaras hariya qaisi cucuta oedipe surkis dmgt lacivita beziers meilyr daylife attles godmen unactivated reham mozah juliaca kelek noordhoek scouser ljuboja hrms clou farsightedness chelonian plasterk schlangen meiselas cafasso evenwood osterwalder veenker examinership anesi toguchi icid blairite expn farinas nanodiamonds roschdy escb rigotti shoplifted katinas vitreoretinal cartonera delocalised paining yct birching iparty shahrir urmanov firwood cocorico ecms turracher pojar gudu caneira hardings unplugs betemit digressed mohaqiq komondor wollenberg hiltons lafita coltsfoot lecia leitzinger cayre citibus ndna naouri peole bowlmor edde modupe irresistable berin limescale imediately wiggled brandyn farzan dolent scheving braca overextension angelotti sephton zyazikov langseth carie scappaticci ridgebacks beled artemov scampia cheekpieces keping gatepost coverack moneyless bramfield massaponax chaiwat screwups dyana brooman lightheartedness gxg melroy geocache vaccuum foxing otologist masalit opinel branfoot drakelow mingyur tennents catcliffe bandaid bedpan uksf listicles roseana toolsets hoblit homelife iads ntw beumer ghorbanifar crapping endrick hirsig mineshafts maktub antiangiogenic kaftans sulabh urumchi copyboy deogracias yesenia sxe anzures frv electrocautery hupehensis miltary leappad albor clypse remanufacture ergogenic toubkal undesireable seroconversion lochridge mukhortova viorst fouque gelis flander ciner polga morikami mayest mohibullah chapstick monkish fanciest itogon screenful kondracki bagillt lyor crucorney intercoastal televicentro poitrenaud lynche adney berdos froma qit undateable aliveness farecard photocatalyst liason rayn rumohr demotix bristolians harleys kalhu knifes tilders sermanni laurant openvg hileman odongo constructiveness jurich guincho meerbeke laibin kaena hendri cobbers maese leftenant fargodome reappraisals firooz bukha schroedter oxenhope fausset wilkening calarasi lefebure automattic ringz cheonggyecheon geschonneck gutsche migron pollutions hdo quadir yinghua mahowald bajour berntson rtas abessole adex ternhill telesforo crucifixus spittoons sahibs amponsah faucette nisp dunlea unsmiling postured dragic abergil bizarreness hauserman conection experianced furl liverman eule ayeni mumuni dervishi bavington oladele copulated lambir sindall ruskington mccains paralelo millisieverts snowbowl landells dawyck mutinying slic kromm baroudi taake suryani bilsdale servals quartararo antirheumatic hoogerland modiin brog aigas overdale abdelatif journalistically gaag bumbo fttc bustros heartwell raduyev edutopia prodigals varinder excerpting gabulov dobermans stussy zud lykov pastafarianism linssen louro usasoc motsoaledi balis tzachi kostek linfoot azmeh monzo rössing epaulet sweetbreads abdennour nugatory schuback dubiousness helghan blisland openreach yakobson redbus albarran cockrel xiaolian ujiie fazi prachya hydroxyurea laisterdyke emollients melvern holditch seea trimsaran orietta uson buchholtz engebretson walleyes bryncoch lifecare persuadable minurcat athor mcquilken unregenerate gesticulating maurilio gozlan hulland hasheesh rabinovitz bmmi pawleys begleiter tutak orex horschel tsukuda aregawi monges abrashi sopo axels künast harrowgate durdham blanker pratfall cerato fults sherill stenotic priss spinball bremridge terenzio thamel angely tufty workboat jaccoud marymont undernourishment baishui castorama newid grimmy torrin raincloud coalminers carmelitas mancusi mooching slacktivism atlason sijie goytre smdc antiracist heaston geovany vidovic sfinx mesnier liliyana velis pullovers soluk cji benstead comentary thamm fluegel agner disburses ncqa photoreconnaissance southtrust bessels insinger forecloses banducci naraghi lomography ventes rareshare romagne hutzel visioned enalapril schmeltzer snitz dxs bearingpoint airmotive optra aswini cristol acacus shanno tranquilize klyce tekere edified postell rasnick bucchino instantaction cervids bsce bedhampton horsehay rokos degroff forkey bierk nuzzle elyakim rosarno cloy sygma krautheim botataung parkmore kaillie paratore sitwells tricomi firedoglake steier menzi rusizi identifed swerts epro conviasa hardheaded rearrest laguiole boulerice countires porthkerry mccririck schaunard petina murrison quebeckers weild burde victorianism ncoic porchetta wowee lddc mozzi weeraratne zarrar spred jayalath genedlaethol altha overlayed unwraps mythopoetic wriggles neros jackel calheiros blackler bournes uzice prestonfield touchpads turso umberg rohwedder bouboulina pepy milnacipran granot zoncolan bensham gondwe magnetotactic bashore jethou conspicuity facr rosewarne placerita elbrick ehrr machain idj nabataea dilweg schnauz myuran melberg corncobs acera swid unbind melkert moodier kiryandongo coverups goodwell emporiums melphalan manitowish wame dellucci baltimoreans mountebanks ceredig hellesdon untenured simango codirector shamdasani marygrove dhaval supression finchale windowpanes csci mutki fitte mcpake bzh lecker dexedrine civically mamanuca shishmaref myheritage patricof mickleton marianos gizi superflat geffrye rubislaw trenary teeman arakcheyev papava beanbags wordly asot glore jayati stae togiola nongoma govs aroung dmas volodia musclebound yens sunley hyeres kluczynski sury pakong canottieri egb essiet crapload imperiously trouten aytes swint arbaces karva fontainhas arrigorriaga hydrokinetic zalla handpainted nccl tennakoon tankerton bouchaud uralkali zuccarini vcsels rawski presidental woolos keratoses laymans rydon sneetches espalier slipcover flather bechtler muddiman buynaksk marcoses snowbelt hanadi winterslow lliw bockel chollas reile bmibaby nodak zingales cabdriver genter blacha unpiloted karalis elmiger mcmuffin parbo dealth standardbreds casaus schneiderhan manriquez dresel maxamed ucil hemorrhoidal nubira braatz coercively stortorget stirrat bewitchment debauch muliro nijjar krautheimer bibhu otgonbayar mukkamala oghi grindleton ipana perzel tidily tohti dooh boroumand modernizes wangchen crathie peginterferon quanell blinkx goulston sê refashion efthimios linell sillinger politian broughshane peffermill merendino tisei skyguide anfernee dirr pandorum barraged activeness oilwell honkey stenehjem guarnere japonisme rexel tauss reetz onesta amendable humiliatingly verel tassy baathists yepez hieber laytonsville idealizes bilsborrow earthwave kummersdorf manríquez russek swinderby zieff halwill uuv knaup fazzi wmsl empathizes copon ” ermete hayfields twizzle dominey caveney shobdon bisaria dunlough theologists giudicelli hollyhocks deuteronilus mauney wholely rishel loade bivalirudin steindorff salati korengal grazian humectant zoysia lamarckii archila anusara ndd copen bolad dorthy jacono tapulous limpley muranaka atomised dkar smoulder doorposts daveed paperworkers kadivar qosi vactor wolfington greyness oduya amag tabarak greenhorns polynice natalegawa pocos voluntarios acci rescored pastorek xtube weligton gloeckner errazuriz racan homesh marichalar disbrow relined abqaiq contibute gamerscore antithetic scartho devolder pseudonarcissus pezzini zuidema dcmg cadaverine mulliqi furusawa colyn undoable shapingba terrett chlorotica creţu bacai ncci feba dyanne elbogen feock golon robosapien varndell artley sketchily guilfest contex electrowetting mazzotti wliw karantina channahon melnychenko unretired canonbie dornsife avrum barathea villans syjuco watchwords shenwari matea kamdesh tannishtha linaker limoncello eephus manifestoes raboteau unaesthetic trouver pertec postillion simler chatr shimmery pcps genex govts rummels gheit kinnon elementaries reengineered padborg tanishq fattahi quimica tiryaki arjay retch anquetin serama rudine nasolabial graduands polymathic kadugli dks westernize hickton gilwern cortaderia entombing aptness commerciality showest bolkvadze sbdc wyszynski longi polariser couzins artisanship jsow exar htein sumud nonprescription omgpop windo bindman grossology exhalations karonen potamos pudney pleating mischer trevitt impenitent beausire baltan nfx mastoiditis calitri blogtv aloisius entine tuinei tahiliani rakija pilarczyk hemagglutination perreira calcars whing wfn corralling woollcombe injuns varenicline botc hoseason cernak roohi zagged cairnie dayanara skirbeck hourn surace studebakers idiosyncracies mallas marrows cogenhoe mezcla mahmoody riddings shamsudin guangchang zyb algus hipodromo ticciati workrate worthiest hhe saunt kctu vatersay googlebomb horridly darse hemianopia nonstarter guruprasad spiva gamber lacemaker resit phrai unificationists reigh leasowes innercity villagomez devoran aboutaleb outqualified sermonizing addded raileurope elkes lovaas octopodes munkeby filmfour stolichnaya kohm schrafft alwen unhealthily lamelas mcelhaney kayalar scheible giardi mikulas fundoplication hannelius cheesesteaks buchannan essaying kiddington therof fawzy twx bujalski arginase balslev korbi mammadyarov whittam ultrabattery ramrao scootering lefton drizzling ascó pbrs mehanna hepher kondh impliment diffey sunburns celliers pauletta slynn ruffer reacquisition homebuyer starbeck colajanni medalling orten longboarding bergisel voca septuplets assignations tshirt meisl mopes dijak hwl kosove ludbrook underclothes trakker rebney teisseire baxt klor thegame bepi psnc premonitory fotakis limewood ewww revillon recondition synar timika becoz alkis underbid yongbo kiddin overzealously substanial mohney kakade maafa maxxpro pascoag assasinated biologos bodrogi shigefumi wasta dorinel penalosa betide gupton xpcc cukurova rickaby luiten soperton radiowaves freij kilic dhruba thomsonfly vauvert houseware mvrdv walem motioning flavourful alferd calangute whitehills jovians placemaking shrawan saxer mckinnis abderrazak mildy maintaing bickington polzeath broggi chaddesley ieps brilliante yantis hitner marong gandhis parities shawanda durando cockers elbers cockman staiano damrau steinbacher wibble contractionary araldo pelargoniums fengjie nonino conks ashjian honeycombe mandolino shukrijumah hamburglar kreischberg moominvalley nowack disseminator illuminative hafei gwion therma nanoscopic roughhouse embratel buechel attukal delousing corteza schoenmakers rohrich korshunova bernadino mazzolini turcat timothee annotates pitcock poyraz jakupovic martagon magsafe freimuth intraspecies lemasters qmu hawgood krx champoux swartout maconchy hsct ramson xiali dmfs snocap jersild generativity manyata stunk reoffend bernado pennel resurfacer rothaus beckstead crye hryb tkeshelashvili nenita netbase rokin ncircle pakora booij timimi gendall akognon potsch pontcanna kadota raib bosisio coldblooded collimators grelle rabigh moisey midgham katsiaryna meole glotzer ledermann thronging dadis smallcaps galinski patakis dith ouakam jeanetta grygera astroland kristeen gratae mouthy breitbard tradeshows shinkin shahrvand raikkonen distractingly queeny lavere activehybrid hde deonarine bodach wesa zwane yingjie benzing dolbear ubad kurr tprf relatedto moggy agrestic naofumi buseck gaslights daters firstbus keavy asotasi exotiques mirwaiz vivie wnuv kranjc improvs omnifone abatements minumum hich mehretu shuangqiao businessobjects tods krehl carim waber fieldorf lcmc kamanga parad dvalishvili gritten pavi languge creveling ringwall kunak dmsa spys moiseevich pentney ojt outrightly greeba geocachers keqin alterable reactively weissach mlpa corbacho achievment nimroz polruan civilise azhdarchid malissa seducers queerty cottaging techwin fabens krugerrand kuragin snam outrageousness lingor pharmaceutically inquisitr uspap zolo abominably lunik ishiba merad françafrique ainun raisi divemaster rockhard heartthrobs beidler neylan chalonnaise wideload avini olanrewaju dexters eckles bornholmer camelias mcgeever ritchies mahmudul trustmark densus brunsdon ambrotype mutchler tepees chickies ndabaningi bayoneting payables condemnable graffeo polaski vhb roketsan whoomp ammara quinter artemi imz greenboro loyalhanna popalzai mcgarrell handey cxo galdakao lightcap klatch ekimov ebullience eurodollar ditchingham dumes dimap stear schnucks refreezing pluss bawku cookey contries kuumba masimov schiattarella quiting yugoslavians remotus henschke jokin soyza iraqiya sartin saharon gutbucket kamrul grabham unlocated laggy overpricing ruakaka ceske dysmorphia hoba earlimart biljon reinwald ratm lingafelter beverlee naqqash tdap wolkoff radulovich muzzleloaders bame iame jafargholi boisture chegutu easterlin ejike nissans seitaridis secoya ihd lauran gallman sestito lakhnavi humphris celotta pittinger mudavadi kirven briegleb chinwe hrpp klapwijk throwley soumillon futuregen selecciones bastani karsts manoeuvered palinuro irms usherwood southcoast neukom louca hawx malmborg valdobbiadene bowey footstone cavanah kirchick niedringhaus sarcosuchus excedrin sabaudia denbies antiretrovirals arkansan scummy rubinsky kaylin soonish denicola legrande scapino xom dhlakama prolixity faughart zorana todaiji perezhilton pipilotti colombine confreres czuma saquarema dtz seethes roflmao unbiblical montres galbanum oscommerce biltine sicco hamisi eulogising avocational xinfu worby flyger diki hagadone witteman saac gianbattista tsiskaridze bokel kishorn envying knoebel lustleigh partier invigilator marata undershirts bocco lebowa lotfollah exhaustible abaunza overgrow zentz hoeber marvyn hedican mccleskey puelles kuehnle waing mcgees eyden gogia hassabis scythed chones haloti teleconferences abag traymore tongchuan beignets mezuzot ninkasi sidoti backhands ensalada ybas petrache pontsticill estuardo politicising saltis lavallette ctas impenetrability karipidis lanken caerhays wyth hocutt borell monthes rogé cinestar trupti revenges henahan pfanner preise rends macheteros leshoure multiculti kyungnam chillenden bubalo jambor pursley tmetuchl conglomerations mckayle strous vasks pettifogging puranik uset anso bainum qna uncat whinlatter bittar tarutao chinu pedregon ncfl macronutrient nfkb thorncombe wcas awak hasin paams bozz superinfection mokes suicidally lininger montsant lifeboatmen telepathology convos meagen aers adition bleadon biancheri biotherapeutics hasheem kosmotras badwi twats keratsini musts cutepdf spicewood ilsinho wefald bfca altemus pensione yeter villalona ecologie dearnley barbwire usaca fanless lituania ditf fibbers hartstein mozhaisk webtop fenyvesi globalflyer belived severability kyobo badh cissi andritsaina invista sonographers gwy cernik hardwire purveyed fovant monsal fishamble cinefamily rubano artax siegle gtaa dunecht frommel rpu unquantified staincliffe loakes koprivica veech hipness billyboy sdps ruhrgas drafi mahamud elides orangewood moominland ruha rozell goatwhore tvpa hunchbacks coronated shillingstone shortell khalifas wordage zanganeh eiden sharrett waab vesko icbl chasses nachtwey watnall heitner thurmann arbitrates knödel kohnen telecomms clannish rusnok belterra shovelled mputu okell jiyan notarization middletons zdx baches ebbin zubairi mianzhu dickran tsec citrin acroteria kellyn dissatisfying littl grethel bloats emelin zacharo extrem selph iabc creamware wusb npca reoccupying somontano kremmling malara adeem maarit nannetta guarin couette cheik panitz continentale haemodialysis catarino negs sennybridge zamka vishay swanner strichen photopolymer jeramy gaft polyneices gazey spearpoint dualled gundelach rilwan zaloom triax creaks dasarathi tasini kortney seeduwa nnk pineridge bederson tsotsobe limpar caldwells lomon sweetney braungart rustled everbright somehting industrialise croyde croda sodales augenstein jamain ledsham rodhe leang brockhill facepaint gdrs ironist paskov bepicolombo whiney signifiant loggie maing electrotechnology mcds sdsl belcoo prokopis maffs mcgeown andiamo jalees harpaz northrend haros platysma davilla schnapper geyserville ichinohe channelview lambdas evis doocey maryinsky humbie pertis zikri mwnt ilych omnicity centocor staka pulverizer deryl cabc sterlings sidd egide morston gasifiers candidats setsuo ballboy visan skrepenak brajesh vork buntrock thimmaiah mccumbee tchs sleddale ribbleton bolkan aufderheide reifert lavagirl fuchsias overexertion kalana yudu citings weidenbaum spanx redel lagardere torkelson hunston parentless tantan beelzebubs dokumenta legian rascally darusman duiven kobina huizar bladenboro monasterios benik mamic exposito higareda ratiocination tufnel kamtapur epsiode roseraie kanowna sullins thrashy bombmaker zehner peir whitfeld lby lawing nympho unburden qfc sylvinho sengi golston siptu kingscliff naaqs viscious hamedani samsom hacktivists selvy genstar vosne alagia shweli dorando flygare ubukata tamiris presentiment tike amicizia gainline korvette tecpan informationally diski rollouts hepu improtant cienegas bassong blaubach shenay holender misalignments hybels queudrue crossgar brondello imbaba holomisa allio dalwhinnie chatree stahelski fuksas olowu outgaining tubingen headlee geiko multiunit jenniskens shuan fangirls jalila ramsays nooo cognisant lindinger eveything angeleri windier gelston pudor karagöl herfurth trepanier ribena bufford enoteca deininger boxhead regretable eroticized malyon sandeno boggis massett nagarhole morfydd greenmarket workbenches barchi poletown dahalo elvi scottishpower bluehost mogilevich garcelle mckenley ongpin chudzinski guilloux gvg vart eagling diah livarot grapnel chingis sepaktakraw tranquilli ilaga xingfu bottomly sehnaoui onevoice kenzaburo masey iltalehti optionality resettlements ironik ragghianti sinohydro steinhatchee funks drumnadrochit hambright porfiry frik halswelle skane sandefur manc snigger channu rabone roriz etj eyedrops mthfr mbarga huatabampo rufforth phenylethylamine eventualy interflora serrao beaverkill privee gscc notkin mudhole asgaard farmingville conserv bido momenti cager fladmark lukasik hannawald marlar lroc schiefer vulcania jmk ciso terzieff wibe cucuy moussi analogic nrha halterman nodari boetie toaff pinboard dgx ndri picanto issie tand vejer changeovers maddrell huallanca wadhawan enyeama weech boisse ratnasiri brickbats movables contadina starlab creatura brzezinka zitron rangely mcconathy valpak kisielice pachyderms grousbeck antonveneta porticoed maiani buspar critcism simrall romary resendiz houris viiv junxia valrico merche cowger mohnhaupt xiuzhen writhlington sevoflurane valhall craigmyle sontheimer amama giertz aleshire unconfident testbench dilon verta shailer longson jacobowitz abrogates straightline fusina reclad kemperman ellida sunlike celeriac ukelele brueggergosman tiernach poruri smoketown joselyn danieley paperclips harmoniums vilayanur soufan nyishi khunying breathalyser liko birdoswald antia darkrooms pullens horkesley borrani fezzes rotunno oundjian mcroy aquafina twitters lavagnino hoffmans fanhouse yebda outclass orbitting taotao hanifah gwahardd glacé pitmedden allisons sobolov evercore icna aasb skegby pytka accelerants mcclard tabbouleh garnethill lamborghinis taiichi mallie nads glasheen zonia sofinnova anorgasmia aycox farizal aast zenoni simiane jewelweed istm arifjan matteoli yacoubi aabc paiz heter niaaa acció arzoumanian ayyoub nokta incantatory harrellson chiran mukherjea deadliness euphues butyrolactone minney beauvale infuriatingly veruschka xiaoqi railey huxham burtka hellens delacruz underwoods keiter zeewolde dicha muddler succentor babaoshan loopnet isabeli sugested unitard vailima abony pataskala resco americablog perkovic bindoon sennelager asbpe nedelin thta mji unaccomplished similair duss buchannon austrey bottaro epalle unclouded starwave caporaso jazzin fnx alire nhis wheezes labarca barmaki abdulrahim kreindler digibox wjtv writhes yazdan unharvested alipio gayler aeschlimann angelical zevin glenallen buddo rutsen larrimore korsgaard zialcita rbh yongan terroirs zilberstein frand shallowing obertan elmohamady cagny egnew svcs teeb weas antidisestablishmentarianism deliverers iicd geesh bostich bonifas dyomin maldef yre theu psittacosis apopo servillo verint nofal lytes gongyi bloodsports afrik shoushan mancunians dimness cassoulet rebuilder reinharz skyhigh reira groundstaff entraps bebout exorcises clayburn richar rabei lucked pinkies transloading groomsman antan mirbat grannan ogiek tidies essayistic computerize lengthily ezeli dochart freymann suer littlebrook sattelite waramaug lubo safesearch escandon nuckols spiriting istithmar cabriole traprain freindlich sogou champy jary tecktonik lligat sportwagon blw hamutenya rowsthorn aschwin redstarts vaccarini grenadians mgrs chiya offcuts holekamp aasan jinga laconically unintelligibility eccb lychees shatford paddleboat belkis doddle kinmonth pilkhana burtenshaw mountainville nowick koellner bjoern bathrobes codicote sehome ugland kesse comparitive tanwir barette controll radiowave grens marjanovic andrean tavris batek mesospheric witha rigsbee nasaw breathability kenniff weijun nycomed dörfler statelet braila mussen depite emplacing rosica mindfully iknow canion ogonyok abshir ambivalently nethy chequerboard craftiness bawana snailbeach mokae hillend zhongliang rasiak cprm scampering soud blazar sonenberg inamoto ducruet nivkhs lillehei rangiri sequenom lorwin githongo sumthing parred msec gooley sedlescombe cochinos gotshal katiba fakoly ronette wanni gazania ripstop minibar commmons weissenbach reassume jaras penwell geia bput cedis ccea franze follie bilges glor munto lidy koten overcharges sfmta mwg nhw ogunleye bekri jerramy dokan runte predannack gostick harbisson yuskavage havi shrubb megabucks westell assp delectation abrines inexcusably yoran etelka neeb unhitched fictionalizing tsampa fryzel acquirers aslockton drawcard loftiness bogeymen llagas steyl tippers wevill sweetbread koelewijn nanan bismol nicad japes ekp komla parve zernov dmdd falconetti transgenesis overstocked qap raciti eezs cotana vorhees tamaya scudetti bolano mulamba metzelder küpper boasberg archstone gowy eurocar exsists katchen wijekoon xiaoyang fayemi cooperrider terrordome mosset bryghus matsuhisa lecanto speedos abdominals pandove gratuit moviegoing prevalences henjak awes logjams etak cyrilic volpini hedonists londyn codger allbutt hartcliffe sundram smoko stucki insurrectional khandker denenberg texmelucan hooved movilla evoy kalantari hiccuping khristenko veverka recommences pentaerythritol wirtschaftswoche funtua tokoyama colugo leatherheads garscadden torneos sivanesan leilah pinton afagh vikar guanling doublemint forebodings nancys klich gazz amyx oefelein querejeta deaux detouring totto webcasters talland westbahnhof waith cvrd townhome sweepings cookley timberman halmosi bojo okemo nerem belorussians dequenne sougou mursal ingrow halver backsliders unpremeditated lainez defensibility banghart sabayon samho premaratne badgett harvel darfuri chlamydial opuwo karlmark goldhammer relleno gilgo keeran inkwells refoulement kohe tretchikoff rusafa financeira schapper downbeats quansah aerobus sgorr thanvi kurzawa afeaki tollman voluntown varenyky madrassahs lafca bringuier qadisha drub oposite hussian nurney emtala nyatanga anyhting baffour collectivised coert angioma strunsky mislabel copado otniel cruiserweights eith randomisation ehen greffier fflur juckes premack reprove ajt neep embeth concentra sowards anshen hipkiss austan readymoney caitríona keshar keleher biosimilars revalidated hagelstein rusling wrangled bchr concow sweetwaters mpds kathiresan shumard leslee kontroll lucita ancestory schoeneck lasvegas poquelin tyronn caires feca jobsworth limetree machrie natera mesick whelpdale flamberg aloneness minnear ashenafi honchos fitschen zubrus kamus thecityuk baldemar exia hammed keram enak lxb bocht gelineau wineville suhaila garretts manfreda stairlifts henault suttee sinowatz huether karnei pominville inextinguishable belue frisked loppi mazher terrazza fritas oystering nuca délices cmmb millefiori balking creaney ximending cpoe bajuk elblag scheetz regurgitations hubbins kanwa kimutai compunctions jimm cancale monans phumzile boet pigram opion khodr koumba arbitrageur pheromonal shairp mahbubur snan hannig mcconnells kehilat foxhunting makosi humanizes avst taxies rebuy demeco lougee nsidc strivers egusi acab yurimaguas hamiguitan hatchings selectee gwynfryn souri hydras nogle alfon sorpe puddington kasavubu stumpers frpi fantabulous drennon sisterhoods alwani plumped sweetlips yerbury tempier grandinetti tostada paratene titman transmutes kahlua seymours totok boyda linby amital levieva gradney hurka sulkhan conero kropf boasso teleflex dcri torpedos christofle piaui blackgang meningococcus bunget fiserv gwrych infratil denuclearization avedisian yumen telogen mirchandani kringen deshay colesberry fremontodendron ignac fathallah shelekhov alemtuzumab glatiramer blastocysts boguski eurocorps hajari xnview safaryan ballwin hiranya giubba mihoko rayl loreburn basix sickos yuhanna farmyards brittani averroës wasserstrom grafenberg adala fasal kontras srivatsa yanbo stabbers duhe gysgt barrells sneads quieten lccc lupane hinga chalfie hellos scoby riduan najeh eggborough teratogenicity galten jarvi gurey pkv highbaugh brauw rsamd willowy jinmen desanti triay decison opiyo meridan nostrums valeen frankfurters nautiluses stotler ghoneim brecqhou delestre greeson sliwinski artwalk loku schnader jaliens taracena fennoy titbits czajka batkovic kulfi rotherwas elegible gudmundsen montgenèvre benakis claming speedcubing wittekind surayev malarky birdbrain chalhoub horridge todorovic factset nke newsfield ddraig shepis masaga ditu gorostiaga subbaraman alaia jukkasjärvi panus otiose hemophiliacs drezner particleboard havat minsker whitetails circumnavigator loverde wanjiku yinon mavica ozdemir lamphey ribhu arrowood gourmands ginori aiwf entwhistle antil lombaerts stiffed alberga berowne heek jumana mannofield converage notifed glodok tharon kamstra mujra neede mònica urasenke capuçon minoff ruminates philmore nienke talamo barquín nunnington alledgedly xol nortriptyline radlinski koskoff kalley ndas gofton dookeran deréon dawne cambe brayman tyrannous alekna iberico vacco navratra thila tairi bagpuss walgrave ghanian bovin mercurey chinde trenwith sollberger streetlamps greutert pączki nobodys norweb chartock nutrasweet sureshot thuoc sellman inbounded flatlanders vaporising isella rizvan bungler hukawng calsci suryan streck ryue dunitz longstanton modhera kimbe edivaldo overtakers lowa shirah gnlf interdictions bharananganam vandenberghe bhana quattrone radaronline helpmates serralles unibrow maryla armanti sasportas tangela lenôtre vacillate americanizing gettman surveilling midmar theway hiel anthopoulos etok sfra rssi kunes arndell rangelov countermanding korr arshin borisovka superbug gobstopper buncha interrogatory belta liberhan ivlev catarrhal leftback bilinski jediism samel mehman contraversy tofo mtvs artex kvalheim panauti hoodless micheel hairstylists ciabatta vht richville radiomen joyes ovalbumin sqc minnett unnervingly auxier alavesa dingmans courbis nimród roban myslef greengage begon matala itsa forthe unitt apti lionhearted lawbreaking boeta hempsted daimiel efilm movsesian aleknagik honourees namkung kirsteen lhh georeferencing funnest oxxo grossness macarthurs strich nikes shivji aldbury simister anhe idzik niurka macena ifakara keala summonsed electrocardiograph phalluses tibooburra hesley oligopolies irréversible clarridge shoora brightmoor fergerson unal dommett mortifying gerwin jovenes saracho vagenas suseo marathis hesme kempa ostrer ferlo shamin falastin gurvich indigence kervin cieplak vacillates uksa niloofar umani derriaghy ksby weem sidelnikov forlornly venugopalan reconsiderations tarish unsaleable muthyala pronatura boerse roehr bambalapitiya cushendun frémaux nasrudin droughns senecal oladapo lickliter aulenti moeketsi breheny bauby aryal obediah riflery daris jutanugarn diora baiba mckelway favalora remic baratti hualian bothies breastroke leibish michalewicz winkless cohre sniggering hijuelos osswald elsewise qiz airan jukic rhan finci villandry lcfs rochemback filmdom fmap inclinometer fodé encouragingly chanaka sovereignties sebastiane chavista slobodkin kestel avolon ghandhi joze acdp demoralisation vanlandingham marikar alcalay prejudicially chiropody brolsma washstand atwa enfolded attallah cobey remley afrim saturns passera leatherbarrow nows cheongwon trevanian lushnje lidgett centrefold panchakarma mishcon brixius silverswords malavé concidered clearinghouses destini boiz füsun ligambi demer landler baetens jinxiang ipratropium unsatisfactorily minoprio koge cunneyworth angy dharmasala schmidheiny rephotographed beeri patinas zingo promptings iczm denouncements tamaru heertje calzadilla wenxin petteway slaten narkiss wuwt khazali adeniji nyamira brainers gnatcatchers whorlton katp hyperextended starbursts douridas leuchtenburg memc fiorilla pozieres recommencing faiers denford onesie rozental looooong delamielleure misimpression vandendriessche khamees tagliaferri defoliating redbrook czuchry theresienwiese jigged yardsticks langgaard lcps oring ehrhard benza mrgo brezec fountas contenting aigua agoglia sambhavna cookstoves norment topica destocking lanikai virginny fcfa coronini bandele cesp pontllanfraith curtsey tabooed wrapup cxt mcwethy betsi raborn qionglai eviscerating ceranae katouzian razzi artyomov bûche nibbs dosch woodworks locali froward multiscreen shubho ericaceous powa lylah clinkers mclinden galbiati købmagergade phip kapoors gillison ihec sharfuddin cocooning jaeden anky singhalese baranoff pelleted shenington powermac jlm spurted dehnert khune aivd jeffri omair zumiez fauss orick frig cooperativeness ghafour kinokawa corriero icrw whur rubiera cusu sombody chumby patacas yerokhin besla lipizzaner herefords whateva flyfishing cuppers korka hortatory figuerola ascraeus hillwalkers butina giantesses taxanes hardev genuflection emberley opdal valiha badshahs wiggenhall coplin fenstanton agran ,you paternalist varughese casellas mordern ruwart ilos kuca tonghe bolita hamamoto phinn yunhe maamoun nazzal insch enviromental freiamt verbrugghe istinye shaugnessy miryam absorbtion hoofnagle perplexities gonalons ashwick chachar kvetch insourcing shorecrest nutritionals circumcising woosh valsartan dimaporo backwoodsman igbinedion glittered lunts eying vigorito filipowski kilmeny scratchcards jogis casseus kreutzberg oumi lispenard gennadios desisa ruchika borf asuquo ffin brunken wiard orignally tatlow doven tesche tagab surapong becknell lqfp simonstone pallisers defilippo archibishop jdr wcjb puttar confab despoil commodes blashill bareknuckle shieling bateer rebids lanctot hanauma sharyland klauss ariga almsick ccss skewbald dosb novellino elot briarcrest hessell heilemann skaugen spygate reductively hekma emmannuelle summerford somnolent gotobed montegut teashop saxonburg prairieland jyles fubu canakkale nammo unclog raqibul roadbuilding absurdistan ozmen attili orrefors quells revalue mfps ganzi rosily casualness penix provitamin reheard aufhauser frizington klamer kosb luzzara parklea hangtown darcheville podrabinek reinberg nuttiness avcs manpack yade dispatchable kmworld inabilities hockenhull gardeur phreaks stfa heffern bahtiyar plagens bluemotion doree haslen westburn gobbles munks arteria gdo ubaida hospitalizing weninger eckbert turnball suada bookout zwelithini tuile unorthodoxy ermen viably oatway icet imk broekhuizen damehood phills magubane hackler csit wiechmann umkomaas shevon goldmines cicig fenglin maguy congi virb kazatomprom trika xchanging immunotherapies bobbsey practioners jaramogi pretexting harfield rajamäki unacademic grifasi badenov crams worldy arranmore psid khonsari ependymomas jubouri therriault tabei sivak sozo sischy brettschneider gunta breasley newhan chonnam beaupuy reaccreditation escalon marijampole milpas pirkle jailbreaks altares tatsuko mssl jwaneng telis breteler dunseith sixkiller négociant sutrisno rosenfeldt semenzato markwick reinbold govedarica emerica nashawaty steggles fccc halatau clinkenbeard kemsing iriyama ahuitzotl lavaka audibles winsten lamichhane karaszewski lenat kübra bugattis oranim distruption stylize lefleur zegota grupe neurofibroma buñuelos cervoni quami goonhilly archelon rasiah materialisation moszkowicz woolavington fivers braudy latenight laugerud matadin battis josemi majoritarianism howcroft decani zogg onil hamao draughon limache ronca mamatha montador schurig selc belloq emtec hucclecote unionisation shangyu confessore gilat brautigam orthop gachechiladze marzec contemplator kaledin rowlinson dogmeat stambouli dgfi culpin bawaba heve molinar insufficiencies scuffling onanism fallowing buyuk chinee decendants oeuf unfocussed geldzahler moutet massys rossmo bamn prospal llansamlet triangulating sophea impersonality jabeen salfords munnik cye winterkorn jamestowne ettinghausen countie pinkava stefi saghafi maccioni ulrick massager lopota basner laughner flambeur manent chandelle portersville blanchester glouster supercool desaster noooo gyepes skittled kotulski trandon fernandino ratomir wiessner chimayo helmers casadei salsedo metalink hossen paykel benu fantasiestücke teuscher fiesch gardemeister omrlp numbskull spelunkers yuanlin ncoc kipchirchir roets goldhill wyers mugg stanardsville forker cerletti llanddwyn unibond wormlike aagot duskin bandz arancini beierle painshill gebo foucan fengyang emsc beimel falih recolonised dongsi thudding kenderdine chollet fawkham plagiocephaly ahip kebri santapaola femtocells middel erwann delphina occhetto preysler volaré maraging tonteg locater charikar jazzercise outplaying ctsa dolto kibale darnestown romayne dennery azcarraga saunton mucormycosis emasculate yike kutiman fedotenko kasatka timpone montagnac tayer wolinsky keem bistrot yemane everyting proximately elgee shafayat sutar bakkar mcdorman biancone sagle ladyfingers microenvironments versifier neice kadriye bajillion reengage schmeidler exalead dominque weitbrecht lanard tiwai benally lasn sliman biowarfare baloise gambians ikonos bairnsfather rovsing fawc burdenko zinovieff pessary bryers jxl dutchbat cely trinneer trabue hardhats drabs ratzon contently prestrud megabases hosty dottori stanfords zolfo dris pokers incarnating eskan arush twana parkridge afful moonlets caldew puddled demonisation kaylani nacirema eglingham cherrydale gerhaher cedarcroft eggimann desena investimentos klingenschmitt shortener insincerely vulpe varne salawati staindrop gutersloh unamused sightseer hicp huaixi vocht natalist visnjic vernice tinnion overpaying fusaichi putdown zeltser greenebaum keily kingsey kempka mcvittie pahlevi mutassim yetnikoff nazims effexor schmick latisha cletis dipsea olsens jetons subsample coarelli grear borucki malie kcbd collegue spattering olaiya iarpa nyha mcy sûr densitometry masunaga gurwen nlj crosswicks sehk dawi nedkov denno cilic dedicatees kuester antipathies porcellian matriculants risø vilu reinstadler golez sawe pipestem lindzon strumble abss audiovox shaldon flyingbolt byman cleri ruggiano wimble durdin taleju kunie earthjustice deloss olazabal nrityagram standbys nilar crotches speedbird pettinato vermiglio ivas kfsn zarganar mejorada deghayes montipora dornum corer placket pirnie attatched renacer solio chandola bangour scsl blighting nostos standwithus khuong coulais fullman goldenhill leisured aglukkaq supercenters alpro binetti gudger gomulka incise litan riepe guara argi jehl niederauer araroa cens baleh misko podgorski kalyn bettcher haggadot infills borgstrom skarz tunne vnr tigecycline punke rmh horrify asyl paudge cedex megill munitis antczak breinigsville nigo jameses slewed verbale banyak conduced sheild katangan strewed prepublication fluoropolymer nationalistically sedco sinabung mackanin craciun futurologist preloading lazarevich derafsh yonggang truants berretta thawte enayat adamos sodrel quilling daohugou monochromes gemelos comar tannock paname malthusianism whre carretta sabaratnam bines lomasney artemide althingi galster ribboned matrouh enqelab lligwy abdillahi ventrone jawas lightbourn crocuses eyeline goodings benzel pirone hiemstra bantamweights meshack trackballs vivancos judiciária gavarni covic colace chulmleigh wontons cantinas irfon meeds panoramica yuans wenling stanikzai grieder stantec frautschi displaysearch backfilling whitnash bcar cordiner atomico tortoni itel arouna santeiro gantlet jackup teklehaimanot llullaillaco kotex zhengsheng raue primarolo fims lesniewski redelfs hunanese fransman firethorn occludes eggold ferrisburgh farewelled avrakotos munsingwear pinol tuakau wonjongkam leilei lihui costard daosheng equiped cheryomushki egglescliffe haverkamp esmo neora chheda coachways pistoleros evolvable baissac ashdon concieved prayerfully proctitis dipali gitex averette suborned niermann ktva harrovian kamgar hendershott emminger jurewicz bicmos prudishness plaschke gbeho pizzaexpress castigation suppertime momm avoch teklogix unversed morghab featherstonehaugh wflz misperceived euribor paillé straightfoward kliman flum verruca cfcm branchflower vanmeter livadi dequan vicuna montsho orlev hokanson mcleans marun morrisroe habayeb cromwells overbooked mordue sonoyta griller botkins weatherproofing pulmonologist ghlas xiaoning deossie palada matmour bankhaus upcycling baalen crimefighters limpert pendre dusenbery xfactor osin telkiyski webberville rscn skouris henrard lubinski mafiosa lardo jakobshalle pagliero eurosatory cromac condori slobbering goobers eaglesfield iccn finardi laurien kassan tssa dolinka hothouses prive golovina nykl angiographic vandeventer flyman ellenwood mydin pullicino tamely pappin broadmarsh gaido tervo kwakye dacourt raichle karnstein baghouse leage lonan saulsberry funmilayo mayhall cepacia hardbacks somehwere fundable glargine bowmont sanyukta kroese stickup meadwestvaco aurthur kalynychenko biggart brundall salicaria liberationist mccormac delicia hamadou tiken upchuck photobooth salomonsson bellany dehere mardell klecker stuke carsons nierop flourescent ceinwen armonía fodors fuggers ufood ankhesenamun cukurs witthaus sanchia abdulahi czs ctirad pother fatted xlf spaihts sportsbooks teched hatchway derriford crumples switz loudmouthed quorra idoc donsol malignity lookaround salmeterol piebalgs mosha anchin dutrow glenesk kuliyapitiya playon bohmer slad stahnke stajan secretiveness bhopa gaebelein bodansky lovre pentamidine fealy perceptiveness aifm togger waterfire kpsi chediak muresan farino subclassified mozelle nardella axr exageration superfluity layups juicebox burway cursi toyboy infrabel alysa candaele disprovable bakkies criminalist januszczak thirsts plainspoken cipollone bricktop gertjan birsel lawsons dewars concil heynes marsella witzig riblon jacó sauerwein phantasmagoric aibel outlandishly sdsr khutsishvili dettelbach worldwideweb follain atj humpin freudians pillowcase capanne yongjing etlinger listserve gresser hahahahaha tompsett siyaj loffredo riesenrad neroni ramaiya snoozer tewin oxazolidinone yuquot rusnano litto berain louisans alcine knook mothertongue scarcella agaba marshon ventolin rimantadine storkey ealier wartell polmadie bracadale khalden downings smaak vinclozolin gollnisch spitalny bradon hughs netafim tynda hinet brickmakers thurtell fulp vermeiren telemadrid despond lfls tacher proceding lemat gulet gilf cyanidation biopesticides recoupment paszkowski kinmundy subcomponent askaig abanto phantasmagorical budging danais bartin taqaddum luek siemion lupini epidavros regeneron kinakh atzori viler mccarthyist lepchas khelifa mylroie fennessey maegan mangabeys churchy damji unmerciful rosan benmoussa organizacion chinking jalula parrinello tyrolia mikaele bankasi nishita kohring mclernon sinaa vitelloni xposure marinis niumatalolo nyjer appli leysdown wibbly reveller kupets hispanico mbyte walorski keirrison mickler walhain multitracking enobarbus sivok livvy crainey cepsa nieu flins tuberose sandback ramseys basran berthaud sukhdeo leonberger cambron zelizer erke vonitsa dharmatma kuljit stehlik haberle caftan memin yudof arrivabene waxx krulwich nunatsiaq speigel yambio subsidary maralyn tunley léoville vrolijk mtkvari folkie roughley belittlement parakhouski postpunk detailer convecting pectins stanberry bellingrath rolm dvoretzky kkp pilfers cyberworld wanniarachchi geidt manninger eirias banglalink vitiate atones zygi overmach cnsl newgrass bettridge detsky peprah entacapone esmee lapides unsought visiters originations banse sedbury andreis tanjug vpls rajula guilden emboldens phalcon duick waretown lampell tibouchina alando rocksprings squadmate unenriched nurofen lazarist homewards zahiruddin unexcelled firestein romancer chodos landsvirkjun importunate wiedemeijer itchenor esrf feniger kurosh engo mouhamed gruyere cattan discoursed cherilyn diabetology wahm freerunner goza cicciolina enckelman galluccio junshi trulia jarmon bulkiness jutge grigoropoulos travailler stallingborough shaff argar mths goromonzi clarksons egee lindrick gusi glanfield sington lollis lemierre viterra hypophosphatemia suhey massei amenability bratby guidances cccl boehler padmasree ledcor flexographic ozersky ciric illinoisan tandra glafcos coppen peellaert polycarbonates vidrine agui eriks carlomagno fetishized meaninglessly nurenberg subsidises ebchester jcrc nonwovens orshansky woodentop trubee hipple manetta dubuis keysar tsoukalas giin spreadshirt hiort pogorzelski wienerschnitzel caujolle landsharks binladin bedient sccm astles bses diverticular okochi zelazo crowton biebl dibben huchet cotner jaslene hilbig schoenberger southpointe ahg notepaper ised crossmember tregua ranchito draughty simbin brinnington samie collinet nekoosa qingnian offermann welney vilnai nuzzo afspa pfann yaqin cinderellas anyother hecking kidepo shoeshiner eminger ijustine mitchener lassy fingersmith sidestream fullfilled unenumerated triaged plov ishaya motty okaro ayodeji bohon kulay unilateralis keauhou aernout tecia deandrea muscatatuck outruns benat wannabees wentloog engquist bantz passacaille referrers quislings indestructibility groman pacult hobaugh doocy antik lepik ebrima børs wasafiri menstruate burgett titanum hazelett stobbe moorjani sacriston mmpa stolar southbrook coffeen sabih hoei bluteau millennialist blindfolding kandak yanomamo zew mceneaney backheel tunesmith demoralise nunnelee shrimali desicion rheta tawaf milty pister mccobb albarello ganor clopper midttun goetschius frazz crackington zarang listach stopsley portabella biked burriss filkin salukvadze pivar uraba montemezzi kownacki gimi aberdulais meganeura deincourt ophthalmol susantha posnett mbete nieland picornavirus jelimo petplan gunrunning yumei rapo spaggiari hmda maroilles veigar fenomeno jumbling drollery gunvalson weste reefed unboxing zerner ahrendt rizzini glamorama potlucks sterilise cassington girgaum rushydro adni mylink masayasu kimbel bonami qiqi pinprick stencilling grillon silversea transmogrified talea meuli eqo ktuu dentice mackereth attaullah taphouse peneda intertrust sarpaneva waight neuroinflammation fujiyoshida fettuccine rockrose ochlocracy spaceway axium undependable dongdan callaspo macp fdcpa flinton masciarelli mannatech forsworn inlcude yeakel exulted méribel forbad qayoom omniture unhallowed berish bluetick vislab prediabetes earll loratadine ecuavisa billström izenour ileto caravansary karara pida bellgrove hackbridge marse boizot petrak kashechkin kiknadze romeus scherbatsky elorriaga motiva murum ammari lampitt curth stakhanovite buchloh hamberger karnazes earthrace lustenberger nehruvian zarema asira vengence supplicated sibierski torrone tranquilo scissoring danged desbarres deinstitutionalisation nafo treelike bungs merchandizing bishen duffell radica rashidiya scelfo taty volnay leerhsen bushbabies agianst inherant sanso laplaca jozami insolvencies millisle derouen irania thundershowers chapmanville hobnail azobenzene beedie senzo paavola bizar swelter gillean saucony corrupter crones murgh burdin soueid eena kingsbrook inlaws nmwa henrit idamante neurontin nakanai nurek semprini rully barsby bofa brakni pkl manumaleuna fasolt sferra deferoxamine rakib hogstrom tesar analy ignitable dugoni newshounds rhy stupka groeninge scarnati venla cromley kpcs offeree abergwyngregyn fredrikson angoff downtowner khemka verla champi alakai thamir frikkie xiantao didgeridoos cantillana intertan flng hedqvist calahan mcgleish lauberhorn lygo berlian pessaries osteo goldstick kvadrat koska dreyfusards xiaobin meisha varini silovs betanews placeres msee scandanavia heurelho aafl orrorin extrudes gangmasters alticor ardersier arrestors rundall flipsyde amport amézaga benfey nombreux chagossian secound brodies scriber nackt discolorations playtv romgaz unscriptural mundella videophones djilali bodnia escabeche frita dzhezkazgan matopos carbona dihua halkidiki dishwater llanhilleth tropos cruchaga phmsa attieh piccolino nissel cnss sebou melder flamanville baky kuroyanagi shirra scougall jomaa ilw bluck plymouths mericle creamier zili kosek yira dodoo grahl loynaz postie tigertail lerg kyrsten eapen awuah uweinat wackers noella graae dshs swets mccolm beechwoods ephemerality beltre murcielago jerpoint concordville ullas vulpine fmoc lhomme xeros cinéaste leflunomide electronvolts ingoldmells klaviermusik coucke impera administrational hopera unpick romac neeed bppv bertoglio commotio sprotbrough opap duntisbourne meddeb trmm dweebs graiguenamanagh pocketpc giselda sadun birthstone cristman uamh gogledd pidie diverticulosis colonialization wtg hardrict macrobiotics jaymie estrich aonbs weyanoke homen creepier wickers arandora obg raring mistretta unsuk hardhat quernmore swiveled simat limy pennsburg bedout areta bridgework tidswell learmont erechtheion yellowlees powdering systemax oheka weatherwise whiteladies indecorous villita pleau mascott fttx pintsch harff odoms larpent lacava lisha vampy armamentarium litchurch ouwehand yerli maricela obduracy walloped butylene wineman cardenden ecmm fisi bricking binaisa cber jacquizz matkin hotnews orchardson mealor erot drambuie dillistone unawatuna sunfeast pammy carbaryl reesing glyncorrwg hahah bicks upbraids cassutt melanocarpa piepoli defaria complementarities freewheels housemistress musaid taleban dunkery epifani kfdm miljkovic smartmatic brohan salvagers upsy newmore upperthorpe unthreatening magnetars carletto scuppers kuenssberg colorizing reciever unsual guilded easygroup mentougou musn bindy schudson undiscriminating sérieux zaslav vevers molate possilpark agritubel foxen alra arfi kingskerswell posnanski penneys religulous faccini kyei northchurch triet saïfi okolski infocision recliners fludarabine drooker radfan amoore calliste monomaniacal carcinoembryonic ravat nordnorge bockman straighforward harsent minimoys uhrich cheezy malbin pdma lockette merchantable moonis wijesuriya nonclinical lindemulder breitbach lanoue fretful dazhong narcy maharajahs idolises cloudier handcrafting maizels andersdotter crisped siasia maimun qitaihe soulman freudenberger lovesickness reinvestigate blabla wrose smokovec giddily bloustein qxl gaulin arduously sible grafters cropduster meghraj eyrich nasba abdella cardonnel benchetrit imja avrig libous hbx stief tegs undomesticated latzke frats crossharbour dragila ingratiation sellards talukder lainer dechristopher hohlraum gurn heisley lausen cbpp panders perceptively andf mastorakis decentralise hoper lizano chantels greysteel unelectable braafheid flinched woodbrooke merrall hotplate matana raymar whelps abdun verita borjan aftv slogged mycotic lewises rajiva unexampled zabavnik tsay muhiddin endcliffe woodlee omoro formalwear hallenbeck morquio unsustainably sigificant carbonless bandoliers hichilema reinaugurated indispensability swordfight rohlman spillett ehsanullah stalins ascendent neisha benja morri growed beus wegelin asby kghm outshooting assadourian menzie fmrp davia kerly anete dichterliebe pinny viers semisi racketball slonem decriminalising kaituma xau stamaty privalova tupungato woelfel woodchipper nofziger mitsch promenading lijie stealey berlant cubbie mirrione aniko kandara efax touqan unselfconscious mumsnet impugns gebremariam brosses diggnation dabizas abbington venturesome karbaschi andruzzi maiolo kolonaki guardamar defendent diestel ceftazidime epichlorohydrin heggessey golfs cocotte brynglas dontrell smartway oler delfeayo noki starband oudsema perphenazine bloodwork clalit merki maust oyewole faac isik lungin unpointed romao sanie hewit kalandar sibillini xianghe buffin njr fracci amcom yatesbury dicorcia annabell numenta puriton entrepeneur annys boustani aquil arhuaco atmosfera karatz cplex nasimov battlestations bunuel clemont superbugs houry dangeard dhaenens sakhee endocasts gaung doublestar elaph lagrotta junan priess eftekhari ynglings bienniale dimitrovski exito cracklings rssb harouna misjudges wltx huixquilucan neverthless mclarney ghanaba domenik joza tharaud yongming travilla iamgold omalizumab tommaseo cagna lagutin rockfeller overbalance woolies byrn carryovers arrison kiberd piria makukula moubayed radislav helmreich volksdorf tolk forcados tombazis teigan kuperus njabulo balzano kilmeade baswedan hardart coloreds trecartin morrel depreciates neilly hoeller haffar lonn chyron apostolopoulos spates kasler imbecilic hillshire heinig kriegspiel bertilsson qmg zatoka sanei stainfield abdulmalik adlestrop maktoob windes dtsc ognibene gharavi damasceno tiddly hemanshu felgenhauer shinga cendrawasih alfege pluk athletissima rbbp americanisation kolhatkar polypody iddi behcet roosen yaha spitler bakhtar radatz benucci sumei jerusalemites szymany bollini brunstad dormered advancer handwave kathia tokhi biznesu loosey linster inva endometrioid toshikatsu maama finton anisette prignano tubeworm naseema tangelo polich saimir bamburi tanuj shatti leuthold riddiford cfsa motoman muick iwarp vestri cornermen panier contango tuckson vanke optout fawell ivinskaya lisin swip shakara lugny wdel ouroussoff cuit koharski thorneloe wynder allgeier sirous doralee ahlqvist redbay iframes dormon druidical asbc baechle tahmasebi inmigrantes schalow sahnoun inveighed gordeev dessens garrotte llansawel composts dubowski swimme romei viveur overthink trisko posset nebs meskel hansal klumpp tuigamala indrawati mayenburg unsuitably bharrat lownds lection mangyongdae ellerbeck benificial golaz impostures laphroaig kymco wooo lorick glasslands fischbeck glaz bisciotti aqc vadzim giersch watoto prosport thamara heroe rentokil linthwaite proia creich jenine lisetta austwick agliotti numerologist guilbaud scinto oliveria disinvited defamations fidanza seedcamp niaf aqis seear kpcb hile leadenham zyprexa mlstp wessing buscombe bubbe steepened lartey bensel anglophilia swoose gnosall belabored lidholm holik tased surfleet kilminster seamie svete mygatt kristjansson mccallin spoonfuls zetlin tritiated bullimore mcauslan microliter comunn gayford tellabs fourchon wcfc koed branchless arrogate halfcourt burningham bubley nagre cotingas autogrill goenawan pariseau infinium decrescendo becketts mandarine gershenfeld brambleton remploy symptomless goulon strupp sigurimi worters westhuyzen borm artemision naoshima drily krayzelburg livnat fgt asne okagbare voh carlyn preval horth lessman pieux dissapointing nosaj eardisley presland safmarine preservations lamari rscc hilu arnouville kestelman zour dybek ultracapacitors muirton henchoz vyke veton bracers wellham floto waggener papathanasiou aubie haubold plurk nabj bagneris matzoh wadie cuonzo metabolizers kingsteignton cesspits piraro cowcaddens villis mossville natzler coupa okung unfortunetly climbable julienned goodhearted sickroom bosler shevelove ogri cepal issacs duken omeath appuldurcombe keukenhof samways navolato marown trokosi henok rhamnosus swk shamley imia cristales swakop drumbeg barnstormed solecism ticos unrefuted salzano bitts engelland katumbi kendu whitebrook mepi dastgir unoriginality chastized unconcern chesbrough nuttgens artemether glaudini economides kavafian hadan puggle brechfa shiliang qasam ertürk jerba deso aogo visse englehardt zohur sabogal hoofers cssf whitebox todrick protoplanets hallsworth afshari elpc korshak innumeracy sybarite shuming gotzon yemini korans brattbakk arod parhat pécresse mallis redraws panyarachun lbma stylisation blaisdon ekachai wkts westbrooks pickney molouk geanakoplos henrichs stopwatches kievskaya autoworkers hmw nlv haugli dirigo kirtling stemware jamari mouhot giambra preesall mallesons sylvaine unnoted travelog blakenhall atome audebert ungenerous poquito criscuolo landham mizeur mowen chagra canazei sheilas shahade curvis feloniously flopper glassworkers kerruish hergott whitla foodgrains yasutake merkland vermejo wolfendale latkes excrescences tonita togadia zubaidah mcverry wwoz diginotar grudziadz ebron liyana qualys unfound sesler shembe quanxing amoungst eigeman toolan mändoon jurf bearfoot polfer svae wastwater slipstreaming underminer carcassone okuonghae egglestone propellent embolisms dyc temascaltepec unstudio pbde lulea chippers bridcutt buerge rayonier mogel usao jobard hierachy napoleoni uncooled applebroog uninstallation tarator nalen rootlessness perrottet despatie olando ligthart openbook kingmambo frewsburg abbatoir yanqui loisaida loescher maffi hoever surete msss ferroalloys hydroacoustic santner kerlikowske glauser beepers wivern cyark koprulu hypotrichosis humphery galella coproducer moqbel keypoint neckband bruckhaus onne middlegate vulgarian cibula smolen bafflingly holonyak overstress banche teet braveness florale chieftaincies raafat buscar karcz elfs roustan shelfari inisheer pultar corbelli pentel sandeen tatou jajah meiselman arachnological bires albuterol clarance koepke demeny hradecky bphil smokescreens gritted magreb griesel teitelman cadabby caulked marianella karpa nesconset exoplanetary jiroux crantock sayah pernoud verástegui erker hayduke phillipstown microcapsules novatek scifres valeron talvivaara quirini chiappetta gurría mozartiana geosmin eidelberg kaavya ospringe newfields verstraeten korson ruam tuebrook nanjemoy sinnamary schneer angiolillo shahinian ensdorf janota hoobler prolongations gvl brandweek shariq wachtell mayda cresent cazayoux carboxykinase yamase bmy pontac venas audium replanning galleywood polyhedrons tristars pageau gyt wilfley daveyton ciga longone bogomir términos deskford piii splashtown microphotography marrella yundi imane mspa ravachol afor babatundé taysir preliterate juleps aora kislyak treet steines marzelline gardam mtcr conagua niblick eumc cytosines pcba neelan angeloni grio notus yigo jantjes geale icesat opentravel otsemobor tahseen minara elokobi klesla manqué cirrhotic naguilian bowhunting hodsden pattin tweeny rixi biver symond godec budgens celac schabir jafarzadeh knowlegde civvy metzker rondot milna vulcanism egnos umbi tajrish seismograms ghm giostra santalla fhsu marijane olimpio donnez unrequested halbreich rakytskiy godmanis interring moonbat knechtges hbss cuddled cptp rudes rcz fumarase bankboston davutoglu wayda reddan leedstown ngandu hudna beeban maarek dewen systemes dawkes rinca lynfield folino karpets danita carnality thunderclouds mecanoo midmorning jiggetts manahi chupi arbin vean utecht hottelet doagh globalgiving wilkomirski kalami zvimba mesones legras ogonowski duking ladnier moqed ymm tolentine ubh europeanized hargens pesic chouest spitzberg brangelina osteopontin sistrunk druker jamesian breder roseola hamze rockoff viggiano rinspeed mither geodis rouzi zaytun antithyroid cibulka kannemeyer regardles disengenuous suffian translunar tchadensis lynemouth osnabrueck hickersberger wymeswold ncbc bday haspel foglights ginia palmettos harto rangin fwm dhali patzer okutsu unstabilized allariz cnaa mandagi coving gemco semira llaima bluemner blai cuccioli ojp vbied pasd jabil radipole viyella scrummaging bacik nexo cryoprotectants armathwaite intensions tzeitel jiyao romey crymlyn manhas gaetjens mabruk irrevelant molini arec saveable uscirf tingwall respondants jasjit funspot bonnyman dependably cuecat siheyuan yakubov trybuna superfood wimedia caramella fotu makala kelsch citycell swankie representatively palmos awarta cannulae portee dpsg scheen raziya tepecik zhari whiskery stiperstones oever deskside mawae tenne nres adminstration aava unedifying trieb alveley yerofeyev kaktus kotagede freeheld covais veis steinlager tepperman burnetii austyn mornhinweg againe milgrim reponsible romona baribeau fuzhong unalterably nordex hrabowski phap mallar isungset moschitta stadelheim esthetician khatuna wesleys herschler tsuzumi philistinism kalmanovich tarina surobi molavi choueiri starsem hellebuyck laane operastar arianda bonati mithal cidi specific emro rechristening colemans tianlong doggies forgie realite thumbsucker samii osthaus meho cooman humanise tacom feczesin jackbe ruesch tennell diaco padgate nuptse uon walloping spro ornamentally sunroofs carsington sydneysiders asbos leney clifftops ashara cleansings seiners overselling butcheries toscan larm songkok kelin jarvinen lauzen immobiliser citius roell haria morbegno holk ellwanger grayce babyy kalpoe kosintseva unaudited trusov bahador firemaster kreisleriana tsri elmgreen arrrr relationally cudillero melika rzepka gastronomica sodis paygo zampino gromer redmoon tianhua purty bennachie lowish lootings tschetter punked mcconnon geox gartin ballymacarrett terrasar shehnaz schmier jacomo credos dodiya hirotoshi bachner tryton maffey onora newmills hidetora dppe topware landfilling igem crerand ternes avilez petlin borse storeng chacaltaya ukra cordoning surur abitbol witholding lamsa kemmelberg ionomer cyw guardhouses wheelspin gatecrashing rostad entwining wcrp factfinding gepetto reforesting braniel broomes nazeem dumptruck arthurdale dilators itzler julieanne unassimilated butleigh cuzner giggled abbou agronomical philomene bonaiuti ottavino mecir mohtarma piteous aryn gallard jundullah cleer javaone bundler pyott reconstitutes ribeye mojaddedi lopers seatbacks comported vaporise loginova amping teledensity dedinje boever eigenberg zamolodchikova eyadema ratico fya albarracin ravasi moosewood vetos fornarina solazyme fearfulness neckarwestheim sedlak briceno emmetts effluence meneghini wawanesa wuterich claggart camalig circumambulating mvovo chiselling hitlerite buyung ellinas groomes nayim gearon innocuously gluskin brida mohamedi mewing retha egames laddish rabina fookes deader lauterstein thushara sonderkommandos perspicacious stempniak uud eji globex onofri juicier sebok yeild adul redspot waymart kaczmarczyk naquin walkom nomansland vietjet verhelst colworth soder maskawa hamstreet struther gerontocracy liscomb unmoored technophobia ckr muckaty pannus pouty xylenes glading dreamboats edcs budke bechis grumpiness fadhl jalon labouisse koperberg drunker higsons sentebale myersville harvinder poppie photojournalistic petrowski sailortown taranath cinemagoers proch csfa unrefueled plek grasslike jezza unreflective cowey sutanto chlorpheniramine schilawski sentimentalists lahcen troutt dighe eleana québecois polyphenolic battleborn nseries vaill meital smud blet liaoshen firbeck effectivly barnehurst frequenters jishou cardiomyopathies gelashvili hosam wcrs risebrough kitchenaid sucio cecilienhof dezenhall otisfield twante entraining edmeades olaves amulo jehle linera wihtout lateiner cassen atsi vaccum lucente thees vibrance errm sallai decontextualized rattlestick algan blini rajnish fannon berzsenyi goodsprings kwoh jayes savell antjie kajiya melchiot tabane tankerness hirafu gammopathy abbadi bcca rotstein smrekar tibberton freid tophill nienhuis outdueled mislabelling bugaloos bigdog arkadina kfoury rezidor wielun xiap derderian bayrakdarian sodomizing turetzky mclarnon smallfilms arcadis tejinder sljeme oopsie shirting zaniness filosa ribamar mahtani gaulke wjet glenveagh odgen brushland stancil herlinda srecs mollinedo syde mennenga plean pompeiian congresswomen drawling coppage eakring triallist emergences sonidos casuistic ameloblasts writin theoni hospita stranden posteriors rhinoviruses acquaints hoeflin hakel kilbrandon rudenstine gibbsboro gnossiennes guffman riskless uniprix zoubek preadolescent lewenstein sheely allaway lorried quraan preciseness iglu preassigned ceec annouced bouzas replacer ollas gouriet holdups adcenter munchers baharuddin werburghs worrier dolomiten outplay ehman candys dirtiness electricite oshman jiyoung polys vallini whippersnapper swri joung shimadzu mcha nonfarm vakili dawr subandrio veredus particlar hamodia friss heilpern towan wanlip aaia avtomat uner ostby ultimas hisato broadhalfpenny kissufim mulched effulgence sheltie grdina josefowicz eini rasmusson apicomplexans grouses cesaire diseconomies pollentier churchfield bodha mendels yavar brighteners kimlin rogliano dakich scorpios biomanufacturing backpass leonovich klunder injaz roever fusionfall pifs kimsooja funiculì zock mendive mcgoff formisano emtricitabine liedekerke melendi preppers stcs loughbrickland werbach waigel gameforge emmanual custo miit domonique shockproof khade parlak quarterlife luthi rumbler livent paredones dentyne rohullah eilbacher nakaji restorable safehaven gossypol kianna spilker adewole saute swingley marggraf bods bromage suduva medicom qayyarah angoras scoters faleh canizares nanoporous embalm ccie lagendijk zoomerang zorman pfenning megadrive misidentifies concret arieli perkinelmer commericial césars paranoic bolotowsky dutasteride crocks brooder vlasak chimi raunchier leparoux externalize wagih rothiemurchus overbilling smert chikezie zanno demio shrewton parfois soplica schlong amokachi tinson sinochem schuetzen dunnit oxenbury norfolks psaki zukowsky asfordby tigertailz coalter luncarty chhun strutton danladi lfd icor abiyev paschali ripetta cameleers githa auriel grazeley forepaw capucho krauts estanguet turistas dilligent vivisectionist hiatal wessells radiantly bichsel knotek metinvest crill speegle verkaik portimao neighing mulvenna sterilizer coccolithophore accessit tomsula norem geothermally roizman assister jader krankies bikeable datacentre edko azhdarchids candiru mcnitt tourian mcus childbirths ljm sodding bravissimo gravitt disrupters qingquan toranzo duggleby lawd shootaround securitizations bunye microcephalic plods coopersburg babani soundarajan antai threee ardler wcaa reice multisectoral fandemonium langwell guanghui harsco wogs kiffe macgillycuddy travelex lansdell yumashev tenterhooks sandjak waide saffer gaelan codax kambanda gudina dhanoa hynie laverick risinghurst teya enquirers assuaging beles fxcm farenheit sigmatel titrating morganstern nutrisystem streetman castrates dasornis shreddies boyata favila incra ursodeoxycholic candyfloss stelarc souse rosabal kneelers iwb presales abdessalam terrin easebourne sanctums stichill lechuza skaug ertugrul hereunto prinstein patatas entezami blx ricchetti morral yorio tchoupitoulas galenson sasscer misappropriate josserand lachanze eesh outsells camuto khee mardie geralyn finham gukurahundi belters wkl nunchuck ostapchuk smtc unpeeled cipinang gibbsville heartsick nonbank pauzé buchel skiverton upbeats vacanti squonk lochmaddy bannigan culliver krummholz contect leav neddick dashoguz neupert startline fogey hawt niemeier soliah anick beckii kanze repola erpen biobehavioral tusayan smos pfennigs blackdog higgens vilifies waria tyskie dineley karpe zhr chodo slps chaly swinoujscie swebus pedersoli mischaracterisation hoofbeats kalva doorley sellable letheringsett konchak odelia gfo snowshoers braunohler pudlo chianina haricot angotti précieux pretium heurich cullera pasachoff kalamity blickensderfer debmar leavel shmulik blumenkrantz bekah winna slyne decontaminating elazig ului basingwerk nilotinib sifo attaran latently sajjadi shajara geuss kotev jensens dequeen cstb gcon burneside khondji ponderings nweke freudianism bermond xuesen adere zahr harmonielehre komano dirigisme tatp eortc abscence kilcreggan bouziane livingsocial shoichet anastrozole berntsson suzane goldmining sagalassos chombo sherries shiying taloqan inalienability bvba oesterle refurbishes supervisions ardill hodeida valtin elizabethae usweb listlessness chrystia fluoranthene thunderhorse brevin silverbrook reconvenes tiffeny tardises inshaw biocontainment lenglet murambatsvina slithered chophouse alongwith sirtuins taubira shamos multiair inturn bajic spyplane zawahri srah phenylacetic vaujany choubey liebau uhb cieslewicz pathhead creflo vidim refolding gillmeister galloppa kathrein jayceon goodyears genyk ojok roiled vitznau galavision kalaa jolan omantel filife cotsen bertolli tisco gopac paddison knightstone modernizer hybridising nonresponsive soused arduini kjellson qasba chows unoffensive danys llaneras mydans rootedness prominance unlovable dympna hemington lofaro thierse crabbed penuel mlat neurexin palethorpe helaba palitz stoneworks freret overlanding rbo zeqiri chiaverini expiated gravenstein aliko navigon carlee droppable erotomania smead tinapa benway morain britsh graumann pfaltzgraff tabram muzzleloader bridalveil pazder betye vicinanza antipoverty doubletake relators dallos astaldi ghandy cammermeyer chapelhall shoshani jewellry misener galvanization zawadi harano tomatin buzau phonebooks texels arowanas skladany oppegard dejun chesa annunciator bushwhack werthein weasleys kilravock taghavi helmig himss toxemia gambar currell hanem eldh banyas ulipristal guanhua malampaya imlah engraulis unccd carnon thiery downlisted flaviviruses celebrex almaza spruell schoolyear artero dujana jingmei improvolympic villines treater blackle lavallade opsiphanes karanovic dyskinesias alÿs chronister betaworks sooni brüssow kemsky judenplatz kdrv buckrose giorgios vswr stracathro memi kinlochbervie bonakdar ssees brían krajan guadelupe titterington offenbacher codevelopment emn quaalude idiotically stepfathers steppingstone blindsiding zamore loteria kheil burnopfield uswitch dulan belaunde gushee alotta indefinitly autodidacts gogglebox lagemann royersford berends harreld buehring organza timis garelli curico landshark amigoni midón trendline ecgs romell kennoway bebber lizi elisions shestack sarenne geyt hesta wenguang ranua bedeau vacansoleil nrtis affion ganzer gendel opies wrenthorpe exis hure hakuhodo zulauf devoutness rihanoff jianmin hucles vinified authier mythologised periodontist newchapel velfrey toosi kerkeling enkhbold supo ozploitation noppadon grea jakim radiologically rehema yoigo ladarius shamali lypiatt psychoeducational pushup kerris brotchie guffaw amphioctopus habbit teritory hermanis reardan freestyled scarpitta killea clerestories nonalignment worbarrow illuzzi bullman kavvadias guesstimates shantala shora bonfiglioli cendana budington dafnis bwm enpi ajvide pubens strassberg tesich shawi hamied kco loserville untrusting spedale mischaracterising glacken engano wristlet gaastra drye oyola rielle safafa dykman drdc commmunity susil kaprielian acet wowza skippering nonrefundable forfour ubah cerenkov doright wintergarten shakily reznick garlow kutsher krystkowiak danjahandz bergwall reforge azuka qalander astrofisica mbuya usmnt neurotically eue japanther braginsky visn sicking nough ortlieb communitarians totonaca poisonwood countenances namika sapelli citarum spitteler footitt raimondas tjarutja manometry lulo doaks abbeyfield celsi trummer kilobit domonic amde jasad toolworks mackechnie iapt vansbro reinholt unfretted soutra afterlives powidz zahalka merryl chetek rueckert spankers benllech denic giovenale andonis nfip sibongile irrelavant vavi sharikov bergelson sohrabuddin foxhunt boneheads intriago camie breadsticks tokayev llanilar tards nyamu wotruba uzoma tollymore dolega steepening overblow pinworms abkarian cliq elsener odling ferroalloy agyei goldemberg shaheer quizzers inigoes headrow mechta succotash ishima minffordd kelter kondracke dibala carboy flatlines harberger jarah balkestein oaktown aups amorth subletting mindboggling durif tresillian annobon starliters deriba yamburg issan jayasekara egidi googlers nabunturan sportscasts hopefuly drivability kenig lehan ractopamine kabiri brodkey alacris againts bordeau riccitiello cuddie lotterywest keva honeytrap sehar credle hossegor ollo nutkins sensuousness karoi zischler serioux aboville charmley sanatorio bagpuize centrepieces ziskind probelm uncared shirly lowassa oilcloth mctc newbottle adman lifg offf mideastern jiyai tremarco polixenes ropin longannet untruthfully bigshot viktorov cratty madey suqami simatupang labate volkow subjectiveness timesaver psychoneuroimmunology komisarek galantamine careering stuttaford christianna bedier cyberwar jackiw rotfl stripers mcse haibach cawdron ayudar insoo delpierre jimbaran multistorey kolenda beleived francophilia merowe sheikhdoms netfront slaugham furd whodunnits schlenker changhong fruitiness eonia empathized facemasks houseful granpa tambaqui whitebridge msos boulangerie hfrs pfefferberg chishui yetunde drph sharqawi mistoffelees solta taleyarkhan admiting mesereau franzia casler molera ennepetal msika reboost ordin runnells gockel oshinsky prezioso thorvaldsens thanksgivings newarthill treatement extracranial kandell pushchairs pálfi alistar lcca jelic gaelscoileanna aalders chryssa munyaradzi tévez beguile calcining undervaluing lineen beidaihe delval eawag evenhandedness uncombined fortuneswell hogtied hollered solarte harmonique sechrist aestheticians ures bioanalysis ostermeier strangelet anido bimla crout kasturirangan pharris swindley kochavi chancing vintry gerven emberson carmines kvirkvelia pogrebinsky laughland shneidman aliana heho spelaeus damerham emana gigerenzer reinartz ddda mulligans godement milki savitz biviano sabriye abraj suthar siripala cadarache kipkorir nendo pavic endplates igcses hajira sounddock kuhar guenveur lasnik drymen plasticized opuses lansman haleigh matsunami serras hermens chiverton baichwal watsonian mihajlov votaw mahas drizzly corpulence alloted outplacement mussomeli madain imponderables preambles overbid bengkalis rabten vincentelli gedevanishvili lathing gake carboxyhemoglobin lemahieu ownerless abdulin audibert bucker kappas butembo ourika tiptoeing towse caustically guhl tragicus löscher inadvertence dftd isah glenday replicability swahn golling sensualist rajawali ateeq poulan nuruzzaman schoepf plavix dunkleman barati olajide goign ribao izze glints isidingo etown yucel karpat gewargis synovate mattoso sarcone typicality puddicombe siqi izhak buttitta deguerin ottauquechee sewerby gullfoss eranga supacat roastmaster lidya joines rushbury tendinopathy kopelev ilori yayin barsukov ragano muthee cefni vignoble nacton changjin repave melsom cavils castled undermind humiston saona ffiec panameñista stavudine cornetti latia steelville traidcraft lemerle imperdiet aebn sollicitationis fuddruckers burnhouse scarba woodgreen abco bioelectronics kounen importuning werbowy lightheaded zozobra vernhes microbrew moonscape marchlewski contexte adgate bordogna naqdi boded loick emigrés midwicket soru dollarama cdisc backpedal intervenors wle chunilal zens horstead crappies gartnavel bozorgmehr daylilies karmali lagavulin sulphates voltaggio annely matlinpatterson newcastleton underplaying fossilisation augue euismod castoro talvi ustvolskaya freaknik konen thingamajig fairfields temara wellins iwatake wallisch coulrophobia carntyne cragun rosha listin ostrove eunson hockham shcherban sipra kubina matlala tontons donnycarney nightengale tregs dolled universa kutak wawne dsx elmbank tiy barraco stoerner charlyne klarman stoeckl temko marzetti dizzie unladylike herner insistant unworn yoast unfreedom worswick paen evgueni yamao tantalisingly dolens pluralists karpas prescence blumfield limbal sorabjee ruthann mulaudzi chnage aktan parriott skidby kimberworth oversampled kapon leylands bigos zanker brachyglottis gamesters diagree salafia habitué galunggung marasmus touchtone chedjou ativan logorrhea shazier breschel buzbee leyer probly teiresias nanoimprint milz herge ziade gaokao grig nihari xeriscape miptv meulendijks torghelle tisk landtroop kresimir kasman semina annik muthukrishnan unrolls contemporània ekoku jacuzzis linsay lacp tamares sadducee statscan omond forastero ninians gaerwen pizzolato repurchases taquari thorkil westerton dominicis montuori gentlefolk shattock favino omnitel sikkema poliedro ehler tapentadol kikue neuroligin phonecalls drighlington lilbourne maitres vercammen scotswoman genworth selvarajah dorsin barbae heldenplatz sportscotland monden growden wardriving schweppe eatoni methylate alviro timahoe classiest garrad almine cristovão megatonnes princeling eifionydd ageyev novinite draughn perkel unrequired tetranitrate coyte novatel araripesuchus oguma steeplechasers badshot nmk dallied mortice graphologist belenky norito programed obstfeld westwinds gick daesan juguetes sadiqi kayumov yangdon leasburg mularczyk landivisiau spoc lipgloss meeth shizzle nuoto glenmorangie abbaszadeh sidlaw oakcrest schaumberg clintonian hongju spni doublemoon fattoria tavenner letheren parex deltek weat autolib deheza veet wpsd kpb notatum mctighe mccadden trewern brillon matusiak ketron ostiglia demery nieuwmarkt darulaman weixin ulcerate tizon guaranties amkor lizanne streetcorner grigoriadis ardbeg thormann pizazz addressability jonh kerrera barrott hurth gurner quiltmaking cattles kettlethorpe mrem etto boarfish waialeale jianguomen megajoule lieke passantino fup dje cregeen legitimisation starcher jeda worthpoint differentiators clucks tabing brassbound maryline azorian yongxiang twofer stupp odnoklassniki shellings boulestin zambon undereducated usdaw tonkov salsburgh laureana sullying fayçal intentionalism piznarski sipson thiéry remediable aslani alasia emmerton cownie gathegi soly travon yetter tahera antigay chilhowie rieman playrooms attosecond multidomain shingai bitc ssdp safaga thilawa tses ramanlal daudzai slaymaker tishkov ballysillan stephnie fredricksen skillen answerphone wilfong scown clubcorp elver permanantly zaidis leatherdale nsas sphenodon bilour wizzo lenko camorristi therien andriesse pingguo thumann taktsang mironenko assaultive ruinart colletto auror rosu learys roughening windau cutforth saulat indictees ingrain glandon uncoded nebeker gobby brandee hélas tauseef cyclosportive feagles potulny defuser curtner ufdr varazdin cartal rediscoveries freshney zuhdi micciche youngdahl maranhao fajita branchage ternura perminov postwick eunjung xpand suppo boldo pipavav inion colangeli sterilising halbherr oakthorpe butovo dunavant blazingly arouri anderlini gilderdale unventilated whiffle huadian cramerton intersession chymosin kruschev gemütlichkeit sanitise davitashvili doeth koty allmark dsj riprock atouba stewartsville cancilla liiceanu bempton qzone shenaz torkan amandi callil venrock abramovitch yrp recherché huggler find recolonisation ostracizing cayden depilatory keppinger nahunta mahaboob pyramis schork wilmotte siok univerity hedwige metri urozgan lessie mettam lighthorse fluffier putes fatburger jawwad summe nehal bignold gazetting natassia wedowee haessler cordobés confounders wilmes zhongyong nevarez mohenjodaro durakovic gosto auditionee yosif benifits sklansky cpni serga hwb eisenreich crescencio tibias scorpionflies lironi auza drogin mythically wellingore parapluie humectants motsu geordan mooned epoxi gonz milies verheugen relased hannock landrush henker vanatta macombs cholent chimere somekind seedheads voreqe delgo aylmerton görgl hurns traianos microtransaction porthtowan kabil disintegrative shadeed havlat rodero busicom leinenkugel gradiometer marrieds vittoriano roths inconceivably neukomm trashorras gursharan nkosinathi postcommunist rooley mcclement purewal iswaran paralomis guessable romberger insitute vbci rollison drawdowns stansel corteo oblivian superblocks ultracentrifuge nemescu umschlagplatz extel exceptionality lumpers eberharter stuard moshonov soekarnoputri monya secton kasilof nervet worriedly riviresa cofre nonconventional filsinger pupping hoben hiriya tirian arwad duchaine brooklier goerlitz livechat humewood glenridge noisome cranton reyneke bood egcg besford pchr dutchwoman tediousness kenson animalism wholesales catterton moskito kresty deshapriya keells fakhir lavrador superflex rutty filar alchohol dupire kilteel chalkias surreality kinnerton nonsens rodis rapidfire hashlosha dessi whinny barke cystinosis forcer eilberg brodnax sevele courtlandt eestor luneau lahlou hokin periolat miag wildtangent schowalter homeworks januarie hefin schickler groundballs cassada unsympathetically steffin cretton threated gusanos griswald leshner odoriferous chatam impersonally jaradat landford incontestably gynaecologic barkby autothrottle hiltrud lossie mittelstaedt hajjarian alspach bakong kics religiose clutts cristofaro viperfish waymarks garamba swaney jonesport skotnicki sniveling excrescence muqeem petai ghez masterplanning sycuan pelter mcluckie cricut shchekochikhin tatianna hobin swinefleet lamay cambreling uuj todorovsky ambah tetrick siafu coppergate supprt varndean leapin vitrines dongala readdress weakfish chantha ghadr sidlesham schairer kathputli vigilio doenst djh zayatte gbarnga crosscheck toffs abbotabad ousu burcot sirven winecoff dearbhla mantzios gilfus tempestt toity mccarthys grandfatherly streptokinase esera sibiya trisul nowheresville girardville neteller tailgunner proschwitz teambuilding hempen flubs chaye gunwharf fratini bundock lahrs meggetland statists ssae gammy losang wyns renco negationist immunised dutse carabosse merav macchiarini rediculously blezard lubambo broco oatcake jinro marcal macchu ranawaka reverentially unfed catshill scamarcio dassie gvi rosharon nassco amacom cqi handoffs wijesekara kolvenbach arranz lerolle balsiger habitués mhf eyedropper shayegan mageean vanacore mauran zosen erechtheum symbiogenesis zfn razumkov zhenping adverting butes daypart aigai pagliari fanpop rabea chamari screamingly kyar tigrayans scawby beckhampton arcelia hangi kasting risperdal shaffi nkwocha hagmann giannuzzi kampfer vouchsafe aasiya moreoever kliptown gorzelanny kget leckenby kabuga powersports provid paypass jackpine guzzlers otolaryngologists beatley kukors herodion messara gameness corseted loomba logisticians matatiele käpylä masturbator toter glabellar krysiak narrowcasting llona pedmore europen sevastyanov nabiha twb dcos ferson keynoted roadrunning shkval affectively ghods valaitis bluebeat doxie hannesson steet serein hemas wolffs moneragala spinderella bågenholm gfn allenson zorb vles fraternize funiculà chaun gangbuster entranceways cango lagrima stupidities artyukhin trooped clarine obita miniaturize montera iuli tekamah intead barau pekerman riffola mondatta pennac vasher aggiornamento ivankovich wikinomics flein hittner nizaris deskjet unmentionables rizokarpaso appelfeld heshmati barelvis postoperatively neurosky deobandis gemba pianoro middleeast argota guazzini gullivers zart bioneers transglobe houdry herbertson courtine lassoing bkd dahsyat hanut lienert permed paultons letrozole donta celeski sandside lubaantun greenmantle broinowski notecards calzones borgeson eurobond derbe mullova bodla fontwell mastrosimone denault guidestones penallt pascarella leitman gubelmann bress clunkier ortgies sheremet rangsan weisenborn ratray blandishments fianceé bududa ciocan cortazzi boucaud skytower matrioshka eychaner cdbg undammed kleo horkstow senesi varsavsky tittlemouse bestbuy colorway caramelised underconsumption proctology saubers shiroma mirkovic unretouched chalino jeanmarie oreti distict mathmatical cherna ahlan bollan langfuhr shelterbelt humoring greenert sherwan brank winkerbean clevelanders superheroics madkour chaudary fuhua soeder luiseno kwoka suject serialist brockhall bonaiuto hartill vallery giuntini asoc tordjman jambos nwg capossela cissoko fraudulence christianise immediatley predetermine lomong armacost burrator bessan econtent gambrills plentyoffish shabangu tingi vademecum kludgy aped basked akkermans neidich powhite swashes solomun afriforum paixao liliensternus auslese gaziyev merner bondt shakhnazarov decares rosalio zabad norbrook intisar antiscience elluminate kbmt narjis shoebury grichting mintlaw mendicino paleobotanists breadstick hirohisa dluga karoubi artal catic gestate partygoer abdelrazik microinsurance egwu pharmacoeconomics bosquets rottach stuntwork saburi buccieri toolmakers busato zaro yasutoshi familier promet benchwarmer střední lunetta logistician newbuilding dailytech prpa mendini gartsherrie fumigate gatehead petrovics warsak pwds waisting quaeda dcist leale chibhabha teresi dulle brauhaus mofford aphorist unfeminine sabras goerz stiffler itals seelan sliney hoytema myway migenes hipps bugesera escaper flimby narayanhity bausman tadini karua oppama scrapyards khormato lofti panzi maikon outclassing benli outmanoeuvre suryo apelike horspath chillcott neuza walstad libardo dagnan giusi cmpc terendak rowledge batangan leunen fracasso whitecoat sibbett somedays mommas montemurro orchitis sybella gaffie hammud delyth ilum tapei dahna nohar stoneferry caporali speedtv venturis kaufland roelfzema horine bracingly shaarey bremhill treichler kornmann mindspark multigrain kyries sprey scribblers conatel zinio rasulo soderquist cukier bulgy monosyllable cobwebby fundamentalisms oyvind hobbyhorse malfitano clps deonte zair expedites entrenches investimento kaniguram longaberger toston nuray marlise sawali autoinjector tahmineh eventi slooten promotable slatyer leobardo triviño wahnsinn salvato larman wenke songzhuang parkham ginned pamelyn brainware irrigator bolerjack ehiogu keneseth rebrandings autocracies gelligaer dipsy briargrove deputize sirak tigi brye dombroski stoneywood selvakumar mobisodes hexing alfold offseasons zduriencik hemudu domainkeys glantaf skanking demayo dandapani jakovic raydale kesc baculites symptomatically woodingdean duyet netanel lockets cercone despondently sanzenbacher cedergren longinotto clads brackla gadzhiyev tugend mckeating genaux maccartney chando pendolinos usts slacked datamonitor aghahowa naturi ozwald irro andreolli beraldo shipu merkers kometal flypasts wickline conrath fazeli naysayer citterio hinchliff blakney spaten bburago glieberman macroregion pulgarcito dorsets lobanova recyclability sobri chinelo triki bullsh aracataca vitripennis legwand excellant qassams cumiskey accessability stellina schardt takamiyama autocratically splaying handsomeness lightle cannich wheatears malony kósa cloherty manycore ropemaker velindre kcsa skey duffers motorboating weinsteins ruukki danspace costolo barouh demobilise callejo welfarism buiter noumenal bowties hualalai alethiometer ngmoco jerid marciac primitifs scalea pribylovsky meirionydd teruaki unislamic schmidtke qera lsis firestreak huarache emporer littermates delaet divincenzo conventioneers lewison unexpectedness junzi dhindsa paustian weltman wahhabist gemberling casperson lvx segas falstein westering supersoft bulks schlozman ticklers arasa almana byrsa mangaldas lajčák ayoubi lojas marychurch tgfb wideness pochon nalaga chilango colantuono ciesielski saltzberg weisenfeld sparticles laventure junking lenhard spinazzola subkoff hoffpauir photogram umtv haltingly hariyanto transmogrification cornpone frankensteins darial mcvoy asrat intourist gildernew vavoua doglike migden midcoast karats graville adande pontani snoods thornless distractive cume whitesands broadie harple nirupam hoity blinkbox fremlin hainton vigabatrin etendard matarese pait photolithographic lvsr argetsinger lapidot wittenstein silatech cogo aimal karandikar hypocricy rhwng biever snuffing wensheng sofri tackaberry wibro tennenbaum kapl hyseni bowart sherwoods pzc keratomileusis akinobu binzel spelterini pretreated joppich flutey ismir surani kinz bonnefous mellion bryza cornershot latavia mehrabian kassianides eventus catholictv spinx brassicas pavley aldaba bacala powerbar antias tipitapa crudele brilli niederkorn drobny brimscombe nuveman bassolé barnicoat weinraub shelle beckstrom hegi raak utila compulsivity goest semitendinosus synesthete futuristics klops acquiescent ulufa rsna narwal muzzey poipet stiver urbanworld volcans mmea vleet ichino galiardi vallario itq adnkronos itos gancia catledge vukasin landsmen csem kruszewski quenby mpwapwa yuriorkis colligo winata vukich mydeco primitiveness dbsa bentov schwentke hspc burgling donelan zakary untrammelled eacs barnfather kashka bhangarh valcareggi haketa kalskag mansourah nutrigenomics quataert puckish computerizing sabby scarnato mistic ghettoizing lomell rubashov pavee winep trolla touchup woodmansterne flatbeds daeron quinol objectional smedes antedate rosebraugh stearmans hedayati arenella wsil vredenburgh behren barany awde fazioli grasty minmetals salvucci hunsecker hamifratz balzar tritech polypharmacy takana zentralbank sundress imvu minjun schloesser bodge ningún kebble ferrochrome dawdon scuffing darroll unadopted hilsum glimpsing prawle roscioli tunstead almirall eisenhut pattar rapho mave mohajir soltys enamelware trzebinski tintypes piyapong ghaem louvel dongseo archea cytter transgenics raedwald sternest martinovich haisheng contextualisation immidiately noncustodial adhir sessums mantega scandling parlante louttit rumbas brightwork chehade samaroo unat townfolk hillstrom mories baginski hayel chansa ngls zhelyazkov ecrc habila maryculter hennon frivolities enano ebird shaymin subjugates herro maximova tarman mynott strewing timbale mckhan dosnt eljanov farfel baccini demsey sixpences deltaville pagden consigny schaafsma marford alashan chinyere lowness zyvex bertoletti gasfield larri anandasangaree panoptic lyndell crating kijabe rouland selsam lebogang doodletown fuan pinella lram reddell molterer relatio occure baxenden descriptives rhame azkadellia pyrah internalizes schrempp universalized mazraa patmon grudzielanek eternit txurruka stagey imerys quietened islamification dulhaniya wholefoods sinornithosaurus uproots ticuna vuyo leakers combwich timnath zigo shamwow reman architecting howsoever steffey wolfy tockwith soueif trenholme rorimer tieing slcc corales nonconference telestial tregaskis baisya cocodrie damaturu htat flourtown hosemann eruh hangups raegan valand tristin caldow maftei louboutins samawa bocs kindhearts hopcraft sakir loyden pelles lopokova costantinopoli intercutting iccas magaz quids recomment helem desparately hulings diegans nmw georgas bewbush coulterville matricardi forand straitlaced tackiness emailer athersley razzall dollops keiren saimin vraca biolab labeyrie rygbi moati beachum howaldt frenetically mbogo francophiles reçber chocolatey cherkaoui cnngo bridies helferich embarassingly cervara hazelbury cimoli cotija flashings stansgate bartleson pasito tartaruga prepositioned shawsville ramappa kevi axius carrollwood koutoubia credenhill erakat torbinski qixing cumparsita gazali rowntrees brusqueness pomés elystan opprobrious sollett charlson schmiedel shafrazi mames kochav buildout barbini bombon systematisation bioid abeyta backstabbers hypnotising kaieda invercauld nxr howdon stuben mohabat oberkfell nezavisne individualizing chernick chachas sembene unsexy usnavi krummel pvg macroprudential naah austerely xhci playzone raser precarity highstreet stimulative rippingale humira youell dalbey mulumbu qaderi gamme larranaga cannily shiremoor astolfi kieckhefer ruthy fnk repoint gatx matsesta allegrini herlie workum ippodromo dipaola spelter dhawal aleynikov paddleboats junsheng gadzuric fanboyism timotei wardy fogal hanzlik dervaux migra usam scarified endearments kegler orebodies ajamu flanimals salinarum stashwick snowcats eworld kraprayoon lazing braxted loanable folden ndx tomobe njeri suppurativa simpleminded tonno mccullouch rakhmanov debswana cashflows heidkamp lozzi centrebet beggary ncnb convallis hatpin listrik catchable acnp schaad beatts gentrify matvienko dannhauser trevin dible vayner russen ababil lavonte barrasford aiso busies maims dejonge parmjit musuem ifq hdad winterstoke montets hollyshorts helos sutliff niky khuzami ashante eastasia plunked cardoon upskirt zuban stumper shorthaul cardis braise tziolis songline dochev tayag plagerism grzywacz mellott federle opra scantron yaghoubi milmore stryd cafcass hurrican sundy jutzi horticulturally atacms hilari gorditas krans dampf oip medialab immolates mbongeni janning lawnmarket honeyghan cascione ditore dynion skepper drumochter bakich trahern hadnot galí spanierman dickheads dubbins varricchio coraghessan belgrove gagandeep baghdadia rentrak airin bierbauer arcieri xiaoqiang achcar usinor handcross torrelodones advisorshares ndubuisi mahinder gpos systemization panjwaii premisses sflc akitaka ozen eximbank daikanyama tdecu agapov chuao zelzal woodwards intuitiveness relase penallta lebhar rebuses ndukwe grispi coronie bawitdaba waorani pierola almay cusic arvilla mishor nissenson trivialise temma hahs kaethe overplaying ricon maysonet zubik claycomb teleton arturas emmaville ablow appliques wasen centenaries ballycarry pintscher scaur verraros polytonal migrationwatch badejo pressurizes ipet nanthana védrines crittendon swanland xijin codel beutner dalindyebo manko bortnikov mucca calza naghi teguise carnduff oakshott brodney natia premal infoline silenzi myia dettol okposo thrumpton choat phix spindelegger meert allahdad baysal smallfield starkist cpam keano psagot santaland palaeontologia swartkrans vernell seastreak purveying dakotah naybet cutzamala bottome borusewicz noemie glendevon eagly wallendas deanes gerhards gestated culio gurode copmanthorpe nanosatellites heckuva liverwurst undoubtly prifysgol jarnot mathema adamsdown darinka effused moonlighters glocksen waldrom sevani ymha merret garlieston isns keliher kozyra vilalta shadier ubari hohensalzburg shmaltz kouts postcoital agentes fruchter sellal gassée todisco nsga kitigan kitesurfer castree nkotb englar mostviertel metrocards moema currenly seabourne agy kristien mitesh tighar naison sleptsova cauterizing cracklin stanojevic kalida afew morowitz bonesetter depreciable zorbing egomania firstmark seatback beguelin absentminded gyfer schoenborn darryll pilotta zinsmeister odenberg doctoroff sebastion facil wattay vranken attaboy francom samidare quirkier parkerson mailmen srikumar hesterberg carsphairn raphi matech rachlis kbtx laskier artmann meghani schnider postnuptial tholins sleightholme unitron liberations spiegelhalter telectroscope makarim joumana mridha serradilla bhagyam relighting coreceptor naica gscs fcsl mediano skyscanner ehec eulogizes pressac mieh contractive ashqar eurolines nemazee pantex baselworld mccuistion semenko daryan janitzio hertenstein eszopiclone kankava zelezny rafia maisa securid enflame beyazit tampopo groundbreaker padwick ferral buonocore hightech candiate serwa levitch augenbraum wittier polyanthus filmhouse nalapat daubhill ktnv sterkel ceris harmelin emed eirian scarlette lewandowska mareks adge failand rosindell waxie bourla hypercar yonamine thoses fatuma gessel lafta clearcuts combien fielmann prenup kalder datini motlagh wwpr klawock hosi kotis sirenas carcel stefanko semiautonomous ruggie svenn diglycerides readsboro ahronot waitlisted kurtyka petrolatum txakoli djojohadikusumo himmelb kerttula conkright reamon stryer ouca weine torian barcella chetia powerhead divests spetchley whag wildblue stanch assuredness suborn parsonages unscrupulously pocosin rastegar throckley kerosine depetris nanopores lington argolic centron jetley hattar saltuk cauby kfyi nibelheim areeba dtis andreoni vfg oleguer coviello carrasquillo akthar bickenhill sebba tchula shakeshaft bragason multiplay misallocation smartboard sures forehands gabbie peñaflorida pulseless rbz livas sagkeeng moaner metroshuttle bioengineers morrocco campervans fishcake llanfechain lulzim mbct farraj terdiman metrolina peteris shareen ecofriendly hachey dishnetwork farberow necula collingdale hifa panthi kabob mealworm teji scousers genesco creditworthy cosign migh blahblahblah outgoings janczyk fraiture abbeytown nawroz reservable devilry trebling ponnuru phalguni dzeko bocoum stierlin gianetti bekoff azincourt peppo tearjerkers duscher sandpits wetangula distressful blared schlaudraff electrodeless desensitisation penurious texico ecrs anatel kramar mâche halemaumau megahit sabera ulitskaya lahia hisbah repeatly sangstha langhart kaechon pavlis arciniegas hignell ezzatollah usx rusco kowalchuk lowboy pitchess machno albannach usin cobaea idbs rohrbaugh nisku shadowboxer maroma safonau schisler dyfrig sherando taurian halliwells wizzair tengan rocos videon muoio maryjane boad fraschilla eclectically wormold fibrillin bourillon mallwyd kolter villan vemula metris lambadi comau welcombe cawker sivtsov lashin mapesbury vearncombe zoltar cambó kambo klix wallpapered reversable yasuchika potsy periclean shumon hisahito remoulade eaie pouts ferneley shamel raquela gawlik dqe tessem walkure theire demutualization salaskar akaji pilkingtons revolucionarias troedyrhiw nolt lebrock zvjezdan bedrocks esurance soudley rahama plangent gheen bortolussi poltrona cichero wentbridge peahen derartu niebel kharazi telesales bearley kopitiam bernville kiriasis dunakeszi bastardised dajka browntown pecvd theimer spaziale matovu cephus duesterberg cynghanedd luxin highjacked skans aberdonian stitser hewanorra harrowed zhongxun nbad maos balderrama enev eshkeri mfw activee subotsky nyoro scil garantita covereage quennevais professoriate vulgus natm fajt tooths akef piskun nederlanders laurinda holbrooks qfp sandbridge jacquemetton exeunt bregy rentaghost requirment marrah tida abertis obloquy swicord nestin telegrammed pecorini blackistone ripoffs teleradiology procreated tonsberg nmpa darrieussecq zucchelli ahlbeck taylorstown kyogen wedgetail reinvestigated varuzhan biergarten fams kanevsky manaj poobah rames diqing cawl fickman melady schlachtensee liguo alaïa samorost terrones augmon mweka kanagaratnam schefer heliskiing addyman karasyov bilkis carthel maulden wte ewm elw khogyani eigler woolfenden elsenham pavlyuk citypass medek nawash godtfred mellars rodker kiona toubab sanjayan rudden xceed stumpp ulev kammerling grafter laverstock yuejin botwright tarpaper tunchev ananthaswamy sagiv mlodinow cherng myfox goofus macdermid frafjord radric agboyibo kanoe hypothesising yuanmingyuan scurries anbumani natalina filipi whisperings haneen teddybear shovelton walasiewicz rotovision ballem levenberg middleclass robeco genoways grgich brenman downlinked lavrovsky somport jenkem radkov dangoor nsse hobnob unengaging arafah sorbets itsuko fruitport panor hairnet cannnot ebrt compartmentalised scattini prieska chaison salunke scraptoft alberstein habomai vennel reroutes wft pollensa holtman karal dajie schouman mukhriz jaklin kdic parvan trainspotters doncella klieman corlette dairylea kazaam loeper heilbut edeka karani brainlab endoglin recidivists demaris candar doddering heiloo herchel rafidah aaap deadened castellazzi tsaritsyno clecs adili fondas magorium fering followership subassembly trilliums matalam gadhia celena palmilla liraz dasatinib gombo cammas typhoo afmadow burgis krishnapatnam macoutes vectren amiriya midlander kalashnikovs bellicosity kostovski profond awford mackean jeffires foxed mortgagees midsole climer pramipexole dolichenus steira melanins edvardsson vallar requalification tuley chickerell efird ceibal kicevo krynicki caqueta nakara zfns jobie pickell clumsier walburge dalya mehregan rabanal firt ichiba parleys serani releford jvb jevans abettors mcdean transcriptionist ferryside lasala dilbeck denuding colums artifically lykaion micho insu underrun pillowcases hpw kkc dresen hightail splays drospirenone recharger cizik gumo wijers througout hombach determinists potten usni ulner crole deffenbaugh angelidis pvrs oversimplifications teggart shapiros purlieu suspendisse influentials oakamoor okah defleur njue limply feinting krainik nautiyal paviland eksi osmany inhalable mimbs janwillem exotropia prahok malltraeth doggers brownsword trivett polon metway microgaming ccid babelgum lochans bossons mve swiper wessell boysenberry mlbam gopalaswami britania epazote ghostwrote shalon jpi sensate bizcocho stehling muhib cusop adang dobrowski kanellos dibbell daybrook gulyas consigns kemner stanke vickey pineywoods dsme kizashi naguru frakt tatad tahri morthens pliability ciociara olofinjana gunselman samotlor hamdeen manegold musella shirasu inventure paskey yinhe countercultures elisabete nimesh derriere jesser pikalyovo onishchenko sommersby nicolelis murlidhar abraxis waterboard enig burbo thanatology luffing oxcarts pacesetters marbleized tichelaar buit likasi salsero collbran frizette shamsie haemorrhoids violetas hjc cuckolding trexlertown mcketta capuchino supose beug lacerate cronenweth tourvel steinsson tennie remingtons passings semakau tiozzo rvx trumaine immunomodulation tuitavake stobhill pritsker dewell knowlegeable evictees annella hancheng tiddler pzp schetyna gratteri kroell fraternized punshon pattyn clonoe lasica matrics dols pdgfr tovil tredington dinelli mphela interlined teenick celene marrin ecopsychology mintal sparco tropicalismo beinisch bahre qorvis jubileum fortesque hussan alagiri osisko hillmorton uclh venson passeggiata comestibles consiglieri coheres verley zirndorf sutzkever schollander borck lacosamide toczek nedum rosabella rafu lecturership kreutzberger anmer jicama morvah ronak saidel shahzia earthsearch dallying holtom amera terhorst edrs hulanicki chachapoya geff nicchi sateen reprazent miscount givon gasim fulshear wartorn shoenberg sourpuss dileita bwx propositioning chunfeng cypres bodyworks goliat minich fullagar alnitak disestablishing iodised witchetty evercreech exeption lumas olenicoff bodys ghiglia hypovereinsbank herwald mour thropp kvetching nabala llangibby imprecations akoni crazzy adducing earplug papoutsis verbals hipotecario ghafari sweigert symmetrix trows curgenven ladak takfiri khalfallah disgorging yaccarino grumblings kindnesses déesse xolos sayres ricou gloomily snitker chakib mjt faulder degregorio goddammit lierre lithwick diagnosticians kravetz runing bishoff verkhny chemcam muttur turtlenecks pcma assistence laharrague invertigo gayman cuntz kelway suggitt bowle khmaladze neostigmine plumping simmo yamoto monopolizes nonrestrictive crushable deaccession misericord gotchas pomares colaiacovo schoeps brainteaser dirgantara sanner somberly nfte griever sorbier sanft jobbins sayulita bossangoa henly grabert zulfiya shukman fayence gjirokaster pasayat sovietism marisha nickens cambert afable halloumi shizuki marchbank geriatrician formell definatley yhe shipler deportable shooed seife unwieldiness naken wokefield esops stoolball kislak casola butana gothika asrm cornillac hessman adenomyosis goodrington geritol bullmastiff deshazo mcway spinningfields boonoo bertolino waistlines retoucher sankaty nonentities fernao discrepencies riedinger desvaux phytoestrogen hunkered torosay alreay jaymay carree peyronnet doyens tetherball swfs cusiter shinnick alsp besseling fansler protoplanet kainer cowered essy samrajya loogie reusser xlb amerine lazybones muffett siqin smacker delfini moscot lupoli henio itex bellota rautenberg gernandt penparcau anuszkiewicz ramblas flagstar dejoria crossdressers exogenesis jro resending waswo kynan rollcentre demopoulos labone barters spacewalking alfreds pneumococci hudi wahib wahweap ajira heerema shavkat malifa bookcrossing reacquaint barryville przemyk catelynn aereas phonautograph franscisco bioelectricity treister bigscreen woodsongs boghall gujran hjartarson bodnant forsooth strassburger chaises comen wildenberg scrips yering chitale herschberger neuroimage tubeworms unsucessful amikacin borenius maxvill mohamedou jauzion mapou kildale soumaila shivashankar kurtas stickered kreischer jdate toibin däubler worlwide insensitively rahlfs areti ispo enjambment dudmaston raimunda rienstra unlovely cytokinins maschmeyer ulanhu genever benefiel okorocha casimira rozon cutajar catinari brearly journalisten genetica kdnd felbridge cdis canouan rakhat dakis laneham shdsl romcom windover galak berfield pernia cymreig islamophobes rubaie holzen gtfs popera swarth surojit alotau majok halcro orse iene quesion nerikes waggling heginbotham consolatory ranworth wertsch barbarities kaffer napolitana contradictorily ultrasoft naeemi fomr deontay assemby beltone ridgetops stainburn mehrabi marooning doumen jinmei cillizza ichaso rsls saei disip zekeriya delucchi arpels rassoul strettle colsaerts debbe overemphasizes vineeta charminster weimin misplaces liveability limpias blacksod pelon ieronymos luw descibed pushtu alfas layovers aukin oxpeckers hoshyar antley suisan ifft deconstructionists kuney pedone haziness taxe crays landisville semioli ntombi bloome maloway tatoo bahareh lnl minhinnick countrie klawe rawabi zatarain unruliness bacolet rheoli differnce facism neovascular drear endelman odwa marquel xiaohong kislitsyn camou netcare tave brooklynites savely lascio inure crues javadekar rowthorn muniesa dismissible harbage shortboard moneywatch likings istead arment pietrzyk jackers aperghis aobut ahdal rosevelt dumar finnentrop settees edaf heidel matveeva whql chernus taxane gurak yowell alsberg gladyshev psychopharmacological ninewa slinks decriminalizes squeezy titze birdcages pinelawn riggott aaaand nurgul desensitizing hadjer coonrod garl karsums falkender flegt babuino umán perchloroethylene lakdawalla dynam tetsuzo accretive laurenne bozan tijan nausheen bootlid slopping bendheim opernball orfield sanlucar brogger geophones bootlace glueing ojd kamryn longshots thoracolumbar hobman hodur fonssagrives traini cyfres liebler personell piousness malashenko maslansky underdiagnosed plackett weiman sacrificium guaco gerstel sysmex requestors voytek piemontesi forestburgh waghef tsunekazu jobarteh bumpkins tandel extraditions vcsel wucherer kiviat llanbrynmair davidowitz jiayin wayson winmill gollings gurrola ospar claffey goetsch cangrejo solu changeability tristani harwin achivements philodendrons atiyya luecke huges chinnici murkiness expirations hunthausen selloff bramlage udca biolabs bozzolo zarghun snowpark magnini mchc huanuni huya almrei profoundest zaraysk finessing ivery sawbuck menzer claimer dalmahoy anabuki nyyc shuttlecocks easther oecologia containership nissman triolo mirfin sankeys skanks sundholm baseggio nevland maxa bubis dinunzio resentencing cacher martelle wizardly holshouser guyville jianfu majlinda thirza kypseli bodey isesco photocard pessimistically altenrhein kalachev gamefly retributions tetaz tapuach souers kampia felinheli khamar easo paroxysms upthegrove barbadillo horsington exulting stoppelman sciamma annalong dabis levines leinert ycf hrbacek venuses ristretto ilna outted lacker lahiff banier groopman editorialists agio effulgent longobardo norimasa oriane dunger hellam prunedale hovda primmer arroyave hawkley homeplate ribic izuka recrossing deuxieme gumpel unsurmountable linteus sensitizers lamna charry derventio intrusives golinkin xihai ribolla dilwyn mcmap voorde aamna ihemelu superpartners richart yongjun zasloff mazzio deibel nemon dfsa wakestock duder fulgoni fangyu qct ubiratan kobeissi coniscliffe suctioning waterscape gormly krin complaisant mclaughlins prifti tarisio hemm aneuploid ungerleider chunnel coquelles lyophilized danneberg mely saido reflectively koppie breeks oberkampf shemin kerbala compulsories darioush genrich microcosms roxxxy nocturia bicket hpac frizz grandcamp glub mariné zuccaro scrutinises whaplode zecco polge trefniadau brutalization layaway ozgur ucatt senegalus madaki gosal fayza alltwen jangles pedalled vpf loktionov altoids yingjiang banjolele suburbanisation poyntzpass spendings leonidio eskbank yetminster gaeseong credenza nadeen twing tkts rehau msrb intell onik loughry palletizing sanctifies herda sprengelmeyer lockdowns huesman corpas stolzing thamsanqa barbazza vocalising rafid csere mccorkindale kurmangazy brakebills sheilagh hardeen koert commoditized shahrazad supervening cavalleri amsallem ashei blooding copulates calik unanchored irdeto boltbus laywer alabbar madchen penllergaer crowcroft costières piiroja cantet pueda uluwatu gilkison alekseeva exall bruehl misappropriations dustoff aliette rosefeldt bcaa vencer clareville xintiandi backmarker wetherington boomeranged spamminess talacre namira amphicar kayonza schaumann aleisha parboiling extrememly indulis noncooperation unreconciled hengelbrock wordsmiths yian gogland aldam inauthenticity lingang fibrates balkwill sprackling staros cassez feriha craigan payees thers heathy fundacao flaneur discretions anglicanorum sheas tarum rovensky baojun taton rapkin spiting verleger tabarrok radnofsky glioblastomas draycote berndtson varaiya countesthorpe masui hochbaum quaintness pluckley fellig reifsnyder blanik politer appaling thell andasibe clampus itre schwieger grimey cefas yotta rehmatullah schwalger lambskin opex afran tristique leski erdös zelasko thwack tomasevic llanddeusant diamantinasaurus klonowski swithinbank corvara folbre lodish tatro comprehensibly stingily kuun innsworth soboroff rodder hansons dunfanaghy shivshankar dialidol clyman magnaghi dionisis demographical baroin nirim budesonide tzatziki accoring torlakson leigertwood yoox rekhi taransay pandeglang hatvany heidelbergcement unordained zuider invertase hellawell fehily recertify dodrill dymo benicassim unstamped boluda adsa gostelow ghostzapper demmer hassig mylitta umcor corleones cinisi greenfly shawkey nosseck scheila centralen penrhys lockerby linkoping abusable colantonio pitahaya cruellest malicky coldrick crammond mincey burcombe clipsham montelena lustily cmha kvea vrbata huls storry diara mulherin aedc choise ganciclovir cazal nordwall faezeh etos zients setser arade northeaster fhimah lewinter pfra jrfu exasperate nwaneri onischuk ispot jiggles tithed sofias lscg yoncalla farrance immeadiately freehill nebbou crume frontbenchers jerrycan bontnewydd yearby pressers palemon dizin hockfield faki founts eataly hamood plga kortan underpaying kunimura recommencement kteh matloff boardriders rossing svare mssc bizhan flamstead incitements chamu montek frolick fawzan daequan egglesfield rutube aibu rabiya ldas musictoday adulterating lochgoilhead famuyiwa giovana scripturally hemsky rawstorne cenex abeywardena muffles applecare willke ocassionally achten signaler memorialising nsbri skovdahl surajit rehoming alldays krims practicioners autissier anthocyanidins sneck allert zwiebel kayaked ,we bismarckian remedio foucart tripes coryphodon hddvd revist wiggo stefanel cegetel villalva gloor hoelzer tinte survery exquisitus expertize olara utech calyon gennaker phia quantrell christow eunoia latry larapinta bozovic heredero prosperidad rotz rooves lebara supressing ferencvaros berdmore linighan crable edolphus flirtatiously overstaffed witbier jennerstown chocky dorschel priyantha rhydymwyn samaroff aknowledge maplestead ripudaman byshovets compliances veguilla liani vichyssoise brownsover toura harow onetouch chattrapati wittersham ogley agrotourism inergy eassy seductiveness ctrc weger shanell uludere mcgladrey neasham xilitla documenter hardknott mohareb figues motorcoaches rogalin kelleway postilion hypothecated staedtler bejart koiwa paesano incautiously thoes narissa numbersusa lisabeth belabour hexal kormakitis zhus waugaman hommet melgaard kondoh wenski chunari moty gongan licenser ructions stedham mcwatters firstbank zabit beaner esplanades garinagu roadholding espon measha isdell wolfsthal mohmands shedded zubeir philippot castronova andrabi jovel holkeri shrinker sealaska mcphedran jordanelle lumper auder lisagor eletrobras gallimaufry breadmaking oocl burrabazar westmoor numerable garçonne nanophase kanfer joyriders peñoles litzenberger klimchuk madheshi chavismo sketchley buwalda brasso jarka grèves lorentzon toonces adetokunbo populaces reinventions nacchio dongarra napiers treepeople shifman kandos sylvaner curandera unimprovable cantel dlugosz driggers mentation hoarfrost perly lasciviousness lual deyes ineradicable mxd jouin earlsfort scriptment cfius koyra canabal freestate janala grinko boolarra youlgreave synthon cente ryeland ellmore alaykum strel cornillet cabret deadheading fmtv yehudai karenia usurbil leibell transgenders rantisi nuttal shagaya holybourne kupreanof mallorie buddleia khanvilkar vilija demarre ebaumsworld surtitles werlein russets uder medicale pechmann vorobyev matzos commutair scholtes homestate baddour muminov inebriates matsikenyeri deathtoll boutsikaris haicang grv gigalitres eurodif erzsebet rintala chalerm bladud rudetsky serrin ridgen diapered kobes larbey plaa preventively cristos equalisers hoften zembiec cubbington gulabchand dunsborough gergis debjani jinshui legan whomping ostreicher mainbocher weathervanes spota oduor barisic rememberance aestheticization pizzaro geoss navanethem duramed bioequivalence divito polesden daurat aurach carports dongsha castellino amea jonsdottir maray muirhouse crivella nonsenses babyish gorebridge nailz belger uhw noirin matchpoints merceditas calihan oxleas starrucca jfr ballynure motorcades zylberstein khachigian ekker erdoes equipos wrings gkm kichiemon figley stagnancy schonert pathy fenlands grawe misdated banquette pelto waksal sportives schlepper diala ecolodge smelley regney diod reddam leyna aircell khotso oringer crox sanitariums rpsi izibor peirano brackenreid joern pontprennau wadood empedocle chocula hernon worklife pecina wemmer johnsonian starphoenix wrynn irmas heelis carjackings chloropicrin asnelles capab omayra dtmb kyoo paumier immortalise maholm tanji geoplin beanball efficent vtg housepainter sipora adisonline pepperland teacake histolyticum dareus anwen sparekassen shayer maiorana swatis tonetti unsellable karawaci galinhas décors cloake muridke globalive upk babylons tsys repacked earflaps tartness linch whingeing kopeks meringues vendt czaban maatouk thirith marangu ‘… accutane templewood spiga zenor bardai vassanji chromosomally pulawy rufi upsized nutgrove mdos phillippines anerio ecotones westheim compean downlinks boybands withee nhps pannett christkind iskandariyah cavotec neace stinespring readjustments mcworld gusmao plew fieldworker wbig mojadidi chikovani syda callaham kusina vixie gnakpa wolton kotey tigelaar woeste optokinetic rizwanur kouyoumdjian bettystown sonza alarmists rösti geohazards frostrup horseguards signwriter erleigh iansa sbisa zhezkazgan ieo metsch herzing beleave nimer morejon circumstancial ogles zakouma cajoles gruenfeld workboats mincy pouf cuxton keusch homebody lesy noele fingerlike doek avenham condotta norbertines pribyl krasa superchair fedai mcac googlies tragedienne bairds sragen baradar adolphine exercisers whiskas sylve notah embarrased aramex catmore boeckman inntal walham australovenator himsworth maturi joynes wrye northsea zerpa fluoroscope pozsgay feebleminded abrons wictor seavers cerak sayigh reify charmings fishpool aroca protectant bhutani scratchin parasuraman angrist baires legeay telegenic cumberford procaccino hilderbrand habberley maringouin siljander shipworms pierremont khansa berbizier diprima bahner selwin qincheng lansoprazole defrees precipitately ingliston relman valtos simun frappuccino fetchers stute ifco sheering zukowski kristyna escobal longpre fjd passeth ysi kasperczak zorthian hohneck michalska pinckard duez thems thelander cadburys qapu superabundant skirrid golliwogs nahma byran mocco customink shoshones jaehn adlers rationalisations byssal neotel sentri gelderlander wichert appleinsider reincorporating electrophysiologist servicable revolte schlabach elzen collaros bunna walles norelco enlightment neuenfels circumvesuviana aideen debasis shamlan satloff chemotherapeutics sindone pharetra katelin gumline darvell knish funkiest reposes sjodin sives addres lacq rasing gumboots jarron bienfait gridshell yongchaiyudh melyssa torbjorn irrc alayon mazzucco tavaglione submenu rachad mamenchisaurus mclees movietickets videocore umicore hni sucession burlton horchow stoppered aqel jiaxuan lookie charterholders swapper gravamen yardi coastliner adzuki estudante mouhamadou pyrotechnical kiesha kambangan peplinski yinlong eurovan communi brinkhorst pasian zaetta mosney plewman bomas lanzafame gobena anxiousness incrementalism brashares aoac bartkowicz zytomirski kleckner monthlong cernobbio mainsteam dunwell zutter vesty pokolbin poorter maolin mareb scarcities napqi lonelier garbling winco walshes knightshayes kosara dampeners shelat naughten hammerschlag hypnotherapists guzzle underpriced bervoets installshield gerbes yalcin zombification keidanren blocos cywka beidi msdf anyukov berlind beisner ulchi orchestrion bambury ilhabela rebased makumbi pseudopanax harch jovially ihsanoglu salhus plaintively gurning herenton sadowska chuah temporoparietal schmierer marmalades jbod clarie krolikowski sleeze lalia overcompensating holoman malaguena antipasto munduruku shalin milliamps falera studivz batad excessiveness mysterioso fricken forssman sydbank itouch resealing malul hulkkonen stybarrow monteroni sned kandula baricco baladiyat beachmaster gemballa haimendorf superskills chickenfeed lorenzon hornick shirenewton supernews izambard unterman perspire entomb alfven langbank appelation assistances hazlemere aspirating maïa mobistar ameerah semisolid pucciarelli borte appalshop zald seini aciman singley hukkelberg thng adaware lisandra motola probaby guyancourt mediadefender marsee grumpier archfiend nourizadeh baev peddles edgerley feio pinkberry anzhela walesby skinningrove lazell reoccuring karrada itziar leiderman makola mellman actie scofflaw thade sukari aruga recuses semitropical maxtv lehal topkick gelberg feedstuffs skorton aylesham quiring boardinghouses pouliquen baseliner stosh cockapoo golfsmith junes djeli mallorcan jahurul radul steib previos kmx filloux ceff oganov deiana pernickety koether nergis bides inapposite twinkly ateek unfaired dantzic yazalde banaji rzb rief atps keshubhai akayesu deutchman bernelle bacause loznitsa rouslan galliford ergasias dimauro sameen freestar pimozide runako acteal grennell mirrer wollan rusks xueqi sanitising hemerdon multiplicata gulled riabko crittle llanybydder dimplex salwen sistah caballe quarrendon jordens thaci sudhin moshulu vilem exsisting alexopoulos submicroscopic washerman woodlynne sigurdarson posthole discoursing ilios volponi beteta porical birtwell beleve curabitur changlong idvd takach helmetta läckberg deterence brenninkmeijer colehill sprüth kontiki tavella kavlak celinda yiyi rajaa rabner firebombings regarder kanarek kanebo ghysels moheli schroen kierston lochbroom stech iati bressoud karamoko sadaqah layzell unflatteringly crankiness exista briery bolea nightgowns voorsanger brondby kozmo beduin morizet dipesh maiori backrooms wwjd basabe kessner epicor hypolite suos pettigo javerbaum verro munley pantoprazole effluvia ardkinglas kakimoto homebrewed priding iosia louv dege pcmc bidon hangam asthal chapatis childrenswear digiacomo patissier merling gasport seima lizer assadullah synergos geske berlie preljocaj dellis leemann nigiri informaticians mejorado toano beagrie nvcc gazprombank nejame freightways kaiga cornfed bundeskriminalamt sleepwalks mcnevin harrumph protandim raheb valeriani digium muradi jinong doanh demuro caniparoli alípio bloodsworth setara flus rephrases palong swimmin obeidallah timewise cugno schickedanz schrieber vanson unoffical vertis fircroft gerding betterments shappi mydoom coquelicot ghillies naq compounder sarshar pachulski dolbadarn bilinda obviosly sutpen efra seclude tysk lakotas hassin krawiec laroussi raiffeisenbank miyachi demmler maccagno itau aaq anabaa derakhshani glenolden exegeses raisani deathstar marzabotto transparancy shyan scrag kidscape zellous palios chequebook hirshman kamdar sackur sihala ufdd bracelin silvercup bomberger awat camin crdi shroder turle yolly matumbi siner electability imagemovers josepho matchbooks hakizimana sanli marconnet peskett bibikov ermina valiance bunglawala kelser hese rantissi cotteridge wbbr oreland saharicus underuse olexiy pushman ophuls cresaptown kruje meyjes adduci meadowside denburn imperatively gibsonburg radimov aquacultural conglomerated telemaque cytometer freepbx echave autumnwatch vucic struever flagger mcmenamy venki ashely dhis reselection slouched gomboc yurek containerships ltps olaniyan rayuela arousals enthralls anora joff bjornsson silvestrini kadhir repub vaitheeswaran kwatinetz careflight deadrise gorlov meneguzzi lanschot energo fawehinmi sapsan gypo tubemogul hysaj anovulatory olesko mafai viaud nikias frisking nuthouse buaben vorobey calculability mazzo ponikarovsky originalists badonkadonk undersize democratising cageprisoners elsztain jumpstarted framley decisionmakers cayard yunyang mestia olofson alberca gige ilisu yechury magistretti hubin rigopulos pipefitter leganes malah proceedures protokoll shooglenifty ercolani attus technosphere robidas disemvoweling marangos rolfs ferromex garics decabde laronde awy ffvs dawdy funso lambasts inso joesph orensanz aztar wkys overstayers mudpuppy banji visualises lifenews hubbel kotake navez itrc chipo garmash osbi rasner jamarko abilify fpk kaneta woodfox fauzy messerschmitts chessani fudgie meinke reproving naím rouart percee sergiev nonwhites polarbear ilink licken nowroz acamprosate wahlert manhandle embracement norihito demonising hosepipe eurojet shoegazer ungers hawtree billesdon zysk wolfish razmadze kumpf mannschaft hartmans bruzzese pognon tintina athough knep yigong changli rautela lakisha godfray lencquesaing microgrids mularoni musabayev kiptoo blagojevic romley taphorn depiero recyclebank posthorn gharbiya vijai samwel formely blueway taltala macmath calenders beshty moneyweek grotjahn ansal stenmarck situbondo pedraja sweeden mankulam piquancy radicchio rahmstorf cpsf homogenisation gettinger amazonians deceitfulness galard varennikov igam guillame morie eisteddfods herges shahwani mummify keilson younas buik lagrossa molcho anthracyclines voelz leiba ngoo rouco gyrfalcons bessinger speek freeville seastrunk jerilyn pedasí dustour teletech carerra bundchen verle nystedt garceau indiviual abbeyhill setlur acadamy traumatize wrair wams trahison alpaslan sumichrast sals baloon cinquanta bambach laabs proxim sulamani gensel petrosa tunefulness skinnyman contres ardington tsakane trealaw delaire rechargable direst pluijm tiddington souchong isues bagful evasively thrustssc protofeathers maffett menarini recapitalizations danielsville urofsky istúriz polivka heier bassim sterban darlins rodkin gaubatz bilked wech llop tharwat anelay staredown neglible excoriates santour itabira ruibal scarnecchia biggi zuabi avriel forebearers muireann katurian mazola physiatrist balintore batmanghelidjh unitaid kuqa perkis nebulously staddle negret haarsma greatstone castlepoint mitx sentell uui bongers joles soyabean mallusk alagiah zft miniatur madlung eclisse darbus votives jonck storari hajigak saderat giannantonio bessonov timmonsville forbis extrahepatic schemm consorte barrabas solness medflight critisize adipocere lingmerth earthshaking villeta kuoni peacejam pugilistica alethia eldercare bokator boilen skandar potager moehring lemmerman thumm sudlow clipboards mabasa gulnar nightdress bendon schwetz crosshatching chlorothalonil echikunwoke retrench junggar batchelors konovalova hsms pursers delusive zedkaia interviu schlissel stillwagon bocaccio neurostimulation wankery dialoge latavius plasmonics augmentee holc everyway olgas superlove kettley nikaah redshanks muneo roup tiggers herv nonfried graffanino moggie barbules doormats cumbias ficca ruzowitzky matsakis mindarus astrachan hersman urick yaxham rovos voser tanden cecils waterphone awj touchez naats dauenhauer yifat makiya dece azucarera erinys miltonic dietel woollacott nobo bomback axehead laight valueable paracentesis bobba korchagin livescribe fangchenggang puello cryder babie zett mammadli genious clytha photomasks rinuccio maaninka unsuspectingly prufer mitz burbling uppie bolzan overshoes dayao defraying dilnot intercon mackiernan breece eargle swol hirotada kator sharapov jurinac aldwick supervia sewanhaka dubee oldwick quines horwitt cymorth onekama theobold psychotropics hailin toymakers pommery movs drigg mistrials deutekom americaspeaks vodcast coalburn eurostars thilina wojciechowska tshuva jccc brys niched haukaas gasiorowski ironville worldgroup gorbach esentially luminar toshimi insouciant aenean alesina tempodrom bruzelius cammarelle rachakonda llangynidr laguiller buteau beadman raül renita antiperspirants modrikamen jurisprudent forwardness godlee peseiro gladsome berting adjud flapjacks adelsheim eskender descibe oanda zimride interveners descargas glympton lewer sowerbutts axenrot gazimestan holtsville yoli buggins moredun maduka cybrid transgaz ejup lesmo orston watsco perpetrates glyndyfrdwy elasticated noveck onetto janow deejayed disemboweling caracappa titeuf schoenmaker kayte nahayan sthe stanich eastaugh druridge ouja wissman bellizzi bayron mairtin thiazolidinediones friers gofio méïté nonmarket samassa fornicating sotg watelet swatragh sayem chhum drossel reguardless muoi vomitus thielman connerton chambas disfunctional zelalem rheinstein raai kktv yuccas littlebury ussi hejda mithaq aetn bueb belladrum barcina intermec kaniel foxhills withholdings bozarth klingensmith hasay karthick petzel meyong measurers longplayer golddigger juki malheiro rajaiah lanasa venders acig sherone sparer opels gobetti dosser ipass astal lerone carvana waffled jells blackridge chaenomeles ndidi hooted zoladz marcle leuci borbolla cattley kharaz kissology cetp destler dualisms vallecillo antivir indulgently fasque chatshow copiapo hanaro thelonius edelmira icbs snakey wixon cobbins hornel pantala longwu coudl camfed jerrys inforamtion liebestraum dismuke bcap boeve oohs kuwano manuelo eagled peacehealth lihong fukumitsu exoatmospheric folias kellingley techworld talula spyer menis ilaje smukler arton wrests gargar leli llansadwrn kjrh carfagna claques engelder nsenga seyam pagodinho hueck namias ssab jailani bellboys razel excitements ktab pozen dudhope buttoning tocq afilias antidumping mckirdy xwd bagaran lauk sapos bacary cynllunio stentz secos sanita nitkowski miv kosoy garroted wfg sandier buckelew unles nonspeaking trentadue yanet magnarelli swails hwacha dohn multicentre cultivatable tiefenthaler isilon cadstar vondra nores hollyrod mugnier kollin bonifassi alamshar kyoritsu majilis boulkheir functionless tawse sélavy unlicenced corryvreckan stavsky cruzadas counterman cantlow ranitomeya kondos bankolé mastectomies schwarber franczak chapela relook secura petersens crombeen firstar ivedik clickthrough appartements wisco globalcom ecclefechan rospuda flowline orz akunyili zeldovich shulan demetz cammon andys ncfe lodis lemas jerkens atmosphères bucketful schettler travessa workshare turfing tulliver siguenza dusshera ycd weatherstone harrietsham autarkic suspiciousness tapiriit kanatami ambuehl ditmarsh saletta traister scouller orama alekperov godik unpolitical omakase berenato caligaris knisley prattling pctv lladró mizzima habbush ryugyong togther shahra fogden lincomycin necas kallay blogposts mcaskill gabeira zolotas pinggu dagano bedum ecover donges rodamco measureit fragged sonawane silverjet jieshou maleeha umph bolani whiteleys glutted diamondville gyamfi sansbury guterson hijinx suddently jkx corpwatch mindich neutralises autobody stube weltz hendrina kanodia nastassia minehan urac suraya kandahari quashes overfill eronen viatical rechtman ambac stuhl wallice ggr mcfate danaya allerston elphicke pensby rentiers toldot ocalan yeshi phytase oaksterdam llanbedrog eboni doerfler bentson peasmarsh sanes ovl luwan relin alih furnham packiam katies gatten cinemedia engagé francky inservice wonderstruck klatsch rolanda rabit ppts furrey erté brase ccim hasner daithi nemone liebeskind connive fack sunliner jalava sabesp aleya loitered spidered glowering brierre gilsdorf ziliak klau cesaria dissembled ilot ivn grondahl scaw sohlman dapoxetine esearch sielecki kamai stulberg girs seamier tokuji neysa wenfu euroregions bruggeman inundates preciously vitellogenin condign redken mytchett egdon alchin gomorra northwold wrightman mullholland vomero isuppli kaanapali nemenyi leiken jasmonic kimolos tuckshop adductors aliante shichahai downswing nicean montès ashforth nonthermal iju takiji cheerily harithi torchmark noncommunist zappalà tamsulosin wruck rijks footwell sarcoptic portlands horeca vaporetto itemizing grousing aliabadi isehara sadoski penhill overcompensation larders prevaricated chengjun rankles terramax newsfeeds cfma inat antoniazzi wssc brinksmanship raisuli floorwalker tucher shahrokhi cxs akbay lookback statfjord corbier unfindable webcrawler crunchies slabber intragovernmental henleaze yarza banjaran decant landsea hemispherectomy panpipe reindl karasik teith scatchard bhulaiyaa firebases gleissner bevell lipin garrin genon accomodations trimarco snappin caliburn couponing briefness imsc crialese siic costella moisan hypermasculine phillipi lalka calinda viad rgcs villepreux grenvilles fenninger exemestane ingonish wellheads eltahawy ccls ballynoe ganne beidh delshad dersingham nmtc applin chenowith zaner lodato littlestone llanbradach kammback mcinroy duchaussoy kalthoum goldenrods miyares gehan bainian coverall burckle dismissable juridica lapdance latshaw cariparma discribe kyubey burmistrov snowbasin doliche eurofins lucians kersa mceliece volger bazmee evenin dullah cliver novant orsova kaniskina moderniser skydives livestreaming inuits karadere coagulating grayton tribler angang kinking unamplified deryk apanowicz zimpher guled guyanan sanvicente zajicek equilar keidel lietzke zayo sydling foetid bims jaczko bartenev szajna carjack ravensbruck grenoside chebeague rakau desflurane endourology reprogrammable repossessing shakier boykoff arifa dussel tayseer jawwal averring whitelands glaubitz declo hamot touchpoint levelheadedness zingy greencine kneza pardners strech bioversity sarandos motorisation oyebanjo claretta hedis tooba botolphs sifma lanty immunochemistry transorbital piech alegrías debell denish farnsfield prankish youga claudino gramozi parenté overstretch stadnicki gerecht bandurski darshaan coachhouse griped nuveen sampanthan wagonmaster infeasibility bernot decock buttering excisional internationalise kacho nanok kuehner valesky bahloul lyncombe buckholtz septembers lamen loadmasters qualter bonymaen fikile dallal mofetil prosor defamer refocuses burglarize lierop dhimma prabakaran jabbo bounderby cowpen forby redzepi shvartsman hattrup altantuya pundir trackpoint gadling trellised nyarota falbo kabiru sadlowski labeler zuerst lcci hinche goverments waygood fraph sculpturecenter mvu superlatively brozek sathers dirtying bekkering devor olafson woodmill encomiums cocarde suleimani associa blazars spätlese mandil abendzeitung idph lenagan seagoe jiahua sunja nipun chowhound loctite stricly brisseau suheir swaggers listyev orleanian briney tdvision luty zahau airfone bonifield myit mantero encrustation sceptically dcli exergaming hermosilla fops conforme bagans homochitto shabi cybercriminals electrosurgical mayodan coulport boeings claria magney llansilin frenkiel rocawear kansha strathdevon jovic velina kttc momix lieske kovels sturckow nmv heteren drubbed drange smocking mpact omino pikey alschuler discolour deconstructionism wentwood johnsgard llanystumdwy ljs daniken merilyn cozen patoski disinter systran greycroft kasparaitis cognex sassoli leckford submachinegun tiruchelvam shockumentary daling bihi margeaux melodicism freeroll wisborough langho duyen kuder lambertson mohácsi ramonti discher inelegantly hyperthyroid rinascente boudu masnick lawned tannous tawn mvj mnos damaschke poretta jemina dugher copayments darus marissen ledingham slifka windrows aerosonde txi skarstedt timolol appeares sisyphos objectiveness claster birand allessandro zúniga earthmover fiddleheads kuensel drayer maugeri busying fogl smedberg argumentativeness fantana remko annoucement fradin rouche bizzaro movius shakeri wafts runwell scorches firetrap asist tukiainen gentlemans phytomedicine expocentre semipermanent serogroups latto twisties exterieur mortella porkers fontanarossa camoletti galiazzo kurskaya esoterically fundaments callosities pulposus wispers andouille cranor koebel jeesh makaa shebeens eifman kirkconnell cryptococcal idrs lechter corpach sneem mewelde guffaws gounon groux darbepoetin alicat mehnert uyana chillis kabbage brinsford yahir phoneline sisario ybf strimple tangley buchinger jabriya baradero goffriller desousa greenstead moonrakers aracruz seehafer acrobatically vegoose dunivant flexa ovr yongpyong seib giugni nesc metzer atmospherically qof seiple skinnies jiangyou caulkers anane freixenet toubon khone rtlm mccornack precharge simonside servas insadong bargeddie stefaniuk iyabo kandace psca kudlak cotinus tapwater khamdamov gutka chungmugong renno rupiahs coykendall ineluctable marketo viennale iqair costales morwellham mainsheet yufei taione kalousek alderwoman pleanty amirah willsie fulls boskovich tipsters governator ranst cointrin unequivocably finlaggan shovelers sanbao rkh wjno pseudobulbar drabkin mefford heatlie bakulin hammen jarrettsville mcglohon juyuan sniegoski stylishness blogher freihofer jorritsma controling pichit thommie koyuncu sericite zygomaticus blubaugh subdirector repackages rolnick stemcell kambiz pufas stiger italcementi froomkin slw confusional noureen dhia hutchesontown lanx ladbrooke pmac clariant zupancic lupron stitzel beamers tarquins snowsill erinle kinneir fobbing pilcrow misjudgements ferr hyunmoo choquequirao phobail biosolutions marquetalia tetrault dehar gih dahej postsecret brambly myoelectric pawk luttig epassport politz cerith felafel muthaura quazepam hoody estabilished envenoming chilcompton filaria amrhein jobos fillette spumoni homepna ruckert successorship popal superfoods vallourec kuomingtang mistwalker straumur schnepp drypoints mardel kleins fleisig evarist acros padnos wolfes aloka lhari naturalising sekisui kreiser margarines moalim antuna sheils jinbao summerhall trewick nanoelectromechanical oskanian hospitalfield finkelhor uconnect gassco gurneys discribed bathford gornick swearword limted bdelloid beardsmore aveos deyhim dasti hanagata mexi dorianne makine hhf bating pastafarians saveliev verbalizing souchak kapuscinski zitzewitz goapele achinoam shafrir fujicolor bandannas mamajuana wildheart santopietro nephrologists damore gajic vastine bouris hulshoff perigord richel botanics zbt shangai fasciana médicas harren terro zuyd scarper cooktop duddridge cfas oshinowo petaflop ravenhall barouch zachos superlink raigmore tyuratam dacor cosslett openaccess disgorged viscuso binson scofflaws lasix phyliss constructionists mqtt antasari repond heyningen dattner indissolubly unperson hbh lgo goman longbranch thando advan nyrstar haubrich devang nolden herremans ruatoki dadonov domitien qingyi nelida loudin moskovskiy rheola wined hucheng vadala marzu begles huus talati devanand sumberg maltagliati lcia gwernyfed panau abchurch sophon otterman pushcarts dollaz gissurarson ballymaloe trinities farke iriki gogua bolsinger puzzlingly scarre environics motorweek geobacter bahaism undreamed palfreman destineer opcon kromah roboticized midocean enti synergie mattice neval fortwilliam jessicah huegill rebollar baltin teath claassens dealmakers firaaq hexthorpe worldpay dynatac shivery sonsini singlaub sledged leyde wonted difa sasac plowboy vinokur copple surprenant clubjenna yetkin stoneback pescow speonk barbalho metastasizing tianqiao letona shapin maendeleo velutha esperon promus encima bunyard dolphinton eob jorrocks mephistophelian tahmina gaffers mullaghbawn camarão millworker importances caraguatatuba requite birnbeck sensitise teets vasic bonnerjee schuring lightering peens dossevi vandelay beitunia amae gorer mullikin egestas hodos readaptation ragbag sporozoite ugtt merillat hric aneesa bellyache vorticists sybren oruma labarbara prognosticators opdahl gritt twelfths efstathiou trinians latty eche rainone begger asopos tagula keithsburg wajed gryder ballyhalbert kinal mcbath volokhonsky samaw reignier xvycc riverdeep remifentanil llangeitho bazalt sudlersville forsters pleck jrl ldrs gregorek nosegay bizos fruitbat tortuously hlongwane establised arram rodric guntars beaned coxford rugamba jiqing trustwave gokhan nikopolidis calow dobroshi denosumab kuhlmeier suilven andorrans dudum kaokoland totterdown andas detering gerashchenko mussenden louisianans overfilling stepstone mylod ducci graterford skeer cmgi raffan beadling montelukast lignac trowels wanty zucconi comis terios metallised korchak frangos terril glulam parthy winningly prototypically margarett abers slvr aaviksoo gamestation cingolani chavistas frankin caig estime distend hccc editoria szarek itacoatiara maleng pekovic luzio shafia sukur mayakovskaya napfa compliation dubowitz senrab ukirt aedan monfries thinkfree morses wazzan spuc reauthorizing bornedal psfk shimron bormes abberation sawchuck meredydd elefsina strines quickpath siebrecht atholton nourry pgti barreca serralunga churt politicises eatables efdss rebeccah allander sportingly nuiqsut moneymakers wpad fadela demerol hellebaut phleger lixia ciocca pretentiously anema paim goldwire zarghami courtnay dharmsala borodulina fumento moge fanne ivus borissov oroshi nahem surrattsville whorton oltrarno intrapartum weaponised bvlgari cosmovision teetotalers firor nanabozho thriftway devided puffiness jitan megacorporations dhotis noncommunicable digitizes accutron brecciated shacklewell shapoval rhl bösch longlevens rauluni rawaqa onta hujar brainiacs kaloyev trefnant temtchine caseville henrythenavigator unstaged archlute forgotton nikai dellert uncontestable mattew eroski dribs nishinari rvca vernerey berkery cdic corralito kambakkht rudlin mountville desisting spliffs emmbrook ayoola lavigueur mykhalyk remonstrating hango nasturtiums duhigg danek diebel piccioli raham dalakhani devenney marimon savvides andren capons celam hachiya fendley capezzone gredler wihout minety fourrier ischgl gammoudi hiorns wolaytta harringtons tamis rabbinically gurton gusha potentialy aronsohn reuil bowflex lassally invitro vancleave aaea toquero repeller csz holah roekel antipodeans caracter obertauern convulse hojatoleslam koudou shobe hccs whitneys hakkinen lyson gevalia waberi taho rozycki euna marhaba piment fdx denvir tienne camelbak manaa welll spectate hendro beghin moggio hailong minjiang akuressa mcquilkin sharifuddin schaffrath powerbrokers syomin wahabis craigiehall chanctonbury demonica xinyao idealistically jossy gutrune provenge katsunuma shafii hockridge lambrigg dynan adultos endostatin postnatally nephrolithiasis isett dissagree mahanthappa bachia nameable moger otegi tenero célimène andijon raveh prysmian saartjie zhaoxing bernadina castanon natig vinti hablan niggly strib marill ridleys alopecuroides resurrexit stopbadware microcell airola décolletage succar jck guvera singstad widders vulnificus eugena backlots problably slickest manufaktura walravens couchepin benzies boreland duree cwmdare narrowbody guangyan zongchang blessedly corridas rambukwella autoglass abelow bendukidze bintliff moodswings cullotta maneuvre archaeoraptor hintertux manischewitz intermingles lavendar melles petrifaction livened maesbury ulic unhealthful bergerson arone stuffings halekulani antiquing falutin greenwash briguglio nexans highlighters togus tejarat tishby stigman talywain chevis ogundipe cansdell viaggiatori purslow laxfield jingoist dreamspark cantagalli ggf crashlanded pratice ungracious consignor blueblood sibeko baliem emisoras boraas knierim maycon qasab deadmarsh domoni disent aqraba myah ramandeep insua perspicuous hawe merval chulak daltry rakhim appartient reginiussen maundrell neurectomy gokana steuerman jiayuan dpko pollsmoor architectonics mccorry seatown bellapais fasion joyent coherant salonpas kakas epiglottitis calvez wilmers seagrams gingery bregovic zhenghua langel govea syer gamburtsev spongers jessops monke barthet maxym scheppers pedicures fpsc vaadin reclusiveness iceworld jahncke corve worc purtan munita adigun approvers benzoylecgonine bertuccio salmonis mpsa mandatum riddley waulking wlk estranging icx planespotters medicity mononitrate bonvoisin lectularius herculez wannenburg boxcutter usdoe tsna lerberghe poynt rohrbough barakeh willowfield piñón yock ranthambhore biotherm aanr gerut brisben normark antwain khiem turnstones lisser rahho cowpoke millio wildgoose rouvoet frisoni indefeasible ljuboten pullard sanborns triterpenoids firefest cohosts yassar koliba brutt datblygu jemaa cliquey etling porca heezen fadli zirbel isidra ordener simari tlacaelel olowokandi seiha ananiashvili saile thurlbeck inholdings irans riunite bhuwan seppa nikumaroro benfold jesca kvas attitash deber venzke waltraute bourdonnaye jpmc balms unbridged magetan literaturhaus elefteriades gnx tunnock dansie incensing cutta schriber khazal technogym landbouwkrediet kozhin giacobone pelot galadima yarema hadebe tempters budos strober litsch umeme timbro karibu groscurth riona overnite guéhenno brothertoft ngoche harbourvest lubecki katsushi cablecar scrabbling gadbois hochkirch touchless fearmongering midy kerswell horspool bletchingdon baldas lyondell yorvit altenahr foja trenberth backstab plié kweisi leucism lonwabo inguri militarize angiomas aulich khada heathery fbop mancur samdup druggie dansili christianities quintavalle chucha socor elapsing cooktops jabaliya radetsky finbank carran garned cybook demnig wachsberger yowza tshuma goresbrook odihr hultquist genotyped etis bewteen derogations belled morave maalim reisig saikal fortepianos periwig ixv conerns slaking gruzdev multum mcshay rutili obetz namic gulgee newarke tingly maulan hillborough panjwayi toutou byx jounce itajai grafenwoehr alpesh shmoe talfan masari houlahan quavering evanthia kubodera sachertorte olivio tatsushi westerhoff yothers pehl siemen rafti paicines surpised yamawaki creekmur pitchmen plcb copperhill sediq wilsher clacking melosh sematech nethope gussy hardyman mediaroom zanicchi vahey hondutel fladbury garnell ilounge gbas ishai mccuen lilypad freeda bellaigue colóns tingles ezzedine gasana ohim divots mcneece bergren multicar berdnikov slowik lionville stbs skrimshire jungla muriqi funking beggarly nessling phedon sfcg incwala shayesteh planers theorin underworlds cagsawa stogner alowed maylin crimebusters blazquez plce pleather elfenbein chamkani oerlemans diker sheasby sonoco weihua zichichi brendle freuchie levenmouth okoronkwo aadvantage hillburn dipyridamole ceilidhs mawazine clir cyclopropyl tornai genuity philcox isamuddin weedkiller mabton incentivizes chickened lavasoft wmgt waxen sonographic runcton sgma moratoria starchenko baiga fisherwoman lyerla renes shabalov presidenta morvai teleglobe merckle macfayden ruyigi cirrhosa rohrwacher telepacific dropshot démarche karsa souissi kernicterus kosoko raistrick loyce waringstown thevenot motoaki chugh broening sinx smilingly cispr guangyu brighthouse maskless immoveable tolovana analia wapper richner sron dunnam manber denhardt blackline munton casquero ldw aasim sivaraksa fucci technico criticial moleskin andoain keylor qeiyafa lantieri rmdsz guines abettor newsouth naftohaz wathba portsoken schjerfbeck neiafu reelections shalaby kopsa lesin lakenham screengrabs bitney cimperman chygrynskiy urbanspoon precedented awsworth estleman savada alpar sandland scamps copay camier shrewdest chessboards dysfunctionality tidbury hormann reisler amerasians trana jetwing stavola bassetts stum alecky homilist ausone assiri erx dobey mudingayi nardoni teitelboim erkut pinpin nanopoulos scana khanon ezetimibe cukic ceop ibold kreder boruff laureth calabashes wheelz kaib melodee inscrutability reinitiated markram rambagh gve soothill vrx rozeboom turai mcrobert heathcott limones motha grobart dunt komamura vogelzang leucate demilitarize claritin markby harvison marotti luzhny yaskawa depass musudan fitzrandolph stice luhring unuseable weikel appc transesophageal uptrend klosi boue daishin biovail writedown eople nebet pillorying winkleigh carias caty pfandbrief codder katlego marlys backpedaling nordheimer soehn hrabosky langsett simexchange mcewing rollingwood fabianki kleis changcheng chermiti avonworth lisman caterwauling carafano harbridge mardirosian mpsc tsitsi pke youngarts hazelhoff filmakers edworthy whrc raclette zetsche edozien sidik ezza longrunning rasaq csfs thinktanks marno sanoh cloner derald arrasate hoofprints longdong saftey tchao zalmai beckstein dxl smurfy quibbled ibw foubert suffit vibhu coalface sukenik tsy zhiping centropolis fieldsmen jobtitle karrow artioli tions orlich bujagali aseel finisterra fcstone ferati inanities faultlessly lvrc kidzui mamby exfoliated futurologists ssti toxocariasis phylip wainewright kadr kalonas brookview mckelvy siasi jazelle rigeur doivent eeriness arncott ironport stealthier darbi dunscore androstenone khing oldskool indiabulls rockling allays coltrin pallen umps orangevale khurbet peepholes elfers bedawi slupsk bigamously bussert nasuwt foxgloves heuga lamco parmigiani nowaday sharki shellman westrup pakalitha punkers frx preventions milbrook washinton pollitzer hypothyroid gandules danilin hawaa luvs dusart riyan pankova rutabagas elissalde waismann topweight promulgator portlanders vaidhyanathan holmoe ibank korosteleva soltanov jayasundera datacore unspun brahea ekl knautia froogle ocassions baohua nadjib astilbe basulto stanislawa holdenhurst gadhimai stoia tixkokob jhony pauric bridgehouse tabacchi arostegui rapanos bendle exults desiccating hondajet abdulhameed anejo dimont minnifield fagging bladeless broström northavon parasomnias iolta firewalled conaghan rappelled toivio talor probo paskowitz ogoniland unpatterned deech mazrouei usapa chorba vitasoy sumilao ballykinlar atiga cleen zannier holyport kaslow gardinier mossos roskelley brittnee kbfx shaath kofuku ofri suwandi shmatko overbanked aridi braskem brynhyfryd beechfield aqe picas balconied mandile ariyan scottishness squillante catapano slyness creches kriangsak kircheisen araque outman chindamo nawzad kandawgyi rabai benaglio palaly umanzor crackenthorpe gonerby fullilove alamillo burchenal octomom hiromoto burniston lolla haskayne kweichow telemetric unimposing vildosola rejigging toua mazzolai blackfan rundberg longyi adauto editioned ilter hannas devedjian sbsp rbtt divita komin reengaged berdy mccrudden westhouse sautéing tiem wheeless godleman chairwomen picada haralambos markens morococha latterday hennah handless carneglia lopatkina roofie cassagne whatua nalebuff perchard foulden appeases jumbuck furthermost ctba ruhnke dineh bickett trijicon gweek svensmark andropause haochen borzakovskiy domy baileyville plumbs grigoryeva pravastatin depsite karapatan dornhelm smae myss canonizing parmitano rantum meave sepilok giaconda lazzarino vyle antonakis rettenbach raphaela hesilrige bulbocodium vlachopoulos melodramatically gandler digeridoo tacke kolawole barnidge bozhkov edik dylans savors ncda ginyard paino burnbrae araoz autochromes archmere ghaffur meyde raltegravir cornier strokkur chardonnays rubido mediapart trainman corato trasmissioni astrobotic elab halber wehrheim lemalu loseby willistown rohrbacher nonoverlapping rappels arborio overweighted ekern alaq varengeville sarun liederkreis seebold dimichele nasirov stadnik roginsky holovak miscounting rabou collectibility mlps roussell thim pretextual cremonesi fabes atucha muchin daytrip harrased kavakos ygm hagendorf begiristain cipfa bitel currah crosthwait alipour balala barnathan krasucki bearak mmmmmm ratfish dinakar puniet lamah handson okas arcore everythings eiss resurge yodobashi overextending joceline shakra fukawa aizenberg montelibano utans universitys nibh memphians dumouchel cracco ssy subhankar vilana fellmeth terance windridge zurawik chalfin qtel goodrow rure proprietorial willert curette frenchwomen shashlik lafaille twelvefold defacements mollier epirbs devlins monlam zarkava researchable mabvuku varilux eiscat pamoate neoliberals carteles jearl phrenologists lisaraye smeulders itsself mueenuddin disempowering keiskamma disbar toraman gullfaks moratoriums villazon scandentia parochially baselga kirklevington cattistock oughtta hsaw cynulliad gopalnath slutzky overvaluation batterham dallerup ostalgie henchy governer cija ministerios giangreco liikanen landay clarsach thermotherapy oldford homsi ciervo matzek ibio tereshinski malford komarek murungi brunier zbynek crocosmia rohita larcenous mabahith mastny deschapelles mozy strykert cladribine ledgewood liuwa chlorambucil needlelike ganzes hazarding aeroflex supraglacial dimokratia meina willesley groag poucher pisar nubble knego chaffinches tengzhou bellamys sadighi chunkier simphiwe katchor smartie reimposing haselberg chellaney univest guillo balester meigh bureacratic floriane moldable jardinière krisel merey stubbies lutfullah pioner menemsha whittock nadjari dupond kolache keaveny bequette atomstroyexport furring storcenter raful gänswein monsterism rettendon morner vigilancia clemmie schleper lovekin fennville balena prldef karpovsky osnos lantigua deubel honeysett pakhtoonkhwa incises nitrosamine neaman pensarn ashika boors casoli chagin murko griem gruenebaum gutterman steltz erionite psychotically pennridge alawsat undie lepetit vetrano thomire shaktoolik linzhou pensee sundblom rearick kave louer stadnyk cruciat daning illium kliger dormston geac sudarsan thornlea ekes tyutin pernick llynclys dedina huguley jstars cavium highcliff ebbetts suchitoto skowronski mathmos nahim torys nightrunner zurutuza wherefores linnehan sartory affianced cooey paratuberculosis senesh mishari hree dinkin nrwa dierama fathomless saltarelli resequencing lionell davoodi daidone tanase nargund peapod glassine refsdal hongtao nonnegotiable ishay portakabin flibanserin slomka huelin walbottle kengen candling galson basketballers pakiam brazils kadhum suezmax sparely decompresses eaglen tamgho boogity winshape thobe microgreens iqrit dilaudid treefrogs colonises osberg contingently valentinas oatis sexagenarian oooooh superbeasto aucuba négociants leatherbacks coquettes crapped equivocated chytrids bwn cauterized paintsil shihua shushkevich bevendean satio hammarsten cockshott tomeing tunnelers doodler hrtv dingxiang tinworth abdelsalam casaleggio duddington lacorte gavidia rundowns bauduc laureateship cranefly janigro reaserch lafford hessing hegemonies torshavn vartanyan donahoo wenqing monterosa nettleford crosshatch lysbeth essenhigh equivelant hairpieces beatiful abama irascibility medanta oloye lassoed ansbro tregurtha gardoni asztalos foxon amarri merchandize aldwarke declawed fingolimod paddingtons bercher matzinger ciaccio hoetger hendin bual yukagir misharin muzu shahrizat brightsource iccat kemalists stagner obradors teleworking gisel ilê halu amaranto benardete futi aslett suther denka peragallo boazman dû abon apiaries fulfillments rejecta marijnen chomo compnay abbassid confederacion chizek killmer nawruz nantporth telit karkhi hargesheimer graters khanya padd skachkov grandstaff jericó iupati diktats splinted lampen ayerza egekeze theofilos carryall helvey acrc impedimenta northall sparq maluso bartrop moneysupermarket uglegorsk nokie cejka neikrug vespas pepel olfers efail grosman haimes tatic nefazodone mansukhani paulen undimmed gourdine zeland zeyno tayleur chool holzberg kench shamsheer dungworth litigiousness whitefoot tarm stollman videoblog rohbock nram capecitabine partsch magaki cabstar pipedreams frayer nitu unhesitating sonntagszeitung adiru llanfairpwll georgeanna ciria aghajani aminath kimaiyo hawatmeh plewa kolonics malakov talhaiarn nonancourt alemao baltimores nkufo apolipoproteins zhangqiu lavs fransham shamie salotto dumpton yahr suiters homosexually bonnethead craigievar nuzzi heggs kaminska superstrength slemmer erasto cunnane gravgaard whitebread karatzas mcvety meteogroup nagamori headboards imamverdiyev geniès mobay adriyanti overdoes chenevert casen woodeshick onuaku nonagricultural mabil embezzles polycarpou kerobokan saquinavir timelord antabuse nabaa epner adefarasin wildish bruchweg fleetboston sputters azmy cadamarteri puckeridge klap conibear cantatrice keckler caique unsoundness salw andronik swanstrom gaoming neabsco desecheo yiquan matney frykowski belcea shoed rebadging genivi dubbers angelea rissman chrs rodak dropcam impassively cupholders behing groser bòrd tieshan kenanga openajax hickinbottom meminger screenprints vango sisig jamee chemopreventive jinghua adorably elouise switkowski rupeni devany shnaider giribet nusser prestiti grassini oddsmakers aramberri luigino jasur purgatorial mudlarks rdecom malangi mambos baree spooned takalani idham spraypaint zeel ixquick eyeful yonker obfuscations labash sabrin proaves preit hackleburg rajnikant neemia katera pukhov backpages dureza radisys greywolf bustled decentralising hanifin geochimica quate fhlbb chimerica maharajan khawla mitchelmore sonesta barbarino peope doncha uninsurable restrictiveness chittum choung narghile taishin guobao allyssa bertrams tessler storwize oompah nistico brabeck killinghall nogues myoga otokoyaku pteropods nitpickers staiths wilckens expediter francophobia mckegney muqdadiyah bncc fengxia boriello unterweger ikm strosberg demotivating coteries gaikai mometasone cornog yumkella cablegate watercooler sanakoyev eariler kemira lesjak twohig globovision barabanov jongerius perezhogin musican milfield aradi simmy albareda uncomplaining canditate pilc inheres brema fimbul cje mursell gki kaese shlock despairingly invasives hauptschulen kyauktan narsad naari mcingvale michiyoshi ssam snif metraux divall chps oskay errrr oseni chaunte stokols dimenna weatherbys marchick wdet morlon roadstar parknshop lochboisdale aready talco deoksugung palander bagna ehad ghostwrite zinoman anirvan ballesta zabawa durring performace krikken dalkon ecohealth walberton tecnologica movieplex nysut fucilla immunogen korfmann krugel byass barjac quainoo iafrate boell khacheridi krump wessinger thalib bhogle dandu meziane paleyfest jianqing koeverden gillson illogicality conville grims orian peynado icsr powney suceeded chesse schwitzer begrudged donnall decaffeination heinemeier barcena kangura chumachenko huanhuan galloways gher tamiia costarica jollof katella fetishization ansoff shuttlewood leikin yuden lazarte cnni takefumi taedonggang colora facussé farese melf ahamad greyfield transferees ogunsanya soloistic asbmr homebred ameal masara wigmaker barsebäck nmdp wilstein cotugno stike transman disposer robbyn pusillanimous panshanger coultas tumaini caril vandort plasschaert cesifo cockamamie nabby mainsprings degroat ashia itweek paphides atter maenclochog marshside fixator stauss stigmatising dreadlocked traykov vaughton nichido walmgate dck heledd scajola woodmoor biaw padalino wiis mckaig quatercentenary herpa bockhorn wde lasy duffing bimson towans stupin quienes brads homenet prieb jiahe diprose guttierez canoy backwardation qelt tarasyuk sethusamudram krayem compubox suvorovo erts laparoscopically grimandi juet anagnostou brevan martek sunair responces binbin schoelkopf graem macarons cheskin pretinha lizo rayanne molecomb ohip intertainment garby apirak parcheesi holdall xapuri mucolipidosis latcham tefaf rihana eirikur dagogo gfg smoothen poken comparables godoi shorthold ency tillot funniness pollarding erz asmallworld naht duchamps karora mutko poundstretcher laffranchi percocet somkiat asaduddin snowbanks wwin sliming micromuse guriev nazam kolodner stollenwerk etchecolatz montney quiches torgler middlesborough errickson mortehoe mediafax lwg sotalol kemakeza screeched nannu heathcock resorb luay alveda wmsc nabati llaves cotehele peeped hedrich amstar treasuring sablefish betsen muker messud roids berkow yusra unobscured ouvidor macerate terekhov badinage mruvka deemphasizing akalaitis mussell unguent weatherbug advancedtca blackney nutbrown detesting klci krumov turnmills basnight allistair wrigleyville bruecke monohan blaschka hautman tiralongo commingle yurtseven mihails radonski instinet fedun magaro frenches teknaf kalaitzidis wallid chewning consolidators intravaginal filets jork lecomber hyperthermic recre idata kjla gymboree pavlick hendricksen velka danneel ficc tules userkare dzr yongfang childminders glascoed tsipouro houton urgun biopark dockage ceron snailwell untilled stenham mantech murabaha esure ringmann affi vardanian pedrillo kibwezi rakove jev chalid pissarides catera cipressa abideen besigheim misys meersbrook brafman sandbostel livock unswervingly russophobe culantro cervarix kamminga pieranunzi robotti alberman zlateva buonomo rvsm sarco palinkas wenhaston greves bedchambers energis depardon camec deighan netezza isys adaora jaywalker marilyne shewfelt lockyear ratby sulake boastfulness dapples arkinstall irey sothic polonio babaloo obejas fornicate taggerty marionville hessels relgious elgible carlitz sspf fidai fnmtv idisk nayman traumatically farfalle unbefitting warke romanticization burnhams indefatigably puzey castellamonte reaux darai fallahian opengate peillon cambusbarron yasinsky mottau geseke nonoperational globke downline lifescience skillets schisgal phototoxicity winegard tringham synthes castoff cattenom epitestosterone pachon pjtv grayshott hortefeux familiarizes stobs huwei tanasbourne bhattal expence schifani eglevsky snarking molodist attrocities rawdat xra paun igrejas payors southcombe fillongley linders nobuchika pcrf uzbeki cnsas batsto blakedown asuming nubium interacademy blackens mattey florit hjejle maragh threemile blueshield fxa overtricks khidir sergia volchenkov smoldered kelud ghw janti emerik oxymetazoline apice raffoul confidencial stamata kingscott antinea biegler reeman manuring collectability khorol klineberg mcgiffert mudzi allmans rashti sunnylands doggedness loray horseflies bolshie grimsdale aerophile hreinsson oeufs bouira ruut shoura twerps dermabrasion levich mahmoudiyah tsikhan hypocretin lopud tatenhill szeklers videographic scheuerman lampstand aici maxman aliotta crimplene nautla plachy mizerak tazer roszel dishearten maoa procreating mancienne karayev unrefrigerated demske unabsorbed tondabayashi alipay tortel klipper jardí treshnish observably dreamlife scenesters oftel vwr ivet yesilcay wildung minisd anorexics prid nyquil arija caraher angeletti sindia membrana politesse kahel boullier deakes webvan vecho zasyadko hellenikon stonyfield theocharis feldsher seperatist raychem unpleasent orbinski xacobeo lorser olufunke eilish scottow njdep mediatech naturallyspeaking yiannakis kalisher gnvq caridee flatfishes griesemer andell leftie signicant gues winnebagos twaddell bertagna downspout grundfos toboggans lifestory bunawan unfriend noop yellowy kenjon wojnar acutal tagicakibau kootch ciron penndel eurocamp schussel chamanga vulputate fuxingmen lesego stellmach sharyl picketts maybourne shober delightedly atus biorhythms ebben chaudiere strohmayer governemnt burgs bkh alfei dhanak marfo tackie baycrest vickerson parathas mxi mekeel cottesbrooke revuebar komarovo webgui myspaces tourneau soeren elstein agadoo aquilante schnecksville kasasa shailja kessy sensuously rabbitts gardaland kinnier covaliu gloried freye kluver unripened scannable pollera selside siswanto abashed kunicki amagertorv sitation sarafyan compaign albea legesse stepfan relgion supplicate jhin mysti clennam delpino playfish ludicrousness krzeminski suscipit boxlike biserka santika piata ciccolella elicia damasks aprés trhe salumi spme sholeh jitterbugs brownridge calarco redco cked champika yoido goestenkors cleworth megowan richarlyson moniter impax trenchantly wishard deoxycholate kakkis yien hudur cartha exerpt farrall eckhouse bipa malkan smolts rukiya ering overbalanced flavorpill malesuada remunerations picadillo kniffen belimumab chuyen inexpressive scro lichterman spittin gearty lamaism starner chessex icehotel zeanah anchormen wiederhold emulex tatio birhan axolotls hoplon scuzzy ossifying elaheh lsta messaoudi lyveden bernfeld pronouncer colp magyarosaurus feltus serageldin flexibilities majeste bunghole barko pinnington chugai kirchberger almanzar malouin immorally edgecote magalie palamos lamagna butorphanol baoshi lanig mcnugget icdl skuli statia chengji themselfs bourgon crappier kumra lipez spooking peruses outlasts sousatzka ravenstonedale zias raskatov albone sundai arean shyu guanzhuang abdelrahim solankis pennyweight gluepot diuranate darboven gabey behnaz mopey bashment queston prazosin grisbi purita ashrita daoudi memorious ausman matzner sumtotal gubser germanness saana nagae crybabies easterwood trebah bhindi greenkeeper newberger dpk redditt sergeyeva birthistle ayars cqg rainelle jeste stikeman terhi chocolove taskmasters tumorous soanes contracorriente schewe lillicrap trebunskaya dodgertown wellsway vidarte gyoza rashers fdk feintuch saedi nedap caino escos buddington cussed meservey unmatchable gtcr weisfeld healthsystem leving aesseal pacemaking mylife justicialista colorways sarossy taltal santosa blanken ddim hunty vlisco brodifacoum afms dedic itaa liepert papaconstantinou pesquet ecumenically symptomology speakin lindolfo levitzky sabuni bunked swingset cuckooland wowing mulbarton ecologica rahila methylenedioxymethamphetamine robitussin vulcanologist quammen prial kulsum agathidium antionette kinlochewe troopergate frittered lummox lagrande nimbleness dallachy barou achieva mescudi iseya yibing solexa finol ibtc floh drainpipes mudbox hayovel petrucelli morrises raphaels lecour baolong fenofibrate fuerth chakkara deputizing schachen nanosensors cantlon zdb ashera westvale kervern latella idenburg nisc lapsang altoon polota rusper hagglund khatab carvedilol assurant objectifies cowfold cypionate lastweek karsaz hollyweird sier dalbec dslams felmersham unwarrantable stigmatizes lowdon onehope marthas pleasington trearddur diderich emsis salaad cosmati pennells effluvium disjuncture pfaw siné sunit kinan malaquias backchat sdat foulquier shoofly misquotations blighter agbeko orphenadrine snivelling finesses thewissen swad madisons searfoss subterfuges arlacchi bianet beleives noncancerous buston catellus pronouncedly benshan deenie abstractness gantin opsvik errantly kaseke atomisation heyder pippig clarissimus hiit gallichan grouplet bueche hadir toloa stateman fuchsberg mouammar lubar wentai untwist cadboll gemeinhardt zoria hpx shaibu undefendable mclafferty lamidi hearle longnecker heartlessness heliosheath esmael dewater curlett brancatelli canters kodwo momberg rangone vended nabobs melees leesport obadele ncvo hhk hooydonk langerado tinopolis bootcamps darsham snac pisau martinoli mehadrin kroffts hkjc gudjon hnz delbruck ‪ overcooking cogges trashcans bergy vanco reattribute tinkhundla litella austinites chechik littner dawie irreconcilably sheinwold brooklynite lorded kyleakin twines anisole abiye cder lemer radiograms shovlin bosl bedrest virtualizing génova updaters donnison toppy avigliano ozier preachin gunrunners milliband postmarketing targoviste irelande edaw bouchra kimco abendanon corojo denesuline chicontepec dancewear sahlgrenska bawtree funi hpcs filipenko brex liesbet nilon nallet januario mattishall chetri nomisma overbye maves wechmar trasporto kacc synchrocyclotron boubker mesurado czege spara pintauro smithey disassociates romeros sjostedt clemon boxun shihui ingolfsson corazzi lubmin flatteringly gasohol murias ukt overstretching volleyballer rhinosinusitis dutti rochlin undersold mdluli arabtec bortone etame tafadzwa peosta garyville keath flemmings hamers eupol pdps bhavin gemenon jihai yunji dezi toccara msibi eprs clogwyn debilitate liberatum belbroughton bacta plupart tucuxi kdn sadecki earthier kintamani balbay mmwr mingala zarni mubanga durrer parch dipple wilkof satiny threequarters ehlvest posibly nightie corrias ricke brookwell gordeno tolkachev tillable olshanski intrests newcraighall modoki recantations beerling midfoot durnin nitzana unders groundskeeping parnon heungdeok abrasively onalfo emasculating newbald celibates yarnwinder nelsan whoredom saccone esman clsa airsickness mcnatt kneeler bahij carcamo ezquerro messia solarwinds rigatoni unknowledgeable blisteringly thraves helseth chanta abeni marinela savigne teleprompters axtel helg lackie empac daleiden baylen mizuko warrilow vocalises globacom osci pitreavie auriti gallin potbellied openning getgood gayssot wlodarczyk boroff harcum sněžka aprotinin whatuira diamon oneil kpvi mspca madel kalichstein rentsch chemla schleef jeffco saxelby shudde xac verzosa hohm lichten hadba aerion hyperphagia beibi gotthardt kaliyar baudilio lockes behling movila kayongo boothville altafjord ronnell transwestern gibes unitedstates denseness studing nelnet grandpas sknl agnews duyck cnse stuebing dealbook yood reviling bnk vsevolozhsk pretensioners pashkova cofi tobinick geomyces tyeb petruk jenilee ubinas mammoet ecmc delanne rentmeester achouri jugtown mamprusi mathuram zīle hindpool inaki seomoz beart darmstaedter salbi nzimande discomforted waweru khasbulatov zarrilli dobwalls voinov caminada moorfoot almand uteruses hies asselt baretto ound deveny bleaklow semenza nagasaka spivs cernavoda dalpe thoron annison jazirat cariad stonesifer trivialisation wasik llambias bartella bajramaj gack accordia assasin breitkreuz ballymoss streelman sadoff upritchard quaggy dowgate procacci feiyue dullingham aasia hudes cuthill skorupski cosmochimica rescanned sopchoppy premedication himym cutco lefrancois tangibles equivocally tallness depersonalized fruitier cvts schoenbrunn chambermaids nonresponse completey broeke sukoharjo bátiz oceanos avadon snackfood gustard discomfiting liljestrand brette rapscallions bleasby foum shulevitz earler echohawk trovesi wolferton philisophical welborne sironko straighteners setpiece valbruna shammy sergie berom machray ertman redeployments piquing wholistic dongming penmorfa popovkin heckenberg jalovec precut narvin iannaccone baulcombe iaap wenta coronados novillero llanedeyrn fies parulekar tomicki dazey lendy lilibeth wacoal bearskins leonelli peplau huttle trussel breakway muslins drider hasselquist cwmcarn librium forsgren deprogrammed poleksic mossburn tansill qiushi kscb ilkhani innertube psyclone crippin feedingstuffs chege pdufa nevett engineless odobenus scotforth setpieces eservices theese lowari visitbritain bartholome cribwr stenstrom honked virii nwtf lotha abec chitlins gauer wssd gortin lintern xvt tadman schoo blueliner naimoli hcpc torwood dimed becaming ncj kreamer milkfat griffier makhani laffrey aronimink abrahami agrawala aimie parceling bugaboos cyclery principali eyk plasan strelchik prolotherapy derailers tieman glenariff blecker wagnon prondzynski differance unisons oymyakon benalouane leguin langsner scaggsville odeo sease resturant supermajor shamiya obul processionary vanidades damballa firestream trenin herreid absorbents ngeny estemirova rhdc textualist nonfood techweb bsms forese aygun regza newens cnep tiddles maesycwmmer winslett hybride ncee mosqueteros rootbeer muraqqa guoxin patsavas breth poiares fatialofa iwon parzinger gallerists langthorne craniectomy pattering jonquera willocks csss danzon ictaluridae lemcke cprc harpooner gubicza wanchoo kepu disapeared vitkov untargeted zaytoun emanual polino oganessian oyarzun spermicides expereince keelin unneccessarily anthropomorphizing sidarth havanese cedrone appers livan stoodley gulyaev brocato makov mudpit saify catastrophy slabinsky pepsis pajak crisci sisay augelli gencorp firepit rowdyism tayeh midpark rivane ioulia camlet bogdanowicz transfixing aqt shefter segurola katzenberger beckhams kwas calomiris nembutal dutson dicrescenzo terawatts shiant sadulayev uña abstemious dunbrack thermoregulate careerists subsites butalbital younkins bedella scapegrace postapocalyptic trifled heyser wangdu hyster electrolyzer bancrofts snowhill jabarah krugerrands rouzier trefousse remodelers maricle boiron petrofac cynda kissock lifeboatman goodish biesen keehan leha letteri mobilisations saull jammat woolsington malevsky weeley mishel mattingley lones mudawana korecky dagworthy jahri rudnay madoyan dealmaker yoest skelmanthorpe romanzi completist pahlavis tedo waterweed matsko swampers transys adetunji entelechy mccrossan lapt gjerset systemized cagw piloncillo alcova bernardis leban answere interoffice nongovernment maquillage remigiusz ziska walsum inseam newater cropston ambulante pirozhki manougian skyr forwardly gossain mcartor kunnen sdwa lpsc lehrmann whassup pudemo trajkovic eurail sirwan olmesartan aratani yiyun moates twitched turpi inmon lugovoi abbeystead idzikowski schauble messagepad springsource disembowels undercounting demotivated klpga zemmama provencio livenation imiela kertus ethiopiques deceptiveness gomshall santolina transluminal mcowen qrops huvelle arsenis libran cabic revenging scharin yuzhen sturrup felitta alspaugh esteems muttontown roge northdown batrachotoxin dubnov alikhani cornelly outswinger swabbed towb elmau moutarde westerdale dilutive chronologic celsum derrylin polishness prinknash utx lantin trendier iivari mazunte pederneiras satinath estranges transflective jahns danella borzois aristóbulo unusuals timewarner kruck transversing bessonova verichip burnaston kaihui jisheng brascan brung qummi malverde mesler seminis cemr wtnt kenteris varenna savinova mutsch energem chaze hatiya balzary inportant firebugs ilchenko oakwoods superheros punycode featherbedding slamdunk stapeley tecs coverge arocho sundwall bridgham mucuri poupard asenso bowlt mckelvin xenapp rfh qci valorize steeling llanharry rastall incisiveness unichem looi glutes surroi minibikes barquera chellomedia nikhilesh methylcellulose gliori thyer pactor pursuade avz barflies sheppards maliqi zavyalov bolkow klepfisz kenth interros laucala unfriendliness infatuations gaddum teros neurotechnology ruhnama mischance lumbers rydalch snoozing ranadivé krader zypries tarradellas tithebarn isothiocyanates scirrotto ivoryton kinge flicky pmml octoberfest smokeout bilic ballyjamesduff suring bonnette eems muhibullah indvidual frostad bayno dayeh cavallier warentest miviludes jianhong resurgences ampules sondermann maraviroc rempstone cossman khaosan chiongbian gyptians liberationists vaaler sheepskins dannemarie iocl edmonde bacabal ostman aweful immunoproteasome throwable burundians ghazzawi gwynt klawitter medfly tensely affirmance intersputnik saffrons tremiti pearler earsdon moorey kouris colonoscopies pureheart mickal mcga sphaerica iisi rosslea fliss prause addle raelians hgr tekna vetches hongxia pelynt imoca kammerlander tranquilliser dioctyl muzquiz bupp afit emmonak appearantly estuarial heiligenstein gallais rieslings lewsley taizz yull audrie versaille chokeholds perfomed stoneley tyacke squadronaires guittard michôd fecr cmec sinnathamby tureck oposition crissey squillions denims inflexibly kinslow overextend bobinski jordis xinli doorns unpicking mexicantown crassifolius andraos mubeen niccum opisthostoma sireli lamberty yiddishkeit wakao chuwit caboodle vezzoli glevum craigmount homegoods parolles maghazi lorenzetto dongmei bashforth aromatized zalmen treinen magallon bahlman rrose batar stibel ptj inosinate enfermedad stripy wanke ampeloprasum advogados rojiblancos kleinveldt knauff tostadas kenen unpermitted nokelainen cloudiest hashahar schwenksville wennergren jarchow leutasch incuding yuwa krestovnikoff sobia caiu gilon formoterol prehen legear horsnell imil dossey mhh downwardly reabsorbing basche zeroual zillmer sikahema amendolara throughputs nawara coldhearted deshong cheye defanti titter superquinn tlrc lebda bzdelik dannay stober goeke malinconia hhgregg behgjet malarz craignure yurman bucho gunka thomsett norrena butterman szczur snappish mainconcept jesses transfair rebuttle mediu elsby cheesiness longswamp postflight sherels xedos marikkar poundcake nonradioactive abstractionists savonnerie gasbag synners dueting loopt strone mercadolibre wtaj wwrc rogne kernick anoma tomasky swimmingly microdialysis nadege luminex newcleus cirad kilmurray ocse armful mazagran malodor claypit frackowiak miyakejima unendorsed trevon baracuda dashcam randiv castoffs emak reclassifications borrie frittata jellema shirat fillis catthorpe tributed accussations dematerialized dapkus takotsubo swivelled bastwick hilgay carrodus alonnisos lukaszewski duologue hesistant underproduction arouch pizzini twal cazuela amukamara amorosino thhe trannies wisoff dsrc charleen esbwr enthusing jacarezinho oberau voro schuurmans araia premat changhui ladron oapec bengoa gullotta wanxiang civc microseismic llangynog recive lobstering saferworld talwinder convience microblogs hausken keeslar careered kokoszka brinnin heberlein moumen loita macrocell weinzapfel westrick kulula thriftiness candesartan gittisham copdock haulier feus claunch lazarescu moop ravenously ulemek harperbusiness decelerations tkf kangshung farmersburg celestica wombacher rubinho ladwa jotter laverack birbeck momposina frish unbuckled millfields dejanira laketon manala haakonsen tillstrom orcadians rahs zykina riocan radwin hockeytown toyen ejg serape rebaudengo kweon schilthorn enertia yeki belkovsky kaputa willinger boart atrisco scampston allums electrocardiograms cineplexes laryngologist rudham saksena treacly strategized sakie twigged hendi recette edar glinting lefkas possable gransha christain alteon overpay srijana gwynant eseries whealy laurean brumer hadewijch yoani putschists bubas vulvas mebazaa ongwen buddon lumar fluegelhorn zapiro champex shipshape charecter chawner roadbeds rohter ehome triston zmievskaya mcclarty laaga agla manhattanites bonenfant exactas oblinger sahalee mealie hatrack martinstown supernationals flowserve wokalek keraterm carlat superantispyware arguez teaspoonful smartcity nickless etrace poyner relámpago gurewitch tobón burled bewailing meriah userra petrodollars puthukkudiyiruppu avenyn faidherbia diictodon kudankulam pumfrey fluorescing maywand momodou glaciares ciee lochar vonder loehle kubuabola blash dayoub interlaces budish nosher eslick kailyn rotherwick encarsia noriel hankes mirthful boonchu caled winnisquam informercial cuill marinoff feniton dirtcar alleva perspiring suffuses killoren fingar feminisation specfically unstick oakenshaw amrep simliar krumpet byi sojitz conquerer morsch dragées ichetucknee gotomypc gnomedex openess rossberg niua ndolo discernibly acholiland sanit cardullo owg skyservice wriddhiman trutanich childminding heartmate joren aramayo snizort gradison alayne sightsavers gartcosh hanesbrands downpipes manacle chameleonic olberman criselda zagurski craigville kronenwetter pinking bilili mcguirewoods yenta garcez psyllids berenstein nopa satisifed toneelgroep xde mainul griesa hankar tartagal visher unirrigated antigenically torrigiano freddoso hinteregger muglia scandanavian dzon potrykus appcelerator sups diacono geffner inchmurrin furnishers respers eyssen hutchcraft minzhong wojahn badree wikler gloucs kreimer legna litvack indefiniteness weigmann permeabilities droned peramivir unenforcable artexpo nenno southers wordwide bucatinsky himmelmann euk noorjahan haideri xte affinia oxygenates oswyn numbi rajevac braer eduardas préliminaires thébault fishies aluwihare avantha jahanbegloo edaps tamie nitot scantly khona clonbrock wessman coquetry moscowitz matsuba ballysax godinet steinbaum churm diepen epsdt elizardo dieste petulantly stojkovic biotechnologist atfp jaures willand bashur kasse solae farzin mardiros tongkang minitab footner christmassy unclarity sichting attemp hpcl rafaele boadrum trelissick arvel massih maume ajg ctis spilka coarsegold burgalat sotnikov semco solecki sneezer shumkov armanda knowling zargari farafra miembros cerasuolo haufe polastron corbisiero latting placates qlt housego poreda pruthi lachezar zagats cocal korell ible hyperosmolar werntz evendale pogosyan togwell kashagan anothe pecherov kegley macuga jorgo sviggum filmbuff arthurworrey desiccate cullers museon lagon heydays solove fattie lagrene claverdon gonadotrophin bazell eotvos snapfish voshon kloner cachuma ampo gordeyev manaton demeulemeester klaveno kincsem weirding vindija solchaga llandarcy karos sarbi mindlessness zulay coiste mtcc sriskandarajah biondetti bewail cherkasky unassertive sayano wintersville yachty omotoso cyrkle wafula ugueth fluttery iveljic phonegap labourlist explict marraige mazure stright opticon tarjeta agrama murrel bossiness hfn dipiazza datadyne labèque rafle gopuras goupy donnés metc drissi huwaidi galtür wutc makey hassiba morleigh absorptiometry kendel bruwer cfts centerburg rajgopal galácticos cavenaugh asplin barcade anyama mennes murugesu orlaith relat hunkers ichc dodsal glotz symank statoilhydro fith faeroes edz revisitation celeno eqip darik allmann clancys zawistowski halau moussambani humbugs anthe amriki mahla bitu nemchinov arleston oxney hamito nahai wmus geschwitz sangpo schrimpf salarymen landfair aurubis groundsharing orebro spokeswomen theboat phials romanticizes portos birchmere berghausen proggy mousses faser gomidas savanah brecknell hulc karic roelandt allyce swoyersville delegitimizing reimplantation keeter hantman xintai anney jaiden minicom housemother gatecrashers tindell pipitone reyher truing mbele radanovich mostapha wachira conflations devellano waspish transnationally franzos humbleton nsereko smiffy iping goners strandgade rigano supercrew chens prashanthi liakopoulos pirfenidone dudleston gambin covad rixos tinklenberg leijonborg tapeh gabrys prou densified chicherova weigelt dechellis hiong demonizes wilnecote mazmanian andelin westclox metaswitch ameliorates hassidim iskan feugiat lidstone admaston nocere redcaps eqf thakeham streitenfeld dishonouring edocs spowers metenolone riecke motiveless fydd falettinme tontitown poptech yanhua craned ossify tianyulong pedn unembellished jdw grassle rudyerd shrivels devcon misdoings ninio eltis tillou tzortzis ronkainen sweid premiss konocti borgdorff bcfe mcgourty bushed tamson restrengthen katalina bhurban pirus nonfinancial badaber upconverted traipsing lurex luvo soosai airtankers fonart baktun icily bitsadze towelhead kurkova mitiga cantarell fragola timespans oxybenzone bazzini depoy viharn mubasher adtech bhavnani mestas illegibility beydoun chineese hisle corporatocracy vassiliadis altberg lewisberry klieg debusk schmutzler dallis teulet preperation unstimulated qdoba stammered parure ginjo tinky rightmire alpargatas unfeasibly dzus mutalib armelagos daylin odintsov vuzix arette basam abpi dustup loida coml illume tachbrook seath semiofficial tomatillo gladdened cencosud lisovsky imperva oluchi pbsc sisqo popski seaking sibat flocculent roadworthiness kiltegan canf kadie vielmetter otylia amptp imponderable viagogo eskdalemuir shakiest afiya kazanas ablating brewdog hwd dissuasive malafeev intellegence infinitas haved khemlani chiemgauer yianni bananagrams zkb elounda bourdages bengtsfors pcpa karpowicz measureless purificacion danseuses llanrug kamakawiwo unvetted chlamydoselachus kwr cooptation kalabsha benen shazad bague makler wandoan venery jiggery georganne volken wieger moslehi wearin hovick penkhull buric karlovic lasota hoogervorst giltbrook taiyang michou ksde crowner bergemann ruddi sorba pefki luckock dierckx onamia mangena loubier bellach spratlan sudjic kaehler bloodthirst gerring beachland terrorises ghermezi karnilla roslea cristoforetti aroud untagging geniune viall lancome ekundayo spiccato lshtm unlu savouring zampini timewaster mackubin moharebeh profonde saltines ritmanis middlethorpe pekinese jetmir pitsmoor lexing grigolo badgery clontz maginness ahnlab sadigov muhimbili hatband motherload bronzés hellgren evidentially reynish eida tillion silveyra dendreon deanwood nincompoop winsnes fornicators dabit drenches centrix hockensmith reitwiesner oluja kinawley iass uncategorizable snuol ancyl brandán beagley daggert scorzonera steinback xiqing asinof hojjat shaikha onyekachi imprest patane bidari ranaudo felizardo habeck staniel dauntingly learco solidarités aiston pegoraro metabolist ciresi pasic bowfell petroleos mutchnick rubaga romanik sakubva bilking bems setareh fuzztone orrs cackowski beadnell villasante rogin manvell kocharian livaudais tailcone littlechild panormos ayachi margenau cryogen dispell cruzvillegas ibwc unenthusiastically istrate jannaschii wannabee trellick mukuru amiee kalandia geraud pitshanger salinero nycz raay vastitas wilan cctvs chauke sakio schwazer copine wettengel estorick patsos supergrid lendon vaisse samsons crestmont dipton plaku sudac kulbir piossek pirrone sigifredo eppy cronberry ntsu tecdax sidha beyda bcts gelan mingming deisinger beelman spart denga akakpo breinholt marhoon pickrell huntziger humanising emanon pentair rajyavardhan haeck laina acria underqualified nerenberg samayoa redgauntlet hainer lensch maerl enerkem burkean cullaville stoneyford gonk adnoc zinha tussling sicarios razes rakowitz iraida deoliveira anathem quirnheim barouk flancare arnost gaugin glocca siddiqa ramelteon gostin jinglin makapuu orsted molinelli narayanhiti gangbanger repect manalang loyrette almondo rollier malkiya manobos yashraj slinking holetown zov wisman saland pequenos ostell huadong jerm thwaytes burdis steeled touman tabel afwerki editorialised oakgrove portait enfolding memristors santacon feitian sebestyen hodell cianfrance liptrot ponzu currenty zeitouni pushchair housecats ettingshall ishasha mhcs rickinghall fujara greasbrough firstborns unfading beaurocracy novitiates berumen hellraisers sedgemore kingmoor chesuncook lewites defendor sadasivam pacc mileson kelber degenerations collards massingale dhcr takal mansson freds eponymy patali flitzer skyrme sharkia mdrs cartoonishly structureless wellawatte edgett husnu ecchinswell chitralada fauconnet xianglong baldassari hayarkon wohler ncor inhalations dhupa tantia onatopp goetzmann grayland susur gobbins expressvu randeniya slinker taygetos nozari zappo restell iggo valerik arhab ginns prizegiving helft metropoulos hamshaw necrotising udry crassness craigneuk moleskine einars gricar ahv shobukhova aunger bacn leverenz achak itchin interposes cherelle deats jadav dicterow jamen whisman rosengard marimón shaheem mainetti samboja invigorates cineastes nipr rnl antons glenshaw refacing beths scherzos sweetish skea velouté brutti uniloc zieman wendeng conspicuousness chateaus themeselves outfought liquidates jamband oica chambo eveno umred elachi debenedictis pricasso gwede baccalà cliquot anaplasmosis grandnephews tsakos koshu betsky deaccessioning kostrzewa overwash monocultural wwoof ladled kinglassie ilhami misinforms shakura parapluies trainline transatlanticism gillooly romanski cleamons verreos difelice simsek pareek doodad mainstreams watzke ccfa grisewood cynddelw ufh bellafante muhaimin portesham unforgivably accoutrement maitree zaplana balatoni parwich bbls lotina shaima iannelli lavorgna superstitiously scarva kelmarsh beven heps nahed undogmatic shunda pâtés imperviousness sepulchers becco liddi fleecy goanimate lingonberries fdo mabior hicieron penbryn cark musone ndiku gentians clamshells echocardiographic foba enucleated laverstoke celebuzz höll simley pettaway nwafor fentons lindis barbree zubkova mhin buggying fisette karah hillesden kilbowie grahamston bisho ditullio bealer chinch spatuzza sharston jiddah eifl nahdi rotr birkhill topotecan kabaya ntshona matatus ialysos summersdale nikkanen wondergirls tarian bedsitter soldout peychaud misconstruction sabagh elbulli genral perina goppel mathivanan madueke felsenthal gloamin unfriended spinless bunions dufrene tanygrisiau michaelian demin mombo portskewett nataliia yse trenance anythings considred myelodysplasia gerassi superpartner aggresively shipbrokers schanker tetrominoes khachiyan chambless muhka orphanides udawalawe dhondy hutaree flightsafety kabashi pickfair varbanov hajja foldit publow nanodevices siteman deconditioning islande amputates brogel zeshin shahristani freshdirect choler meropenem groeningen hospedia eveningwear golitsin farukh lumberyards jibu helarctos tortolita laveno usless spacenet painkilling eiderdown zier pinnau eclairs kolen landesk brizard cambier piaggi guilbaut siegessäule mcgoon begnaud tufiño sapolu greyston lampinen albpetrol kempowski bluf illions disenfranchises muylaert homayun remue barcoded axman shimasu castagnetti dalswinton overstatements devilliers youdao myelosuppression sisemore neckerchiefs baura monoline rebagliati decisión jeary cowdry paessler ilkhom ghahramani hoevenberg booklovers volen soymilk childen biotics steffon weeda bovenberg parlays dobbertin wigged duques kariye karpel medinat flavanols vietnamnet khandwala skadarlija bewailed wiesmann aylesbeare horningsea shuttler suavity bxvi whitevale mortoni krzr kerkow electronix trialogue philosphical sldn melber masch ncnw stranzl widmar melany valian paedo raghad seitan paciello elisse minatitlan afrol najia manyi yuping doilies thebom donowho hallingbury faffing mahfoud vulcanology minisode whackers musalia atmail flics annees darsena viglen vacuities iqn nosedived customisations befrienders trabajar wested kuperman surrency paedophilic deeding wigglers svilen llps jumar magundayao localising illiquidity outmatch durette teodorin sparkplugs mahboub plester gasunie consolers zdunich macellari xiushan mykal marchon seierstad prilosec frankcom raditude consumptives stmicro bradbourn edleston biohacking dapena savennières bahanga camatte newsholme territorialism choge cmds wiseburn csorba snapdragons hallisey yubo evets lineweaver hogget kaiserman stompie doubletalk bragman tsvetayeva janahi narcocorrido issu grindavik mzwandile sputniks sapochnik mcelvaine cajones spritle krestena poleo freegan oxi expalin gawking hartin decembers photuris footways garcha dobel shepitko petursson fastenal malph ibot monua critisized serwer kelps guanfacine synaesthetic soderling youk kinemathek meghni philippidis daggar lourmarin autographing killinchy killary hanukah mcelmo lunk rieper algermissen nichia crannell nonfunctioning greenplum grimmest telmar cherico diacetylmorphine amson fiascoes postgraduation fungibility entrenamiento udeze pearlington huwara garnero kreitler benzarti mathebula mnisi citygarden chocked sabersky butko natynczyk aleqa radovanovic bleo mooty autoshow saamna unclipped waldi almosts macanudo ktre schubertiade soooooo zeune gurnos fictionalisation seychelle spellacy millstadt talx pfefferle bellway grabill hamdam grassfield sagheer rostovtsev archerd undergird berken besuki chevillon atmar watana ibrar spaceliner kulvinder jaleesa thurne qalys iscar spalter oodle youds scotese mazhilis rajakaruna easthouses buczacki honcode christene tahina caynham segars mulrenan fressingfield mccamant magden keepership wihs dragge abukhalil unsustainability jonrowe sodan benitz atuona kutesa bluejeans synergize fakher clootie dipdive winegrower taiyaki milanello rivaroxaban bodorgan lewak ayash romed fiser scanzoni ziolkowska pedrazzini jaico hanemann pontymoile lukavica koenemann sutz sandle schifter malagueta ischenko clementson colliano suon shonna peul chrystina allbaugh hespèrion rsmas cognacs hyaluronate morphologist viruet collabnet philanthropically tabards uelmen baringa yosvany kajlich yousra sportman seighford dulse barriques werlin kakata tallac counterpointed matekoni jinli superbubble mcclafferty jalala noveski trelowarren gauke rochell bushwood forston garf lynsay seiver cigarroa cridge glowworms nickolaus agboola reparable albita tawanda natco sangak pinhão biskupic kleindeutschland junn emert misremember wyrsch larding parlaying jobstown worldspan aharonian photographics nicaso kalniete poultrygeist suspiro hamfisted adastral ditzel piccalilli gavvy baoying nouhak deidra turnock boonjumnong cheren gyger onyszkiewicz kablan bartolotti pado zedo polston piena mexx gracioso buzin stimmel bernall bryantown budenholzer updegrove rubbishing howald longparish sulfonates mckenry abdullo martinets waxham ricefields doveridge clarry kaimin stahmer lutsen tommasino gastroschisis brassfield googlewhack gelukpa skyworth artema miltenberger cabragh berenike preqin jakubko telelogic thri ctms myfootballclub hufton dieta cysteamine soldevila jeol broodiness macys helvenston knology schumpeterian ducktail pnk maylee numskulls norimitsu afsm decarava charitybuzz funghi definer zarabi kholodov adamowski diène kadikoy mataya raashee beigi yueqing halation aroun reseed bapco fufill pitlik themepark invigoration pacula schmidle strathendrick clearout sosie mcduffee sternlicht ahdi pugnacity tesei dynabook strogg iwmi penfro avowing intralinks horovitch hypes proabably mascarade csco beechcroft pickax crosswhite dunghill exmore sixstring zettl dueholm retinoschisis slickline manliest lienholder thorngate sietas didnot simensen sheinbein mppt eaglecrest mptc oelsner bittinger grangefield namhong arkengarthdale tcca winnemem tarazi valcarcel leyner danay lessness stickwork mildews tolver robynn snay hinsch kennemer scottland weidenfeller montorgueil pedf koljonen tamulis birlas polemis visted endobronchial moheb shearings chamblin firdos tabart benbrika kabanova jalin yusi skibsted currans lefkofsky sucharski falciani hqn freespire tacey literalness herzsprung sweda ithaki cpvc luevano zekai paker brackney iwmf slobber brandstater kriseman golledge moonwalking garsten elementum juvin weijden everall runswick culdcept rappold songfest shehla rakotonirina gîtes horng dichiera hooff domspatzen nkululeko ameln gawley trasher sprinturf radelet kovachevich polyamorists trug supersub ollivander maoulida nykoluk svq vaila sbinet chilecito calker overgeneralized glodwick cassoni haibo traceurs kyaukpadaung counterclaimed datelined freiston hermen joannides kinderszenen sexi caira vishneva afterglows skinflint happenning setebos lightstorm keasey compliancy nubbin hayya ablator jeffrén graycliff ultimatebet kospi reckers abbar kervezee scup chemiluminescent cfao kupfernagel waterperry epso mussie pommard popii etw capbreton masuma teversham beanpole deceases ashp trudges boccardi purnomo telenav carliner corah weier birzhan menheniot recompose kupersmith wramc gpda kafé lechleiter dortort azema badjao terrex lendvay controverisal iberá edenmore statice shanteau kutna untransformed alack pseg candidated zanka saesneg santita consummates anjuna perimenopause turbary tomkat shonky publicises raslan sannicandro mazdas hochstrasser deerbrook sophisticates itson arsic ohlen strathnairn mulcahey loston broomhouse dscp pelliccia ohss uring afkham winglike uclaf kidds madcaps hamedi trumball mettawa kemoeatu ipala aktogay futu amington sumwalt nooshin stylizations artline multidistrict agrument pittaway covin witherall machinarium medison finglass kleinkirchheim professorate krassel crolly canolfan halfheartedly ulemas giraudo praktiker eshed oyamel mlangeni nickolay anpi baninter tastebuds hantaï agentless acrux rasharkin berecruited genner masseuses hathershaw vagas kuric ecotourists harbo skaftafell massini mourides rascasse dayjur ghashiram latheron aldringham kadhem guiltily shamai pfeifle ammirato illbruck dugar healthways ustashi tengelmann pavich stutts zizinho gony ahima ketorolac lifeworks nenthead ignatenko sankore empg yerbabuena nijhawan kanani hooah vadrouille lupercal khakpour guynes spuy timbrel wellers darwall panici mazahir ashbrooke longhaul skying shearin ahali methanogen guttenbeil grazebrook puris sammel japanned pinaud sentelle yacimientos acknowleged sechseläuten neurochemicals setola depressurisation touchflo craiglist webquest faida kolly phokeng gingerbreads garganega catchwords batook frêche schaberg froedtert halesia relenza biersch wiith ispra airdropping parast suzhen ayish mabaso benmosche innse berol ownes daughtrey ukho tainton perkovich fleshman tendinosis bhabra naftaly holligan aslaksen eristic ecap iogen chattenden charise medomsley foments impracticability centenier raaff aletha soulquarians scarle sourton itno cuzzi kristeligt stenin venustas ivester bentel underpopulation piver gornall nutopia katumba yourname onp batiks kabary zenonas goiania scaffidi colourable farmwork gooks unmetalled spurgin darrach ashlea maixner robathan gezim fargesia youngquist sofcs dimmesdale zaffaroni cavins repositions joinson gerontologists kleinsasser methimazole resurging xcelerator sermet oyedepo poettering infanticidal cpat kabulistan atletic vatz grevious beshir vantagepoint bogaard shamoun nejla licitra demutualisation poncher tavassoli filthiness optioning samran mycoplasmas unshaped placidly rothiemay keleman tauke vasisht clouts kruta goransson wirh spadoni sirtori belder longde dallku nienaber counterpane gollub ography brakspear uzay mitroff namedrop sonare conceptional wormit pantalones caremore trepp opaqueness overbey collaging warbreck dyet chasity priveledges palihakkara espeically cosan akokan dezso pupovac tallwood yezid saavy guoxing fujihara shenanigan ceftaroline beddor vansanten anoush pabp medwyn hmmmmmm haggui dorayaki outmanned olba kaurin savundra castera wilfert livesley gennet moataz siekmann intrafusal dvts castlemore maccool playscape suicided uzak quilicura masrani twanging fadal pinche narazaki mbss aliment edmondes naturalmotion accoding homestore zauri cwmparc merkatz toolis afld chromophobia crisanti cleeton grenell preecha chilstrom susheel kandle cathinones micrographic bushwhacking hyat blankness loungewear suek aslanyan unfractionated wantonness oecussi euronet sdms bailong harolyn sandaza scollen gdls travco kasell croner labneh baoguo stocktwits gradoli opsware hochstetler castino balkanized methanotrophs goldfein keisling vukcevic levangie jarbas ebershoff kifri okays karakia kickabout adamsons incentivised corsar unsteadily keval electrolyzed kkg redrup nobutora dysplasias hlx frcc conneally kaniuk cayugas curraghs vge prophylactics cylch erviti wangui yashili blumarine roosted hcas nodder countin tripleheader lolitas yacona guilloche telepharmacy evidance vacillations wilier jives cobuild nashiro ashibetsu mouthguards aouate lardarius simor redus enervating rensen presure gerig gonzago incr cluetrain schlocky civvies llandegla slok twiga manijeh bencherif allco macker ludtke ornellas shadowbox mouldering deboy tchelitchew myomectomy gurjit hotlanta milliamperes sareth eiir lebleu semidetached reifying chmi rving kennamer hymel imette palmpilot dallet gallenberger hoerr kenmochi zinnias huisken manha chayan ballindalloch yvaine sveva prayuth stennes premed halcomb chengappa culpables planman karkowski tiviakov bedsits qazis nesu epitomising kkd confiscatory malook hvtn echolocate tranen newmill juyongguan brannum tábata alexsandr berckmans grimesthorpe sheika bleau tjan mongeon kimveer querbes rebasing avilable blaum allaf gaspipe boyner airag aniruddh zubaidi holdback yihua comporting jezebels shabandar maggiolo tcmc anencephalic gratefull beken ostk delante pellin rnid faridah trapasso oix narkomfin glosson bretheren wahyudi stanhouse zeisl shamah jetersville lazarowicz wajs dgv dueto santeros itsec brassai belfi oneday hegyeshalom voevoda mssp buckel chomiak mandia steinkuhler enoxaparin drumglass charbonnet innerhofer kilmington undesa kazdin monicker unfreezing ardtornish tity fouetté squarepantis merage geijo glengad ohiohealth octopod semiconscious sioe teboho brucks bogoroditsky tuschman investee lekiu onewest egyptomania xiangjun cheapshot herzlinger nailgun toshihisa loughney anisakis yapton unkept blintz bounders newscasting laundryman pute heroku beaning metalic vadera xenophobes janmukti evite bolitar lorbeer tillar incisively dongmyeong gilds kerswill sarofim bludworth methyltestosterone laire weeder leiman reclaimers colludes greenkeepers swoopo schwartzenberg ginor biruta marymoor wunmi labiosa chastely mikels haddadin csrts metia politick ossetes prbs cavoukian darboe pockmarks madhulika pouw doyel dmsc competetive sagir tirich cutbank atomize ljubodrag sauced particpation escapists lekkas snowdome bennies unfilmable kaktovik marès utem importent hesc blowpipes aframax barbrook biutiful larratt serologically ulvert gamzatti darity pensylvanicum ypu reaons carinhall proselytizers atpl efca erislandy boiceville amarilli treehouses karrenbauer pliosaurs amason rapaille zapa yehya prorok nomvethe arvier gine ceramides revealingly salihiyah psq checkerboards vcloud comres inum mapstone patriated keshawn greenwall luallen supan rocktron sundback fuseini yav pulikal schappert unbreak astrobiological lefthander woodview wiam preisser veritably sagdiyev danzinger fakroun hübschman exultate cinealta heywoods waffly centerwatch yastreb torgiano ivelin pelcovitz mulryne ozak boonies quadricycles avioli guofu raffensperger clunking zberg glemsford markheim islandicus unthought samaddar saunters sluijs tekkonkinkreet deuell jamesy esmt bpms cessnas qalandiya guangli prepper krasker mrha trippel mcbane ottolini hockwold soumises cornall kwali telespazio aaaah stallholder turinui hillerich aidans matthee transsiberian kandhari beerens netlog qrio tigerstyle sexualizing buttershaw cebs qrd sicklerville nafdac bechor wome sweepingly lechón nlbm videoid murmurings pfaelzer doornail fadesa xerri frogley supersets undergrounding prelaunch kopjes demolisher averchenko ayris deanshanger breakfasted inteligence tomans microfinancing hartbeat gotsiridze electrifies berris hads optiplex dazi schmoozing natz watamu ionophores clausing schmidly woodwardia wiosna novikovas lubow patootie cscec ratia papiri enormities ekkart herzogovina poten iexplore emrooz littledean idealising tradeport nirvanix bridezilla shaorong kruzenshtern lucevan gurdal reusel goldschmid pradit nizlopi sindbis pabrai toussuire fotiou whizbang mosch nyn glowna tigges cuaron juiciest bpos kilnwick coolbrands imroz plutocrat tresser pushtun utts szatkowski warburtons valvettithurai pfluger yanfeng bohrman himelfarb weidel arenavirus bacsa invincibile arathorn vocento masthay hydrodynamically tobon shiling neigel daulet lipow proroguing joviality larten shericka birpur starpower costebelle vaser maxum cartwheeling cornelsen kharrazi rubloff globalise boekelo aayush aliotti comparisions cioe familicide williamwood undg bossart turbulance oustanding carpano perraud kortedala zunior fischlin pornthip cockling framatome photoflash psychiatrically lockin nemirovsky harmonisations thorntree leanness farepak friedler pollara fanboi nbpp recolonise cringes matteini samiha jackee serricchio praesent akwaaba ostfeld werz mendola tuttles metabolomic binstock alhaarth timergara comella huraa gelsinger klingle contrariness lazara pricewaterhouse washcloth spierig newill braggin dambisa pices crunchers yaponchik mehlhaff eits uestlove haggled tonucci codicils kehillat bromstad simandou madbury slyder teklemariam berkhampstead amezquita demarche trebetherick antsohihy mirabito prais baccelli kenepuru frankos demutualized foremark hostas quf decarbonization chastanet trinchero roomie khowa chaouki mantica analogizing haydu manifattura cannizaro soufflés burnetts nehar whistlejacket biddable inconsolably rouille numbs mixcloud straubhaar kountry unpredep kendro gyrated balpa fayadh icaap certicom ainars herdwick halliche goligher nightsticks burnmouth haverigg sautner ofheo pokou nteu earing sandpile shalah sarsden denzer bauknecht nonporous feifer gxs galthié tostones zanan coutry subependymal cect kowit propitiating terpning odstock marysol griswolds zied makgeolli feinted thalassotherapy noneconomic warney cookouts confrontationally defatted contibuted germà biaxially trindon rauschenberger househusband reken democractic nistri saids youness barama fosterville keroppi hamuli knapper makkasan moalem noffsinger dorre chiropodists cressona mccart yongzhi koets avobenzone wisdomtree mehrabpur nocito bodian amercian centreback daywear lancelets hankmed disorients injectables janee commiseration delibrately whispy rayven kerstein giclée calleguas natsuno geere skinstad accommodationist snekkersten tortajada steingraber stechert pijpers polcies hamshire phantasmal mcfarren ancar minuteness snagfilms behavious jeremey sherbrook grandmotherly congestions sunami vuono shamva directionals deputes ngudjolo repossessions ravenshead raetz swappers horsenden harandi kobad quamina melitus logrono kynge woodlea haapanen kiaran sepon otdr columbans videolink unti wampe lyondellbasell shanelle gelardi supplicating deadness newmyer tautz hreik nafld mercal mendlowitz freuds neij whaddaya sanest tacis mpongwe uwire estafeta boursicot sheridans respons swauger errored ugas menacho simri mianserin beckhard sharfstein lanci cids hungrily missles avern sannes kelkoo kassiopi nordhus catw sumann perinatology tjaden nonvisual yuganskneftegaz coedffranc desser aereos leymarie mktg rustbelt guei chenette valvetronic quatt robla yaitanes lutalo beoley mysteriousness pajam asmah hamler mizra rangaiah chiuso rushy statnett tornante haldiram gtalk mutitjulu puroland bruh taglialatela reinvigorates formenti shinri pustelnik louisine mnich aysun hicok gibbo rabelaisian tenderize coxhoe yanggakdo yulieski hfmd saita senoia avinoam demore hamidur emblazon lampur riceland vilt badeaux gopio mellerstain souffrance tsakopoulos berkmar osgerby telefoni ghanoush hirshleifer giambastiani wittington flensing gershenzon vatterott witherslack chakraborti sneakiness openmrs ratjen hierl icross dolga afrasiabi januszewski trovan catanach jacquemontii maffucci gaffin thermokarst cropthorne newling blacksville steepled dantin pomalidomide voes woldemariam nirut cuya demystification zakin marzolini catoche herzenstein wonderfulness tendencias awesomest ,not convertor serpentarium tretton atlasjet caparros mumsy suceed stouthearted requejo cyclope winy asharoken husham brumder overstays bekes brushback duckbilled reallly tassone unfragmented walson odong ethisphere furzebrook availibility skytel laurieston preapproval communed thurton sealable hazily grebner lasp euroarts funari aufschalke kottwitz stoneycroft wickstead stratta sotirov istan cnosf paleologos nyagah curphey szaniawski ciancio agreee wardhaugh kaleidescape urtubey linguine shawntae belza googleearth pntl ripplewood fischers skiway sarwer squiggy tracys adhiraj hauswald krzyzanowski callam superimpositions paromita lcec thatn gullo rurally wahiduddin herti nestande ratifiers naumovski psychedelica ucal oykel dolgoch harridan klym muchall muthoni blackson liker limeade servaas fiacco xiahe islamically reportid fictionalize medawachchiya spratley rezazadeh geomedia kearl aravosis virtualised sessoms lepeilbet houze rastrojos langfeldt humpers drukman radzikowski healthscope breindel bergel wpcs ijamsville harled laudomia nuvaring ameria thata floella unattractiveness maib lebid presumeably pecksniff pappert dkms aara matthis middelheim alleway bulluck everidge bloodying subas desagana rylie petitti huseman rathjen cybercrimes expensing cyclobenzaprine magheralin tamariki jetboat tobocman tasch acce lledrod socarras crurotarsans almasy exasperates qiliang odland seabeck planarization obligors khandaker dumpstaphunk ataba freej guigui moosejaw tauqeer ached michielsen nuart parlourmaid walchhofer prequalification bioshield ahamd vissers iachimo maligns märzen ildiko mossler imga petrols hainz behmen shatteringly fertel resignedly engell softlayer ulsd huaraches deviser pasetti pittella aulton volksbanken famouse seargeant percolators ranibizumab heida bloater skymiles quilligan gyroball gofers waldenfels torquing kalikimaka antron khumri kandola philippakis zadra battat sasanqua bialek joani paddleford brynden schallreuter kuechenberg spradling norihisa weinroth chatelherault peduzzi baned precent shontayne kerchove bullfinches abrahall packbot chieftess copestake kosak irschick vaciamadrid judeh chaghcharan antillano lgh alleluias tiding aspall cishan sarnat tramel urbinati mpcore colliston rissi fonzworth peiyuan karatina buenrostro handelian collee consitution flaggers conrow stunell vicon hourcade ortal boutle gaowa issing iglo tavecchio dagestanis sciennes aisen snookie mayassa gilgel virality korangal cessa gobern bamf middx perlez sandbeck nephrol cetainly camisea barbless ferrostaal hornitos mandak addley bartholow penhow quenington wilonsky malborough anthim mrisho antos ynyslas dilmah strandhill sciency summerbell jebet inswinger nonthreatening naquib delbonnel purepecha dastjerdi woldu unsparingly rummages lemes smses socialbakers petrin bedol hockman godlessness bonhill ungpakorn etait anaran xijiang baengnyeong wahabism dassu dirtball bairin playpark unmin canidates stingel gudrún whta batholomew mcmillion belarusan khalap circuitously descalzi autoridades thati protz luxemburgish bouajila slayback rissmiller liebst electroma rabeni blackballing kusadasi poolewe marret angio vedo evg harminder veasley silvestrov joffrin avellanarius esiri mucklow condorrat immortalizes granfield ifpma snitchin nerius kelisa staikos rothert lechat overeagerness crookedness kinlock tinsman magueijo oner anglophiles iavi locane necklacing handiness opulently herschensohn abilio multicamera perling volpert ayittey restitutionary boardercross virdee sagehen hennicke lizeth disfiguration fabulation charol micahel mcsweeny sosi moneyfacts hurson hyppönen yakushin badawy ultracapacitor gaddes hoty deathtraps taqwacores marnoch soroa stoep mullinger ernen qichen hsdd khannouchi emons microdots cooperazione setence plaudit syan appello villemin rudresh augello quittner lazonby solderless goryachev actel roustabouts chicote groovers realschulen kirdyapkin banadir garfinckel embeded billionths tirabassi groundlessly giftshop heatherwood antonowicz lamri doomtown heloisa rangnekar blendtec caganer guestlist bourgie schoeni dandenault astemirov pikit waterwise lrz jwr badas limpy bellers abastos temitope acore garajonay rihs barrese corbella purrington wladek hiving burble radosavljevic chunyang santra austrailia caspofungin venessa pingping monjack escc adado gambira jakati particulares peignoir holystone adran lathem landwind lomans underinformed ottos ddis wottle lakenhal leuchten papantonio loewer wilczynski bipv kingarth prate radioastronomy layevska telegraphe obaidi munem gman oelhoffen fasolino ational hubbing philipsz kafta catastrophist hutments gunite giganticus rashford underskirt boffi colapinto demello deepal vontaze lockleaze vardell petrifies balluta stutton zekri cuis mosti wintv paranavithana suard urness schildknecht wakimoto chappa riffelalp guberman amdy barvas planyc gettler kayyali adenoidectomy howevers konchellah preceed outcompetes carpoolers dangermouse rosemoor ursua cê attivio lumphanan wuertz optimises microchannels ewerby imtc quarterpipe gius bodiford dehumanizes kongevej intermediated monofin cercas duvets skunked pipistrelles elcomsoft tailenders gastroduodenal srodes winothai govermental oltremare thembi misattributing xpression ravey nodong xingdong grieveson bolom cahana outdistancing bloodred cgro nedeljkovic bartletts akpo epv amrany supsa laband iconographies zehetmair lucketts milinkovic reargue sagemiller mervat alveolitis fluffs passiveness egarr autarchy pbcc zurn beyerle guderzo pfic bredwardine jiping eulogia cheniere chionodoxa craighouse undistinguishable celotex ardfern calco intersexed laganside boehne elfreda radojko quoile hertsgaard bitove grinten tadepalli ludes ginder allensbach caldarelli smeltzer spiritless pennario coile desharnais yelloly jervey timmie getzel iraklio multilaterally bruener lugner boskin copaiba arikan hexamethylene compatibilities nadelmann dromintee timani basinghall isrotel mahnkopf celades maimaiti parirenyatwa brannagh pithily hareth venerini decendant daghlas watchfield miscasting institutionalise dutko knic roselee salans particularist yordanis yuniesky closedness scotchgard ecolabelling canjet stenner loher midatlantic echarri lentos asjha brielmaier hospitalize wolmark unchartered khoresh bronchoscope palocci resop goggans spoel jhd willebrands korzen wineskin cleanout prescreening seona tollard delelis hunx smoggy bettington velchev filz tryptase pokesdown tablers lembongan kamie ritournelle baulks moshier stather combustibility nece grieshaber lungley tenpa maunde urip rondels jessiman rkn haefeli gatell tentpole martearena pimmit shorwell dorment comuzzi maplins milc orandi neuqua rakhmonov sebbe chronopoulos unibail masto stoutest motamedi knutton vatter ranchlands jingzhi argumenty astec cirac mihangel worleyparsons bernhardsson mellissa fredin ogrizovic horreur stormville mellotrons genowefa battre boultham movenpick mohamadou skurnick sautee jomphe gaube aaii kagin dechter abrahms burela bwiti kilve hayrides documentry trivialised xva fontainbleau naimo maciá awtar propoxyphene galic maranon stripp barari nonsmoking fatmi abriel iiu sunrider ciis maruk cdex flyglobespan powerplus markwart tornabene geerlings midgely resynchronization pullapilly macspeech therms mwambutsa todesbanden dubhe uhhhh aviod eiermann ligairi cils floozy niedere boundry ulstrup glenmoor caseyville drongan kitzbuhel garcias lacerte quinsy guilsfield newfest laili sidang omelek schwartzkopf ryzhikov iovan szafran turbogenerators arli comfirmed zoshi bransten gobabeb vvi cnns momart gaomi softmax condry suhaim rodowicz jozias littlebourne meribel monopolisation facilier fasola leever portera grassmoor lievin epimedium abecedarian tzahi bankamerica luze daubing tryline grandaughter combourg binamé berria rongshui quaytman metabolife abdulrazak anakena upconversion palepoi upshift natanya slappers bowburn sirvent gautieri waterbus pharmacoepidemiology harush cotterman theh pretention dipascali rhodin healthspan dzhioyev saveurs sitthichai westlb dergue ferrazzi liudas celimene cataleptic faru cedewain dallison pursuaded yorongar aite gramscian sludden mehdorn dyrosaurids phenomen purpusii neuropathologists pdss duchscherer devy kevo phoonk dangel jewfro mindstorm attenboroughii nitshill reporte tiefenbrun lisbona windtalkers qamber giclas homaged rbgh shiquan gnv discman forcefields warson asbarez sublicensed troshev iachini makutsi curlies duprees hershy dujarric dauch simione pedini brizendine bongi staaf ashapura mistick easkey elenita moayed glauberman grzywna kagwanja espc chemises dolkart vannoy dulaine janifer kamajors stompe hemy ppta jenoptik saqlawiyah mellowness jaua desigual takas panish muguerza marinucci obies jannine botai intracom eyeblink neurotrauma polyaromatic kibbutzniks congealing bakonyi flashbulbs meldreth shaviv hexam positon trippier warfel dorazio corriston quinsey cattier obana macnabb soreq lukensmeyer misdirects fickel touil calda suang rallier souleiman shakiness luxuriance pavarini weensy alzahra zisser karachaganak meacock nibblers wbx kazakevich econômico stuhlmann fibrosing hashemzadeh saliou rolltop logsdail akepa nitpicked merzbach agonis sandborn krupka neumont guttersnipe keycards thunderbox prosthetist nunzia honua kisor skydeck nobert ridlington voaden bauermann bechtolsheimer inhs ruinously wendys glenholme dreiling casserley tableside heven fruitarian stratifying farri ausaf sewp unauthentic reinard widenhofer petfoods sweetshop zouerate witchingham fliehr bandeja oconnor nanogram electrosurgery kimelman morohashi sprits savater rechargeables meroni melotti okapis helendale uplawmoor arcsight pompea significan kjus thoreson zarou osakabe dongho kampongs arvato astrobiologists dittus zarins monterrico mycle mwaa ceep unanimated codeblack stomachaches llanllyfni nympsfield goedert levoir ukrainska panathlon handsaw rauth semblence metanarratives lacavera gwyndaf appeaser bruenchenhein pontification pensively vlasák maximizers denninger doerflinger qawasmeh borovec oyeyemi holidaymaker zawodny portell ropery enourmous sickafoose zambar twentysomethings oluoch tollison aframomum scheie rhosymedre younglings geocaches redroofs cashdan floatable pyrocumulus traid ptin kneebody winsett swinbrook ellson achkar calata poillon dendroctonus whitsand pharoahs hinterglemm sarvo kufri kundo platonically northbay copnall fmcs salhouse maghen ligouri wooderson skateistan idlout raisner nwk weev cristophe stasevich killeavy biobutanol diwaniya fillo catacutan navigli oxborough propylthiouracil garitano sokhna ohlund aissatou ceaa groupuscule ifema tchmil basini operationalizing disarranged mtrs channy bleedings hajaj biotype tiet stelmakh baumkuchen downtrend trellising bookbag publichealth upholstering sweatin rallys schrenker rlw amikam barnardos shanin caneel intergrated schuil immortalization kokua wieting wladimiro bigonzetti kekich chasanow eatable silverglate fleenor noogie mulatos houhai hoagies raichlen sulaymaniya undercovered falsly lobero parky similiarly cherien rhessi cossham shainman treffinger slapshots tmap iseas alecko badgeworth alberobello eirgrid gemar motezuma inviable jamstec umca lapushchenkova longfleet jamiya trabants cnnic bulpitt earther blencoe teardowns tangalooma wonderkid cullet lliwedd sifiso toben ballykeel drendel focis slenderer emotiv kurwenal pakoras mekele prominente nonorganic arlauskis vandereycken zonderland presbytère preppies titulaer microblaze ludovick hawford showery jinpa yomp kaster shizeng dison bonifant rafetus balfours lrit imperils chubbs oversleeping compsci spacewire klaudt turpen biohazards dreifort allthingsd beardo pyritic damásio shasa voicestream superhot bogachiel nappier moob accretes financers radulovic incorrigibly barenholtz malsor blazejowski ulimo gurgles olivarius pytchley perficient lovestoned abbondanzieri thoratec cloten auyuittuq counterpath largly psbs lodh lajitas mieu bihani monot weissbach moskit shirlie hilliker scottoline abdussalam cothill crossenny biodesign chunkin clubgoers vogelaar bookmen feminity picosat wirsing mamphela ivanisevic swigart zuiderent nigp permal lammerding mgahinga lornie arcsoft corrosives totalitarians nause nalluri lorino fidencio zizzi wupatki kroo permament pieties egoi dieker padil tricom trefgarne sidelocks kroesen freebee acheiving treglown atheer wellcare lupeol bhajji hatmaking khachik findern andreopoulos porgies cassity slask trebly phonepayplus ferrett amieva ghostlike revelries marrel fance lekgetho hurll pheobe crianza laikin scauri madurodam overgate yalumba basmanny hammertime pricilla ottenberg conyer gappa tmcc mcelfatrick anthonioz beautyberry parlá rauhihi brezsny lamara spuhler vivisector nhlapo traceries brunches decoutere streeters tadamori rubenesque polychronicon abouth tarangire scopwick loizides reihan kial steeps kingley trauger rugunda sonagachi tomenko unbox vanburn wonderlands tomasik filevault poringland gunderman averageness bombsite dibya lakos dobber enfields islamize potoroos joggling kippy azera vallow pornos colino fttb msrc matze chevrefils celf cappucino barkerend laodicean naide rajak bmhs onen najla yatom rxt hirons azziz rudong plodder atba snepsts craigen acquavella worsts scrinium matsuhashi carhop drexell wnbf lauby mlotek balmat aptt crassly darrett caraco lucimar sandmen liemba titanoboa efland megalyn lalmohan cinderhill shaikan arbitrageurs rogatory parant hesher boomsma ektar pompeians wanguo camerawoman curers bigotted lehra homeplus disembodiment pacbell cyclos seym moucha cerridwen momotombo bouret nanduri celesio bisnow dannah handpicking ullamcorper kathryne scandrick norne blares magnaye burhans athanasiu ferriera nhanes sajous darwent alvarion braccia eekhout stacher inaudibly aquascutum marchwiel chaverim flowertots baljinder wissey migiro gibbets sayong possiblities oshchepkov kinton ravetz hadfields potatos tampion jestyn mohatta dolts wenjiang danzey gundogs scalds raske lippspringe ittai eastcott zylstra hollingbury novlene sentimentalized flystrike handleman comprimised moviebuff appropriators tondar goure raskar libeccio montchanin petrobrás paradeplatz simco vwi linneman brayson turweston liszka omnicare merzak sermilik fisser cathro beyster snoopers dylon iwpa leppings hurles michiganders hiney emtman hypochondriacs villasanta penninsula europeanised douchebags eagen pattisson skimmia roesner phippen pavant enfinger wolfberries jessopp dahmane cochi yasny rkf forbearing sphero luthar volchkov bishopsworth falvo umas behnoud yinger venini transfigure longueira teavee wingates sukhinova doumar duette virada kerelaw uhmm vachani earthships darunavir muthulingam sonkar marchy driftnet hefferan geml epidauros takeoka titwood vodcasts montario bornet rowner shervington sommerstein aktyubinsk barding sikhanyiso assuages dietzsch hawed ramnarine lifeclass resumptions bookexpo myroslava hlady elettaria weatherburn rockii stracciatella hurleys musicstation rawly teinturier ashelford muxes songtao blastic francoists farangi schappell mostarda omapere mctwist sheerman shimba binnington effe harles marrese shimpi cassanova halbeath wenban osteocalcin buyin rivulus scrummage emodi louviere muaz amerasinghe satel democrazy igbp yaichi relearned reverbed dpz polycephalum osac paetz aktenzeichen nafh khoroshkovsky omelets zhangmu leamas pandak qud cimm trongate midways munyaneza attemped mccoig ywc peregrination unipolarity slobs tetragon prusak sportsbet fatsia changho gasteen czeisler sarubbi lacome saurischians okecie douby lochead promesses tatafu fitments hattam neifi awra legambiente oilskin eprocurement hostilely wilmorite convivio wingspread sabaot wattson dazeley chetry swade bacos bernette hairshirt alavanos mushahid huichon botwnnog barnz tacci unfreezes hoder muthuvel toshihito itms hodgin jrtn mesnick cittaslow bokma vicissitude majmudar theatreland anisul ulich prj hollon bcrs lyminster porphyromonas qes schoolbus malkeinu bartee cowbit conveyer dezza steart plinky ackah stupefaction danwei ackner nuweiba jfx mccreanor braslavsky ramipril attique peterculter inciarte vidic accordant encumbering jochim fallouts tskhinval mugla tonghai gallu hausler broadsided commandaria brogeland medero yipee unaggressive shakedowns csst ruttle deblicker foinavon cppr matsuev ostling blattberg docusign undemonstrative lensbury bänziger woodbank contry curvier unfavored patronizes bolometers caterwaul headends addthis strateg dyspeptic malandrino lamprell bolze whatchamacallit bulleting haggas franza tortor toensing gressier cisi kirkcolm kaiane nmrc mahasin manoncourt woudn burkan carolene absenting rondinone aniva romanko vukicevic rets yakhouba cumberworth premacy stenius irresolution avanzi megginch wolkers bloglines benzos faurecia alverthorpe penglais brfc fizzling galarrwuy fassberg declassifying pinkaew haensel cattouse rizeigat leftish muigai supermajorities badenhop dimasalang cvca skoyles barkhor redrado nagourney clubmate whimpers aircar icewine pummelled whitelee munafo mohanna elmswell karatzaferis elscint kroons primesense adelsohn eidan dhusamareb domenicali kurfirst xct setoodeh racaille mustapa guiral antidrug hoggarth theodoropoulos ronaldshay counterspin serfaus isssue cadnam drewermann teesport redated kiess chionoi milhazes veley snafus skep chemoprophylaxis massell squareness outsprinting velikhov nexhmije tuppen doyal kapka aossm bilderback muddier embi uthayakumar mccheyne iluc maturer gurland atention schloemer stellifer nals vitalize cahoot scharping assesments bankamericard themsche woodlesford yujiao beral demonologists hydrick bleys guiping tushita gotomeeting chijindu latet dandified ewington karimullah ukcs allsports mccomish chlorinating muslimah bricknell ilgen liitle semerenko nirenstein milnsbridge abdulkader bront kizawa friedli cathe palatin porong gfz garatti anderes natella adenike ommc camak schoolman tyin penlight guenin orientating letestu clise saure everytown pukaskwa cryptochromes taihua galekovic larney kacin repurchasing scusa modifed decended newsreading nefes knome anthropologic transformerless kenroy mcdavitt elwa chasubles ethylmercury bublitz metrohealth odama morebattle kalkhoven nashir pencey sondheimer rooneys antidoping alela nsba tipner leleux cabbar vocalizes massetti borgella veracious anagnostopoulos rollable marfuggi shelnutt abhorring viticulturists pokery stanno flusher maymana kerl iyke ralitsa limaj wikstrom naevo jern tawake intellegent gaebler atat dumelow ogx ambion triffin twiins youthbuild jasperware rauer whissendine mcleods gerobatrachus stefanescu sifters milosavljevic denize eyking rauzzini marksbury dingel wholegrain shehade kilson lodin mikesell sherzad medifast mulisha hemiscyllium fadell papermaster vuguru epilepsia hardye gibsonia haeberlin yankelovich backlashes supes petushki andreikin dyllan pratical deveney isses jidda lanugo aaldef nabozny jerseyans knockando vittatoe rummikub nihonmachi fuzi gillenwater financiere gakayev arraignments bassat medicinals philospher vandermeulen consignees sawhill sjostrom legitimates bugajski sangiovanni museumsinsel hyperventilate genser bevers bowheads lrip sahal mcgorman brissie beefalo palpate bluecar agrifood honigberg speling bayefsky tangas ejiro nigsberg killelea orbiston derwyn weatherstripping berrynarbor catheterisation guadet deleeuw fortepianist rulling doulas wfed nashes babyhood kunpeng ikitsuki bogden gerberg ucea ambiga rucking melda kendy downshifts aigu jundt navtej shakhlin stereoviews nottie vuvuzelas tramontano foulbrood gaito eatman lumbala mboma gymnich motoryacht chollerford hasl cauter keading gelle mommens mangane controversa barbone ryozo uspis garbowsky excelente realmente bedfellow norooz fritted afnan felly mooing pyrotechnician untiringly haitong batoka toileting exacttarget practioner mfon granvia trzeciak treament caijing tremeirchion witman kadewe celebutante cucci sundried socol ruvkun madaleno banahan goldbugs movment parklike ressurection lecs uerj techni koly hoseini unconventionality barbeques dawdle cidp walthour cornstalks kalakuta vcb withthe swanpool gruffly aerodynamicists bumming theladders significa tobback covingtons linamar paveletskaya redlake bharose zytel ciclovia slavkin rothay middlemas bootee hotdish verron intelligenz tarara irom santiesteban bolens pyrithione subconjunctival chachere shpend lyst debriefs bluecrest kuppusamy fikes challender guggenheims chongde muzito worng beilenson zuhur llangedwyn farcot kinkajous febreze shosetsu kathar cashpoint rounsevell souki intercapital godbolt citarella raqib nguyet exsultate centrifugally postlewaite uwak longnecks undetonated mitgang bawar adamses fobbed firepool grimentz blikkiesdorp geophagy katende honeyball iadc linkshare standeford onstream jouf vampyrus guelaguetza renationalisation neorealists meakins xitong hauswirth aapor tituss outworn wildeboer shusett sunswift behoves gargett willerslev mamère tenku craftswoman engraftment shamuyarira lefebre manenti centanni belleayre intenet ucst wscc catano icba broon karalahti blackmans kcas werin taroa napitupulu wagland cothren saprolite santacana tolvanen iliushechkina chandrakanthan yosu arquilla bellars mirikitani ciofani amytis lambri anagha veerabhadran usmonov munitz trosch nasos esquer hrelja bewsey gluts kiyoto bennifer bramerton bradbrook realigns abkhazi murakoshi richelson barkeley zwentendorf irvings londono greenfinches ghriba vallorcine kmic garnerville rossos frobel dinamarca mohring bezerk brookhurst baranes aariak reighard debello galyna mulkerrin brebis zhengyu hairmyres kittisak ocfs deanston kingsweston gazar prospeed geria gestas laini madlala fasters truog badrutt biocapacity goldcrests considerately gilbertian botin sanparks dovico ostendorf hempseed delerm horseheath qomolangma pacentro couso bhelliom truglia haryono skunky zulkarnaen darkland vesconte naite minwoo salteri zimbler icklesham shalfleet conductress vagelos zanies emneth haldanes fiterman morskoy bantered stojkovski dadá falt gugliemi chilblains lownsdale rainieri claar aceituno stangroom gritter elarton mallomys robroyston kelvim ouside erron shernaz nonrational zaino whyment sedighi almsot mendik goodnites nandos micromirror croshere fagatogo heidrick arunga scarer pikoli bumrungrad colourants asna radiotelescope kidnaping ewy bishton agustinus revivified cste rooden shamsa llandaf gagg muthill madiwala dhgate bercu khoshbin kones mirzai gaier disaggregate garcons shavonte partlet krissi chistopher nsam kaballah partialy mushrif easler forams mcrobb osmer stuffiness codeshares demutualised nosal hamelech pritpal kippe gemi kentlands orlow howletts mappable indiscreetly hillarious infomania exeo colbrunn gentilini aigo mouthwatering metoffice alinejad brockweir josefino pellis ringmasters stilwater yesco soering muntafiq threescore reboux handgrenades anatsui taborsky trehafod xiaojiang donaghcloney harpooning doodlebugs tomintoul buergenthal mamabolo pressoir ebata rimonim einav claimes polartec paauwe bartletti nettops thumbscrews adario otedola hvalur enchantingly montereale sanc groundfloor jezzard peltzman duvoisin bukuya sexaholics nicolaos pmas pacos communigate dsrs swennen kamyar zhaozhong travelall kzf envolved futureproof whimsies jessell waldren restitutions monji friockheim lecount bacons wbv reignition aleksandrina selectiveness kmiz himat uncoiling carbofuran quso superweek cavort annino ergometrine volanakis romsley iwant halifa cilgwyn muzenda sacharow unsay patternmakers richboro kottkamp eurazeo dudinskaya lightwell laskos ghisolfi anixter pirgs hafa liebscher iraschko shetlanders ncts sonnenblick morenos halamish playphone blandi furer hinnerk kleinsmid mérindol kirchschlager fainlight protegée gvaladze southdowns bransgore sugarcoated praa miniaturisation alfonsa intralot costless enslen constructora niwari desecrates enwave ganiyu cadnant qis fernandopulle indistinguishably omilami plonked coalman glenfarg njoro niedenthal riggen crouter anthrozoology ceniceros uln hevener traigh rogal eslamian beaus buehrig wygant unboiled lidgate nurturer gtcs cranfills kwast bastie sannoh cunin axlerod reisel jebara bagaria cointet destierro rimel arefin odyssean zelotti wijdenbosch bught ratchadamnoen appeasers kpsp kwarta accetta pegatron furball jamme squeek pnes boattail climpson banderillas thumbstick desecrations oim kemel docusoap rigth firebricks tulipwood hairband adeona lorelle tremorfa astrom kozub dragovic matco dorschner starlike qalqilyah tweeners brohn postdoctorate microturbine outmost astarloa trawangan rummaged kittleman azizuddin stingiest naumi gardai suchart pekarek mmscfd rocheford mcswegan mawenzi jucu morling highjacking barrer shelterbox cvx gosberton heliconias parklawn kewadin regrew soddu mcletchie tolz coopetition sbcc wakeeney trimpin readercon nafsa nanoelectronic rockcliff killock pocked opoona pixellation penlan outspending comares wellemeyer spreyton sedloski ginsenosides lemonier antebi bieito esar ruthanne ligations dordon egpws robleto mdeq robideau ofisi percona putdowns lidon masilela stollsteimer vershbow andreson utilityman villaine kanehira lorden skweyiya heilmeier likoni fonctionnaires ordover dahntay placek irwandi honnor surplices zwiers borribles pangas valldemossa hamriyah doetsch metfield lobotomised chemoprevention pmtct ranunculoides romito glatman snuggie haertel rasual lieberknecht ekert sweedish ogled kopylova tatsunori phishers tampax petronijevic mcelduff engraves umane majette sponsered babyland akpala textless yoghurts kinyanjui hayrick vlts eurest sashay dadc tschida tinkles meloxicam sportsdirect vorinostat kirbys ihealth mascardo xkl wiesberger audiosurf belgiorno lilavois penasco borovac yasukazu themselve nooooo tsend dariye amebic shimeji kokkedal perben achelis boxcutters cluss unigate indur tesl solás somalians hvcc gilfedder adre interart soupir haaften weetzie critized carlotti disturbers tuiloma wiszniewski drogenbos spellar weigela oluwale vicke whatnots bmis fetlocks maij addleshaw raivio aviara petawatt calamondin shooing ciste untruthfulness libicki jointness hayb hannaway mclauchlin manhatten counterweighted kirrily civilisational chimichurri mousawi dogmersfield yzaguirre usbs deloatch nakhjavani petrila humulin ousland crossties univesity varnelis slivka acquia manab abdeh terser teape karpasia heelys fishwives selfs amankwah silcrete waterpipe footlong mulero maaran waguespack schwerdt fernleaf weishaar telander melmotte yunas arpc flashblock atsma nyaga fsta ocv rajbir aldbrough noscapine kluever aberdares wapt kindelan dman tappenden tiputini gitau bertele digimarc rabii oversensitivity noilly gandel southpaws proterra corningware birgeneau apostolakis chhibber reediting mokdad stavronikita iztaccihuatl salit schectman tradesperson slimehead eastlund hijiki praxair aasheim fiskin triboluminescence upbraiding alhassane kansen birkitt nopales wolz sieburth jazzing vulvovaginal psiphon accruals regather georga reoffer sprüngli slom maksimovic shelina kibawe goldby uppark popeater krups elizabet dbj esclusham judiasm roseborough exterminates conjoins bootland exhange pengzhou bywyd pnueli valinda mormeck lackenby euroscience rishawi leest evony lindenwald ithenticate pouha eclectics chernovetsky bieng interpersonally schellinger protec diad oelze arpu cannibalise gyaincain roddis kolkota besthorpe oggie filmclub voracity liebrandt wowbagger iaca brunstrom hidary larma craker garelochhead lefavour evro lgbs bramshaw wesam mekhala screwballs avsim dixter hashman ridgwell lajolo himberg kroke gebelein touchsmart trebarwith olarra olafsen atogwe aerotoxic wacked dedeman folkies laterly mycfo ryskamp vangsness magaret ners seitoku bluestripe barratier potjaman sourcefire hemicrania baroch lmgs sharen starfest montay staight veneering aircom makiyivka huichang bulletstorm diskeeper flytraps breezily icls arvizo livelong mitumba lumsdon groundsmen bourguet tyvon designline ciclavia cupo haluska bráulio greubel congeries solmssen jianghua gleno disfavors willl turnell kumsusan eswar entis qype kogge bamboleo mandli jaberi wildcoast baduizm papakostas aeca ujpest lavel famciclovir insalubrious wisocky appal juszkiewicz xiaodi cruzer vadar bidoun yarou caihou woulds insensate frima mayumba preventatively gossiper untaken rifqa latinisms kitemark nahri abercrave brettenham haizhou partnoy goeller rudolpho comtech karbowski tribemate wickenby bigest perindopril aruze itandje davidia hintlesham sende daruka sherba qubaisi fagone eliminationist seadrill deliverability dernis braganca guanica salmani kennys barnburner fyles remeasured souhaite copayment cowslips omnilink bernazard magzine ecla creetown sartini wetterich gernert aqar besche banford legalistically llandwrog claridges wakaya converstion tirosh rosthwaite chattem wufu tregothnan ljube manawan remarriages valders lifebelt vivisected tfiloh tateh veneza cartload wallworth sbic gansz patriotica dishon colohan neveldine witout toks renau ecocentric pallidipennis debka steadies qec caméléon naweed xintian hatheway clevie nalawade regence breashears chada aylott gandal toley assington cobr pustule zelensky crookenden amiriyah favignana smoothy knappenberger icklingham alnilam barassie inseminating sunup windman grundfest rightnow pascendi overworks euphorbias niebaum giebler ahkami pelletized undersell holstead striano hachemi scammonden demircan jobbed renhold acculturate chevre montopoli latitudinally yaojin flophouses tubiana dennet fenceposts kushev paochinda irps spacs fantasists wakefields arrak grossenbacher rehbinder wnci melnitz mabul hildon wohlford favretto digicert bryl insecurely contextualist allpress herrenknecht cybernauts seban orwells ermias boming jakk bencheikh matichuk rabino kribbella creaser contoversial editorialise prattsburgh impaneled langebaanweg hapcheon pgas naziha madeiros blankinship combers montieth spokesmodels openminded loscoe routan lorcaserin khelil sterchele gaggers ostrowsky ajavon gibsonton onesteel llandyrnog kimel swartzendruber roas tamasin emulsify harcus solet cuka janhavi refreeze winik zctu ruzic dizzyingly inkoom wskq heage apms sapte vanterpool rebholz yeghiayan prestidigitation faez sersale bynea interveniens bagnaia wattanasin helicos fundu kaeberlein kessingland alaei mcging elligible galitsios wazan raucously morago trillionaire brettle ballintoy bickersons beffa techinically finalises unitus devree mispricing mengual samory mimara aamulehti innocenzi gàidhealach golby nanofluidic bidisha pukes vehanen kavo paradize eera kabura snezana hoaxsters wellingtonians ritblat spacewalker alerian cracken sheryll cntf hurdled amreeka heijne kahaluu przygoda nasti darba muccioli mtetwa pohiva backdropped himmelblau vivion mundham neukirchner disapear pelephone lhg archibong buriton pluriform followeth coux partagas semidocumentary tanios massob breakover microtel indicies liom maxin spanley naïvety summat devalos rokotov ncma dignissim certes xiaolongbao preissing mulryan antrix heedlessly dinallo nvision tauqir belltowers croham hackery scharffen tronson shtreimel berceanu bhcs hypersleep wesner lacta howrey ndic dafi molokini kisaburo massanet zuiverloon goldschmied whitlingham treut shitte reichmuth alfonsin eckelberry claimers chancers sonntags pogopalooza warmsley rosett seeda obhrai bojaxhiu benayahu gsea rsss arzani harkham corixa yayan cutrona neurocysticercosis asmc hörl dukovany sparham levelers choriomeningitis ikes derryl bharwana kitzbüheler ppic kaneez puds ballotta jots roëves ananova krzystof soundbridge dialoguing caucaunibuca nawe proin chalone xianyi porterie bradburne rahnavard stevick unevaluated aceee chelsia microsofts nondominant kentro shihuang jansing leadless lonard livemocha pulchritude gokarn milvio furutani rememberer edgeless oteng gleghorn yxy hrz namastey ramoche mutahi akhdam pbtx kagarlitsky balestri boosler kupriyanov topiaries szechenyi pampling jerabek everybodys orata ymi mcniel celedon highfalutin pallesen shadbush vvf deradoorian jink fazly mantei addiewell kurylo esserman monetisation vectrix cibarius nothaus triniti lehew skogland elgan ller deflower joho chilwa llangennith froemke powerwall lundrigan bumbles stucture petzner sunshowers hawo woodwalton renationalization landaluce coxing perjuring prequalified fazackerley washlet opulus aristomenis leadwood fichandler troups cufi bigging avda rcom supervoting rohlinger rudell thorstensen idealogy chrimes ramler eidsvig tawjihi disconcert lacaba cwmtwrch sumarni globalists nsmc zanga mardol fondles afact lohaus sensabaugh redflex fieldcrest hez squinty hoj melloy sarmah distortionary clist alvr kushayb seracs cloisonne hylander haribhau npsg urgo xiaowan caxa reroofed diriye cbas schlucht olomana socioemotional sobreira turistica horsehide tulpan homesense pneumonias higashikawa schnatter serajul poolroom moutai aubyns xinyue horsea polenzani bhundu neenan saskin scientifics fiatlux harray talaa amateurishly fillipo adonal bilateralism sonnanstine mnemiopsis londonistan nirwana cassara basely gonyea shipshewana methenamine markeith shkin hodd bolac bewkes cyfyngiadau jirina barrelling ofda annuitants elveda keahey tihs wenckheim murrie amytal scaffolded youssif befuddlement sportsquest perlozzo seuseu maddix cremyll marriotts ezzell theunited chairboys painda jerins illiano toiles gulbahar heatwole qifang pagli yerkir swaptions enisa wykehamist wannes peperoncino diagnóstico envato denbury duross papandrea kutuzovsky jibreel ziebell kecksburg orlandersmith lamark wulingyuan gidleigh luxoflux gordman callingham chalkwell similan kapner airboats chillax evisu aedin ecocert coreys gemologists perjure ostyn decisioning superfish offhanded idahor widing bunni garreta sylvina heus sharkwater mcgilvary boukar balson kokam carstarphen sitnik wbli dogfood weinig morici armodafinil jailbirds unmined ednet gwk emcs polcyn jazzer knupp acies nddc losper feadship barzilay waternish jieyu carbonised tomasic girths beauceron moayad khider qustion palfi virginicum faini tonette livestreamed winnard tattie koryta gevo gyurta bxs noiseuse ebags ermelinda edsels bushier maleta cheikho yashere neuroregeneration wedc egerman koenigswarter bumbler caffee pigorini lury taoshi consistenly yonis ejei feklisov sinegal senwes gapminder nawaat truvia tournedos reavie tetepare scissored knicker bergfors alterian viane interweaved tsilla fruma olchfa tussled vitalizing dezhi seekh fyvush subcommander emal thomass libnan hardbodies falcarragh conver genitalis christlike nilufer nemcova epidexipteryx weissbier macduffie rightback drean cnca jettou keskar hoblitzell androsch cohran hajiya jaeschke pishevar wildbrain foxtrots unfathomably baup gregorini hendrerit afac tailfeather independants skarsgard tioram gevorgian flunkies hullbridge showstopping treculia aguer unutilized multimember pratically baltas navacerrada polpo triquet srcl yoruban circosta barstable ritzema sembiring ropinirole eucomis jagtiani racecard thimbleberry picozzi tehy etretat prizing juico ogah fxpro onrush assests hakkasan schlapp remunerate spasmodically heavyhanded tellqvist wadsleyite starbury brodziak hametz seethed nutshells frush profootballtalk seatrek malzeard morelock shipbroking moonlet fairline wladfa djalal sundrum zern tornell kiuru keema labradoodles aford galardi marabu brezis obiols muirs intimidations abete cirelli rivm skott computerware aysen gezeichneten cufflink keycorp rehobeth nametags issacharoff madslien laurenzi vgas rouler otoro yamashta jibson hults awkard weatherup alic rilee sukup ubilla tinkerers baraz vermeers sedins maurward issels manbearpig lathwell trethomas ofttimes wistaria excelle novinger nirc bhuleshwar lizama powderhouse reinfried munsterman usiminas amrane himelstein radiocentre manoug oritse sejal hesmer neroche calmore toyonaga daneshmand zyxel biohazardous batterer hooo fascher tanter strongin snuffs turnquest ngeze kersaudy funkausstellung chargepoint homescreen bigne hurren wajihuddin sofo redwick duguet azadliq footes sawbones fangataufa chalfen prugh fibrodysplasia kozhevnikova noeleen gnad soxx uhlaender angarola mikhalyov anbessa soothsaying hasenfus weirich valmeyer ixy xdc calibrates madarasz muniyappa fallo nicklausse toothaker brelsford waunarlwydd gappy sandale sueing kicc edelin glamping pedowitz laddering veglio buttington addiss dumain dyw hougland iddison kagy unliked kipsiro schicker forenza bdx yorkies fusses jakubauskas lanette karatu realeased stisted trihalomethanes raitis yinghuo wanderin lapush ladina martinos conjurors chorwon spiritedly cowardliness sabattini weeraratna kettell elbryan mundaneum infectiousness bhartia pugil hafele miltefosine lorenzoni ubicom fuch plessix sajed effah blessitt azabal addendums benepe blockwork ptdc coopt sersen hcso tamogami biddlecombe anaesthesiologist corticobasal uskudar nicker shenice gowned careggi fean rockyou amgueddfa scoa assualt nearne manses omanthai cowsheds dillie nuzzling fsbo hsj edgeways dilettantism rahy atomizing schreibman millholland friesians bovrisse platnum palmist midafternoon submittals scandalize lushoto shwaas meaders liggan frumin cicalo lymphadenectomy malgoire sosnowska nazreen resorbable ozeri qilu cassop flounces hartsuff mercede lifeflight didymo tessenderlo trochanteric iepa paidcontent jerrelle wallick nowpublic advergames showcaves quinquefasciatus nidar totters binacional lodsworth dogsthorpe darweesh fougere whetten rickers kadra hatless yoopers pulsion etampes couloirs philosophising geneen encipher haggan exisitng dioxides hedp callixte ifx fallou jazeerah glenis kamron parhelion bressanone unbraced snca umile quance christabelle eldra chaffer lachrymae glocks namers ripatti atomizers schnieder docketed mastercraftsman decertify microbubble akanbi scheufele ricer canar droguett pannick guayasamin hesling attenboroughi milonas sirnak sabahat acli matossian sigalit hlatshwayo dassen bonafini crute whitekirk galvalume undervaluation yokich mitzel peepul megaraptor pesonal recrudescence delling cobblestoned deilmann domnérus outsurance gattelli vollhardt hypervigilance seacoasts mssb mortifications rcic ralcorp inciteful gload leinweber miram worple jandarma changeups gratwicke herden sequestrations portrack unvaried grossart compromiser dysynni sukhvinder hadrians mansión violens euroatlantic hassock quicktake tarita judaicum expain plusses fretts zlb taglia jumadi opificio kobasew licencia scientic burhoe grucci perked brandford bartelstein anty terho pinecliffe maneaters gilhaney bottke wortel fyf dematerialize subclauses goglia bbox dafang microcellular rafie followspot anonimity livingood atkeson knile militating leckrone banaszek shakespeareans fumigants strutz farcet ibagaza unperceived ursache anounced woeser irrgang sonograms adlung leventon bridgemere distributorships awx fuzzily izetbegovic hopers adeo oudolf bendick councell caroler massarotti tabgha yashvardhan glengoyne azizullah devanadera cutteslowe subfreezing cincinatti shapeways saemangeum chhim boskalis lantus pankhursts suspicous roxi unrealizable solidaria donziger unclimbable gowins referal nossel kurage exculpation donham marmaton stangmore mogk refurb mitsamiouli doodled crossbenches hirohide haastrup unpunctuated bohnhoff hollstein ferrellgas palermitan derlis vrdnik floodtide nosik byrdgang ailed diffent carpentieri kowboy perissinotto bursars mitreski sobinsky mullenix enforcment diebler adrem neuson subramaniyam abjuring cripe hajredin deplorably bocks broucek sturtz janat dinova sanmu slowpitch hdssb rabinovici fiene rhydyfelin colao krümmel mansholt epupa lehideux tjallingii zelt mourie apalling neighborhoodscout oska hurring hunh sudarat hatkoff ommissions zagorec chadwicks replayable hauf barder krithi mahbubul zajick masticated rakkas pnz walcote rese vohs anwb akri schams boerma mancetter mccan juravinski pieczonka innocentive hacerlo blists chalkie ellenburg malmsey sportsblog splintery mehrangiz coedpoeth unstretched styopa tramped highground mathemagician lapenna melham hitchner ziebold pinklon nmac saathoff loura gorfodi kunashiri imy gasified lagutenko nekschot sektioui quadrennium derrig greier tarakai caussin iowas spoerry derric yuci hogged yawer fotino vaporub islan rember sisoulith bramdean hoverboards nabb zabiullah shahine civia ficek trink kirkcowan steelberg tabio putrefying jerkiness pratesi cocksfoot polyacrylate macci branthwaite youtubes mievs birdal witzke maltophilia wayns portgordon abramyan precociousness meghji buchlyvie skrzypczak pirom satani illyas spanta elaborateness kazaure simins diffley scrims lakebeds denevi borodulin whitebeams mitchard chairmain remodelings veraldi lowlights overcomplicate canizales oloffson venmo vyalitsyna bigio anthemion wentao harborplace nhsc recreationists zaru mcgreavy benamou loerrach sinced njuguna linta dgap respecter kargman auriana quaglino bitched prettify leogrande alagir admittances cloverhill rayden dyens siaw wagi dishonors vxx dppa khenin vnl shorefront saachi enslavers tmos tarrington guntrip gavels holdman nhcs ceac staios buskila cylab haersma fuerbringer hannahan fugaz jerzey complacently manouchehri fxt playlogic hofu karahi bellocco discombobulated extraterrestres hamren maeir betc katemodern dialouge almasi nanogenerator mcnarry mintel mcnelis coundoul oligodendroglioma sssc rxte cogar banquettes emmanuelli bursted sheleg yacov karleen anjuli ikos sarachan icfa mcmansions mcchicken pervan craigsville alvero joellen gwyllt lainya wcb suleimenov eastriggs lamai vanmechelen ayoun laksono malinosky corsellis scharpf easycruise obreras makhijani delcarmen lanipekun otash naifa mecklenberg stylz lenderman berezutskiy candescent urdangarin farfour attackmen laiks plopping steffanie kgil guarico kavuma triacs namoff suckin adbi charness raik supernational completer unpicked terabits georgeas coahuilensis akhalgori zeschuk spanger pcom chantemerle canoville crizotinib ujf jidong sendall krasnoff blaengarw hexton maute amodu gajardo neuroimmunology prendeville sheeple alsgaard shanbaug purkayastha cablesystems thoughful shithouse updyke angliru squirrely newing psdp kesley tumblety hilgert topco aqualand snaffles seanie buhs senchenko denorfia rouff grisez dobinson dystel polyoma kaseem blabbing cheesed mengestu pollert comtemporary brothas cmta ardgay dzhabrailov néerlandais chorioamnionitis afriad hovig throve jedis tsygankov trouvés bregenzerwald umebayashi ovoids novarra salier inertialess genung andd muravyev slurve azli comare sheward sluiced weirds hautmont boldrin freiwald herley faisel phillpott kinrade attrill geeze takefuji cpjp ilaiah nalyvaichenko upis sibun prisme rasagiline mcgunnigle aberrantly raqeeb ngilu daeschler lessem copuos deigns bouabdellah twinax wahiba eleftheros prav circumciser poxnora urango rustproofing weili altnaharra hammersmark perpective mathlouthi islamised tomatina sermitsiaq canaport middies georgantas jeronimus epiphanic verminous nutbourne dmards endtimes schadler hummes pourmand madalina prophesised bicyclepa westrom alabamian khagen civiletti resoluteness beliaev sheidlower paey protip rugman raisch sawicka phorid llandygai smoots kajishima subcomittee mwaura terracciano haizlip huista premysl babalawo corporatized shochu oleanolic candidness benyo girlies sasken sudharshan beggers cisma seatless itab kanojia martindell cayan woollam rollberg weisskirchen barricada transam dyster kostermans vilardi schwaab hitchener rukundo aultbea blastoma hcdc liyan camrys oozora aeternitatis withings gruener ebbett pierogies hesselbein khardung shopnbc shandler nnos girobank makutano védrine akhmadov greeman longomba vitug lalumière bilsthorpe kurucz allscripts noncooperative birtukan edwardo mepal groud jaffas stockford vrettos rigourously mponda riby njogu fatties subsitute reoccurred labyrinthitis erus vilankulo obreja fondacaro crosslin volodina longlife koppang hudghton phosphokinase donnée butto môquet manucher vuco itcs changyu vrsa degustation plexicushion couzin albox deloney professionalise soghoian spaling catellani hanoune ustar jarjis bravard frolicsome mashie citywire bertish foreplanes noela charlone manlike brauchli monoski anaesthetized amets thrussell jarquín kivett birdstone masek ripka scroungers allmost loeys pillock prised sintel sundorne suckerfish xiaoyuan zhiguo baofeng barich fentiman sivell ludvigsson lawyered schoenstein edlesborough vermicomposting latortue lantinus porphyries guoguang tcz cuffing ilri nacionalni kehres dormanstown blastema epicentres circumspectly titler sidu inerting boesman sapia oistins avacha microwavable helictotrichon meltzoff tohopekaliga izsak arenzano rollerbladers treasa himmelstein geszti bluesier hedgers taskers tarrance zelyony vrtx tasar redmore perfomance danisco adulterate efj desalting veikkanen hafeet denationalized rijs yafei defrancisco neph himrod tailwinds sulzmann górriz masterplans truthy heatmap nöel undan colsa andrographis ticagrelor ewea giobbi terrafugia balentien duara soundtracked chinked lymphoedema iccm hazeldene bohara postmillennial cherryholmes oconus izis concierges huncoat foreseeably myrsinites asperatus targowski arbeitman fued afflerbach keesey ferrymen spinsterhood temeka nondemocratic furbearers barrooms benflis janeites bsfs spadework zoomers replicon froms maicer uruguyan zepplin telecommute mosiuoa protocells footraces naciri presencer winnik arushi franqui benowitz brigstock krout fleeter almontaser lifebelts castellacci komlos dirlik rotelli perana wiecek attanayake surugadai perdis tenderizer fitrat giansanti baleka lcgs plender clearcoat mupirocin akdt loveleen nobue awel floridia awja minkes dubow narraway summerseat sleekly manganello dresscode jeyes bilotta tonganoxie jalle medders uninvolving narts paolilla chronicity milbradt rapini quow pivato chilingirian biid semgroup kokosing mustaf ingenico kurtiz orny rosehaugh iouri almerares prsp ciclon ccpl jibing ravich derrin wherehouse korologos cemlyn pedalo pilade larrazolo stolarz wadah fursan briston bluette evaluable wobbe fesses hongming gahirmatha dewall kensworth salamu prebish medbury wtd nurcan krupke griebnitzsee mylyn bilsby huancheng tajbakhsh conguero operatically laerdal deisseroth haldol gypsey vigourously bimi spirent imbed washingborough howwood haqqi wienermobile kameel entrevue tatman scelsa huggan kucan mociño attardi mssrs ozumba namai mcaffrey borgas timelessly chuc guizar moturi cobres rigamarole bankey awendaw crucifers tabbies minergie stateliness lingfeng mesmerizes binos speechnow cellblocks wosner narcocorridos jonaitis demerging neusoft bealeton hearties lauterbacher correale chappells udhna fitty deleu jagmeet wentorf lifestraw sprunger krepp epec eizenberg marcoule wve navratras abonnema hackwork furtwangler kirkilas cumbrous asheru tghe dumanis leaa ancova sultonov hazelmere aberglaslyn hacham bunnyranch buycks bhcc hardscape samore otological hiltermann kaberuka chanca pendas soapie divorcé waap nisr arsinée pooya zachs pirker verini healthpartners epis tickenham gfeller fishless keif kaufhold newsone uninvestigated blatner oxjam estrosi tellos britart myca hinckle castrale miscarrying savuka mangweni odalis bluesport zhengrong faraji goatees monikie medhane prestigiacomo barbuto videomaker asesu datamining samast sdsa throssel helideck silvercrest rijnmond cissna netmums tameem responsively vansickle myrtha constanti nuttery phrs eppink darrent machol wackness cientifica steigenberger ccbe pentameters rossignoli trimborn washboards chrw ravell preslar snai halcion gardeazábal microexpressions maxair tristian politest regionwide fiabci hippotherapy aubertine nabilah pober tertzakian cantered stll paperchase burgundies thaugsuban haon londel agreda kiyonori hölldobler bamigboye galacto jubinville szczepanski tradestation echocardiograms vadehra mcfalls atessa luminant kaming saaransh pandith fennema toothlike fthe stoianov choosey talysarn bovvered altherr madekwe malubay konducta djemaa poppets bordenave renana tmrc jobes xtend nevoso shurna sses sinigang epidemiologically mcclaughry matusz aiyaz gynn chappaz nanosciences wolsky rosena jabre mechoso turqoise leanda vanetta bowings tawab mousepads giordan gieves toradze cityarts opana scai majstorovic zenzo solters lammie sourcebits vanderlaan wallbanger miskitos sinkewitz habaneros coverciano mouritsen paleosuchus frishman warbled malotki zillionaire glier miuchi entv krystof diagrid batian intitial shtrum jinpu letterston petrodiesel mbewe minicabs amache panthaki traumatization sciona hazarded hohagen ctrip adderstone catsouras kleagle overdressed delarosa nrsro markowska wasey scalfari moonfire monogamist bussler olivea sarosa haqqania nethan pitchy domeniconi blecha asfi amne nicktropolis maranto advertsing globalfest unoxidized elzer tollie sportsclub fitzell shinkolobwe tarted semillas historics krooked curtsy agroparistech kröll aneja taneski nway guaranis goona vorus kandice patchiness gossops desists mouffetard genband radostin dranoff bemowo quicks leavelle trid capman pigheaded synthesizable jawid jesurun bekonscot battsek reanalysed gieve preassembled yoking afap rossomando benjafield wielandt pattingham dicussing pheloung furbies zilda hitlerian ikano merchantcircle peckman gloomier kochamma commom lavena yalin cinepolis moorby errey heuch unenjoyable gonfreville hachim digerati lassar eleva defenestrated vauxhalls beskidy hoeveler coworth sambit telkomsel toshka sompong parde tattoed grovers sahili akiyo denarau marinca silverburn bandrowski forword phera douet yonks falc pobal analytik fouchécourt sellstrom kantaria bxp marqus jenda damazer sorto fantasising syndications ccoi lcsw rabinder edgemar fransico pantridge gwyr maicosuel fendrick visine tempero shanghang neons ashuba wiegersma pinniger sonicwall samuelian alderville arbabi shotty bremelanotide tadeus foid motzko authorizer millthorpe montecasino gedhun afsheen muqata herrity akbas guisewite trinitario flimm dogmatist juliett machsom barsh cdts cabochons sizov beruti negovan stipanovich glistens golberg splatt bashfulness mcilraith paoloni bradygames klaff zarella tasering hofbräu reinvests pommerenke newteevee shumack soliders pangestu mumbengegwi cheviots davidsbündlertänze riabouchinska isaan orongo sooie idependent hellyar lawbreaker moatize mennella annamites finnsson nooka verdu nosei swordtails overtreatment kaixi marciel clutchless cherel huwaider vitagliano currence bedroll odpm weightwatchers itanos hannibalsson ramday lezayre cotroneo demattos coghen karmitz hofreiter penjamo iiris kriete turetsky superdry filardi transplantable doshas spitters davanon terakawa sunwest scoppetta enth iwin bartelski biomethane yolonda unallowable holthus tbis ensco metaio catbells zarrella hasd bloodlessly brocail kilotonnes bettisfield confortola solecisms goddin mudiad legeno zolt ersun barrish eurofly baradei boardshorts virgle tovi tribolet jbf pushkov hamide euarchonta turquino hackleton segraves zoosk iobit bumb darvis weirded tshirts secularly egregiousness schale vishnevskiy shirty teigh klimaszewski ayma sweedler samoun paam kondaiah foltin pantomimic liche alfirevic cantarelli nisour rapuano massabielle petrocaribe revello submissively neophobia grinded hren yassky shuold carnally francomb gentlemens madikwe nasby fluphenazine temme afcea honeynet groomsport zertal moutawakel bourdette spudgun tofel arguers saporito borino zse snoose elati antiphony rozehnal mosche montsegur cheasty hairstreaks strassler hygienically belenguer kulayev natiq mcgoverns nothstein ralphy djsi kadija hashmatullah mindgames zld baathification glutz celexa vanderberg pgx rashin hewgill aints morken efpia blowoff kayvon hydrocolloids martlew pervomayskaya morrogh bonette upala keryn oxidisation tupamaro denya abdurehim unpassable franzke kleinbaum chamoiseau ramazani trepagnier inswinging déricourt shkolnik doesen braziliense quidco draftfcb qpp gfb prochlorperazine mihalich urzua omtp eigth boneyards asparaginase ozment loncar demetrice narisetti ermou paolillo balladonia roeck superinjunction sunesson tucek telegramme harpagon wendelboe acuvue kongou biotechnologists tvashtar potolicchio avineri deriv shinnery fusae lignocellulose amgala prominantly bruseghin defonseca barbre mikve schroeck pellon rozenfeld nebulized wojnarowski rapida procurers donkervoort jaudon whizzed pcmh maganti pipc terryl morigi keitha secondments eige desenfans mindbody nineham monoprix pyapon thejazz lionza mezzaluna gever guanzheng credibilty megalomaniacs superfruit tarasoff suported spennithorne selvaratnam captivation catelli smerdon lubya ddinbych oplev fenstermacher kalluri barach ratiu prayoga dokoupil compering speakable lesnevich taffet betimes hensingham usdan chaupar dongwon tuataras lno ibish rawlsian lundegaard longpigs kakum iuzzini buoso zmi actelion bips chellberg alphage piloerection approvable xwe vashist dunley ratliffe kurzban oryxes qoe nafeek fiocruz kientz ccci redhook florescent filarski stinchfield floggers aapm pollocks kantis crackerjacks urquiola jasey figeľ wathelet eismann shamsuddeen loansharks hypokalaemia craner nathen triska lpas amge herewini alongkorn fenyo altangerel restaveks nimic echávarri nookat yomps spsa kitesurf antagonises puckette ujiri compair holthouse pedrie flagel kickstarting loutit sivanandan flitted spintronic unroasted mukhtiar unblinded brenig laventhol downley spufford curre innogy telquel harrowdown evershot majur jongi alpinestars yajaira rukiye saturations hounshell woodston sponheimer jailors rachet lovefoxxx engrafted agap korres bombilla ronacher biner mikla makower cofee kluft nesses mantlepiece farse vanderheyden eilene jebi huldai karling speedcubers aandahl scrivo aproximately wriggled shads betrand microlending swedens bpx matfen ostracization scrupulousness borned masoudi entrekin grinton devecchio marrinan noordam sprl npsa karaaslan yanhai bethersden badiola lamfalussy siphonophore andoversford llanwnda ferragudo sadomasochist kingslake claypot putzel zampolli balmford indinavir ilchev wanging landladies smartwater brugal gowalla buter bargemen hpakant grasper houweling chemosphere kumala sophmore yardville ghardaia metabolising zivanovic teleflora ladda caversfield kazaks hizumi arfin fracassa jorc lienhart harpersville gettings batasi dehghani lochbaum howtown waywardness yifter ivh vlv wullschlager recons guanipa veyrat murehwa mbai zadokite gellan mashonda dise dethlefs neller papachristou moralized ahrends ujs navestock fathali rexes grimus trollies avandia lafontant ngassa tyonek bolters famau inola mediascape kaback hazey mollett presumptuously dayjet miled profitless jitin myreside semtech sungevity christia tarren brynley domestos pilsdon kasliwal johannsson siamangs thorougly tabuaeran omeros rebreathing madlen rassmussen georgelin baudis beinfest gegechkori tilleard nonrepresentational framfield soshy plaz brulee bernius gebbia grix achoo doubront ebrary endotoxemia bowdlerization tillingham schudrich anuzis protetch blotters anoraks bulding yusufiyah mallahan papageorge edderton benettons patikul toumey bosniac explaing winborne kozlík metallics multiformat mianchi cdiscount ottowa sunman hungered kolodziejczyk musaed moosylvania cwpt synovus okta stavrakis sumaria commisar captial chastening metrotv ravand araji victorya climping llansanffraid mesotherapy shellshocked gemmel campaing ndoors truxillo tayyeb charpin badreddin ancier oscc barragem annd posterized gaspin dhoon ermer desynchronization dansker interocular harkonen falsey zensho flexcar mlss gadsen dehydrator toxo chibás lennig troncon yustman tiete blattman aberkenfig masoumi dobsons kirtlebridge hulihee tayyiba blava pricegrabber colkirk nonintervention ponderously kabine goddio lmos goosby sodded arculus worldatwork shoukat chivi overexpansion chemed skeletally marcona arapey burchart teaford muvico mogford caie retweeting upstages leuprolide marmorstein carrall rhosneigr eastsound cambian polini cedep hadass waltzed honess meslin permana galvanizes undoubtable procyanidins clamper jackknifed damrell boiseries winterfield stratou exwick lebwohl orlinski coleham vaultier beerenauslese biqa reguarly paranthan shortliffe michaelle emmentaler senatorship mathenge ionise sravan woodenboat jacory lecrone kuemper sangwon voluntarist mollar dingleberry guetzloe weisband investissements rockburne sealine prik davitian hettema fettle jallo shatilov keithville napley shirtsleeves venoco broks gildan mortonhall elbourne sodomize anaesthetised nayong slotervaart philion manky rivastigmine yest chinalco triplexes quecreek michy cartner bandoleers earthcam bpas studzinski specialy specint faucibus asgharzadeh dematerialisation moccas bowo weirauch ziko manged hynds delehanty waldis tiresomely motorworks loterie cloudmark lupolianski hatsue lopakhin marylyn jalaleddin embosser usbi pekarsky alstone ryandan rocholl eucheuma waaaaaay sawma vehicula martikainen crackled abson iaculis turgoose neubecker kelburne loone flueger treeby ddec gressenhall degreaser klavdia lifschutz robertet herft pacia doppelt vasilakos croaked hausfrau vandervort sangini harlene cottom hollyman debose rollman vencor minchella varrio engy touzani hadari vego brona sniffle dominczyk tembec smites jazztel fellating gramajo aleluya wedbush itumeleng inghams odiferous ghurka ashfaque garaad kleypas litcham corsewall orfanato keneth reseated coladas ammirati minzhu snci berkmann agust soep secom mahantongo hydroponically chalong witcover regathered weissinger earthlife fplc pelczar grotesquery fttn forgiveable bhogal greenguard cloudberries decemeber laghdaf yasuyoshi gröben fogiel kades crawfordsburn fractionalization chateauvieux dorcan kalsa threadgold campmates ranan flachau ispir priscah stymies cavi netcast guarachi eao musalla augured xianyou hartsook comica stuntin joustra flippered lurchers zouheir tysen appletv yscc marlie scarpino pupusa snorter propogated bastardisation skypephone femia nebbish chigwedere pelote gembicki achray spume idrizaj karabell unidata frania gutkin sties grantors hungering ibda sanyuanli barrydale sundahl khashm trabi diagana geeneus georgiadou messege jamaludin battening tabouk podor legbourne goodhope fragomeni droubi inel kamancha maried cuch beween bezabeh harjeet ogwell madjeski scafidi undertray jamdat knl tayaran lezlie alarums squishes terrys squirmy lieff tepedino coagulates multicrystalline meatyard drysuits khokar achnasheen corazzin bareikis noeth thalis kamoun wessling denaples bibai forbo peipsi eberwein blusher cotting moptop cynllun decoste pidc extramusical rbcc zre pwrr spudis ofcs nitsa ghettoisation zarrillo kikuyus comisiones nothwithstanding sardiñas indentification chrysogenum delacey likhi payano tornay suhel sedgeley bassington ductless kingseat mesaoria biproduct amygdalae sievwright activies rovetta grandpre particulalry bedolla noveda farberman caochangdi lhakang debie siula parmo maltreating masie sanjust fluhrer dcha immateriality triquint ufcu critchett solvit boiardi gfy jesusita zilia handscrolls hasia despoliation mowl heldmann liftin artemida postsoviet thanki wittiness wangle carolynn gabble memmel willauer nelp khullar ethosuximide clogau felstiner keffi aksarben mizuna gyroscopically penenden jerame kelon thurlaston chisholms especailly mafe besik fillery plée tenderest ingabire thereabout microsleeps algeo micromanaged panepinto uwchradd rubbernecking shafeek ampfield sippenhaft bloodcurdling amarg buccino putzier itopride samhadana lymphopenia matenopoulos immunochemical schoell diamantopoulou somehwat kaloogian mashhadani samnang overtrick stasko wadey zehavi sulemani perusse adiv irranca milborough airblade scattergories modec ludeke wiland ʼ mawari totilas roundelay ilit samco leyson sreepur aslet chrb pillories ateker committ vosganian llanboidy vassel marasciullo warora hinterstoisser altero petland baralaba oneamerica kishna cosmica espied spohrer spluttering troposcatter repplier agota goswick outis zannino holvey poskitt doleac connock tropeano tupaz itele temelín watermead kfgo unman scrabulous leol carolae wakiya hadyn tonetto microloan artcurial redating aggieland lisek quantitive candolim nayif reboard lambaste wheelton cucinelli reconquers tomochika hospitalisations mordad kronish outcastes seht atuna medshare sadza leyne marktl liveing briann chiambretti lucianne osinski helpern udvikling busines niran enom szumowski kounta lobue hypochondriacal cizikas browny hiraan unbelieveable khulan tenke anyiam draculas sureño wgz federowicz prasugrel licalsi glebelands coulis omnivision tascón stachybotrys polyvore wdcw mauskopf sarko universalisation hawthornes mindworks heph lubega rightmove pullins likins decamping vandyck continueing robinswood sesquipedalian menders minvielle caunton nonjudicial zelenin freudberg werdegar tinius hilmy sexx mapps efimova narusawa lewen xvm danceny ludemann adré fudges continuer mobilephone purkis corrosiveness davidovici divvied huffine balmedie trócaire moshassuck technophobic tarallo untamable ivin barski valdivielso givings strathmann scarcest barwari eyeholes focsani gestoso rossmere wiederseh rahum saneh piersol miglioranzi buttonwoods parrys habano massaad modrzejewski freman myotragus mesage interoute kountz cordina judenrein kalemba raether eiff devid caissie zumo csba schoool drumgelloch newlife legistorm atomfilms wakelyn galyon hammerin saeijs thornell rifaximin sextuplet figi genra pitztal khosh scandar vereinsbank commix teriparatide artsfest cavlan jacquart servicewoman baiden kalkin frescas chinyama hecm raqi sanderlin taulapapa superjumbo aliy broughan minley photofinishing hollybrook zakian munched buildering lebewohl luebo stearnes denat shterev outpoint wlae yorp highwoods davola posdnuos groaner easc coffeepot nyirenda korder fiberglas evason idahoan biobio gesticulations helvin goldenbridge dmytruk zammar mesur maquila condoleeza jacalyn tounsi tactlessness pantsuit hydrogeologist witchweed maintanance cjv chrissake wanko ayvazian thandar andeans carringer bivouacking sahali oios ellsinore pegswood nivet cabreira klagsbrun purposefulness tartaro burb piecuch guererro teny helsel flugtag reaon gauder emedia labonge milnathort harnischfeger diangelo paisner meatpackers webbys frelighsburg poquette mobtown bagle parcelling corrugator chode lymbyc nordion knai colognes chason bouwerie skyeurope loebe bairbre serialise dripper daffyd porri zhenwei obviosuly qac washlets foresterhill distain kumpel barky norimichi registerd abess probabaly piou nihe kerstens crockenhill offramps rubberstamp wcrf ghurabaa infirmed ibraham smolens bruria achba shegog filippetti sandrin huszar humblebums rwu roanna lisheng minnawi mercerville beahm rowenna brainteasers mizel ebeam colwill slopped yurko sgcc cavel mirach urooj muttawakil nasreddine towbin cmpi blitstein gurevitch prevalance arshack underutilization scibetta podkamennaya genuinly santelices zalben rouhi slaughterers roskin doyers wwon goldendoodle knuffle steggert raymondo shitov reguly mahallah morphoses boisterously chicharrones gikas dragusha webwise switchoff extravert warschawski schtroumpfs economising teedra orrison grotesquerie frankee sachkhand shirvington intellectualized uige harchibald changhai pluggers lootens schmatz ayubi rasho npra wessin papcastle seiff silverbulletday athaiya primar ferreted pitmasters hadri lahoma nzpa marer vegnews drayage streetlamp tolin radioplayer landlessness udalls sooy hugest neonatologist protégées chunlan petitt kupiec reynoldson belloli onesource carlyss cichowski ctcc ardebili htg mendle meetze greentop tahmasb minati kerstetter hutchin yaowarat jagemann ramak dokubo freedon sussmann johnnetta filoviruses flextime blankson zollitsch dewanna cupriavidus broms nevatim dijksterhuis lubtchansky coumba ladurée basang heydey spitballs temazcal wooh gorur tebas jablon abdurakhmanov nadol stiefvater cystectomy lionize eleifend argyles nyiro esbjornson iraqiyya cottontop kasaev scelerisque landstown aimson yongnian reupholstered darg vory crudities unpreserved waie limed darbellay lydbury sageworks catcheside abbc malonzo capalbo machaerus withies zeti monbijou critcher aldorino merrills talari icstis bulygin leibovitch hissey victimizes prif overcorrection msamati botteghe birria gigondas galmpton boyter waqif gorska chilingarov maltitol schiappa ruppersberg ferruzzi fazila hornlike mehaffy kashti ludworth sibony informercials gissler glutens toothbrushing kaligis mootha dmat gueorgui kummetz yohana jiamin khayrat allford mouris droping unmarred gogue comensoli gibberellic juddering comradely abiertas berluti daskal elyn magniflex sulfenic setzuan willersley coyness kharwar orza baingan speedworks schoep doorcases sølve kearn casie beaujour viqueira qingli pettinari corré hunminjeongeum longhairs renourishment dadey spirituall mbunga yampah jendrick osana bogoro boroson dogood keiland carayon pdip cyclers sheskin mynetwork crustless gleitsman maily kalkstein milbridge millboro startac krutoy aamco bagli minichmayr tubercolosis megaport taricco djukanovic reinelt jiggled kaupp woodroofe sidewards tonkinson disapointing ftld comorians amapa hackable crystallising cleeland tepalcatepec golasa paulick carefusion microencapsulation galáctico carnesale lanjigarh hris distaso vilan shiah batcha thermax schnecken lecht geneste karaki overspray gussenhoven lamparello wigstock merchan tarkhnishvili gestifute pipefitters firecrown adiele rajus pfeffel aÿ thorlabs kogalymavia atna bulleit kagayama ostrovskiy zakhilwal higgenbotham nerud jakson dunholme denisot blickenstaff usofa annese obliterative macroeconomists valloire carrigans anyinsah dunkels archly bishko wolftrap fehd moulaye incommunicable schink zydus paloverde nuaman amygdaloides dfki marava babitsky monetti parami delafose roling kolat beula berrey safawi lioret underbellies fokienia plectrums wackies economized meyr rupi katzav hempfest plutôt arnull promark fiszman oleaginous digitimes corelogic celda whaleman claimable petroli diment hexic lulis messiest efaw bucknam fryklund cheim yongyuth scarey karnam pohlen orben leiberman questionaire glassing brachmann zhongdian linkebeek gaggia levring fantasises hydroxycut stuttard peruggia wagley eroglu chomski woodhoopoes duffels mhondoro timpany oando chernyshova srps vulvodynia milewicz mischievousness twinkled rbge simoneaux tagaris aeropostale wadeson agca sloshed ragers reportings abertridwr adjmi atempo ditcheat devorski rosenthaler hoogerwerf kxxv colva farbstein familiy indentity lausitzring malamutes packouz telestrator samter aiag nothung againist eligibles modernisers glistrup matzerath bébés jaksche whitsbury anybodies plazes escalatory tabakova orzo adumbrated gusciora oximeters rayton kippie disharmonious combatively culyer nsdp mcbriar aplon zenobi lexapro nurun emanuels bombifrons brossy lamere theyve kioko rasgotra lusser wissington ostanek jadson depoliticized steampacket goldtrail lucubrations companywide cpfc sugarcube touchups alstrom stankevitch djimi tcxo currrent nimb rebe echoey khemir stockers rww introspectively frett isoa vasti sidebotham peluce legitimises buddin kicklighter confrères ilegales cecp anchoveta hovorka tamakoshi firin haniff kronenberger sheps yahn slinn callado ceballo aminullah travalena rusticity admob belligerant rfea veenhoven mehrerau hesters rengel refregier geddington freeagent urby knuts knols youthaids espndeportes lawrenny topscored zubillaga koryn ycombinator abshagen drzyzga rhincodon gunslinging hoomanawanui bajema nimetz strickling stigmatise opendata owh kommetjie perello scharfenberger prolificacy misspoken naturaly sivagurunathan sedita rainclouds yahuda usjfcom drozdowski servitto canapés saldarriaga rifabutin lundestad senillosa daytonas cryospheric chokeberry douek populuxe tialata odabasi cattus keynoter postulations keasler stoltidis marzilli schlecker romiti ennahdha exhumes multiengine integrati idiocies sportwagen murrumbateman haffey wabbits abandonned hyperpower enbrel perfluorooctanoic vielma ragpicker roob calia rodriques warmness intercarrier hochreiter winterling miraculin dbcp naughties wondermints shobhaa trundled chavenage cellardyke bulería avramovic breathalyzers matebeleland medomak messent stanesby sovereignity fdcc tangeman reproachful breezeblock wwan leverich shuvee semerci cordone paliperidone probabilty kanzius etemaad fathomed wolszczan acronymic swooned futons shimaoka morgenson footpad shovelware dannell babytalk albig iwar upclose turland steinkellner essiac maluleke lumpa mudflaps tiatia mozingo lakhvi mortell montlouis overseal shuying conductorless learndirect sempo tamy ibell televsion fairton upromise nhbc lobianco tcby shekari celeritas dmps crossborder twop lustfully ptychodus namasivayam roomster mahjabeen grandness quett hyboria daqian ciliau smick pochinki wacka ridc wugang hassman numalink lipner ashprington tachilek dehan boopsie snifter verrey kanouse mordkin sollars arbitation youngquest analisa emarketing bokov wacholder rutberg minnies aito lumension cozies proselytizer langrick qubba calina embroiling lifelight taraporevala seaching naeim blaenplwyf pelindaba tomatillos ambari suttree trailering satinover rudds renucci bagwan forestiere peik shahara bicom nsaliwa vanowen deliang chunying intralesional sequi eyeshot olar tronox grotty jibla ballhandler auberges unweathered colome newnet nazakat reappraise vashadze braida baldisseri wijemanne kodnani insolently notasulga fortuno maalot fursa bronllys barcott longmead ouvry neccessity obscurantists xuejun blachly hochschorner mazlin hoaglin yuzana aspex biobrick lasana oculoplastic kreditanstalt irungu syndey buyse resolvin brajkovic flowback crematoriums jumpseat hassanpour barnert paetkau petroskey kulcsar tomada operationalised casulties kosto nonperishable zoldan clemetson sovann owuor karamani mabandla unproductively frilford bazire holdeman krisnan baduel radwa tsokkos massala rumy empirica chente farmgate pardeep canabalt nyepi romal terria awin unpinned derrygonnelly khaima kavoshgar nghiem frieth suhair arrue brantner promaxbda eichbaum opensky tiscornia haenyeo balmaha tchividjian batko mamlok autotuned hillfoot blunderer tinies sanmina  leucovorin cwmtawe swy fluhr dudson pellini gbao latterell fusty madie bife landmen cxm grindstaff botesdale galère lisney goodfield jurkowski vesce masone chidren respa belorukov aggett iding depfa waldingfield lovingkindness slanty kembra hoganson aradhna meryton billmeyer peculation examing psychosomatics mayah hodari suckles curborough rdk kopas tokuoka bakuriani tassara fsmb casarez reflectiveness unrelatedly plemmons schlotterbeck equilibriums hayduk solidarities bodfari sumaira sunbeds campustours capd lendal palpating wassall afiuni edusei nutkin jubeh oriakhi eastex schlich bulgheroni scothern fosgate danjiangkou ziebarth wotte nextbus vertebroplasty wethington rivelli muckamore metatags smadja darvishan fishcakes ruperti lebedyansky rier eyebar eagleview contary micronas huppe senk msbp awadallah roil echosounders lables supportively totin subramanium niw innoshima daston theworld biechele gyngor umezaki palliate bigbie rotoworld maurren medress mischaracterised heilind franceso wheatfields gojan jilting snader dillehay abstainer coval persell ahtila timimoun pilip hafetz ookla cercel zannini veikune mousquetaire zeppos ossai tolla ogborn musyoki trudering megapascals vfn limper baetz vickerstown kobylt duncansville spatafora vernetta daraz countermelodies kinships yerima snuggly reedie yuam noctambules barschak longliners gafar pooneryn sgy embarrasment dobin birnes crockatt arwyn carniglia grundberg berdimuhamedov frivolousness investimenti parahaemolyticus hilltribe badhwar palavi gosman torren steijn tabron kelvindale shininess allnut halfs ntamack tnsm incarcerates brostrom pivonka hederman uwt mansurian apcor ideson bisciglia suboxone riluzole laxford lokubandara gerresheimer lokum wichniarek vergo hawing ferragosto dillema commiserating jaydon smitham cildo mistruths zulfahmi kieber croziers takino kittleson vatcher wickware etone bystry stummer duckpond houman njtc phlebotomist clobetasol federley stocktaking bazaleti verdehr glenarden morvich biyi flareups csam postrel jailson yasheng peccadilloes nonofficial deltour damjanovic gomarsall myntti tpao saret nahles viamichelin kux kojis seedley hoess mroczek ifh lucaya bauzon joskow bresonik tonalpohualli unobjective valeric brotons pentrebach mercexchange subu invigilators pernetti telfort govone dshea chozas cido vasiljevic viégas bantom rebroff jochanaan strategems vecher halvarsson merkins oggins mullers bretforton letouzey boerewors craemer vilca sydell uqm aguad lodestars ethopian shanab metzelaars rapisardi flaig fahem fharraige sloggett totsy deflowering trailwalker anklesaria oppal fufilled godsiff solvik redistributor abuelazam harperley vernham hameur feldeine isaacman negen aldonin rongrong wandong cluley eedar playaway wlans hilb maziarz erck sociobiologists sippola dynavox csbc martham kamall unthreatened fernholz skwentna skyjacked luminita ekel blackaby poulou essure chainless disinflation lyxor penalises quadrilatero couser overdriving toxicologic badgingarra rutzen harped arcara exposer bleymaier tolas brissette decroce senioritis arteriosclerotic betzy schweber fishable moralization bogof grauwe sahag sspc remmers tepoztlan tauck bangham miessner laddy cartin chunda rereads resig beneduce shweder lazkao ardvreck wellstar shahran fese bonsanti impotant pareo cabindan duckies screenprint synsepalum fhn coolhaus washbasins recher darco wiseacre teletherapy onechicago avenido giabiconi cellared shude possitive microtec othmani jialin albertha waak countercharges batterers dutka nordwand mayeul pasztor weinglass psihoyos posuere cahuita languidly prescreen rhas signator gensets kaysing drumcliffe venturella bonani jansrud binita jayawardhana sipek supportability sulligent klunky gatecrashed ibobi langness vman bitencourt jahromi hoepfner cheapflights acfas idabc marchello voivodina patrich ffrwd mahall ririko disempower grunted bibliophilic mndot jenvey whizzes persued pbso gypsii zalmoxes fairwinds mindbreeze sadaqat airwatch disaffiliating houdyshell bafflegab temsirolimus ululation totie poohsticks ngultrum mushes ranz loelia mollycoddle midniters krispie brakha varsalona mangé horseflesh splendido itzhaki mangyans filigreed sampsonia hintjens lenzo groupm cranney enova ulukalala arfan kasarova acquisitiveness yetts birdemic vmpfc seattleites buffaloe orfali tibnin nujaifi zepa bercht parija clodio kedourie japaneses michito cristoph laurentina bitterlemons escuelita porti jixiang nearline gyrocopters cheminant barshop bloviating hoosen scss seakeepers homelike felsher mostofi defragging ciganda hougan hobnailed sealfon albertin loosli ozbek kibel smar woelk canaccord deregister lapietra wodka bentalls ciliax runscorers ibisworld tocache bambou balthazard mileageplus wyckham mulheren herberto sidewinding bankson raynold quinonez añoveros atherogenic hynkel ginoli thorborg stridulating presskit longonot hephner netwrix fontmell lochos bayen odriozola chinedum sipprell badmouthed jacobe koniag arkholme youngwood nersesian rebadge outfields chocoholic latady lodal ampler cymro whitakers cumpsty paperbarks dabbawala shabqadar huiming sciencey sartono kinyua secureworks tangradi charlesbank hvt scotmid kivell patisseries retasked ulba stoutland eids minxia eckehard kameli emag urraco lancias delsing cirincione bakhoum yeji erran professionel glasthule jabbering trello casano aggrandized benua idsia shedload geys collaboratives tiarella giftcards cringleford klatz kalwar aanestad rocksalt sowah durcal niketa charoset reatta airband follistatin codgers kulchytsky okladnikov odonkor trecastle chelem establecimiento massachussetts radware ibin oide snideness beaufront lovestory revaluations democratised renovaveis feistiness thriplow brobeck collotype matheis dierick patsatzoglou vigneau bieger timet odrick belldegrun beeghly descried zalka castlehead babbin daisetta wilkenson meijo rlpo clios reproducibly disfunction littlemoor pumlumon bloodvessel perveen cantalamessa krivtsov rosenberry hlavsa villarruel kallam sugár tiltyard asmundson foolhardiness lonnell gandan pickavance thornberg faiumu zierlein luccin denlinger languirand gurvitch farcically exfoliate luxuriantly tremough rimberg sosanya tippling ladendorf gittinger jackanapes chemex notz baluk haiming conciously deregulations wirajuda tarran aschner chemmy carpinella gonner subasic mppa retherford outler dittisham dysc braies goofballs fortensky postering lercara aeeu chorzow noelani immodestly wittenberger stecklein hudema sabti bjornstad haylor refight sulham geomag unnessary lutzen gigapan thast solntse mobily mikhalev hentges weisbrodt orjan quedagh nonissue shirtmaker systym salelologa haltzman noxzema grotton efimenko aerobie derridean fcci biscet dobrygin moshammer tivey nereida clearchannel onyett kujovic tastevin jassar esterman yapper embroiders rondor klotzbach primettes guling kendler gulleys thommo mohmed keiss helpston sudep sokos fewtrell cornwood lévai youssoupha mimram cance innaurato nightlights intertubes heatherette chupack vukadinovic polsby accentless grafman pusateri intellectualization redmarley dziena trollop cses exhibitionistic paolantonio reemphasize hardrick whineray saisies curiale bawdiness trostre romalis mondamin housni iraizoz worldnow spelke dorismond tempelman unreviewable dirties amvs pére mateas inuksuit brookhiser joyrides tribunali töben berdine paraben hyderi tuju heythuysen ikeme roscos heartmath mccafé cabbell trenbolone ihas nonspecialists oceanair harlescott gunnin ondrejka ngbs omlt headstand shippon phola karabelas dasovic traumatizes tarsha paktel manchurians pggm firewise looke staffin reneses giddyup kolelas tenev mournes markovics baldcypress oncotype juked govindji khunu vfi tattooists planningtorock sickling pleasantry herrhausen bonnevilles ribalow judyann footless raikov bammy geremy micheler limusaurus garritano antimissile misreport cauldon lovefest sawh waldgirmes hadep gramzow botein zewe crabmeat stepovich aeoi pingju esmaeel histamines ushaka rosehearty schavan elmayer apparati sailboarding singleminded abdolmalek tyddyn lowborn moonquakes huysman ropy macchiavelli ediplomacy meilhan zafón mossimo nolton barkus smrs providian mccareins wheate isikeli cybercafes lilje bounden fualaau selek infobae kaviar piledrivers edvinas noforn porthdinllaen ojjdp landri farrowing malkoff dwyre kibuuka othere rakis asab melingriffith geslin thongloun hereditas levitina dundonnell emmies ncst wallowed semrad bewails soludo cousland dungaree bermange ragosta rodbourne florien cantiello deyermond siegele poshard nimit elisofon lavizan blundeston buonaiuti heiferman kretschman aakre dvorkovich mbarushimana kolanos enlgish waiwera idolising duhulow miccio singificant nesch superintendant sapori harraka pxp netbackup bullheaded krimmer andrianova rickhoff gengler fudoh granick relabelling tasr nasonia spacca someome sesser purt omiai rebuck markha unstitched tabachnick nilsa uscar tudes solando afco photopigments depressives steffe energoatom resentfully crapshoot meledandri soderlund safetynet postboxes repurcussions kyphoplasty onstott kasischke cantlay reymundo mcilvain bhasera airily kibitzer shankley vany blackbridge kostyantin charmeuse banyu yewon quinata verfremdungseffekt sugen endoscopically purlieus adepitan gracz thermoelectrics aceti toves nobacon digitas tofus inteligent aldeen auchinairn coule mcniff nonghyup bushmans iesc caberet ritterband lapadite viewfield laudner vorlich funderburgh thomsonreuters woodseaves jamii sheinbaum guamanians yonda ruckersville musharaff essop schnall adblue credulously boulevardier kumbi carrio sungkar brouse resonantly hingorani flader rurality dongzhou itinerancy seesmic analogize lizaso owoh oystermen fvi asdrubal whiteparish wilcockson flecking blahyi antiguans einsatzstab laor duely oberti schlemiel caroff qunu timmi yobbo schomer suyapa lisby berkeleys arispe trut regally kvasov trali neronian streif fpw woolstencroft gatso halvergate noels ondokuz opionions countermovement sosei winesap delinsky segler prostrates lonkar spilum tuitert burnap norten misted expensed shiftwork underactive squirmed kochman mcquigg chenet richmont remolding playgoer mendillo wattsi keylogging bugarin coruscating bowlegged faretta iwokrama pasqualina chrgd akubra beetlebum fernea shicoff solidaires rahad aleah nashawn krogman bulthaup riggings folkington unscrupulousness rockot kerpel sonographer hydroelectrical tigan lesbo unforgettably yanagishita stdm clickair atazanavir yisraeli brixey dianchi thks girts renumeration rübig damschroder tarkett maitatsine tedtalks bartana memsie riojas loosers renilson kiltarlity mburu peerman gingles pikler goreau mantria hogarthian botellón campouts agnant yonnet leestma eyeshade friended vidan taleggio naeba pendell rafaelle rayback incompatable shilowa ketoprofen journoud grefe supermedia unamet privelege belvederes krivoi macdara bellyaches zisapel laboratorios biring toyes noryangjin sudans alfani musahar doorless frittoli welaka deuchars staudacher villagio albuminuria retouches cresseid peréz akinmusire syesha cfdr jinger fertilises bonanzas cgcc ascribable arutunian khaznadar brisac backstroker frohna eckland pshaw heidepriem stadtfeld barrowfield misbehaviors panamian merritts dumor lueneburg belinga saurs kelbrook spreckelsen pavanello babchuk jericoacoara ipsita wenqian jassam marmorpalais primanti segnatura behlendorf macor vichit jaaber maxmara sumbanese deboned mukluks provacative sheerly sabeer kintsugi papademetriou dombrovsky skora autolog watermain mulgarath ultrasuede gleicher scrupulosity goydos reinjection broudy zobor vopak inkie countenancing charkh bearne torakichi tavakkoli facilisis doog urkal dmaa freeski boisar scafaria gussak sarrasani archambeault samri wolch campiness arzneimittel bemusing sares gatrell findochty reinstatements wordworld tobianski grgic ifmr tshogpa committments tcom icehouses freshies katsoulis repetti trigiani transmusicales cloudbook skelding emmenecker deola batie aliskiren egilsay greenley ollabelle larudee snowpacks lizzies komives infarcted schifano newswriters georgakopoulos voskuijl demillo lynval maypoles zuaiter propogating mittelstadt ambady juyan giorgetta crabbs mashood reaal kichak indiginous fourscore shizhao southcorp cohodas wedtech sucipto confectionaries ginestet seiont tamberg sweetens folates soona schipp ramchandani sadoun bozinovski yokomitsu parentes dind riyanto polaha thorsteinn boparai storberget texa paylor otps fosset maith sumfest unfamous golbourne nymphéas mandata wooer saadullah bioaerosols sileshi seatonian passkey pironti spellberg iqpc cantamessa beaverdale jerrycans darvon vornic lefko dewoody nkhoma debdale spyhunter knies otsuji ravard cofco benberry giniel glynllifon weichang makkar willowemoc speechifying bamborough eynhallow mnscu gunduz ipartment hypothecation covehithe bonavita meggido cseke equallogic oloibiri sourly contine paulis cavitt ahdaf everwhere raploch kandarian novemeber paddack harimoto alaikum pingan hoseasons unitised ciemat implausibilities dsei kuttelwascher mughelli procuratorates minidiscs yehl wltm labcorp ujaama kerbstones saltanov cavenham bazilio sookhdeo avai notos sodexho courrent authentification champness ezenwa buisiness roocroft hinkes mulot strachman undercards melkamu krink ultracompact urethanes weaubleau lycanthropic blanchelande quickarrow omcs vashchuk kreusch placente baatin trutnev ruding shhhh barzelay billowed riverain biologi hemoglobinopathies sweatsuit ingol unforgiveable ceroli garboldisham upmann timbalier zawar dorenbos sakher lovric kreig chavhanga awakino dingess sultanova piquillo spillages vistaprint reregister racheal outdoorsy cradlesong wenker yaweh frosterley himo karkov ausburn childminder haahr donastorg omagbemi nelima payzant kukoc casetta pinchi mawrth francene mazzante shufflers prensky npoess chocolaterie lincon bauserman huxter bafétimbi ruchir lewars liraglutide marignan torryburn toddlin diakate shishou bechamel bartolina gurnani wennemars knuckleduster vanderlans barnetts barrea pardeza carriden nutso secrist wolkonsky squirrelly custers ennobles micromachines winwin titcombe solokha homogenizer flinches britni thelocal lankov yanming paulene mondia ummed keeth amsale vernors myto medis glenrio evps ciocci goldenacre dolora seatrade diskus bossini xnet epiphenomena kristofor balcarras brillian chartis matichon lancope unsubscribed benzopyrene macfarlanes moonalice scopelliti musaffah foremothers argyro lélé lewins netbox lornah anella gremin computacenter sdic ithink xiansheng plodded distending nachmani boochever lhrh jcq amorello mcmansion acusing parrella thurairatnam gheesling rasey heftier toey aboveboard abdinur openhearted aquainted shalan buhne rsma ysrael colourway armiliato nelfinavir tealby lihtc tishkoff alzner ranin sagastume startpage rowdier fulljames shoc vanderhoff telepsychiatry saqqa bmcs polmaise lincluden bobošíková calliper hajin lochaline thorsell cistulli irenic trefeglwys connectable wernersville kettlebells zunil intercontinentalexchange veeraswamy invadopodia smaltz pekkarinen chintzy rocanville batarfi lovecchio nanosheets kolinsky willfull benach pichushkin limewash examinable kochetkova derges subnotebooks seatings multisim jovicic duffys thac arakelian tomarchio crusoes lqts biospheric patchi haselböck camay freckling ironkey reagins tussler gutjahr starmore apua hanker sumang coppack groupmates koroman enervated woolls boardmasters regenhard resino terunobu cyron ioakim jerrett nyne khaldei rohling seagrim fanfreluche vanno baptistina sumbe romanticise shehi ivaw sspa liwu zappelli diffferent tysse dyckhoff kolpakova clevelands renuart retied terminix dippolito dabdoub cajasur degennaro blaeberry boohoo okny hcz hayehudi ecmt seleccion pelago caesarstone dijeron sartorelli jerell fmca étouffée cypermethrin sergas odiah bloggie seans goldminers egocentricity mcanany coziness muhammads emprunt mimimum jeyasingh tulchin buckypaper lurasidone shenderovich pril preapproved spitefulness hawkmoths epiq shouln athawale nellen wipa groundcovers deonar shansky controverial mwita coleson sbarbati marlantes salesroom sadagopan wallyford spookily courgettes asphalting foldberg donellan veterinaria moscatelli bodysgallen fortt ruith macquitty fishermens shadley ackwards farrish schioppa mckaie carnegies althamer warndon sadique cervero hungwe movens torosian scheuch burkino capicchioni relaxer drenning shavon braan hollobone desutter maipu stmt lashoff dilapidations openx playstations tmsi hother photosynthesise dladla flummery cibil boorishness lebovic rhiwlas birkedal eiseman rafed miskiw bugie gumdrops hahns nissley morduch uncoil baltijos noninvasively schole andler eidi padrick dhanani abduwali deparment slumlords mavrodi mendivil constâncio darmawan faloon siloviki pamarot iskandariya pranoto arrowed pricetag inveigled frontality najafabadi megastars relevé cbes syahrir subplate mochary microsomia sinnerman pladda homotopia mirali usuall dizzia terramar qtrax béchir dimitria hartner dreamspace biosurveillance demonaco novellus portz ostersund skimping biopharm buttala trabeculectomy ijaza spraypainted wessner rubinfeld graddick cinv kvvu jianqiang kozakai luos haeg numskull rabhas multistrada spergel orosei krauter gulfshore romanticises skullcaps leaser aricent clothespins tlili moscona loudmouths glamorization coordinations incognegro haddara glycyrrhizin reconciler edur fekkai honnappa concomitants zarinsky caçapa successfuly mcphearson natos neyo poppyland queler nlgn avidyne wavefield sheathe moominmamma vuma rorc technophobe lichtenthal ˜ wortzel gialos thimister selter nonini titrate aggy pycraft thoday sloppier selvarasa magiera fadhila teriflunomide mwah bawadi tantalize sonrise vanagon sieversii depakote pentecostalist icesheet catched karwa staxton schara manster arinsal seemi apartado belyayeva outbred stambridge neded pantoum lavolpe seremaia gulbransen bermudagrass localists savastano zdrojewski rolwaling pakuni bautzer sheiner energen dionte bridon ashong goncalo iaee garate dscr stemgent nawshirwan fadem billowy crystl landwarnet lambastes raffaelea theberge wordplays seshego nush mccalister kangle microlitre darrion gleen duffen dayroom overbuilding litster tapola rosicky cretinous compartmentalisation fizi ticketsnow rocuronium gashing ostracise rutherfoord chalai uzis janecek wmh aboya linguiça killyclogher shirks pastorello jurisdictionally tiffoney courteen yelwa sutley soupçon matk poju felinfach brocheré eizenkot nannygate tonmawr dargins submissives redleaf disgustedly greff mixups gisla stampin electroencephalograph aluvihare zakarya paunchy rpks annelle babywearing langrée kovalic barracking ratey vollers ajvar nfer feminised ogas dhoo newbill huntcliff smartypants langlo chaouchi krasair gnant monimail nauseatingly kolde weixi numeroff kneeboarding spanakopita klyuchevskaya overgeneralizing astrologists msrs skylstad remmington gound scaramucci ashikodi shirdel levisham nabaztag mufg kuchis burres takahide decompressive trella temime analyis ziplines ezzati reframes pummelling mckail malinka chado streamable assumable neidermeyer mirandinha iald tabío ruli quitno connett huntik zoutman breemen hossan upcycled tunjur kerrys jimmys generationally akiyda eresearch roudham snw grubisic ademe policie nuthurst kuhnen epizootics enthuses roesel walbourne biasone bangara hapsburgs christner pramada voracek nbic feinerman colarusso kimona tophi abakumova undershaw giezendanner livshits khaleq minimed wordsworths palandri ndele daughton heizo aristizabal mohammedi ejigu atim yankowitz delai puttering tradeking disrepectful andreotta goodmorning bosumtwi giberti bilbrey fralin tattingstone idlis eithun bolsas dweik finanz cengkareng sivil freeters supras parlon bronconnier bardelli bochatay unbothered patzelt grätzel vanguardist bagage neurotology llangernyw tuohey dellin overladen lilywhite saveth shirihai adco ovonic moralize mikvabia elixer kenric schiebinger drumond billière swiftian parmeshwar haering braich menello overlawyered nautically arbory neyts hepting zagoria acing coudn thermoformed denter krisflyer overvalue baharom haircutting enviornment mitc arrelious pomonkey caballeria canziani grisdale liechti bazayev tinpot cihai clambers waterlines telesto tverdovsky verjuice nored eusden soumela ropert narcotrafficking kismayu twyning poussaint matsumi gigajoules antepartum oksibil plaister numerate spaghettios greenstick shadeland citrullinated oakdene eneida robiola irreconcilables vanj jiadong sarcasms begue doncieux sanderlings riminton bobbers ironsmith greulich zalis subornation backpacked rebensburg bauerlein rejas rahimullah spalko dogmatists deblasio proctologist laggards truckle altimari realarcade respose kaarin ishikura sherlocks devening guayaki vergangenheitsbewältigung ghostnet wamphray tocilizumab rafizadeh palantine conchi shamari glenboig kurant serpas peyrelongue prestidge hirvensalo waliszewski begram comtrade welshampton smurl pruna oakar dankers capricorns webasto moolenaar greenskeeper guruli spellcheckers quarriers dakoda aphalara brimin shwekey ammour dtcs colgrove smalltime finlaystone bannisters muscatel rabago monocles guyett lapdogs krzykowski talkativeness aircraftsman olimpick ramlan berdyev fennig rfmd norry actéon naseerabad cherlin patriquin cangnan xiangjiang gatete moroun nceas merlis veedersburg sportcity cheswardine demystifies smurfing mcdarrah munstermen chekhovian tutera mcci concertation itfs routinized mandle raghbir breier bentler genval bizimana romoli stabby camusso tchomogo prematch yaghoub amny woebegone byrnison athenahealth sidlin autenrieth phenonemon redound individualities mabin shousha dochia flambéed baramidze yusmeiro virtualizes botel bunei anoto compaore holtham shipe connoly gilbart ramit keams yesnaby pietikäinen maxjet hexion stanners finzel limonade mochdre concatenations monsummano marcellis tomandandy briosco borror nautic mcgladdery mancall vladivostock nessel santomero kinoma bahadir unsterilized swaption toyko abdulbaki ramé cheapskates gaytan webo crisan miraya mirsaidov jazzmobile unsharpened eternities marketeering blackjazz azare lanpher delea trajanov saarbrucken enwonwu merriott derker rosener desyatnikov meraki borchin calmus unet epynt defen luocheng dannenfelser sanitaryware saithe soonchunhyang backpay marzol paralell stanz etech rja freile tricksy makhanya engima manadon xhaferi dexterously tapenade mbusa pommerening salaheddin warcup longfor ebenstein hongda capnography wenwen unparallel sodini outbids kingway wrvr unisdr piratebay kerchner newbrook larrivey bookaboo nonviolently vendange octocorals brunicardi tury lamictal luketic mcnairn rostker koffmann irineo guleria pakleni falin picure overplays mbacke rikhi bobbles maitha clervoy fareeda leverington bkx renneberg fujiang tulong roedel natca papadum barkero aaish pflugrad summerhaven telerama dulieu mehru interfer newspapering cretney umetsu losts rauhala mertins suhaili moscrop papariga yogan couty marex greencore phentolamine iceboxes linning draskovic lalani beging spicoli cobalts gallowhill chancelleries pents cotechino dhanabalan delicioso alliterate apiarist lorscheider vandeveld harstine fastidiousness offerred massagers fibernet yamdena cpga attracta ogio wunderkinder belview wonewoc catina dementiev karlinsky larcenies arpana corthron tintomara raissi stld pougnet babineau juelich socma heyneke intelli metagenome abercwmboi officialese mayella audiocassettes waplington hindhaugh labit aarnes advisees saxagliptin sihem kikis coconspirators demises lidow orsoni djurberg bertron niedenfuer rhinology uniworld sdny tideline robling gallou offencive boluses poupou mindreader alwayz nocsae unrepentent ziese grandpappy borràs chordoma shabery bancoult slushie monolithically suell ashiana digiday micarelli banters sudafed merseysiders ramaley quantique meteorologically englis hindia pulos whar ischinger mundanity quesillo eglwyswrw fedexcup axline drabness tekonsha guarrera karasick stenotrophomonas hyperconnected reventón gurrieri cosmetologists potterspury larod selikoff jianhui oreskovich crimini ofter glumly darcys shaowei ateya essandoh bramos goche jlh segretti farmoor ilonen fraternisation gareloch abscessed baraa bikkembergs skrenta kandt longforgan accanto palaeoanthropology slavsky sikov garding altheide sebelia baibakov russkies stolarczyk ragdolls gudvangen sicsa casbaa nisin fuencaliente sunbed sabuleti bearshare talei tamango olbrycht scarola highmount karbalai jasdaq zorkin estruch admas mckamey rappolt pintat flatpack acused syntec brightons knowler olowalu upcharge guessers pagnini ariat pulag spile llysfaen racingtheplanet jesch mondell harmans nortman mdax bladderworts zeitlyn irking nixing horsebridge tamor stovold brimob charalambidis almyra ferrington heartier yacuiba remonstrations eepco airvana forceably sepetiba jugaad riveria iyiola simantov judis schaible brauman chromotherapy uprg delcroix androsov drumlines deckle plake niuas geitner lebedinsky gilauri furture rayyithunge arcega labutta restor nyambi phobaeticus makaton calbee aetf karsner appalls crypticus woodforest thelondonpaper hunain haywoode brotheridge ruskell skypoint volquez azada limn mccormicks angkasawan catfood erani eckerson rentrée mortin pasquariello cowplain nartey salved derron underlayer osadebe parleyed urrego ogbo wasden quindici reeker cgcs inkjets charith fonnereau olalekan paraders parodical flear bayart undergirded shieldhall kelenna ahhhhh butre yovich lousie copnor duker stotijn bielat predicable anez kamaljit killhope gussied potocnik appelqvist crashlands paveley belvieu damons moulinex bakalli ganeri overreaches brolan refere intercommunity baphuon medicalized yosvani floodways mousseline kabaivanska tarzian emison skiffington beneatha ciliberti creds whetting beancurd inoc whelming valad burmaster authentics sweetzer unicredito ghedina erzen deema mobinil ishag scenester transshipping houssine talibanization cafarella decare gejdenson crickmer hazley albawaba kayna rhug bencze orice unclutter seismogenic schwertner chikomba nakazono swaleh lungis montelepre allurements roic culinarily givanildo spencertown vergoossen mence papadopulo gabetta niewood ballyedmond quante murgu sidespin hollifield gones woodfords yugos abuelas hostin ewalt lavisse tido lesnick benjamen brewerytown portending anonyma falgout myhrer rhj prokofieff shujah backdate jeromes forestar cuénod woundings warrer longhill dehumanisation orangeade laffs ccra oldbridge leffert netco gaudioso hatib debonaire ficcadenti renaissances dowlin pittencrieff antunez endalaust gloominess déja sydykov attra mssd gienger koumakoye suzon vadum radnich interprofessionnel parles florales saratogian kishishev easyrider eastpak wyotech dunalastair klempner fahlgren beezie macel pacbio toeava luliang explainers glomb flashgun quantas everloving florentyna flexfuel gutstadt ponos nordfeldt bonannos bochenek alligin biziou canwick morrissy chipaya crafford springford graveling kanokogi rakewell workaholism chauffeuring nirmalya nutricia estoque minneota astronome harkinson vnexpress ffynone lardons clowers savenaca refinable kftc starquest winnefeld wohlfart hemopure dunskey ahrendts wojnowski goosens kalid endlessness lingmell woiwode botryococcus creadon kirste hopstop munadi impolitely cortachy natkin livening privilège snpc numerious mamane mathlete manap tzorvas orams sitings ijaws puthod kalivoda leverburgh walkmans spiewak sazo wbcc sascoc wadel nivard haldenby policlinico ochowicz codding biggish sarosh santulli guanidinoacetate anglepoise felise vivisimo erlegh benshoof vnesheconombank gulya skinsuit ovadiah sandzak galanes belched dccd sconosciuta scenerio deviltry skofterud kapha sokhan theonas holstrom vilia squelchy danchev semiskilled venita mikadze biranchi skarupa thid bitchiness scrawls muirend myhren footmarks filenet potties motivepower lambics bidlo stepfanie toddlerhood chegem xobni preponderantly frodeno ganes brynmill sterilizers hapen gallogly khalidiyah lafi slitted colacurcio shearling parnu daskalaki oshri unguicularis gerardia baqeri perfectionistic oxera ittre cragnotti bureaucratese sebesta palisadoes opeka falsgrave pietras treem lawnchair nayani physiatrists tamkin salcey unpackaged telmisartan bartak greenwheel hilbertz allrecipes festerling jiming triade jumpshot kgd cornman aboubakr layhill deployers malnar zangana sindal anifah nimick greenbrook sylke scheherezade devier ryals bizbash construccion coelodonta widmaier gaggero unife inuendo nonobjective rouleaux maow cadenced pohlig sakano celente malaitans lancker ladds donees kordestani pwap aristy elsje répons honsha boccieri whatsername coathangers wadeye outpour donnette shamardal senu görg barja tatel socco hogger amarchand wendlebury ofisa zinin cisternas aiyana noetzel nagami kirkwhelpington padmapani asug stuey slegers merron kritz misbranding seabridge bishai cheesehead xec wigig dayenu placemats akello octogenarians rubial tuleyev leslye hrusa oleanders unshod ndjamena kucerova lanzer uncaptured underplays hadippa hubacek cinematheques turnbridge catmint bnsc jinming bryceson lapindo supramonte mrff motrin aamd deferentially komadina bevere hefferon balmore jongro lapido unicor massarella crundwell keatts shlemiel scanio grimsey unimaginatively nfz hatchlands mndaa waylett sheffields mccluggage mancia misdirections solbes desgranges bulelani buttu lincolnway partech minaev basaraba recirculates genauer larraine houstons schlyter sandora norichika netvibes stansky fleckeri wapshott farolito samovars brunious mongie lacovara presidence paradisiacal bogdanos iervolino yessir gridding tankini riems mittendorf bockius agiza irrelivant autostadt welthungerhilfe baardson khiid superbeings appollo ortuzar interphone tardio batham kapadokya gurfinkel willises snowsuit snakeoil housecall vanpools bsis flassbeck underbanked everyplace sagoe ebble vadinho aeos blurbed reexamines addas boerrigter lydersen korunas tsuper subdwarfs ashlawn heagney casel campbellii jumani bandos procedings morskoi ingénu scit balistreri lusky chapli ,what monthy doci carrino energex edmark bulte abraço towerhouse godshall hargittai holusha chasten mexted penarol unattained francome cooliris liying condem sandquist iley coxyde ullock adoo changiz chapelcross shiyam tokely walkergate imler lavelanet valler holper debouching chothia petritsch monogastric noorzai sunzhensky tanusree meridia glenmere schoettle oeltjen sharktooth cdfis benetech obstinance cotty artbeat kregg chivhu trembly rozita dedge sharak llais aerotec cauliflowers patronization hammamy appeard teig omov kozun transload iolaire matricula teekay consummately shemyakina manteiga pitte unwitnessed bulgakova bloomgarden deursen goodnights adass overvotes becherer andriyan endeca deaccessioned deshun winterbrook fnma marchesano awwww contagiousness gillmoss achao acidly puppyhood deriso okinotorishima feierstein flexicurity trelewis choksi norrey honeycrisp womenpriests telerik rasky immunomodulators mendilibar shareeka dollarhide hemichromis sharemarket unlikey palins tunheim dikgang poteri consulation enchained idns transparant noodly orsières demoulin heroismo biomodels covi sghc lenhoff licea ibera remorsefully dalkia logboat individualisation hewage pleb llanrhystud reemployed satit dissatisfactions fiorani flounce faldbakken hyperspeed mansaf cnty silano yasini awaroa hergert romitelli vujacic bettinson nimbuzz quirkily dikgacoi mergia overdependence ddw asciak bunged doveman tamagotchis boska fxi soung hcbs heronswood eiffage sigalet mercaderes stigall orlinsky segedunum flna cyndee clobbers sumani bouhours lardinois ficano rouhana pubococcygeus oportunity boulianne fotomat morrells binjamin janene bessac ouazzani sexists temuri ingenues lobola cisatracurium beddingham yelton strathyre comeliness weinblatt bjugstad kusters leches stavropoleos dunlavey metastock lauitiiti overactivation claeson herati skysails fakey megaregions alil mclogan woodrell talega vergelegen tomdispatch ddgs cimes kampfner trikke hemann cheparinov wised shors braunsteiner podobnik suzue merissa torff sswc atsuto dissolvable runnion strohmaier niri ashouri mctier ramras zreik ihuatzio khaldoon saffari mamad teleamazonas bener tulsky fracs aguanga moneywise rommer dtmp mengozzi azzuri amarilis kafala takiveikata chantale oscillococcinum zurawski illegaly thompstone boilersuit choiniere subscribership mathiot vgz tajudeen liangping schuble lowenfels slews yearend greenmail varez matjaz zuazua taxista boink tecún lundine sasnal quadracci dhada yeren registerable wilmsen seibersdorf tenative antartic gobstoppers ruebush safian mauricette llanon ogunlesi snuggled ishmeet klinkenborg telegrafo abidance aragoneses winda adji ovos pedalers oscr abduljalil misfortunate kerching rollkur rodne zaraah ngrc sloes epeat abdullayeva bordowitz aspart dellacamera bartholet creton horrall beaumes isacson weisheng ribner kaibiles espically smyths odent parfaits jarzembowski paddlewheelers fatik khangura eashing boome ilisa boyadjian dacal onramps haiqing spanswick swidnik rhoncus guibal plummy elstone hyflux panchina swinstead skanes ovejuna hentsch ohoven speedtest disolved cataloguers wfirst photgraph mcninch richeze ronzoni orender topfree handysize wohlberg manheimer fryars wigeons talinn hovan trefeca bloxx azc morhaime omah boastfully iseq mentholated dhiab ridgmont britan amys suchus krucial elys mockeries saneamento graywater bolzaneto aztreonam misar democràtica tomotherapy quiara scarless amendolia playspace pynes pachacutec guvnor aronstein acftu houssein inexpertly goood sunbursts manba semiautobiographical perci andresito inoa ducange circlets bizspark jillion sturdiest constan willberg braker chalom amedure nonconscious supernet vitreoretinopathy hensby sellindge minipops geronte straussian buttermarket congenially whir osawe yerself strathie malevolently backmarkers meulman chalak kenninghall locandiera katlehong sanctimony simunek calasso repletion ileen plauche sideward muwenda tokayer bereng yenga garelick antioqueña dhoinine fich vezzosi actionist entwines kandie blowtorches dewain chavdarov allocable meniere rashean unhinge cedefop nekkid yutz harmfull ncip cervid escargots lubrizol dawdling newports obeso katherines schwaner bevois nailbiter katherin roseae valjavec wease avient volleyer bogdanski nachmanoff holmsley learing shurland roepstorff fanga gwybodaeth somethng hoogendyk ticketless pinecones matison santro piram nasara superinjunctions gourdes tanera ivlp xaui lunzer feese enterasys undeb cleofe midblock zeituni pasquill cantilevering tooo asni tennants shaoping badware impremedia dowlatshahi bijani arcc dunscombe genmab markovitch tormore onging skirda camdessus hrvatin laufman saidah khadaffy aideed albinson bidognetti ramify xiapu vonzell superhet tuomisto mcmanaway niznik kobia ukcat handspun bleats meddlers halavais bamc islamey usarpac huiqi ribaudo marant mandrax wirefly tayman kanja bloodstreams yelin puces cubers autoalliance deifying sergant corpsing zumino krumholz marsis shirqat highwinds labourites hüttig srirasmi mayakoba franklincovey incompetant labry maltreat talerico stanols caftans autostart molaskey aquasco tyaughton artron myaung streeting zaiser denmon fahrenkopf makhteshim guastella offed promedica abbotswood lidsky kickstarts sanakoev falkus amberleigh nonworking mbarek kitenge severer saimdang karoshi tickbox dronedarone kasl speediness decaires southcenter ervs sadock squatty skidmarks snarr spak christion scervino timbits chuku scarers phenylbutyrate silverburst exilim tarnopolsky nonsignificant hmsa summerteeth trinchese bojorquez kiffen iocg tsikata whne abridges kapchagay rossnowlagh bauler grabovsky qwe scatty cmz wavecrest statures irvingia fuar artim polyheme sheikdom aboriginies murieston babycham ambinder insensibly reprofiled lanzinger almansor barstools bhusal spiby slader baxterley jendayi voronova molzahn showbuzz grasscutter oyan pealed bolot sierens agelessness tamest prochaine randolphs lisianthus jonai afficionados waterer merali bluffers cjis gregariousness berding sowter shazli vantongerloo sveaas bitsa melot sloshes rubleva poru maltesers rhoscolyn ellies fossdyke gvh aqmd broadstock lyerly shipham avantage stasny overcrowd kharge decieve palavela misiewicz leapman bashardost chinkin dormido subsidisation microconsole streissguth ingreso vdim kleinplatz doogue caricatural aqsiq aftergood gardee idiakez malee kashua zonisamide markiza steepleton alney schmich wtas sumiton tafara marettimo philosopy litfin irongate bovisand buizingen eulberg fornet snic resettable marrion berriz jingyan pegase suffient stieger deafen mcnabney overdraw penycae cku paronto tregarth siveter flexibles imporve overrepresent vnsny nioplias geneco paroling isouljaboytellem alair privilage golebiowski collesano psls gordita dongliang tensilica dagne aecb lightmoor viscusi wapama coffeemakers jianyu berthelier mugavero kiwanja wallentin kice umds ballycran hidefumi bednarczyk mahdy alqueva bellovin ruhemann yammering spaul monteria obanda tonsilitis sebha colglazier ribordy minkovski samode boites camy danzante thobela orthe minker xiaosong hashr yesss dacher moormann sangamo degand decluttering schore megadoses biscot nerdish longhorned canefields senichi minxin izurieta santoku sullenly mauldeth kaney atthe ssentongo xaar peos berz familly mobilizer hollas donskis rollergirl aperol guangsheng mitterand rabanes tolcher trebelhorn hgn microenterprises ruswarp cogitation oreodont viant thearc tilera nuoc assurer gwm heppe pakledinaz pelvises torquhil pulizzi levade bastians moscardini anouchka longano evergreening architected kleinhenz irmis chilman socalgas colford bendavid peabodys mutabar cordrey duvaliers buiding chainrai dickert bomere blmis mercantilists serk courtliness glufosinate castrillon catya hillblom outten tipoki priciest planetree recompute schillace autodefensas storrier twitcher stakeknife cheekiness bargemusic cpri cemetaries carsia bembeya greisinger douzaine bromden marites britdoc rahhal nakamachi abbadon ifrss dictor brechtel giammetti srz koito casebolt peipah maked barbury cambone lendingtree overskirt jasperreports uninet bekki abilty tredecim supertarget buttars cluelessly jinguang linick detainers arobieke unreadability bonjean pigsties rathole syddanmark demeanors scarifying fastcase lunesta terzan boov obvously cherohala arrangment kvarme lacamoire eabl trepak debarment veze shamkhani schuermann böge deetman stous munches holeshot sarniensis wetterhahn salseros fountainheads floatopia adarius marbo rindfleisch schiebel actt intrasquad micelotta sledgers calcify mandaric demagnetize irrecoverably jagaciak spetzler carpenteria assuras rocktober dynamed demontagnac sodomised aspal katleman beatbullying stalags espinola subversiveness tronolone lekas heptinstall comissioned ehx pulverization nter roadgoing zhitnik draman cabarete ringless chebrikov sowton mcartney sahebzada shabunda uncrushed serpette promod cncp bateen kuluk weichselbaum gangasagar wheelis copout forceout differen iacet aeroscraft vray ouaga ramanuj gutty killean honeysuckles kendry treml icfr comtex whitelegg voskuil margasak choephel raquet predications openbts sasae moraghan deciliter kauser heslin mccane zerhusen momolu nonstick bateke cesid kumakura teulu vmap battistello nmsc tapajos struk tengas communites hoenlein kalandadze sovcomflot enery adventurist rfids missis couri roomate julika martifer karanas sardini comptel haipeng ilko surber telecity draftsperson vigneux korede overbooking pricerunner herkenhoff gandelman heanet cavelli kamol bramnick tibetian skyjacker yotaro finklea fundraises crticism songsmith spreti nubbins nevadans demonetization pharmacare ypersele mceneny kuapa krvavec preceeds quenchers tadulala decaturville taleo folksmen mayeux enviously furai nccf kislingbury roote friona allahpundit shabaan verycd chammah wildig mccaulley irishcentral dinnet ellers kelco advfn maced tonen pparc albertyn raikar debulking spagnoletti whisby mmic fortysomething kabr krotoski alijah pechman sajo laborites fermes sicav westcot bawled spätburgunder bacheta handymax birbraer yonfan sligachan vexes yardeni inequitably transcendentally tailcoats peson vakoc shackler americanese nasiriya koplin alendronate bottenfield fead tangena tulloh alevras mouettes lunacek foulmouthed korcula teyon fidlers figleaf godinton houtz pitsuwan serioso mammaries schwass japin vibratos varous asoif chahed stockgrowers birgham khatchaturian sahibabad molotch cromagnon gollner nolonger macdevitt hölzle thadani neftali caseys threepeat solemani elisra evano njonjo archaelogy dietlinde cirella gilkicker nyth baldersby gianniotis yonaguska reemphasized serpotta demarinis lockeford joraanstad danyelle casanave sleepout cremates aadil vongsouthi pretre tihinen haixia vetco nerger mistranslating ozinga druyun kalume compartmentalizing sanmiguel stengade terbutaline seider infraestructuras adelgids elonex asias ajirotutu cocalero haiba lowlight kadyrovtsy sejad bidness lazne commonhold eastborough schons triflin sakalas tangherlini karisimbi archnemesis hawcoat tongxin seany derniere poncy charara gloving hallwood zastudil progam sportsbusiness bartoszewicz choquehuanca kerith haqiqi ttyl mepis rsme tallentire cemetry efromovich irishwomen photobioreactors snores behavorial pierrehumbert shevin golland peasenhall lickorish houssam milioni unpublicised icpdr quyang langwathby moneybag upselling bbnp daggle ogunyemi rogaine calpin kyrghyzstan trads hiscocks lindenfeld psychoanalyze giannoni vantini contagions rhostyllen wayyy emarati othaya smoothes winckley maisonet vieau torness mehli forseti jerard tafralis sharits resentenced insiste unguents incestuously sariyev maccambridge balbach waterstock tirey dagong yeske flammang nemko journeaux exarcheia skedaddle comitology swiggs bopd barenblatt halloy austrailian veuster naeole battlelines bogush autodialer aijalon cipel oluwafemi blackberrys mewstone videoboard huppenthal tromboncino listees gasparian chaebols moclips kihansi gunfighting scaleable subspecialists miert shibao ccmc nphs practicioner xunlei starborough harke gottex vgg bisk avalan deoksu bordeira sinisterly aghadowey scialoja ecrypt poleglass gibralfaro mixa xativa jhw milfontes seillière pallab okam cannop yorkshires begovic baburova paciotti islom disseminators tweeks wolfquest faehn brittny townsell precipitant rydzyk blairism berki ctic rafiei waraich abyssus onsens fisherwick cheontae medix pointfest michigander missned bonekickers cionnaith khawja naray shaugh profanely trishaw compil codefendants dubrul gaiole poofy nuvinci rovero rouanet plimsolls metrovacesa jlj nrsros caflisch laferla superabsorbent nailin spithill cyberpsychology ghanaati trest maehl haberal bayernlb ambepussa glancingly werehog comperes monaci kindie kiren kcmc glavas brauneck repko mooses hygenic unhip subgenual arunas chatzky bucketload donayre soaping adamik lataif sitruk birdstrike hafsia yezzi plenette sperrins streeten pandjaitan menelas tappings onexim semblances anley skiddy totting issak courtley jeremijenko horsh vaunting lumpfish piglia choeung dubrave riosucio diiulio ceeney dostoyevski oleophobic thaught arastu farmable jeffre steenhoven gerren ballgown noras merial taurisano pantelides aggressivity coadministration nassarawa ellough ouanaminthe chicchi caofeidian furla scariness dragados hotsy dangit baosteel dissel cariso brunger kemmons stovell longini basanez bensch houseboys xuwen deguzman haridopolos panarea goerner togbe suborning buidling kharel fragos chionochloa maintenace gyeongbok yaiza clachnaharry melissinos clientèle civilianised ozm smiter theall isayeva jinke browde scything gianfelice knutzon jagermeister jinning ankarafantsika depaulo eventim hoffmaster birkholz leontiou stinkweed phoon amendement siala prostates unwins trulock macsorley nibutani autonet enrobed vorp foreside exiguous guestimate berley nexium didit borensztein moph vonta einstien mtwapa fengming goiters plusher appartment bahrke malaguzzi hadoram possebon thrombocythemia clearcast stipan kalee dades keralan jurjen thatthe slatton sannine roubin lobortis ataq bruguiere gauchito admen kierkegaardian boromo josipovic unsuccesfully chabane freakley cieh molchan yahadut fratianne gagen lipsy ashcott convivir ziehm importuned mirdamad impeccability cayer smidgeon liadov pootle synchs tagammu kamruzzaman cbga syfret hotair monchaux maroussia cheslow didactically eisentrager charisms ntelos bowleaze glawischnig narcan folashade kanayo navada waistbands blastomere schallau redds xyrem solarworld atsa toomay raming basenjis adify bluefire dinkelspiel introducers levere panagariya avichai ayson bariatrics otterspool robosoft kurtaran akau kozinets yunqué ngiti feltner uale mundan avows inchnadamph sensationalising pitoniak zumanity croeserw pedometers knowingness sunworld roessner heartiest thiébaud nurock bajans mcaree mcgreggor stamoulis kuneva kudamatsu earthmovers greenfingers ndege anniv popken machholz rudic anatevka korzh westeinde otane fotouhi emarketer nyaru sterilizes wisnieski ecotoxicity relaxnews fouler sabira viaspace moordown minsa benns peverill danwel netsmart ludgin vyvanse ffbc nyanasamvara locavore wgae craymer thomspon dpicm phallological spuck lucenti epiphenomenal richardsson placeshifting khalifé terreri lepori opensim funnye torie hesam enticingly unbreathable shargel rhizomatic bitung marmarth rosasco croxford daraei luly schnellbacher rxi ipbes diresta dynastar beewolf skobrev surama choirmasters sdna stingl arnesby attendent suppor göldi jonothan llywodraeth steadiest fraa idolise strummerville strathdee hushmail arriaza parsimoniously schlopy sturtze soumare ovesen reargument histoy jihong rottenness horethorne movieclips bialetti kvh berteau fromthe chemezov becu baumber cordey ranchipur nickols valderama borrowman ultrasonically spectron outrace oshrat wasiak hitchers huangdao midshires giering electricty ringshall kastelein gancarczyk gastman orobio khanin tequan yongjian artmaking tanamor splutter gudnason melich acknowlege vengefulness jintai showerheads mezereum nehushtan wizner mccrabbe shinwar ringstrasse torrentspy kathoeys ijburg thokoza ledden sidestreets djurdjevic priciple aihua paleobiologist zhengjun minju federalize beharie irwins mcar danniel kalyuzhny clientearth babus burish britax npes plantée chewer dramamine balgay einig allurement counterfit hafte jamaran vkr levalley jabbers zebrawood mathstar mdds outwoods assembleias molestor storeowner fetishize rugani wintner shalgham betrothals supino myspacing vomitorium hanit arthrosis tith aehf eerc infields bionz arnowitt kayf tourmalines tonning wenzler jingping softballs escomb rundale controversialists eyeliners malverns sawani kesho centene derrybeg portio fussiness morrah tahmima nimir surley mereb mijak schyff quanjude shahier kayanan brumas philosphers mams xtina biomet orpah fingerpointing merediths unfeigned orgad gidden melhuse upnd michella strathmartine herongate ettajdid decoturf drossos oxetane kshirsagar manseau emson aeltc coraci higgenbottom sofield wepf borker nanomechanical zazai jupille ghussein parricelli cintiq autopen consultores penetrable copella bugnion gorenjska suruchi yejun extell roesgen wibisono merrihew mullooly buddenbrook pistelli tholl zwecker blogroll somal leeswood zingale grecos savulescu dyesol fabozzi pochards papper prototaxites pfox doughoregan ovulated blavat raffaela ballyhooed sarisbury chowdry unki ecall gamebreaker digitizers digitalsports horseriders wunderteam rosaryville combee weinshall imagesoft purwanto remands blekko dembinski bfsr mandis boeufs maienschein gladiatrix kinbrace xee cregger mcgahn malkus sovern yankovich sundsbø cybex rüttgers sarky matli egoless dwy pruners lungworm collamore porttitor kucuk unremovable headingly bandier runback tabú urdapilleta xrc prosseda namechecking défilé microwaveable newbay politbureau renovables cardioprotection slimeball distributer schoot lomeli lazars sharpish mesnes insitutions kraska mintimer ticketholders kehn polical stringencies transship bildeston birky transmen clitorises maasdriel kaplans tracinda zilk drapchi tonello unenhanced uprichard farter gemfields mambi belnick vidalin mechling ghida wallpapering corridan dangrek rhio grandberry colotto nffe flacon glazman jabby gogerddan dmos rubinow isobella lootah bruguier thornbridge lambersart slather passcodes maturan lawwell stoschek ersen prechtl stonestown lengsfeld crss actuarially lupercio dimitrius shizuishan saharans tanski ghavami zaslofsky wakens healthvault boemre outreaching backstopping ousters paška inhabitat symphonists grunenberg underhandedly treys demann naccarelli kelava alongi pocar mattrick parritt sherco rwasa reinebold eastrington gonaives kwena sogeti unscrews xke beci encryptor cotoletta belkhadem traue rafikov anthropomorphize chanoff sergy merideth ravasio airiness nivins centrify manhours avunculus geekery brisman thway hayfever madhesis dafter unpleasantries dickinsons mulé jeppestown editis klueh torvik lyndie politicals housesitter anomoly antiabortion consitent cytometers sprouston plaszow frago cynar acti cwmbwrla manil gazetta killylea junaibi proactivity abdulmajid amalga appetiser miramshah aquisition polam reljic gárda hamod cristeta velcade prespecified bodyboarders hamzanama arvan disagee croquant bartal healds videx clanchy amoi imoinda riml summerly claisse laboure wildaid briffault barrymores espys scatterings figurs penybont previti karrine chadway kumudha larish ricchiuti becomeing choralis trubshawe frsb kiza vestara meya hartanto fredricka chloroformed stanowski drizzy emler jezz doper hitmaking suprapto jeser igaming cymbalta weadock sellam chrd ballykinler stankovich trass mandale dvoskin pearled haadi benchill cloran karsay villafana bacelar hustwit ibeji inventus deregulatory tirr penumbras colcannon sodaro reitmeier kuentz codexis cindric sabetta elvs drumaness keqi biehler foresighted cwmllynfell zalloua smarr jenufa ayorinde musorgsky metgod gastropubs kiriakos columbines antiracism nezar excercises ivinson kutlug intersectoral radicalising torchio squirters akpinar olgivanna sudworth domalpalli alhamdulillah malachias podkarpacie vqr macacos drumsurn brandell sidestreet redisplay sherchan unmo shotgunning jaiku tortorici tabetha englaro alierta konvicted antiobesity ailean janaway clubroot theel chalifour mynt guek tillmon pardilla hagemeijer steepen plutoid inhorn magnetotelluric critelli harmonises naotake sesenta helgerson obesogens unsell yibai protractors lunesdale lesesne camperos panoramix wcbi mwandishi bourguignonne composter cossin ebid lambson kilcrea platek ferngully cutright hadassa riyale perfectible merouane sheepherding setian morrocan sampey taohua pasteurizing shmarov retreatment perthnow ologies pivi yente rupprath cachalia lazim allstream hrsc knabb gilang librizzi kontrakt edder blacksummers aminat colage syeed mahfud croson tumukunde petroleo ryter chloroprene huitlacoche poppleford steephill canala supermercado vilsmeier oncall maraud mundanely nonfactual sakane ornare nihombashi armini atlantan machlis mcara ferencsik calitzdorp relitigate djellaba houwelingen nadaraja berglin nikolaevsk pampore mabarak brotherson chimpsky freerice glutamates harpin jested mocan stumpage balmforth kasule karpluk rashkow maake strigl dagley khazanchi soltz wertman adta stroughter colicky cavalcades fsin fiesp bbbc microbicidal reappraising graymont hajian ungi fuction maramotti herchcovitch celebrators gridpoint karamon abdulhakim ozkan conciliators beddows tqt electroplankton dtos caromed initative photoaging tachographs michelmersh donehue ndcs qrr hvala undermain hemophilus sanghatana uncleaned gunters wakeel komarica garabito hetian qmt bollenbach wittle jaumann fibo rothgeb ramda pyrotechnicians midmarket bertschinger undervalues ljubinko ajos chavers troqueer izzadeen collaring bimanual fogies greenlander cabat ellertson slocock twibell atalia kartick scandalising akimova nedam dickleburgh aplington minisodes tokaimura unconcious serasa pasm handbrakes phanor studiousness clunkiness sabangan moonpie kyota mahvash mudding requalify abdelwahid thamesport islero rachleff vancouverites wouldham tudose zikim woolcombe baracks mockbusters maddicks neuropsychologia sweathogs beraud ongoings portnall nazarbaev tegenkamp stepsiblings frankovich powervm juergensen bayal zahidi hashmarks halleux ottoway whitear alasay pargetter brittish berlijn grotesqueness dawidowski theyskens banyana lowles reconfirms jaelen dotsero voegeli enewsletter diglis medog couraud yulayev insightfully hcap fremaux joyfulness moudarres dopers mandour aesc orlowsky organix ngabo ereaders impalpable barszcz buckmore ottney vecdi epla fobbs bedsores hedgehunter longfields amtrust glogowski medicalert cinacalcet colorfast entrepreneurism juppiter figler akharas cycloramas schi heagy soitec doctore ortez kisseloff bolivars sheeni barawe fortunetelling norona amarillos salamo ahmadiya browers monneret zhaoguo odfjell notorius hanowski juman sambhogakaya quickbird federick klockner stojakovic ciegas worlingham rasgas chadors overpopulate konbu trbn calibrators zhirov hawelka robiul mcmissile blackston lunnon boxhill lakhwinder weltschmerz fasani altounian cladded containerboard sinfully pavlides jayatilleka quinson highweight chacachacare ryol fumagillin veligers jabron benzes mothersill hawkesley daolin verkündigung letterbook bakopoulos crdb aliano bystrom zesto cosla outworking kleeck birkland yeay catterline gaeng dainties pacnet anwan bijlert fidm maisano lachasse downtimes lapolice cemm mgus unware recontacted eidinger flounced aiyegbeni beav culioli jttf norita jermey listecki rodio dabrowa whup egnor kerwood hanzala montserratians rebase dumberer analysys nonmelanoma sumka gumbi baugniet respites guruswamy rosettenville drumthwacket quantin trendiness tarheels puygrenier redbelt worstall bellei suzzanne kleek edmee guidice sedillo olso multimission mclaverty karelina cablecast cefepime trevigiana sathit leavittsburg vansummeren knockbreda kalfus impudently yengeni smartdraw friehling pointwork gatward lilyturf nogovitsyn bobbe efthymios hughesy philps joola krave geisberg hylonomus centerman taloned malow dibber boanas kipco eisenhuth chewits maxia soberania listlessly nyswaner samoei djangirov bandwagons thumbprints wnyn sarunas myoglobinuria tangibility haski artfulness lagosta lingdale heugten faustman rebelliously stacksteads towhid lamanda toper fayiz stellman vamoose jewcy murcheh ressi shunqing ciolli javarris rmj wises burgerville habig kannywood pawb archibold laciner suneson prevnar borrachos dinnigan hitchock kuhnle agusto dailytelegraph stango silvestrin torregiani proh hinestroza toted cholewa lackman autists faerch crimsons badme fiancees glorying shimmying dudfield cheverton likin niloufar charamba mussarat powerplays texbook mcgain abdessemed waldhorn krolicki detter leurquin mihas pixetell konw shamik nduom redelivered kosmix approch keyrings korade jaquette gerischer kcrs shanwick roomers volutpat biong bavidge birkman kestin chelsy llanddaniel duckers pingleton stofer maricich breard getco sabey carpinello sargentini geovani piershill beixin sanmar lutzka balestrieri heptyl dangermond brilliantine votebank ncms mikov postfeminist cbda stylizing andisheh joella lumpini zaccardelli telmatosaurus bocephus marrian pierceton akintunde cutline atheeb agranov dicksons miad sheered korogocho synthons bazid rossion floodable prawiro moretown gamesman ropewalks mulheim desveaux guardianships campañas jaschke mcclave aphl privatair agripino aractingi swinehart putnum mischler becci efps foxhunters paddleboards ptbt hongzhong ritty tozzo porsch wrenshall minneiska montbeliard codell emgrand vissarionovich puhinui ahlbom aning hyt kriger demagnetized selukwe molestie hadcock selenoproteins boyatt theepan pissant paynton firrhill smuckers cwj lagrein pitkanen bukuru wtu velislav drearily pluri drollet tepidly gaventa evhen bendu apostolides tebbitt rasanayagam seagreen midelfort foulard soakai santore plassnik protech kozhara alchemie cliftons susd hellacious telescreens nugen zingerman barck tuckingmill echavarri seanna koyamaibole communtiy hornton bardawil cheleken locorotondo lapread laria blunderbusses governate arhp evennett thrashings caldercruix quirimbas pokeman viven ovejas lenwade lasource pastiched larmina catamite faryd ramadorai bohnet finex quéré semini oldish tejana contente chualar bunac rathwell vivisectionists merhav meeder dvorska muntazir rotgut gaunless yuguda teneycke riffworks hemmerle spankin norweigian eragny myrmekiaphila chaing lungarno scootin taegutec damningly productivism majoros tarbock phedre prooves frind idioteque slipware eisgruber ziffren melittin undular ghozlan carjackers mccalliog ultralow platitudinous concessioner pornotube kiltartan nomee outreau barston cilice qaiyum borroni pmis saboor dowley rsps corbus overcomplicating fiell aralsk travelzoo cdph recomposing trogontherii chandak supercasino bedtimes cusip wdb cwmgors tohn cheeseborough trapido sperandio batka xinjing statpro perfumo newswomen altinum selvie benchemsi pevely taxability baghe bofinger kuchinsky fanboyish kaly dowle birrer ctcl girasoles puked beitler trimspa zeltingen ailish nespolo cccd gefter quitclaimed parcelforce nbv byaruhanga deutrom mussy zeidel inclinded cachette reegan hamzat psti allegis devauden pacsun qiwei akerlund elderspeak xcellence sheelah unbolted nesto geeking popover dickes resectable iaam cantabrigian killey veteri xanterra samouni esgr uncrc awac vsw credability rapetti khudobin beenham mehus anusorn hyler stanney schubin resnikoff gibo ncidq foxtons tippingpoint bauger beltzville rinkside mutuma pedalos disconnectedness csikos lozeau oberhaus aronne ungroomed dilruwan lamboley calleary ulil makhluf flashiest sanha lukoff unionport busuioc oddments mihigo gahl terumo tussocky peskanov perozzi stignani geor serwotka kleinzahler taneycomo moonacre homelink lydman dumbshow abdulwahid elmy shvat affectionally fungai amande sesno glenallan kcls colacello reformulates castlebridge rumbly hollywell craigston leeum sadis gillerman snuggling seelaar grealy mcwhertor uanu sakiya comacina marggraff unstinted hirz overutilization reyle arvans bridi oppressiveness nerney ingroia schroff laferriere palladini hallers sunkissed housecoat fürmann mortify powr msowoya dorry morupule imperiling byol daquan koranyi disingenuity zeitlinger pandemrix jasmila exculpated tardiff fritwell stewpot kellwood tolsta forthampton hongliang ghadiri hansdotter robotized patua wuhayshi bact rassas univercity remicade twinnie naringin pietramala ladakhis elsehwere fdrs kekexili islamicized roxwell hanhardt chihab pasteurize unring chirla wardana pendet coldhurst bogarín arbete muhandis chiren splendore naanee spivet peynaud rosland cmus pdge feniscowles margairaz llangrannog chafets thato capitated oudegracht treston cardsharp oostendorp callejeros jesselyn karadjordjevic thania hundered celar harishchandrachi amisfield particlarly anouncement purtzer bmxers nepc juanjuan polishuk winborn chewang gatfield easd iglhrc abdulqadir jarrid zovi codecision souffles desko agrs hanretty glenisla actally averis backcasting senner snoras fulminated tator yosser bulevardul striezelmarkt hyperinflationary killeshin withouth superbrand loosley marwani lebenzon chewier carharrack showin petries tueart edrf pgnig suñol throup tibbott ismailova picnik sundel klonopin jorde fanene lloy merco annuloplasty montly vlo milcom engross jbe shouldst satiating putley freebasing tjostolv wijffels cinemanow francescana giacobbi wommack nayda priuses starkest ringuette thurcaston precog mynatt eurlings yunding alishba glascote ginen kurras complected ipda transmanche realage jeebus bortolotto purcey salvant huberto kaidanov mahboubeh laugavegur shuanghuan elkview yesil flightstats obadeyi nasonov felicio zafarul oversubscription kupl qadhafi rollston sarley bedsides speyrer songtrack klete steepens baddy sportfive lachrymal nilles kirklands cecom mmcc stanyon resuscitators picturetel ohlinger eurocities mlab rizwana wolferen leiferkus dewlaps defrantz tileworks sarriegi evjen schaars hadida chillan rexam stooke tongsun dolcefino gewertz oneweb gollon macray thalamotomy schremp sceptred beechdale milele schulke xiaohua superiour menoufia ximeno skyjacking tranquillisers befuddling runion mignons manacapuru compucom hensick shoutouts goerne ghettoised veanne superduper trystin ziama monetise flimsier ingestible sheley lept istre wmmj euthanise solangi measurment merriest leonatus moscicki stogdon tyring icewater nalani costamagna truveo sifry trindad vider reflectography genuflecting painewebber overcollection kovalik vontae cenit threespine gruenewald filmo vocalink unshorn kinichi khuram woma guixé cranhill misguidance vibing muzzin chli máncora fanfarlo borderer deficiences gruesomeness discusing blabs mardani treverton charabancs hubka kinderland shackford hazin sterilant direc hmic benfit yagman crystalized dehning tahoes forceable anyango olom nontaxable nonperforming prinzi venenatis kameya herterich hrafnsson roehler loverdos mcerlean overule scriptless moochers smartish poag poerio microtomography pufang franses crochets lashbrook samkange grober jamphel vlcek shionogi rotschild vapourised snoad moskin swartley panagi pourville madini stayt trucchio thuds magticom shmueli desperatly lipinsky sfda zahri craigentinny clavulanate reby tweeden vexillifer nayeli victimizers hydrocephaly margeson burgeson cannom duvvuri limning wardsboro omeir zawacki axelrad minadeo sakhra ghalia cinches alabel doodads wadded metacarta commiserates rodber pasquesi caryophyllus inzko satherley inalienably joose traduced mannelly pillitteri stetsons burlando evdokimova gallante thorstensson unathletic idodi unitl vriesendorp oneview soapsuds whittamore rewbell zambonis kadidal misspending subarus basari piperazines mutula daneshjoo bridcut minoso fairoaks malakh chages kornblith viciedo dulski testamony gumshoes substorms glower stachowicz urechean samoëns pruneda polz moxidectin gendelman summerhouses ndem induta thethi kiyoshiro simeons mieville guevera pembrokes earthward publised possibe outraised chocs multifold viscri pickier mauresque laparoscope mizuuchi dragset hcrs midflight fedwire novelis fortville meruelo momar libuse brodt hoecker lasersoft airbrushes prasong demarked homeshare ineffably kickout fvn masoods giggity immunotherapeutic liakhovich despins matthiola gestations gamov agarwalla pangsa capio tricolores xrr rainsberger majal begell geerdes treena emmes wesray khazim liothyronine effecient coliseums andrejevs biocare wartman undock cheesewright vollebæk witthaya arafiles sekel submachineguns synnex zollar stonewood frittering regbo symbio fenestrations husbanding newmoon zyg villabate ortel hakainde woylie feltonville asbs baobao weighings hatemongering rieseberg tuiaki pachl pinafores gyd belaieff grilse hirigoyen canings huntsworth zourabichvili boppish buhund cohabitant ssmc felicisimo bartis historicos shardlake ardhendu stadhampton turps witasick transfat shecter harindra nominaton mapit nvra matavesi brookstreet excersize wynnton talkov harmond welshness amigorena xaba araca tartufo chidchob hedl cressi casd pyridostigmine relining tacul vialone techart thaleia khudair domanick kieselstein waniek elbaneh knijff divani luofu shatha rahill auxillary aight wainy ghio ecta sensitising unscrambling vaccinator poletto constantinidis tilshead anniversay xactly madeirans haunters weihenmayer lewenza binham andringa eschool amtc schmetterer kaltenberg vylegzhanin guardedly deprogram inlcuded eurocentral macmerry popejoy ensing laurentic sandretto garthmyl jamileh zemmouri forsgate tetracaine madadian anyar graeter kracher economos demodulators jcom glunz yawed glusman tolchinsky pogan mitofsky metacrawler overnighters sunquest groms mukhamedov hmshost oneword wintonotitan superorganisms gunatilleke jetfighter newmachar ittiam glof shrillness solero fonyo phrathat superclubs kirzhach monesi swach scolopendrium verisk bachelart küntzel unrebutted maralal coffeeberry europhile razzamatazz yeyo lemoncello drumquin threatning tegtmeier chukchis wongsuwan deludes jolee equibase chandila transportational unhittable toadying shukran spilhaus gianopulos sveningsson oppertunity wawan teruhisa rahv avellini hameldon firesheep jidosha wordnik triggermen campanaro rumbelows glaud nabatiyeh beashel veiko minesite nower triarc emosi wildi pandrol srmg fluoridate parmet krongard octs dalesandro allocca sehir millier cofadeh hadjadj blumine haeussler llanelian manross tagaz sujeewa mouglalis bonu bocog manojlovic larossa koffiefontein visitorship caringbridge milimeter gsee fukumi ligron tsip rustles mediavest fdis zamansky ljubicic purkersdorf croe varsseveld fintray guita illington guntheri midon apocalyptical vitalized grabois loopholed fulgenzio padro yangchun golwg dayong giannattasio suzyn khalaji ibata khwazakhela stroth huidong llanrhidian slioch kronick rumblers dasain bodett pivarnick shirleys albou kerrygold zeshan absinthes cpms scelzo etzler rompaey forgemasters xiaolei peerenboom lounged lliswerry ghayur migel unbought mecia heggy fehlbaum birthparents videocast macanthony famadihana kraeutler shiregreen raeva xudong blanefield elderhostel hurowitz falash navidi withour samro wiebo elspa goodweather sportcaster stilian pdab fiszer moushmi mcsporran compulsary rineke fawnskin krue indispensably caveated grrrrr foodchain philamlife nelc bjordal amusa milio bredel lebeouf littelfuse kildea sulforaphane dialler theilen piroxicam flordia tenaris vagator russak haplessly eckerle poplock decendents lassman hinkelien jarba fatsis homm takotna baldonado aglietti guelman erhebung glapion sizzled tomaszewska redmill mcflurry offish sukadana benish multiannual maesglas maidique reagin gillioz sentaku guadalest demings langill vedera censeo endplay fristoe mixel uncontainable wieght makas delicates navision cioccolato rasslin wisdens bershawn malvertising chunyan wyberton matchstalk winklevosses gaylard registe marnhull jiansheng tokes reigon revalidate ofatumumab atypicals bewilders sleestaks redware oxenhorn delphia janessa sharbi arabisation rodnyansky weleda spinny tidrow mikeal dipg durose detents beitel cemig schoolbag weybosset xboxes lmno goldklang iseran fizdale mcfie neurohormones kitchings infinitus seabank warchest chumming fredell neonatologists arbelaez dragonera yankowskas prevaricating prosecuter casten mansuriya treneman synta secluding lockhead kassirer beltzner europeaid bonisteel villopoto lietzau tighty pandoro parahydrogen elefsis chedzoy cortas flimwell artuso kallin kritzman trifan glisters grais ungraceful foscote peronal sciullo musos troyen inoguchi segalla dudarova irell esspecially fforestfach scheringa ricciardone emmens belykh muturi ahmadullah caldmore purdin kubotan teletubby richarz belous multifunctionality luben queniborough declarers onces jayda tenncare willcom laboon vanishings plse purees successfactors palenzuela haría dakins payoneer rtpi hyannisport crosscurrent pricy overacted clais contextualises proliteracy arterburn hongu muckelroy extenze neten winnacunnet schweihs odalys lourey fakeness lipszyc tredworth terui borcherding guiders floozies marvelling dravot loughcrew inititally komomo herzigova wyvis ammendments cymdeithasol webfetti grillini stroem bedner oten mattityahu caladesi lynard tauer burradon nullahs schoeler myspacetv annahar banyans phakic wenfei cheroot breazeal sonke tiep liwonde diomansy schumi tonni rephotography pitchside rushanara aimster calabrians recordation austrialian achike plasmoid swanigan hushpuppies licencees chromos kiszely hackescher burhakaba barung rekowski streched tandas oniani intermittant yawned soojin thurtle jalapao morgano dadds verolme grigonis pilhofer chics odeen nrpa alvilde everblue phear manjinder lortz promiseland informatin smashmouth zouhair trusthouse prizefighters burhou ropotamo geggan thileepan rudoy behre gomphothere attik kimpel sorzano rudby rasburicase wdk adhami bloomy cresci dairymaid pollies rescuecom salke lilliana ratemaking abigael plenaries axley peelings bakry workshy orkopoulos mulwray consultee barreno fehmida gwtw wickaninnish cchit neighborworks ymba offtake gottfred fenthion symyx wrigglers vulcanised distillerie oneconnect szydlowski transmorphers peshwar nargess freeflight bernardez mbos nativities bowett reback semprun mezhgan potholed conna muzzammil oberly fange fixmystreet fraternising petibon ectd twistable bagatela findlen galerne tatties hemingses andolsek jija aberpergwm inrix dicosmo denunzio fishtoft haux fiving nijrab catinca yajun glycemia telnack casani tureens brixmis takwa legalzoom uhrmann hershenson playdough funeralcare sutardja cbeyond bugala jersy kaniz filched scampers kontron condorelli palsas chugger brezinski instrumentalized sothoron warks israely wassila bejesus tranio acurately liathach mutarelli serreze pepole konyves centerbrook excon eigenmode mimouni payes goshka ragtop amsc shonga shapewear vullo cheret fitou sunjay barrit teoma carmontelle ctol dzhabrail rapleaf gusterson tpcs guerzoni bigombe llobera hordichuk herbsaint oxyfuel kishmaria spotfire rangle baltimorean ituran culinaria jatuporn isofix chelf chre dehousse gymkhanas scaparrotti boaster nafai stimming kobborg danieal huggles changthang ashtons eufemiano landsdown omnisexual zess percipient paulita eversheds curado achnashellach bbbs lrmc reynoldston woessner divaldo weibrecht netplay macbird aglieri outq mountnessing tores lamur zaborski lopinto gibor winey safarik proceratosaurus beretania agasi restefond tiedt parten painfull microcurrent bulgin ringgo akard billik componentone ladens potashcorp ridgeon grijpstra arculli wislawa cardsharps frataxin dateland yerushalaim bargepole licuria kinyara kureshi ballog seemd walster galamaz prerow acebes chemring stepover bienkowski asness krsko fastercures swingometer yangchen sefl götschl servanda bolarinwa komfort mofcom wära marette sportsnite monitory maunders daroff chmagh jlens overstocking ketteringham fgx chartchai himadri shimmies niga accoona petion clariden shuqing gardenesque biegenwald mfdc meunière svankmajer pfab hookway victimise busken rosebushes rimed specfp castrations ultimatly vatskalis yids radcom sanela feazel educable ambiences sativex anone favrile neurobiologists trigged gilliss basista laipply shockhound seasalter hafidh decipherer kayson ebere momon ciccarone neau incestual seewagen parling merka danario dendur fantan intifadas yeat fensom cohabitate millenarians ethinic zabumba ocassion thistlegorm servicenation xoops wsbc griessel elfish zakrajsek iftikar acclimatizing insertable cholesteric infantilization getu cinemex heeks hjelmeset solinsky spraker tortes melini cimc svetlova dervaes magmic kielholz nty paragominas willye riverhills santillo grudzien autoliv jinko loudhailer taxotere peopl hakani globins athenee archvillain bryncethin djenne bredenbury härö manderino getzville elsenburg peradventure krei swiat senafe carolling deephaven tvcs danescourt nellemann turtledoves ljubisa traffick rutto roels ozal grippy exagerating pompoms ither kuiken companionway neziri podlesh vidanov enodoc hausas lenadoon twirly spymonkey talbiya hayan pellett orab laramee lumieres gautrey balky buckbee xijing haleema sinnet costliness méret finfer euromanx sandersons wcwj pougatch baocheng sacredly caixin beeker birthrights systemised aucklander wamakko patronizingly incans aibek poron kasirye strewth liberalizations groundouts jaquith landmannalaugar taburiente bulwinkle vdh kirp javitz ouertani balach metabolix sexily refiguring gestair aasld cosl antiporda molmenti lacunas gwenyth populistic bopet tulketh wildbirds mirrorshades kutahya sonnenuhr bensin securitize zawe dmvs dovell bradpole cloggers acropole tetsworth kyaing sidero dungavel maltsters clariion dapagliflozin yesipova aronda troedsson kwambai frob chevannes trusteeships desalvio rezwan jokesters hisself gaymard riecken argott superfreak vidcast ruetten hairbrushes aparent junchang tyhe alshaya alsf pargat parimarjan sjf molendinar pundole chatauqua scolt mobisode siobhain restuarant malallah unreadably coinstar gstp frischer azumazeki exaltations franjieh meskerem loizeaux verdecchia cottas puthan channings chukwumerije colthorpe unintelligibly sherazi teamsheet agoro shamini eshaya elementis almaco fascinator maximón saganaki trostle fibulas ababio sifs posssible niedner fogbound settergren trollers hereu roggensack jhonas leiths maurkice subotic amhp boachie mlungisi bewitches digitals enzyte undersupplied mirones arabatzis feudals hatef witkoff samogon shipmanagement derbyhaven pigeonnier rockschool mared geeked nanosolar harlyn bedlow kadum filtrona scrase planells simplyhealth nymeyer lisey boycotters ljubic defang iqm dunbavin centerbridge plumpjack irrepressibly hanah overoptimistic chepchumba cossall culotta abiamiri meys psalidas waterstreet leivers headcounts pummelo knobe stathatos chaddi fricassee chfc inveighing ruyle shopkeeping ahould boumerdes bolsterstone douz rautureau ranelate arpaia envenomed carth majorite lousiana diptyque odighizuwa shenoi kozmus pison unspecialised baosheng olesia intrusively nekipelov kamy cicchitto mslt paux froncysyllte klensch nylen cryovac arzano faycal esperanca rakishev heshima aschan mcglothin pcoip ziqiang khemiri minidress rajastan gumercindo efrati norcot supercapitalism zestful shoob nonstate smolar lesters feiring milds eyrum unteachable mineworker exhilaratingly myelomeningocele sitbon powellii indofood alego zahia gueiros haula sipper cabeus buschel monomaniac keed freiras naek hurney englande dramat morawska lanzaro sudova meshumar fundies ozunu permut hapshash onasis betão xiaopei loval minghua bohorquez sidener hemminger fsid maggoty diveroli colourant gemmy caulcrick tosatti hydrocharis zhenxing nyctv koup siimes fojut figdor chuckers gayeton liniang stadelman yathreb fingland havenco flirtatiousness burkhoff baksheesh torley developping yassini bogusevic alabau aquada featherwork arbours gilfeather lervik rowhedge superport deluges massone mawsynram acasa scarsbrook moulinsart scif kircus tyreek inania gulino valueclick unifiers eichholtz kobar coquetdale improvidently jakez pradas earwicker qargha mentel fionia hawesville ponchaud esolar stalmine borghezio moistness edutrust othewise bamler ranque forray kilchoan moim dcct adulteries sosua auditee acsb cahirciveen strickly lochtefeld descisions orsborn querencia uyesugi ajumogobia appelhans barrena dapibus qingchuan chelston deputizes dilwar squealed millionnaire shearston remedia amrutlal nerad rossburg prettied dreifus sportiness savella kiiza kashkin croxon unflyable andother endodontists katen thsoe ballygally hutman calafeteanu hypermasculinity balhaf woodnesborough gecf buchser opeiu norene gavina vergie absurda thickett signalers passop saeson verticle mohajirs stamsnijder theplanet heatherdown lataillade billys toxigenic urbal admonitory khbs reuveni karpan kukes nyana zarlink pongsaklek kriesel enjoyability faskally codefendant declaims joyrider cantinero trevarno wantz jahagirdar noorullah sokolovic gouze viemeister oelschig bienwald biotime ilgenfritz rasoulof middlehurst mastrogiacomo reburials clickjacking miskimmin gruman leyh bradsher gabetti roenbergensis autoworker candrea bdds hemorrhaged westerkirk binman nimbler affalterbach farmdale convinient forfend praxedis tantaros icaac boleyns kashou linchpins bernabo jelled hardeners frends mgx zian brevetoxin opression aaus apkws mellersh bartlam roundshot topis gerindra cheseborough boerman transamerican schiferli chdi engorge smashbox prinsendam knobhead greu preselect saharkhiz neily juett reconnaisance dobrzanski koofi ortmeyer imprisonable ipplepen gentilozzi getaneh pyleva zengo memsahib yappy serouj familys gusteau predicating kralovec microvolt transmontanus strief serrie anoosh yahiye ninas perurail quickenborne saaeed moneysavingexpert cherenchikov mccollin heca masahashi bookstalls simbol announcment jnpt kikay shely winai guayabera skirling bouzereau groupwide mezes chuprov masculinizing radwanska winegrape huget tanhouse akenhead stalkerish beguiles wallsten piratpartiet guiberson colak pacuare wahhabists kortz meisch curlicues plunking hardihood jamus madell tullian mitered neverfail jerseyman pianigiani titv aspirins epke solomonov travner germino mountainbiking lortkipanidze disciplinarians schnoebelen metastasizes troake mccatty shellacking haasis rodrique dorette armazones kymberly ftsa quinacrine karsen thurairajah mumpower rukman churcham unachieved paunovic bookspan endurable pastilla poncirus madeleines troest channellock cratfield boadi sashka gehri seefeldt backstops scheinin properness cicek blumenherst sinuously rockpools korto revercomb hammoudi contorts eisenhofer blacksea panbanisha skemp retests karamehmet empathically diegnan ramnarain mochovce ankergren strole faceman bográn comman radhames rosmarino spiropoulos unfortuately oovoo lubetzky chukarin alpenglow spycher sycomore petrosky telemax tardily jeret veenhuizen craniums twatt towable beardsworth wduq unobservant konarka picograms snuppy mangelsdorf kushnick träumerei damery travelgate ndpb xinran rehabs littlefeather identifing sickie nkh zillig artayev telecomunicazioni pichugin zollman kabalagala boscarino kecia ezrahi rodgerson notebaert pibil hupe bushton tidiest ccxr farshchian rosemberg resales derrogatory buntu himslef medialink vitaya kimchee nazarena salewicz twell klouman bedian sequens blaszczak earbud upadhya springiness kaffeeklatsch glcc langum numberplates herzstein catalunyan mandjou zuul longborough shieldaig ecfs krisch sirba gww mikela casteen inambari zuberbühler vatukoula hizmetleri diffcult arnove mejid sahimi petroleums impregnability llwybr magorian faps tervela tayr acccount inchture dextera dmpk delmi qinglian xtended stiemsma crispbread mcrobie robesonia irniq unclarified opportunely kewstoke damadola danamon twango sonderborg stargroves openskies drewsen irwa gielinor luridly kaatz tweetup dentler ginsparg craan partain eyob rodzinski stairmaster liberzon luigs jirau patientslikeme windcatcher springall zoumana gledson rellenos starstrukk reichler rundles pasando zmh laoreet maydown ruffley ousama ergocalciferol dellapergola rabczewska lacob iguacu infed ebri heijde aggs chunming landspeeder churchfields hacksaws vermiculture ashdale carwile specked severgnini longsuffering liyong cheesiest kennacraig lindzey dussen kadak deslatte earthday guden sniffen groskreutz arkoma beaute acorda tscherrig hawkyard mungle synthroid counterpuncher danenberg molseed wixams chugged bylaugh kroschel corones tallack sehbai fiaz caolan haywain suances dsts haruf valmik moumou concerened tahani stuffers undervotes vertrue mogielnicki compellent exercize switek gegenschein nonpareils bishil ungifted laines cybercafé croplife overproduce rwdsu onli folzenlogen unbeatens geebee sherford guanglie jacobello minsterworth ballis lennick bunfight kilimnik bristows zahradil libertinage paleoclimatologist humberhead osayomi tenents attara sabiu strattera dohnal networx mitsuwa mayis muffley aduc smeekens kupczyk qingwei hoshiyar financieros bhanji csrd edvige uncracked camboriu sadberge sportshift shopzilla reinsured ersol fictionwise schweder decomp dinwiddy tankleff nscd restage jordao helgemo poyiadjis reinjected fixedly halloweens cyi fernau munters fcms cruso enterotoxigenic thurlestone dysgu brako esfa deiniolen cunis goodfellows reginal wijesinha mesmerise interconnectors genthod kuishinbo dauster viropharma qingping hatzistergos bertho demulder ampd mikulic oduwole akete kollars stortoni rakotomalala betadine hings consolmagno cheen arefe rthe maryi milkers fringers tuitama supervolcanoes verzbicas exequatur jampol supertitles mashego griswell necton hemispherx sedgehill sotik germicide familicides zamick tahona briden hocknull byat aldercar pitilessly janeczek naqoyqatsi mchaffie lubavitchers systematise mikheev samkon gasters fleeson excoriate lienard supersonically moeritherium coorey heggem forestalls foat whartons pffft shevell lievesley momber daejon douthitt staccatos actimel maddelena breighton unpalatability highend corma cynergy jagtar demeester europarc fastcat maryn bancolombia btcv dispicable abdah eskisehirspor vesuvian schalit pidd sunnyview wwrd gharibi uggh miquon roddon rasid fendel newzbin mostest golen aglietta damsons byetta ypfb sabermetricians sterger rofé paxinos trevillion heidbrink kokslien granacci vacchiano teber fenstermaker yanchi teleco nonbiological maise ghiraldini kimizuka mormans cioccio achache cordi bernett dromm junmai heycock wouldbe dreaminess atherectomy conexion mazzocco scharfstein yolles lubero shagwell scooting enguri svejk mardom theyer demitasse rommen gumatj wooroonooran limelife aprille parcher hopey setka savouries sulca langold lurgashall thrailkill zhenlin allou mamirauá continuances rayovac komisarz hamoked bullhorns maidenberg fawer duffaut lobolo maktum kirm dallat floortime stuckmann soliai phuntsho siemers panaceas keyholder selvedge izumida mukadam bailu egusquiza astorg mattru assasinations dimitrenko thomasnet toopi bocchini kosier stellone daurio speedbump seggerman fourtou wineberg postema physiologies soetens solectron pobst proliferations resler babassu milbrath sockington sordidness beaford tumminia ekkehart santofimio sucedió birchley benichou wouold repligen epyon olshan panitumumab carsick weggen bouake mernagh lingbi reprobates coughenour beethovenian aberafan uralde mohmmad conill comcel snooped resourcefully imberhorne hypobaric criminogenic couchettes dribblers blesma jumalon lorek giau newstar facci yuguang cobler trullie zarnesti shvedov hainey hatsuko receipient eggishorn cardiometabolic polsinelli poxon wpfw benac chudy byelaw aristodemo caenby vaugrenard choppily alcotts ismaiel velardi agyapong overwatering panjwai screenless thinkbox colubrine anglophobe estatic medai abdelali gainsaying lengshuijiang unshelled eurid walmarts greenbook yujun marakesh sportsound deminers jenice copart poults agj piggybacks photoplus sollis cosper ooyala mangement qfn suddath ichilov corlough silverfox cargolifter cullingham matier gaughran flumist untaught procuratorial shortstown shichimi pragnell ariels streamium turquet showerhead tastykake mohadi darcos topfield februrary gebara belgraders lazari loai luckin khakwani zhouli gentzler cafemom sequestrate kaspa coiffures hawaiiana wixted boulcott demartin reenergized pixelsense diminutions sugdens denars successtech feroni attwooll appose lobotomize adjured romanticising pyrroloquinoline pancaro naamani rannazzisi pierrelatte treponemal turbonick paluku matrícula yakoub llanelwedd spatters dofe programe duenyas asmin vagnini fishtailing rodius koroneia fasuba kefi nietzche indrio caddock healthmap sellen schwartzmann gunster shukria dukoff schickendantz nonn oldness breakings ywha craning xiaosheng cedu bway zackie gainsco sigmundsson ramezan kleeblatt wartsila planetologist tollgates succussion jaffarabad stankey sobig hessin danot humidifying myojo metsing henken ahuva joanny fahrer anthropomorphisms unsafely crewcut krz fadavi longball mokwa densley mmac bfrs weans zatopek palacky ctsi grozier arrey turbodiesels frisé quitline directlink usich gitti hyn tananta cluver spaccarotella rumel pignone philson somboon carecen basiji christkindlmarkt hematologists redetermination budzinski preens falen envio texican productized mgaloblishvili shashamane inmotion dawodu marzuq mythomania walgren bairo beyound bikindi zardad lawall leweb aniya msil kynance saalim zacek sutters milpark atlanticism warnie americanness jinlin woudstra udawatte cloughton danbert dabeli trich handholding aorn doeringer oxholm affliation barrineau suncom cottoned homecrest prettejohn lindenberger studham dragonaires gibara lehrke eljero thunks proporz epcm brunonia earworms viewmaster atcher whant ulery tofurky zentsov wisewood wahabbi rumbustious parkdean izembek esdale carrafa blaxill frieslandcampina brawer yuks qarnain livedrive siplin qtopia philex rehabilitators fomboni searingly bloxworth beofre availablity geneina azacitidine osinde continious mahata sungnyemun damirchi vcj basils alaron naccho boozed carigali redresses yelda carithers menuez owlets volynets hoveringham gierhart castleview fauchald boldwood pantic newsie aeroscout lamama firewalling upcs pischetsrieder warmings belstaff familes envia deliverymen theth pastie hataway cemevi mesco obtrusively embryologists beverlywood volochkova canipe talkingpointsmemo sejour djoudi bohio petteril maqsoud heinisch lols kutralam micó canlis hibner misinformative testees reisberg chapur krendl bosche territorian bienenfeld craigroyston cagnotto institutionalising roguery thumma kneedler kinvig durants lohrke motsi coater romanby ritcheson thevenin densuke kampani anhydrobiosis amnesiacs tadek aljs messanger ggbs overabundant buckels kerlon contesters trilli postition damazin nisenson muscatelli tolfree brosio circs esmod rhinog vétheuil medinas subotzky microgrants alacer mittelman iise belongers badmouths prax ketam mechanise kahney rheumatologists hellebores fullfil plasticisers valorized swifton beinhart denève esab multicasts lindborg dermontti pabla staropramen louwerse maloofs vosshall cisternino markopoulou freegans fervency tambal arenson ramkalawan masalai woodstone mahroug ladyboys bgz chalkstone showhouse offcial culleton brutinel zendai bernando makubuya beledweyn cpci fantastik geothermic jazzers chabris rozes kastari hwu tereska jerseyan gansta weigold fizzes mursley descants sourasky hazer wanandi weissert bridgespan fourniret sedately kapowsin karabel ktvn sucuk aljofree polyarchy equivocations witchita prozone welday castucci cristalino teabaggers overweighting designedly anmd effortlessness krawcheck shutts niyamgiri gohouri suchiate allihies manikfan tazo yousri tanovic voinjama kelvinbridge wheeltappers authorless twineham reseating uploadable ladya fraynd dardar doozie ambuhl chapulines meryon gartshore woodacre alkermes coberley soaper hadhari kwiat pollastri windler cristen farecards overlearning villacis berlau yardwork hadja bonarelli derenzo turnes miyakonojo voltec itogi ramdass lorello walsoken igcp springmann odalisques nénette avigail politicaly kazam linothorax reblochon filberts longformacus camaya makriev bogomila geolocator dysmotility heane kanhar talkbacks sabeen chievres lillico wedgbury levete zalkind cutely shufflin habayit carfilzomib castelaz hoovervilles nurc morange ffo dhurandhar stahle chacin dlco karabits quaff indemnifying eggli tautness akhromeyev suhre grainthorpe narec sculled paperworks sonofabitch nonhazardous wanstall nabilone beastiality ecoop outie medicalisation atchity macroregions langeloth gordien sugeng oschmann biener darrian daswani shanaka ibanga ictqatar ockerby suckler troublous chumbe yakum gunks strouss floricultural galloon presentence earline paramjeet pontygwaith tiphaine turkevich ehrenkrantz whiteknighttwo sianis ploner womanizers dratsang eejit lastings rothbaum chikin delva tormarton radicalise magnetocaloric boockvar rehr colliton heimans ackergill khadijeh makgatho manicurists bavaud solage saleiro teetotallers leisurewear kindertransports autocare kilonewtons servies obradovich msst funkiness capsulitis noisiness peaceforce nickolls sagrantino arcalis budhia apartness shortcutting mcaleenan novotna cinephilia emta montanas enkhbat uspi mahnken siry sunrice lza twinset sarara roseannadanna russianness makudi batpod nakazaki maguiresbridge kariobangi apportions hansheng viko lsrs critisim newboy bethard adgp bruley offchurch averna canisbay chipolopolo batsmanship guajillo traka johson thougt kidani viscaya shiveluch elving begums ngobe corjova bamian soderblom cfcc perilipin implanon gybe entrapments wtwp vassilenko commmon foggers bromery fourt methylxanthine pukach chyulu krisher blueshirt thicklip phreakers clappy halkias decrow frolik sudderth fiki klotho ilaoa liebesverbot vinegary forcemeat gibel deodorizing oregan panousis rcuk marsannay unconscionably dodji prono fulde efsi nesci nbcs zakri jonckheer eyser subtextual yuyi hcfa glossip favazza uniphase tweeking absar koblin outgun yeasty zantac shafiee mattiace rinpoches ospel handcycling anoc gevel bouron lebling uplander jamora linpus ineed esquith vlccs punaro shopowner shaktar bereavements osala drwal dissimulate streitfeld dalbandin sydes mischaracterise steingarten percussively sogecable stonethwaite dabbah conformality beaufret weedpatch honko cowlin viatrovych locationfree munnell rasit wielaert sumptuousness lubarsky valenki sauntering peebleshire brancowitz panthongtae accha dodgeballs brendanawicz linebrink lispro enkianthus despoiler torrellas gericault chifa swannery demonstratable bummers dipshit musicans armyan patetico shapir eyecatcher fixturing shorris infostrada kuenne blurp cagerz chewers khowst setaimata annihilations medeco benslimane hanslip tincup cardiography bassinets popsters aardonyx manocha choosed sirett goncz mimicing achte leaze monstruous entitites riksgränsen gcig whoopsie dunauskas adnam runned movi clearers fluffers beignet delago whippin guiterrez maskrey drummy wadan personalis zicherman jerious falkoff gangling mayford takle stamile prepcom conceptualists crumey triptan dester khator drudges growly someon hathern rusutsu cybrids eneloop diprete measurables abishek unchaperoned hudhayfa cvii mayahuel gebremeskel huissen bezbarua adarand vitters barczewski dershwitz zilly goldworm wooters riemerschmid cfmp sogetsu plainpalais turchyn macnichol szarzewski caresource radina creran reevaluates piekarska waafs ndms tyszka senel scrine vivens sahadi jahanshahi konovalovas genetech insideview squinted hilsenrath erisman kwarteng podvig sheinkin wagnerism hipmunk konstan cenveo pottsboro auxis patricola zinczenko bibulous plonker lowitja bacp rampi trusteer acceleware pillowed wilbekin bouasone abseiled barelwi snitched kortajarena wrecsam persepctive aikwood zeidenberg palenquero sopes homegroup microchipping frogspawn uncapping halabjaee pittington nymann sociodemographic wagtmans boster llangammarch brfss takudzwa celltech trueness mccallen inphonic lancman skovorodino mohnton seland millenial krigsman bertholle affini metapneumovirus oldway mushkin buonanno debiting spankings lidong patzke aosis disoproxil llanfaethlu emagin lcvs ellaby stockholding wssa sentinal fornaro hartleys huetamo namasia hodsock angiograms elashi bilaterals ponde mingardo hudman rusiya nagrin neytiri curvo meite juridico windansea countercharge drisht gutwein cswa degibri shortchanging rubios gwaa accountabilities pechalat borysiewicz costumier kyel hafiza linbeck tacle akinwolere klaiman dhanji huppertz snookers tiano sheinfeld papalote poliklinik passingham pantsuits ricanek lisant whitecraigs rehydrating ccfls laundrymen seriocomic röschmann evolutional britany blankenbuehler sterotypes multipolarity carryin bootjack loucas plebian ranulfo taekwang trobaugh detoxing convinved essary jannero lpcs splosh kaprosuchus zeda wavel anadin rydingsvard multibank porthoustock mainella eisenbarth ruscote baltasound nyumbani rwn itzkowitz tsybin schopman alfsen kallir cario competative unapparent tyngsboro agustien aastra mingxia parsky apatzingan winnicki vallhund dimondale bogason limbrick “ leibfried ermakova merafhe qubad calfed suanne xup layabouts abaetetuba chayne balkenhol alfy financeasia bamy eviscerates aylor lecharles flyspeck gadflies vapers extemporized liedel crabill twelvemonth jual kaloko tillyer keyla babula mumbadevi duynhoven guanting reinholdt pommie lalvani priestner gelila lansdorp immage pisarski zmax campuswide daane gersch scissortail silar schtroumpf intertrigo dunnavant semiologist hideousness aromashodu thunderheads naffah ammash staehli sigall microangiopathic hanmi penlington chnc kitware galderma pericolo voskoboinikov alohanet molycorp governable dostie jyvaskyla epiphanny kalista showiness pilotfish contol refrozen althogh passan funtion slumgullion facioscapulohumeral muscletech evrensel gargunnock ereira livolsi popbitch bettyhill shriveling sunweb interpretion momondo iusacell keyur ruya stagy kuntal minature achtenberg alkhateeb eetpu kazemzadeh kellington vulgarized windels paypoint whick acito ereng birdlip murietta nareit yri uzoh frenken curatolo trippet auret jackiey gugin utstarcom dohle bertille ascots zaenal gestingthorpe wheedling brutalizes kentisbury nvax interpark efestivals saiidi modibbo winterval mencke docuseries britts feering tauntaun sermonette tobiasz mabelvale congeals cessations stetzer ezzeddine learmond frontierswoman godam skaer uneo argumental aquiring exhaustingly gilleland distorter dongtai plastico strathfoyle priorswood zanier bqi greilinger gargani credentialled zengel mcelheny caerwyn ambilight yargelis groeslon sheera idylwood elevenses springen kayembe mixenden ramazotti luetz dreman viter mikitenko maznov mushed pillers mikhailichenko apprising edreams yehu soyoung burco beechmount workd geekdom anglophobic segye sxu rizwaan instils kaguta removalist shaindlin clubmoor cricqueville abolqasem seizinger tgen hopelab tigue kotlowitz axyridis whitcup shvo ageno meliha neuzil firewalk ilusion kalsang onep cppib golabek schidlof clephan irradiator burack infonavit craiglang baykeeper wmn jsmith dhahr moven policiais huacana manocchio zazzo waytha garridos transregional eurodisney wrinklies verkhovensky evjue mandujano tomatos clarach mahamid baylee backrow hardinges ansingh dinam seminyak fracked wellinger boskoff yaima gangwal specualtion fieldin screenvision yezza ksara phoniness mcdreamy butiaba terrapass provinciality wuxga irrigations montori adfd googins tiberghien srulik cerasoli offbase tinajero thomet thesauruses ruccolo rohleder bextra redlinger parsad joone coffa wallasea jirgas wildfowling gryshchenko mountainair laxon gozal morch arcapita wahlund hamdania vblock cantebury lhatse qudratullah schmith ieepa loquats enodeb wiffleball gritzner kuelap kingsmore mccomiskey thirdparty rigorousness tongyu swanilda furn prouts buitenweg latis kmtc hkiff nowakowska zagelbaum helpin karatas blechacz croley amerian schuett unsuspicious railfanning greczyn ripperger botros dolberg quamrul playlot meglena muradova goepfert bintou interruptive mdec bencosme overstressing ryane sering brimham ulch rehomed npfit elterman danus mahealani schrey roflcon uncharismatic cachaito tsho coverable haafhd nanomanufacturing clinico chisaki sashed gatorland wlodek tusko gurka manolada shynola seimon antillon cliquishness asalouyeh whhr wristed stolley huselius cellulaire creaked witchdoctors schollaert gemino dayen kosoff disfarmer bruhaha coloccia pawprints miniaturism salvy karnani tabnak untameable jerron kagle sodiq xbd lachica clearheaded morjim flyaround downpipe gocong impulsora durmus poterba artemev khomri processionals usds refi paylin danesfield dreariness wanky stromback saylorville silkman huchon netnod unmistakeably muhlbach fdw natko replanned cestero wagonloads rejuvenator universites writeoff deiter abdusalam aperçus phana feltri gotterdammerung formalises marroni cabourn goreski munkacsi traco sovietskaya kusy kadem daturas prathiba jarema lutzes vartkes equitability aircard trica kurnit rydin erkmen zakani heikel incretin streetballers groundstroke goudé chhang pinpricks abbis prestigeous gamalinda aneela fescues npls sandsend somun carytown kostos insititute nysba freegate twiddled poltoranin lazimpat unsmoked sotigui graafland truffe nightscape lignocaine veloci glaviano jeffster townrow codenomicon acceleron commerically damluji blueliners kenchington swordmaking yanke steinbuch nuernberger schudel grubbers metaltech blogsphere troshin oldsters gvozden postwoman gamy dogtags ringger pulloxhill yabuta jeramie vicp prommegger afpak semeniuk slaski stardoll launchpads paulerspury gugliemo cornichon seedpod alhadeff davidovitch trimtabs nbty nsgc chatellerault confect wesberry stauner dunnan solmonese cnsi housebroken matherne hamstringing badakhshi schoeffel ghemawat savner rokni sisneros schirm frogbit hoerl hindery monitise cartograms ssese wwwt delgaudio beby puos dimatteo cookstove roslyakova comparethemarket vmw passot pyone moehler beleif casty mansky volleyballs tataki infinera miens siliquastrum scacco wifo brunacci djevdet toeaina kuitca jemmett slovis esfs tahereh patijn dennerby hittle shamwari elbæk kilgraston tewis chroman efan fornatale olugbenga vlahakis fugere tunafish emerse ferlazzo prilepin feickert endcap mustafic puertasaurus towelling bensalah kominas rhewl monticchio hler hirshey mavrommatis bowzer ioannidou laganas preggers glengall ossis ingore gkc bettter goligoski daon mvy hgte lenze strathdearn copperware boogeymen hermoza sportline motshekga unwedded bayboro neurohormone jessenia axona deim khalfoun blisses taoudenni snacktime wayyyy hlcs charne boliver oriali zolezzi cerdeira martez rosens navarri branzburg bentyne rheal odibo preeta reinsalu mishit arkless karisoke happonen apio gulledge websit jadcherla bareiss reimagination dobek wolmi extranets pathumthani meeuw kensinger maysie baumgardt dockable sovo junlong bankrupcy whli montrell recontextualizing leasingham burqini peddar haversacks cmag rgbl ablates kalinsky lewey seyama gamrekeli putback cormican portavogie bitange lecun gica scaffolder trusties yusop kahtani burtle wienerberger avenidas drolly automobilia natelashvili rainbolt cashiering msti sinkford sherene nethercote deocampo underrate reticulin perfluorooctane inters reeducated inforum bazoum holick wildor fcis distell euphamism djg sammies artzt airstreams finckh iakobashvili krakoff sior knockaround stortini unknotted raskamboni pobl selis lashmar punnoose casia getsy breitweiser atfs overrate whitefeather bgea alsobrook wernich safilo haarder bewilderingly worldfish sportmax kalsoom glasow igbc grandiloquence marineo sigmond shiara werthner chse provigil freepers positiveness mishears antor jammet wanze shershah bucholtz aloise youdale kosmatka arleth finchem shackelton doddy fusker biagioli oleochemicals zakzaky listenin dulon bushie victimizer firgrove rakovic preclusive macgarry niap krieble speilberg wingle beatable parthenis kitbag fennemore sweetarts craftworkers nessan iooss worsdale hyponatraemia talog jjw penelas dartnall somashekhar isrg carrivick gorle rudich wheadon seilala szapocznikow hodeidah haricots buhlmann faccenda retin merkens kabore biley malburg restavek shaine aneuploidies grafe selfors beirendonck svedin nanjie calim deshea plastiscines kamiar divsion lovingood bessacarr elci mabira bosic distractibility siron wolpo girozentrale apeared hefetz homeliness ashoura benotto frankum durnbaugh kivisto jaeck dicapo torrefaction towednack uncelebrated kitting mbaya matous liefers manglona wolitzer handwringing esmie audretsch mboweni dvids krisztian yousof cebrian palazzos mahendradatta maietta polzer porokara playwrite overmeyer bumptious grendell scatted kobaladze sinotruk zafran bedimo uncontroverted scripters telepathe defeasance frauenliebe neeve mineau jalloud zettabytes crappiest kelts holieway piddly zubeyr conditionalities tashigang rennies coffering armaly cutdowns kuncewicz chisenhall adwell townsends hengameh schlotzsky mikele caravanners biocatalysts stonkus rodemeyer oxygenators mutar zuman radam bluh vladamir mickolio hairbreadth repurposes tronti mcilquham wouln kyno araldite merini fréthun sivia dupraz loeff rickrolled flinchum larkmead yarmulkes natagora ccia samimi sunsphere sabili heartworms jaisal spedan anticuchos uliano alesund fullfilling preferance medvedenko tethong smala guiseppe blacklaw kounis bonchester dodaro thunderpants rettie paychex gyde heims barcel capula sarsens sistach blaes souch simplicities snowsport babik pulmo handoyo ordona flexplay maiza dworaczyk giannetta chorine riquewihr cartegena shurberg gurvinder luson angangueo desulphurisation savoys yealm toposcope baerii tsontakis mirthless dognapping proflowers voake argentineans mondovino sensationalise girerd openspace gassi oaked landgate americhem awing miyar cantábrica tinari chapri tejendra calorically grundler renschler saryusz baalsrud masury collateralization rolovich zurick sulistyo neurotoxicology vitabile bracka langhurst rockism jianqi ellick omnigraffle duttons rogha whixall zipadelli fromager techsoup mobitv ecia mcweeney viirs winnifrith refocussed barie rosebrook hopfully mcadd trowers saitek bayla simranjit fabretti leedes scombroid henrike aios sotu femail calty logista mystifications braingate chopiniana rittikrai mazziotti pardeeville rowdiest larvicide rjf fauque buchter falliero amiando aixam headgirl paulhus whdi maxo lakeisha panopoulos mcnease subtheme vlps moysey srixon intractible toughly lobstermen abromaitis gavrilovic consern sightsee anderszewski kinaesthetic superduperman operatorship sandhogs avout rapscallion fluffery brachylophus leithead makrokosmos ismi ocme intifadah kazipur jinka averys surpise hurch vandenhurk bottino punningly ctvrtlik lecat idania visting zenz sounio volunteermatch gessow polone kuptana berberyan tavuk pappajohn roughhousing amateurishness bottlenecking wedlake trusthorpe cossery businessfirst dalibard dictums ortas sinder vigourous stainbrook proustian norz fieldhead llyswen lithman unionise yisa hippeau outstayed sboe bembry hayneedle beorma hoeilaart yandall sibby orrible metrosexuality lightrail masspike lovsin areen ménages abari yarkin henline scraggy vittoz mehdizadeh hakurozan tibbitts rovan drewitt posterchild kalup hewar eichelberg henoko saillard petroecuador eyfs torchlit asustek hundon peetie scrunch paliwoda qatanani zewdie axeheads chupp ahistoric camcopter torina rmcs dissappointed jaid sportshall genty christianophobia tommys krolow spurwink ginbot othella ladislaw syntext plaisant moodiesburn songza wohlsen cremeans measurments lenghts satyric waterwall ranahan carrog mohla ceterum firaz solney oxybutynin cederstrom mojada krawitz davitaia avantime makayla chaffing idiq chuansha occaisional aslanbek duragesic boundstone faeldon thejournal shrapnell baulking meths trewhitt kidded faulcon flighting brosky kornukov mcsweegan prulifloxacin wollmann implantations durbars toughens simioni pbbs casebere shuyi fraises pariyar agüeros limer freise fibbing jianchao groshek shenai falkous zavada cantuária potkin superpressure causewayend triulzi centrestage sixthly klamm cambor kamuran marmur sennott giovanetti upbraid charater speciously bamogo rugbywa dartboards teared aliaune dejian etemenanki castels ibrahimovic metsys consenso niedzwiecki carslaw kloden stogie halloysite chunchu carret remgro bonerama rayward patnick vlingo lamex wirtshaus faggioni matthaus glozel mishandle untampered kifer eboo tathagat beyti vinorelbine jarvey montol lapatinib chashama kitterman waigo yvie restaraunt hoogendoorn garvock sesana candicacy tarloff backcloth hardus eynulla supercuts gymnema conditio faeroese chandlee aapd dsip skrzypek halloweekends rbocs formost zeliha bloops shemekia mannina photofit spraycan mesothelin intubate downies schiesser schramsberg brazenness jakl hochheimer libiamo evenhandedly volano buresh unintegrated pedrique cleco silvermist urbahn ceruleus opk czeski overbloated matharu galewood billey photoessay horóscopos medarex lundekvam gauweiler ibritumomab djohan hilliards soffe boogy plasmasphere peñaherrera yameogo polypill vetsch zigmond hoskinson johnjay mathletics fater apodyterium ganina glifberg deromedi gesticulation soudas sophi njeim mukhran trinka copaxone corroon gressingham jungbluth choua curdles chinachem liniger pilliod sanstead fraim freder rosamilia poohed vasilj salc eqm kurzem nosbaum fireline fecklessness oskam jianshui neighourhood babip guillotining norikazu unseparated unificationist kriegsmann goity ollestad ebj ghonda buscon forewoman cinetic seig welegedara huntleigh leaseplan topup igrp goeglein ssgs stonum hbeag zeru sophisms nyambura alexon offe bayji brabantia dyball herendeen hazrati freeskiers krysztof longri yochelson micalizzi wellmark montcuq ashafa ungerman limewashed aquarians sapiyev windbreakers nonoo quirkiest nonya acurio broomielaw liebfraumilch lipservice amaria adverted beauchene representive nannying domata haijun cockcrow palfreeman mpingo hilderbran wenallt odontologist mechri theuriau ledwaba splotched semanas abulhassan holben breagh blancaneaux eerola unsubsidised hardpressed glasenberg gaidhlig downhome petrusich cusine mariastella nedrow unobtanium hadeel tolfrey aiskew effen shern multiculturalists protosevich untransmitted keyingham woodgrange janisse chakarov masharawi vestibulitis magnotti barrettes howlands fumigating geggie krivets kensing purls misztal drawstrings legalisms laught biobanking chiamparino akhatova ussuriysky musis frenchness pumpsie chironex inprivate telemar multistemmed schabel tressider peenemunde striegel wörld kashikojima seec mttf fortina citroëns cashwell tymoschuk infering fauquet garrec prabhudas perosn harell schliessler happé spiralfrog confiture deeka prettified bcec protas bjornsen cocodrilo aeromax crookedest gphc bolmer bemo beacher hnlc kulsoom jabouri sitake muneera flimflam stessel jalalaqsi noncontact homebirth guigal dabblers alterative rohrlich reassuming glusker rowohl tongariki benezra eisma claure goudas haoyuan tomcikova pornchai fedoroff blythedale papania truluck werbner jalmar francheska yuppy tetrahydrogestrinone stassinopoulos nerdiness nasheet iedd subdudes iesha jayawardane funeraria schwaber undeceived backgrounding pagotto afbf overgrowing jatras claycomo vsoe zagorin zind ramrakha ginwala unaccountability thge frieman mckays hymotion anthera computershare rinzin dubsky mullady revivification ifap rayad desking tetanurans atcm dexatrim jamster reifler amfilohije telsey donckers ujian zecha hadidi definied fereidoun hardianto citris soverign ballyholme rashwan cadaques korstin petrole vivified addyston recontact rewalk lamman wunsche untenanted knafo pinballs rownd disapora efimkin ecotax passbooks hyderbad photofinder hoffler boxwoods memogate goese raimy madderty zakuski failte bernotas ilesanmi stenseth dehuff kollerstrom adalius pawed harardhere fortea carlsmith shaf lybbert irritably regularising poissant nelahozeves mcarthurglen fawned pashman autum engelsman bollocking carvacrol lonmay descrimination muthalik leyman koreen mwy undulant himalyan fareshare wellfare necropsies turtletaub rebibbia septics irmen physiognomies saihi bacchanals naher pearlstine melck bustline dhore kobak cpfl ssao guven tourbook kaletsky pathologizing bacardí hewas neurocrine marcotti uprate tingelhoff bruv kufour jinzhu streiter bsafe meonstoke eannes feris chatterly pickert polyphonia amantle securitizing costermongers karee gortney fairlead turag lakwena friddi johnshaven bucksburn micklewright tallula halef athman finshed malaythong hachijojima africam sudbrooke ugone dets bcam writedowns oelrich tailgaters milquet ustda stagnetti streetscaping shalrie leao shoate pinetops holmfield nonallergic octavien howff qalqiliya kukula viadeo ferreria safeness mroué frattare manyfold haggans abiy redlines primative grapecity probowl corynne shuaibu suley graffam counterfoil maliphant perfumado shabina privia graindorge hemdan lenotre crunchier midground stirman gemfibrozil deconcentration wieckowski inabata dolwyn simbarashe freemarket pessin deitrich bbmf neophytou ailun micklewhite anonymizers coutaz shushtari tamarisks laboeuf zocco puppie restudied ushida byberg navjivan ngic splatting yingde birchen footboards luppino nmsa loaghtan pagpag regionality rabadan twiddles bonnan bluemel wended honeycut albouy panosian azdak htike malelane maracá owles capurso sigd pejo zinacantán petkus thokozani mude tredennick devidas nfcs shroeder combinable scrabo hultzen makarkin loking yasso gurski jianwu motricity goyet hipgrave eguiguren gunbalanya gotoassist dulnain geddit enlightenments microcircuit zebraman vareille demjanov speechmaking schwinge pontnewynydd greenbird distronic kilgetty moinard batbold trollip aerosolization taimina naclerio huffines abua laohu loafs nffc koju perrenial timesdaily greybeard lagisquet amorously suely wssrc lowrys holehouse behal nazarbayeva munyemana fullscale tullin shigihara linnen miskinyar accordi earthshattering pasaban blameworthiness juzhong tendancies deptula hapuku seone haybridge lostine messsage cybersyn ceferin rubasingham dioscoro huiping sarafin socrata qaraqosh stemp allsport minkley mateys ocwen cpcc slyfield perdriel narrowminded naimat mwenga wonford dichlorvos nationlink ingrates osterholm hokes rudzinski dumez binatone bumetanide kartchner menegazzo schmancy pueblan pinioned martydom seiyaku dehumanised liljeroth blackspots danbolt synergic cellaring lochard butkiewicz nikolin democrates hamidu achacachi costock squeeky daxon perugian chrysostomou collete irinel tuttosport kriukov blackey sorger nakam endonasal dacc islamofascist luchar scheeren movial bloodmoney kisimul marianske psychometricians aeropro joltin pavkovic ippi chammas provied ared hansville martinovic bandeh burys ety northcom carbin sporkin belkheir bandow offor migita twankey mistiming brantano hrcp prugo sott mascons tarchi freezeout nxs artomatic unweave straggle hersov biofiltration jiangyong vandeven greiling prakoso onkelinx storeowners kuzui msra furama metastasised karenna riederalp rubbishy superhorse barclayhedge sazegara mariscos caspians pettistree solofa wcva intergovernmentalism congue cariforum gradwohl affuso parliamo dosas mazzolino canarypox afterlight schickhardt thouvenin jocund dawayne fagella thallon modha millegan sidhom picciano khalwat mcallan webtrends mergel chazin hoenikker indeterminately abuot xylorimba majadele clenches burket dambe zybina vivika busara oberbeck polands pereirinha hennard marangi degraffenreid kaco baaf palmeraie sprowl rasheen estebanez goatskins spudded chasnoff pertusi raphanel larenas slamon applys monovision maitake reasonble weleetka reappointing overnighting tirolean scri brendler wozencroft avishay ontime hoeger hesla qnap coccinia dashanzi technophile applix xebra supermodifieds paresi cheddars horrillo throughtout crackly devetzi vladimira magy frippery kasanga mingfu straphangers twinlab unappetising henleys untidiness slivenko medaris libbrecht gramer tornielli rmhc euronest reaganite otaola leofoo stijnen dieties walek sukka aldara umpierre hirshfeld fursdon melanosome relatability chalkland resupplies mediocrities freiss rosenblith mackmin augustinussen melendrez lahmacun pershin becuz paje fourposter elleman hauxton riddlesden huaining frueh redoubles pettifor neighbourliness mehmen horologists gelid instrinsic fundos zingler carnalea heedlessness raghuvansh potec reductionists derecognition pleurocoelus veronis farnaz blinkin galettes plaks fetishizing glazzard brockless criticims plurilateral cwdm bassekou cyveillance conneh strumpf potholing spoletta zlob universtiy spmet millings vilborg jorrie piffero ambulate djerf fanzhi sadosky largos pinetti presler fourplex debido aquaduct makasi bachtel ismp jinfu slavenka quiana remonstrates bacchan filippone qijin definity lipford moais wattleton patocka essentialists gosk depressurize evenor candyce necrophorum jagm tewell spreen xintang droped trocken charmane yums carrizales bussone kawishiwi senewiratne tongliang inflations nanoball lagueux degout screwfix logierait allvoices awadagin newtonhill kesuma foge optimer endovenous pirozzi oooops bikehut tyman galeas econlog outturn tycroes preconference mafiosos salimah learjets felsinger jieddo wmet zavis experion thornburn gwpf chorused chorusing sleezy abelii manrara wesly peddocks kaitz encounted mital atban malsam blar neglia bloaty contentedness ellzey preem yansong pesan canstar prances fedcup scarcroft altamount hawkhill refurnishing mainstone flossenburg kindergartner sevastapol hydrogenics najwan hayen satran nixzmary sosenko jnto utsubo reargued boncath astrotech khalik prestone ratilal bussiere ratley yuhui pletsch prevarications kuriyan crupi tranfer absolue bloe shinning steadyshot polution actblue polytrauma viewforth sovietologist brossart aislin technoserve sokwanele attrice buwono annice katterbach bushwalker rumblefish transporation züricher panitchpakdi kaklamanis panariti hicklenton delegator pauza huelsman seigal margolick seguing microfibre bemporad mullaittivu interbody pugach rostowski hyperresponsiveness greffe phillipian hoopy djuma bosnjak churchly killpack richardo soundpost sailosi fceda adderson facetiousness xirrus moontide mixologists svanen isamar stms gladue myphone backwardly zimet nahh surpassingly gratch batori beixing khalilur dubl tsypin trgovac praesepe overindulgent lllp sciarrone stemberg silicification antione westburg labourism megavision lexecon bilgili flylo sieberg khrystyna hissa republicain cwmni smily kayar macoumba cupuaçu gethers phillipos officiousness euroland gaila habitate genoux spreadtrum hmri dausgaard midco furmansky tefal outshoot heico gothicism braghetto kendray pelizzoli alridge tatarella otherworldliness sharsheret ocrf tamburro matrixed porthmeor digeplayer ouzinkie carmignac hawrami mattotti pfiester hadelich aguillon trinamul ehsanul lespwa enshrinees vatikiotis laduca ssempa bathetic pullups usfp echus sodergren cowdroy hugue ohhhhh glucocerebrosidase scardina seignorage tillerson desensitised gholamali fromagerie tregantle papillifera leppan idzi smushed vanauken khairullah valassis karuturi ramsin antiphonally handpick mbola ncsf lenins mouthless bestman sakowski teahan probalby strazza slaska erenberg zuley zongyang sapeur peppertree autologic caesaraugusta belliraj unanimis badale haythe portinho sibbit remède nightime montasser russotto tastiness generalisimo kromhout kochno argleton nepalganj lograsso lipowski hochiminh policiy bishay squaddie juth glowstick goldenhersh carouse tooty nunchakus javaheri caixia coupal peshkin biotropica lifelessness trye hrmm dilday glenoaks ntas shoping muivah openedge bourjois kornreich slades surcharging foxey diogenis berkshares nisnevich sebutinde pendletons nosheen espey hasanat padoan meiju slotover bruggemann aljira cloepfil cohenour gxx rhag bojeador salberg pfge stratigos guiso bandaids lambdarail nidcd ouriel affie indestructable stabilitrak aspray sillero venneri russiaville dalaigh hardell microemulsion regg entemena idahoans updm icnirp patheon mazzon shiao stathern talev bogollagama pluzhnikov motoyuki afesd wanfeng pickiness stiel purell sopko simplexity waterbeds damane flautas poge shokri superfluously lusheng eliah sanguillen bookless cynamon forewent ormondroyd farset retie lasalvia eghbali estis robogames frodebu euboeans jdsu plateros skymall summerworks unenlightening genshaft precor dimeco melcer paelinck plasmoids simonovic neske lumefantrine slavick lazarof mouloungui holloware conax skitube talamini havelet herion brammertz mccgwire peregoy palpitating gatherin beardie murdin softhearted lesnik hartkopf lutai ipaa churchouse imprecatory kilclooney nardozzi timney viewability issed serby jiasheng friarage actuals subcompacts fliegauf sproughton hillas jocularity dilantin whelmed seedbeds houa heimowitz casteless haluptzok nietzel deeann notarised proform cregneash trimarchi frodon lauría dushu jastram mattone burts haselsteiner eglash collyns bodyshop wakelam huggable tacular hlegu wehre burutu marketsite rihab unutterably darfurian weghe bacong kearon yamabori aais jamile sarajevan huria truphone furbank precooled westenthaler oeav foleyet strassfeld nonverbally creagan meterological drugmaker mhks wincenc marmorated kukura darnit carapintadas hevey naidus hanzal mongeham loesing gmcr generes espinet cucciniello dispersible althoug winola draghici injust camoflauge chichilnisky syy alvarezsauridae wicha kuksiks ccai rishad flailed eaglewood reagen fivesome toolstation hosman balentine misron westworth adiala druglord interferring stepfamilies simhat buglass zuza smatterings varifocal urself forcedly cimla yulgok wnsl khalfani acevo humpday hundreth havar satisify khilnani kitutu endeth serifovic grinshill smgf kekova ghouri continuers dieumerci masaai sredoje tuell puuhonua jarena lifeskills bellydancer plch fenwicks abderahmane burstwick bandpage dstt outisde haora unascertained dumyat tongli schmuhl roake lisbellaw braintrust kooba gramley webaward poin kissick tactility liveops jerkily mbalula drefach freindly julz insaat akilah furrowing oversupplied lixx llangunnor abfa meserole silm lawhon shoven carrai hoyes troob zafirovski migdale tourmates edmisten tastefulness mongin allergists doomsayers activitiy graboff angon lovemark shengzhou agins hefferman whammo asenapine moppett zenghelis kiwan tikes yalincak vinyards fantasised bozburun epel negawatts pisaro plygain macaninch plener respectfulness baguer reletively akhtuba shilbottle houria amika welzer sarukhan sakovich mythologize mixamo toolbank precycling tedtalk oscawana littermate rhamnous pegasys llangadwaladr edwen audiance icgc rutha spadeadam dspic gansky grode alverez suhrawadi dhall kulwinder miscalculating buggering duluoz siasconset berdzenishvili chandrajit jackbooted bleistein marshalswick kimmerghame newsnow bahdanovich wierenga bouhired restylane shanzu barfing nascence fedmart chalcot koltz uofm qio kooshian undergirds slipup freeganism jhsv asba rhyan freerun monory kadijk iplex madeiras ciralsky worricker chesterbrook badii keudell gebregziabher discoloring glögg unzips lecq dovydas martyring mantric cocreator itsuo manhunting nasdijj helzberg priamo axway ondiviela bankings coolican stoneyburn stureplan parinya roofscape kols pusic sportiest adalian katrantzou fiechter bouphavanh kaptchuk werksman handpump labowitz koizora numbskulls dukei dmpa pparg muhabura scatalogical galschiot havern jorvan meinir vaccino cifras ditib scorebook hakkar pinocchios nyjo fraternise jawaher outteridge pttep madshus acquisti parexel adhamiya ikela stilly weisses telekomunikasi quahogs toymaster condliffe morguard thrombectomy vallies septime reicht azenberg braless sanah cherikoff couth effectuating akkas sturmgeist supped imich mutsamudu helibras properous grebeshkov wenying assous bhujel vivisections hormuud saksit schryer philadephia jarell blandest dumfrieshire briois kadry depowering choel lisztian opekta deltha aboim melodiously mesac janurary putzi mancs consequenses fernbrook ivankovic concered gasmask ispad phay firecrests ruckdeschel lebeuf pylant lanks karoliina hunnan reventon sparkwell luvox gravlax marlbrook balius fluxys shakim tootill petraia pushbike bowsers caulton murtazin bispectral meccas captian primatological cherubism perrodin knackers budongo cosbys magnificient dropside faci staffieri urbanik tiuna dimhrs greenlief mereu izaac riggans hyperphosphatemia keshwar opers bronchodilation sonkin kindergartener ongkili geiman bancor chamoux woould adefovir grubacic tiffini maussa loofah mareel kalkot aayo pruvost boroditsky phobe microflex truckie yeshorim collman juiciness véry jianing basterra whatstandwell abady insitution strubby spirig friedenstag turkomen rundek mtiliga junhong ikpeng defrank jojoy federalizing balsamico skovhus mayview orru maaren lukac tarconi minnewanka flavourless niederstetten forshew jahed lelisa aeco alberghini constuction sursis remaing rausa geophone fussen pestilences allante procoptodon kamathipura unoosa expeditioners pickoffs trock datatreasury airshaft cajastur appollonio hauert jayakrishna pedraz monsur eayrs jaik weinandy ultracool antosca arismendy baudach cosmeston facists pompes kazenergy traiteur zuying fouzi caramelize mcenhill imbibes aviacsa smei jillie hopla trevelin ekranoplans cndh haoming mcgranahan mantashe mortera cravins conniver perrish protaras waylaying tils monath bolduan unfun teepell aranas practicably dolciani porosities hayde nationalisations clickstream murjani isthe waterings favs southernhay waberthwaite irupa papillaris jantel manyang nanetta robinzine jetstreams stompy allergenicity preparators fiallos sherdil magsamen krafsur mvsu kohna ftts echandia kiester bonauto safty everlastingly graminearum orango ectc kallaugher nonfederal mokko tsolekile amirav handmark longicollum ostberg grgurich divilly acrefair conaco pinguino groundbreakers klemke boîtes finacial billers zubulake mentals makhmur stovepipes dudly cuisiniers gurling soze akinwale ponifasio cortonwood bodycount zidar aburizal biolcati mesirow sparton momcilo careworn shayes breceda tranchell cleated diluvian cluzel diblasi kringles skyterra roecker gessoed cianchette bethann lhen lynchian wowowow ptwc rumpy teratogenesis slipcovers alion psychedelically rafei fitchie tawakal lauries fernado torsella mitsuyuki mues foldi daufresne akuntsu chitterne keyhan hobnails camio kambas hyperarousal kooga mujahed golikova statendam safecracking gaffoor videocamera battleplan dychtwald chenai teaware tuitupou exsistence crispier proelio floured tringale tensleep norberta deferiprone whakatu antegrade ravva leinberger dimwits ingestions pellizzer sixthsense pijon gtbank scriptlogic heartlessly chesnoff hafif androstadienone dechu jolies crisa manjarrez poryes shortley brochette wansley mamund filaggrin sufentanil zniber swallownest dardari weeble cinf obsequiousness paulison kamvar charentais perriam crackberry aobadai ravishingly bottorff fargher annointed jiangqiao sticca upadhye rebind esgair lakpa varities procrastinates repulsiveness hadadi moinian nipsa fiacconi jafco chukwurah casher asiapacific teleworkers straeuli acoss asik breitz nikiski bekken ansolabehere wyboston foux darwinians hajir plateauing uetz shimit wuori gatecrashes cotrubas leenders topquest nacif dyma urgelles facciolo aspey fadila bavis anadigics shingirai komis sbpd burbanks laynce pilocytic kilbrannan gotabaya jaywalk houchins forr doernbecher viewpark antwaun uprating unexcited fabisch gartman unexcused zackheim gcsf lyubomirsky apcoa kidstuff argaty pineiro darkley trocaire chaussettes skinfold yenni raffaelle retreaded oursel hoverlloyd elasticized labban gogarth suvir unimark brunkert ndayizeye gnpc degs correctives porrini dafs kloiber gruma hosannas reinstituting batterjee maindee navajivan edac koningshoeven michos trenchcoats shartava didace loreli methysticum rishard flambe xichun bokaer dundar mcconnaughey vranich khodaidad multistoried barrique derrik connives depuration mininum masutani palamidi stutzmann xhosas whoot needlecraft thunderously actiq flagellating cherilus tenderized sockwell maëlle fleckney imprecation ohmori dirir arzate siberica covill beriosova blackback onic marylouise implimentation gerbeau zijiang salvationis bellm teabagger vujanovic achivement jitish heising somerley betu blokey fanyi kampamba tremelo jart kozicki palmaz goniurosaurus farbrace cyrenaics startsev extraditable cammish anchimaa phanfare duntulm lilliesleaf mudman wonderettes shrilly panksepp aromatised margaretting doly jly tayyaba qualitas declutter catchin gorazde eurochambres amburn wema sherrybaby mashamaite tapasi stolzer mepkin schaeff warfront myachi schook psaps suffo trism humlebaek depersonalisation mirabaud zakour noreiga mdts embery rossotti nelstrop pityana bookstaver kaiun microencapsulated suavely yixiang dariga abdrazakov stavis hunnisett arrogating skyspace ulreich barella poopers debases bokashi unosat schepp fosseway excesive gormally prevacid tubney nickelsville lemv punchout rodek sarabandes ruegen jamaine grush gellard handpieces bacciocchi abakar thalman melchisedek cuyabeno kahut bielan leggate pakage hasama westmarland hodgeson nantymoel keara ranjbaran sierrita arridy mortaring gpj mikhelson bdav dimethicone zumbado agrement orlandos roughen tianwang bloms asmd numerologists muffuletta gasanov nicarico zoubir teléfonos tugenensis byrdie aercap tucc pseudorabies garrowby waubay crapola potthoff ejn biochips livneh koches lequan yarwell technologia kalvenes jekel iassogna handberg daunton kinepolis lightwaves katseli descibes savre ukse demotivate empathising muwafaq jianlong larssons spearville microbuses refreshers chinnarat tsas weibring goodfella tatums whorehouses percassi kaltenegger ebley delineators ladyboy selloana cronie tartine shehadi hutshing fasbender entertaiment quenelles netherbury gulfview paredon hexogen atitude nusakambangan lappen zadick allerleirauh growingly fogbank obizzi pumpkinseeds vmeste romenesko shelor topazes addicott despommier cheesemonger shairon simers benyus kibitz artouz zardana macgown amilly althin nonnenmacher eligble bolillo stufflebeem bazant gazpromneft schwag reexaminations hoornik carignane procrastinators svitzer quisthoudt tessel completers refuelers fronk hunia techamerica geci bairu jaeson henbest keauna lbos elisheba seligsohn tetulia ballacraine niersbach villez hobnobbing uncommissioned moscovite fourpenny faram ebonie banyala metrologic overproof mugniyah truenorth havret haydens eppleton nsct inners pittu embitter burlakov nemore immensly elfrieda qaidat outworked ectaco demory tolchester boxiong gukasyan dansette brogdale prision acress eimskip echourouk keasling tamkeen pancur saltmarket penywaun woodstove preu inanely leibham farhod saio waterscapes dewitte banderilla bazardo cangemi agagu indridason sidka kasit megatonne yegge jeopardises sweney ataollah commscope phuti keler millionare yudell countryfolk gabfest striesow paac incentivising humidors anci briege soliris leakycon kloberg belco unalike yachin prisbrey tanevski odina correal xanders iavaroni sudachi deskovic geech massoudi seekingalpha thanis skelhorn ovm zinédine buehner pontrhydyfen atenza blowhards hardass cymuned sebel outragous bmce ziliani patamona testors lawdit aboulafia avanessian saidou wasmosy chemtura märkl villafane vesselina twentyfold stereovision slivered wknd iwamasa bartulis jaidon koliada supersite mcstravick eiser flibbertigibbet mccluney movsar cobertura powernet chambishi sugito kralick brunkhorst sabki mccullah tiriac piiroinen tlali barusso fricks sensative rochman forestiers thefind yijie nonprofessionals splotchy extenuation dinoire tweeked parishoners adongo lymphangioleiomyomatosis benaim berardenga quadrillions mildner brodax nkechi tamoil sabik arnzen schlundt mondshein pandoran wisser marlane samardo leyhill treki merilee ablations feinmann malleswari banasik kiplingi enagh hmyoi shakeups mollino knickknacks ballinacurra kapsalis farringford intrahealth rewatched sunridge boelter litif zertuche tamegroute skinnerian xtl cablecards merchtem makel ixis laguage nasril notimex heathside livered flintoft immunotoxins fensterstock rajabali omolo purnululu hestitate pbts tuskeegee efsm parching sucré kosminen mmxi sulat disinterestedly borberg desribed dharun reveler lapasset stowupland litwak barovier namb coproductions barratts mcgraths cerfontaine salisu worrincy housey denike pouncy theaterworks ochils ercol laseter paydays bruzzo fcvs sodersten adzic shahtoosh forsbacka brakke formulators poulard mohtadi tahaan marculescu zahava firecontrol platespin ruhakana rennix gloveboxes schleifstein workstream wildearth yosuf golway highcross gunes firedog mydomain streckfus unhappiest moyock shof mistreatments zanoli transfuse traumatising nooke warborough trigeneration blangkon tapner yazpik dehap weyhill parkay conferenced lunchables acharacle propriano viridor ndung mappus officals sgts corsino encrust allanbank underselling accme kataib gocar maniatv genchi duboff sculpturally kaamulan stubbly kikuya hometeam whovian chelmarsh biopolis theatregoer veiws ceis bernstine alejos zipolite sopogy penyffordd movoto miquita dogsledding cranioplasty leppla dfps pmps danshui chiman benschoten decarbonisation buzios aestheticized valuably mammographic ibmec garimella winterization verfiy mlec phobics sheevaplug benkahla ligitimate enj krokidas sosnik shantelle onal organochlorines monicas lukaya lawer neosporin unstopped leatherbury agues kittenish aldicarb krippendorf billcliff aritzia altounyan philadanco rombi geomicrobiology irrationalist martsvaladze jittering gazeley zaretski visceglia louette rambaut jesuses bidmc studentsfirst maladapted blindspots cartee bookpeople blud perske prahalis trinite smartzone guanciale dorch silah pegol kettled bhuri fearnhead jabril lybeck coeck topalli esdc filofax armoires voxbone taky proshkin roneeka thca scariolo dolceacqua omnipod smarden belsat pxd naalehu bradsell dbts onesies macleane kikkan notarize midthun decommitted prosed gaucín sqaure seasides speal banif mutasim rodeway oeri copstick oldcorn underlayment mannkind hallsands rickatson magpas lanceros kvitova unfashionably blasband luisotti favo churchillian gasoil matuska karademir fettering lifetouch nadil lupset tienanmen lesiak highjack photodamage kingsfold morizono gigaton kiver porteñas synfuels wojtkowiak emersonian umhs wating intveld arthuis poutre fulking openfilm bagrodia akhundzadeh nonbusiness sharvit extr buehrer blueburger fokou compex taer supercute paronnaud boiman diperna kirop salfi phelp anaylsis avanade fluorodeoxyglucose vasovasostomy worp micromagic pardonable forebearance labossiere esle drq quinziato goedde maurices sanliurfa koozie chebotareva kunimatsu akobian bicarb tobchi sunnies detroits volx unhealthiness tohatchi tumelty hidipo yotel flowlines bethards fingerman mollick complaing wookies madieu lothing prometa backlin koomson piddletrenthide nacco revoltella balze chiquet nndc aejmc ducarme uncastrated chokey intacct sorpasso bluegold disposability polignano kiplimo whittard trailor glanrhyd matel takeing augmenter salaris meloan totaljobs paulmier shutup mizon korde colbath mcgirk yrm filmgoing verklempt lovrek occaisions inul gopaul butzner pruritis andrades healdtown padavan jinfang bramly questro montecinos riia epea mulyadi isackson integrationists nisource neobaroque demora crosstour focac saukville kostyra buce crickmore gliocladium kogas surreally mediasmart lassale blueprinted brannick glycaemic husbanded donnafugata tilleke realated brookers molaa azcarate staake schmaler treprostinil littlejohns muntazar lincy cumo dogus overtaxing carisa andreassi pipemaker baoquan permissibly haniska anindilyakwa liangliang haverland villehuchet omidvar thusitha ralepelle havlik eyerman pushka aresco edisons cinecitta graddy jiyul asrael sexpo psychoanalyzed telecommuters flookburgh entitiled goulson krod mccahery outmaneuvers sadovy deschacht wienzeile appetising baritonal changeing pondberry seawings alarid nakarawa dicyclopentadiene cege nbrf wdn sugammadex undt attendings louthan dzf fuifui telepods multimillionaires kirkharle yirmiyahu kircubbin glagow gospic neic moonmen perenyi cerre maanshan ledvina motamedian rosenfarb britting stukel efss desirably pricer gugerty chumbawumba eiberg rontgen palipehutu helyer vitiates oxclose jiangong dewael talvitie theret jalawla erquy peaslake conzen ghizzoni rumasa badanov macnulty metson schleip deworm avelon ticketnetwork lurlene khukuri fjällbacka bearder jozefina concerta rehling balou wmik sofiko wrage eehv barioni ultrices condimentum caymus jenkinsville murmuri pirouetting destitutes galitz nagita lapthorn decilitre transferjet dixmoor fireflys blumel endorsment langbar factive gustus immoderately governability shomari tejocote clockhouse burlaka persing dmitriyeva osteoprotegerin diyab podgy demokratia terrycloth toonattik brennon kamela hameiri forristal fecs rockmount wardah quaysides mcaneny durney alkalaj pirouet glengary lovcen devenick haylee garrotted promenaders ochberg senoko diakhate rostki myrle slansky mhlw jiga clov seesaws nunnallee widenius pyrek weaponizing francop edag sadoway stavrakakis corncockle discimus cesarian dawney judiciousness roumieh wellpark guojun hodsdon rosende nbaf labid spectactular antezana haddrill velculescu headships mainsails pattered alred culata forebearer chalie sheader seeburger oubiña dsmb heubeck devilment katayev camín gorson predesigned feep saucisson bertocci brevetti industryweek lumer loiterers loadouts nstemi leaphart titillated aghazadeh tigipko corrugating saldate activase gailhaguet qimin khaskheli enevoldson monay bordner rohlin interxion munusamy menchel welters brooders kuong louvish tocca sucsy humouring codorníu darwesh henneberry hankamer altinger krisp wpas purevdorj zinifex herbed itac vergassola wellsian cpsi mlyn acustico multitasked lman undeviating suzukis loizidou brachfeld thornier marassa jamiri anesthetizing dhaher klag ronke consentual podd umme proppants peasholm ingenuousness upda hopyard scrod sundome koblenzer morroco amlaw bishopswood tollbar probab rownhams pemco nembhard rocchio grapeseed kreger puistola sulej nonfeasance certificants lessy fareboxes pslra gruv kirbyjon liukkonen handzlik opri uhy behaviourists tropically walvius byrdak teardo balblair dedaye ayuntamientos graterol overstaffing bekkevold kentisbeare kloesel légumes benedettini gianini utilizations galka trawlermen subex klauer ndjili electees tillen kysor porcell yoshiji ardec servisair rapiscan geronimus millimetric faliraki herreras manawanui marmonte leggs balsara capercaillies minari outdrew stockingford jalaludin terrmel hanieh bipap drance diedrick telesford svich silverados babyfather nashvillians housebreaker cholita towelettes gurrelieder crookall upends destines pictbridge appealling sensée ronney bentoel laviano llanfrothen apixaban vlodrop sneakier lahair brushfires unreadiness goserelin sleepily maldistribution dixies birkmann heidmann hirohata fundie reitmans qsp campbellsburg bridgmohan lööw cambois akkuyu garce photobox placidity clendenen riggall marathoning boduan rosenstrauch interplays wyedean njpac unqualifiedly mayger stalberg mazsalaca mutianyu absher anouschka towerstream bachani chepchirchir touhig khoro iwcs blairhall qizhong zurbaran coremedia opposit aispuro utla waxer cotney halldin setai portese concealers requalified reshard zondeki lavendon microids archwilio ceftobiprole krawetz bonacina terrye qaed quivar idiosyncracy incrementalist weldele rascism chenchen civillian wasowski alchymist ghyslain plawgo ebberston crailo whitminster callcredit debonis petroceltic sterotype ericksson hillsdown sameshima vicken skivvy skyscraping deiced biomedica taktser mignoni representitive gordinier cervantez kellogs perhap sofort deferasirox blubs shrewbury africanisation vaders mukluk biebel drobot schotter bubblehead zhongmin acidulous kathyrn interbike tigist ruddles ruszkowski brewfest rehaul starchefs wertheimers towbar ambach icti roberston barhoum merzouga conergy mcalarney bellens sibbi megalitre armentano takingitglobal vagliano newswriter ciechanow dreyers marcheline pownce spicker khitab rischer portin ecoh vacherin frucor carbomb hanout vernand cherukuri wvsg schroyer synder acronymed schlieker hartinah mancot blackflies slimfast brillembourg blokhuijsen fellate nulliparous chakari litwa berinsfield nicor greened decribes naameh confessionally knoweldge hoeg exeat shivpur safieh walkern ravellette ozio hickeys delapre döhle daimary paymentech pengelley mankani littleness montegrappa jabalya landlubber keyse apprpriate ltee travelwatch ccpr crockfords maradonna ferraiolo segerberg taluto temporizing newtok hately picrite yandiyev zonules pawelski microfractures lilygreen backtalk stanleytown lippiatt roentgenology quacked poplavskaya squeeks nctj larminie gaviña fulanis phonebox hanoverton denominal sautman virtuousness ayouch sybaritic giltner felciano moodily jagpal godsall hijja micropro villency moceri spakovsky heselden jolivette unsalable greavsie relton acbd hindquarter wishnow uceta harmolodics ancilliary mejicanos glencolmcille soleau farmanara lifevantage garnishments zhenli wincy repossesses chikhaoui goolam bhum massachusettes mayetta tyondai eliasen transfusing frale mullavey yehude gauzes steads shaddick calçots labourite shameen alexus spoetzl kearneys refenes safinia paugh resumés irradiates mostafaei gurkan penksa pacquette côt staybridge samsova innolux precontract hendrikje wuhou diniyar somethig barqi cleansweep ghazwan salafranca recolonizing vbp fissuring ecsl poortman franak lisitsyn viktualienmarkt glasspar adagios polimeni chibas glassbox gural fzc opencable labno liene factionalised uchiyamada toasties roodee wifa cornucopias bpce jevington interlandi balasooriya twizzles kleffner freefalling pupilage whirlybird underdone morcos yanover suckale syaiful geroge moabi westamerica dorsainvil tessaro kedington nightmarishly aleipata devisingh blakroc pienkowski sandinismo lawee galliher urton amorales bioindicators suvo govenor bullah nhsca ganbaatar canvasbacks satelites usry semet portune cranksets munchetty ulley tauntingly nardello periquita infanticides jinyan gratifyingly assitant latchman cervezas ngel hirokami holsboer dissappeared eastsiders commenee stupples menhennitt cablesystem dellenbaugh tigta cunarder annibynnol ralley wayt masterfoods gareeb shaden sepura wheals minnikhanov srebro plastinated copilots loulie topgolf okg pentala brugnoli kaesviharn malthusians hackbart teehee volkers yawuru pliner unseeing collyers infuser smartgate rectums anbinder charendoff inexistence lyndel imron lorely viscarra tofas alabaman fladgate herberman dupouy volubility europabio rosabeth hectolitre truckline ralink reelecting metzstein cheryle apneas altom nonscience saroornagar zabell derra inexplicit koltur whimsey physalia motived buoyantly rewatching gayakwad loquacity reindorp yaps bryukhanov bidborough bilat sailani stohler norair caminati bourdy twaalfhoven painton faughnan adriene hanmore weitzner misun mozza ebuya rychter viticole eloran frontyard landmarking gerhardstein hemingways videoegg shaoul celozzi derwish macrame apfc millvina murueta ashkabad overdrafting prodger pogles grammatikos belgacem palenques joseff zhongqing uytdehaage revisioned nailah mombacho basell elote taposh mussett luonnotar backslapping mincom aguak kurashvili lazzeroni xenserver aï aloia prejudgement weirdnesses naturalizes komalah wamsutter ggd akima elayna scls kortum discbox multiroom zhilei eilian grofman hatstand rebaudioside nymans ecoflex necesssary bicky belhelvie globescan whitcliffe legwear steung biventricular riquna vidnovic marinker gastronomes woolwell mulipola rensin chrichton martonyi airgas huffs pointscoring ughh flays shinkle numara marchman prickling beiruti dalcross mdrc wieldy houngbo kasle armaggedon feminize planit heamoor androgenetic contenta piazze jocose spezi rsgs mading acidulated pulmonologists deparle ksenya minimun koepnick slaa lifestock espadrille sonim squaretrade farmery marecic beqaj pflieger westleton radhia tianyang rejectionists wheating thje carlstedt tinwell contigency theraputic andirons hormats odms expro dudeism sacranie satpayev seroma drewniak shnayerson radmilovic maccullagh ballanger avli mallek filizzola loftsson shumake kontis dutheil gussman sunee mooren queenland cubria galgaduud rothfield bahrom aricept southwoods guca josic bagir lagosians oberriet lydstep scirica speedlight fvd southstar havaianas kayce pochalla shchuchye brooklynn werning beamsley froriep gergo stockel santoyo esack bollwerk parlapiano tssm fluview onechanbara caciotta chineses azarakhsh aggers charmat pielou acpm bosen hapilon medevacs yedidya piffaro inmet nonintrusive feldblum vittachi bosta agok lighthorne halfar squirearchy binde bedevere nanotechnologist tafua mtcs hulhudhoo trivelli damanaki amortisation ulrichsberg benstock forearmed isab craptastic vilató cready qgc kasatochi legalises veysonnaz fion puffa lakesides hyperpyrexia tehching lonsdorf keggy bakkali tentativeness fattouh hirschorn jaliya ascherman fitiuta konyshev borring skurla acquity windygates undersoil bureaucratisation hymy wainio emelda billinghay protocell gottesdiener cottered lakhbir altick kawananakoa thalken lionised resurection partitionist worldpanel rasied irglova untanned varriale microcastle willden mahamed poolhouse gurtler worthenbury hyperflexion déclassé nekesa sarcococca arcandor binette monarchos athrawon falcones rosellen djiboutians itag paruk weavering avendano bridezillas zirinsky axeon tunin noushin dunion jialu kessai milwall lethaia legnini vipa westmost mirabela submenus dogwalker indietracks debruce tesev retrenchments gallocatechin cymoedd kooyman commagere mcnamaras baayork marquitos hoppenstedt polysexual horsely amerigroup newbyth uría nomadically chirruping lague blaenrhondda andromaca bernos celling natters koening djama oostvaardersplassen mythmakers centredness schlep shitless hipswell derai misplay lechea hewings lypsinka carene ncrb cleanable myeni hauner pinkowski dorene sorour llazar ilsey michalczyk terzic quinceañeras hollowood fozzard grasonville feofanova grinnin basex stermer spacefarers admet intoxicates lightboxes ruing sabreen cursiter ribis sochua smokler nyid planipes alizad dopod desley subari minashvili noseley polissia tigerlogic bradstone stikes görlach indivisibly mallnitz swizzels taishet glendy spainish hellicar reesei gaudily strategery gursel taxations bonica masarik koogle chinotto soumyadeep flipchart lambrinidis mondialogo freshens priorslee danyl avantis cnq outswingers flavanol zhongsheng lelkes howies teanaway vinales dannis kcdc cervelo tallygenicom shamlian knickknack chesterland amrich leidholdt sevastianov shipowning repellency navarsete kentia bdcs surfable gougne raucci bracketology sangars wonderhowto buchert tosatto blaris nextag prilukov beansprouts supersweet lobjanidze algiz gnip impetuousness sheffi kankowski edeline stoups monumentalism fredericksen naturalizations mosts esom woodloch atlanticare nasima gutwillig buglife buckteeth readmittance rubh barmoor balmacara brignall langelaan passtime unremunerative crego eastcoast mcneel grovelands pushiness patharghata emmas tylar savol reassesses cheshires accoustic overcorrected stoglin freshbrook iwth olgun mediaguardian relaford avjet kwando khamal ohryzko aquagenic frowny peura overexposing paybacks trés rsos necar ocassional sadato haematocrit trelech barbetta laitner advancetrac popley terpsichorean aapp tenspeed hedh draggy aberman khawani epok elese tachyarrhythmias ogbogu choucair degioia medelci kullas gaspara autoload winnin readmitting flotman maridjan cheongshim greenzo jozini kefraya paciente omrah washaway tolchin rosehip wiatt microstamping shyest jahmal gieseke piccolini cheneys biddings possokhov vovak guintoli ddca nimda fontina spottswoode artington hymned huyghue maame dobrzynski konis draggers acelino sietsema chalki ppds abdelfatah xinguang devide derio bindis brassie mispriced prinn vitetta mckeand showier blakelaw cillo choudhrie sinja drinkell sazka growe cobolli khosrowshahi unretiring bakhos ramales keslar dormobile sauerbreij streusel wideranging fugett schmiedl fenney gigamedia jamont broadspeed erson fygi neurosensory wlga fuci wheb sundowning tolzmann bouveresse marisat rossow cracchiolo cobles overgeneralize lubricious stravinski cetnar bersagliere labeef brottman shinyaku benthamite mugridge hurk codere unpruned wineapple laccd tpbs authentications seagen contantly smule maeklong eqp zomet mhango chippies seratonin krawchuk chellsie gonig eett regester westa wtgb pansing abiocor leeched manikpuri ignatio derenoncourt meranda institutos elavil juventute abdulhak irwyn hempstone unfried tarfa blowjobs eyeborg scharr orked aldert tthat gudino dinp helmkamp ususal poelten dormansland nicolaidis iosh aimin browett deskey paradjanov kilbarry mckissic fodge shalita concealments zandbergen breakstone pakzad usfj stralen lamattina undergirding kwiecinski clobberin mahiedine hartop chareh priyadarshi rarig thaworn carskadon sarms korbich jinsong parametres portoroz watumull caroselli vcit maniar coagulans camilia vancouverite winterswyk proview gadarene europarties sichrovsky longsands uaisele mpack knockwurst kipen bulimics slobbery truvelo stephanou thembinkosi kalanick lineartronic hebog instinctually cluing douggie mortada amberton kivilo bctd camoufleurs umkhanyakude sovereignist arbaeen kumartuli theodores negotiability lindwer pinck ganaxolone oakapple apprehensively damier johna izetta wowsers jearld kingsand jazmyn longyuan guangrong cerpa dhalwala debis hibernal depressors eastfields sprink heeter lbws hamami mandjeck gimmelwald tellef liechty spiffing hnidy hemodynamically pratichetti muchiri pleitez poat neurasthenic antwine cprw eiroa odontochelys djakpa mosad dugs farinholt silich preacherman kaylene znaider casnewydd jazil sonova liebehenschel lightpath tumulak haridi sarajevans chistian sidy mcmickle pisarek biomatrica kreplach yannaras santalab bangemann spilborghs zarela nuuausala kianoush murambi verbenas softtop fachie scudders esders koplow neiers kuniya hennegan marshgate healeys panick croquembouche unstintingly quantifiably kesslers pancrate trestrail kreinberg owein wainscoted balitsch pfpa delord forground lowliness glenmora djedje embroided sambol iraqia hcsc windish tpas quisp sarvari spritual backdown electrosensory multitenant kawecki mikala thermomix zornow wilbourn rabideau tobas unclad ginkgoes hineman sayanogorsk deyda tourment amke tornberg garrowhill fabulousness nazma taouil foodbanks sagel commity thinus boisi podoconiosis kijak bauge coopering bevanites malibus promotoras rivr finnimore ekmark nissay ebbesmeyer unlamented boutté amakudari inniskillin hojeij nxdn scoldings webcaster naptime akahi yerzhan buttenwieser kinkos unconsumed guaracara sulcer afflatus baman rupai hsmp emilsson ettl candleholders garana gregucci abff paretti accurist mantained osondu kbro wurly fluster sonyericsson lartin multiorgan unhas stakeford bloviate sacyr agrilife quizz assunto asbill amellus postsurgical malcorra crullers pullup longueur anticolonialist czyzewski kachan shulz musze watchout sabuk incoherency arlingham malarek fogger appexchange bagnato kitwana maxiumum jasvinder cydonie srizbi rown jixun kools iniative risper sandlings safronova avitan jtekt kotula danzantes touchpaper tarlok shipard euryn rozana clofibrate lovern zonked shiells postnasal ynysawdre dunira kenion dobratz designworks molestations galeya uncontradicted salterhebble verjus rotork papau shamhat boqueria theere reformable realer crematories berberi gurmani tibaijuka fleischhacker ukfi adreview kianga iwanowski paumgarten wisut fullpower ebang utsi barbulescu bedin civillians melucci icher abitibibowater habituating dasheen hefting alperson bdmv shamala medievals scilingo kalanisi dejia bacharuddin dongdong masticator ruymbeke ansumane upgraders multitasker squa atsunori mandri ecard biedenbach dolphus muhoozi deedy hospedales tishina phonophobia attitute sinsemilla eulogist rabani amritpal dubyna pavri quresh portering mansory shinique gettting tocker ngun whitegates sanitarian serralta spanfeller tosic farsad illegalized studesville securement gamaleya apmg tommyknocker skirl glimepiride ziadeh jakubec benizri unmourned signifcantly yemens prewritten pleasanter abdille captious kocic fenceline africat akinsanya trecia eqv pomfrey feedbag shuttlers sellas koundara magassa saleswomen stopps salesclerk khatar dyilo ogou scums checky hasip mnay livinallongo underminded hibbitts neuticles jinchi woodmans ficelle kovatchev guilmette deatherage hanjra musliu ditherington allanbrook cristini capsaicinoids leubsdorf kreher processability stirland nemeroff joyon peagler roswall surtain monsivais hosek nichopoulos borjesson trenter murfitt llanwrda ffrdc bassolino hcpt mukwege sommerer bahmanyar dimora gertten supera szmyd huncote skevington ssga bannout michikaze maddala envalira dakowicz wuthnow alaton pixo savik bannang hermé mediterrean cloyingly spinnerbait abanazar standa felley recces linteau pragasam kindlon renowed ternium spos faegre qabatiya anmm lgtb rectennas barcarolles orrey pohan cyburbia phallocentric zeppole mishloach meadowhead riomaggiore rolinek koina zemp calibos swished mahameed girlington arraign monsignors camerman dehnamaki ellickson abdelhaleem herritarrok narisawa subpoenaing marincola leukemans watcombe witters poipu cloudbreak daggy mzamane powerchip containable panjsher eiopa clubface cannelle kurbatova resegregation hillberry fedders tenberken adamsdale nonmusical ingenius dissostichus myyearbook sfsg mofeed tuninter anynana bockting leswalt bhagmati tokarz jalozai reinitiate tassle cottonelle cadjehoun carprofen nnrs pakulak malinverni calise arraying sungold ffred recalibrating sanussi greebe beiji secretaire guleghina giannoli pennycress mongillo hellishly grydeland coorstek smudgy guilts crase khurrum comastri kleinke beautifull boguslavsky liberis iwsc chainey livoti mollifying gleadall vishny coberly postconviction gabis scool protocal zimov arcone mutie willekens clarium kagay carlise yawp biowatch lupfer covario guiterman bikhchandani halleran jasperse mabula polumbus bulukumba diglipur zanzibaris garona shrives kinstler progamme tennesse citrusy torrenueva janneh mycocepurus compostion zaimi broida celsia undertrained ullyot cineas pincushions mouterde moisseev naseum estover eluana sumisip moonbounce scrimmaging fartlek thieve gambro hayshed impresiones simearth matela putrefied demarkation danchin kronospan bullbars kenith warstone bunkbed primoz incredibad haustein roastery weilin dinamika ruoppolo doddie goodlooking stylers prepossessing prechtel qpe seracini kohal pfsweb bundaleer rockenbach nonsocial barnuevo interwove carbapenemases ozgan gamila oppossed gopin vinnell crawick harilela verdick getjar butzel rjk demurely beloussov housebuilders sniffling nawijn sharoni fremstad rightholder cheko hbas dxy convienent shaimiev cannonsville bittermann outrushed garaud mariwan nèg sadwrn pahlsson pingus hockerton pockett convertors chiappone farceur rivada choucroute aggravations africentric areopolis faddy sweco glamorously clermiston transvestitism alualu skyrace novavax moganshan shreiner rtss zoric wotsits liuyuan milek pantalaimon borsboom satterly tocris emvco palevol keigan gerbasi woukd drumoak kustow fajon longtin fazoli gohpur inlaw scherber velib slumbered brennans mokin béar kirschke fractionator ignizio tantillo galgadud malcuit atomizes opensparc mississipi trawlerman dashe bingguo fitze dascombe eumm visnovsky vvmf dissappear blaxhall molchanova gilberdyke donewald refinances maléter kimunya delboy scraton handoko wilsonart earlie apelt dinaw barbagelata mafuta mundford unwisdom scuffs chrismas altmore goulbourne duaner technophiles tobiasen kuye gielow midani ojinnaka overnutrition naruc instigations jalopies shestakova coreth julphar klatten superpole escura ratanak foodists gravetye bidadi hafley pinchao wiedeking easeus meraux housewifery nuseirat fatwallet gableman dehui safdari mckines sweatband venetz lowcock aruz ukfc sendlein dementri rifs flocken greatland schelomo reres pikelet marketisation clouted akivi tenovus saayman eiter simner soraida moniza genuses zootechnical opinionator kolsch ohlde patchworks emun brendt koobface saera unlevel luhukay pointillistic learnvest valke swarnamali ballinspittle impeaches shilly nsso capuzzi hettinga akerboom goslee bakesale polakoff grumpily falloch radas rentas shourd smackers scharlau boxier ryongchon littlefair chichon moohan matsunichi eifler spiegelau prestonburg maleka kuol tillack falfield solfest tsujihara tacchino gerallt villier surata beurden cadmore photoshops bastardos kizzie creels vref actigraphy igps snaer broyer shafak hoefflin tided nicus nanko lufti millisievert manick funestus pompadours sreenevasan kipunji toupees melland agnico fusce gassville antici maclochlainn reverby equitas canani urmo shticks marcey proxicom dogaru scowls litl xinkai damagingly untracked polovets surgan geissmann nourredine graybeal bawdrip emotively daughdrill ultrarunner cefixime pruis steinways aashe yungchen glanzer oppenheimerfunds hamade minyon heparins caqh bodenhausen varnavas robyne unparallelled lickley doneness piraya bodeguita roscigno bridgefoot podgor tetz somjit overachieved hokkanen pamphilon lighweight torsello advertainment amhi frean condescends thalasso ghinwa higazy roves mackeown ninnies dusinberre apgc mapoe hoekman cpsia vipp chuño unexercised polisar maponyane clattered rofes kanning jetsetter patsaouras lochrie mugrabi boyt bowdlerizing backstabbed drimmer bakhash geralds eastwoods shakesperean betokens sambazon hocked restitute eshet seftel roben whinfell sensys staudigl wonggoun chattered wising llandel shacked oxiclean faldingworth bigan riesgraf higl nontransparent competiton yuyun xilisoft pinkos payre burkeman scbt geofencing frustation sherree celebrative jaslovské millecam seadream choos alkerton snarly manzanitas sitzman seekins simhon sodertalje boeker iiroc teie convio sampley shageluk photomicrography priviledged antonelle rheinallt mcuh mahound valacyclovir ednos topex llansannan heterosexually palant penjaringan biothrax backlick dothill constructability zorab carpatair eary pattonsburg regulski bevanite naiades strumpshaw wilberger koele inheritence cablenet rosegg barzak farhatullah edgeplay ziplining aldsworth lifepoint camal intractably ryz branders ction pattis muchnic arrrgh seroxat kolinski ilecs admist inactives harperpress bhittani deveci tamps zimmitti fanleaf udfs burnton particpating chargeability unterkircher unbelted advantest nyfd dilbagh priscu execpt tohme nsri daignault kacyvenski degredation sucessor evgenios mkoyan delozier yein puddefoot britcom acousticians iqor trachelospermum updateable circumlocutory wayah burell pachysandra mles maime menie crestron unworkably sondek justifed misshapes bajevic daydreamed critchell ventor psychopharmacologist abde westerdam peggi besties torteval stultified pomelos tiedeman durnell aissami ferryport reenergize confected harlestone roncagliolo stridulate indiepix lianying wescom cebull sensitizes newble mukhsin shogren hagatna ratnoff teyona covanta stanish texterity thermador laicism vendeuse handlowy stephney diybio empingham disolve soutane gorard bezzaz guerdon hebrard nightclothes aasra unpf balalaikas alperon tatling reopenings alienage küblis hagwons nodia raffling marcaccio denude wenneker bermanbraun standefer cisen cacak emmanuella lazarre disalvatore orpe wigford giampa moldea luitel muziektheater menchie aggarwala bimco lanesfield rayful trancelike coaters alhomayed vantages ogunkoya campylobacteriosis zgonina disbursal jannot redwell nonmaterial quane contraints xenith jassen veroni brulte cleopatras overhasty xendesktop saney sitkoff sunderman reoperation morgenroth pearlfishers flagmen taeke biagianti alchemyapi sciubba teesville domoney withywood nuriyev chonda manarola covansys sbsi baccalaureates wenjian indem sarracino cohabits hastle miza hubig beyler kermer morawiecki tenneson powerboost labrea patima youming aarebrot rimvydas dxn fuds shrimsley sciutto barbut palexpo pooni cypriaca polgreen dienten butterfields maymon akumal evertson uncheckable mingazov matrook wallisdown thynnus unwonted muqbil meguiar dibono pistolera tonnies carolle powerhaul tirador mcmillans pulzetti echochrome raïssa lekic flubbing presold denominating pasquerilla lagmay quickoffice empathetically gayner zoueva kreil gereb bassmasters redcurrants myguide monzel samardzic amiry sequenzas accelerative honohan hermer baver nooij dijkstal thornewill contrats kaune vatted sherriffs microbrews abrazos kriemler hartmarx infelicity janjic mechelle sege kwizera aldermoor crystalised datagrid macadamias nugee gourin beauracracy ghany kajeet burles hadramout kineto seguchi bauw halbritter misher praderas armhole nyagan lessel snorky trivialises jabur ekren nbfa shocknek kristyan amapari ₂ broadwindsor hiraman perano uapb swooshing ucac friskney hemmerman sanping pasttime gallbladders avionica emannuel auchincruive spotz kunken ramita irrelevances outstate sidat microturbines kidspace zinnie ulumi pastırma kilcooley narah helzer portioning bimpson fenners peaston trotty rezac wisconsinites ssos dohyo blighters ranchettes twangs dushane grandiosely steamfitters eiras slovin pollicino neurotypicals abdikadir malseed vogelman theroy tranquiliser mbango shrewsberry bensedrine nxec jonquiere wedeman noncombustible goalwards barzegar moriston ardossi kauders benabib hypercholesterolaemia annuit maginley kanoute bedells fideo rexroad yamaichi reimposition brynwood eppi travi ruegsegger freeny lemmouchia enfolds huntresses philanthropical areds wyevale uibhist facebooks slithy braymer inverie ultrawide vlautin pitau smutniak magwood daleen ostracizes pinchin schumanns kloser hotsauce tushan unauthoritative mullivaikkal bryld izad selphy kornat crousillat headquaters dolgen roscuro rabois swietokrzyskie murewa fouth morfessis eilen mollway engla euthanizes critize ivalu lokuarachchi westco afterplay readably locati poienari leonnig zezima soundwalk jmcc tillim moblog bloodthirstiness guama chels liplock baaj kylesku riann hohlwein onelink baalak quaggas tallan minouche frecker chemaxon pennybags brekk hebl agbogbloshie flagstick samakuva brundige neuralgic qdi newswipe beaucarne chenxi timeworn aozou lampost vigneri hoggle yardumian mapson merisant okagawa ciroma houge gubay pynzenyk tacolneston khumjung ppip edry mackiev clai kadek jarraud gtis windmilling carringtons quivered fruitadens puzder ambitiousness hakstol waino cucchiara rgbe shuvalova dwygyfylchi coverted castrucci readier frz pupkin nowthen underequipped ambuklao brumbelow biotoxins wolrd fabada reeep graunke iog dismasting useem lanahan binya nayiri nawiliwili eyeballed outperformance tomcod nickelsburg muriuki durman dannat tavernari whitewalls saaid zeen warshow koolaid mpgs winced hounddog chilga eckenrode lensless hypermobile decentered mcgauley miyanda momment shafilea ranabir fathoming laret shedrack refinish maruge zeoli ferrugia dioni hettrick rozek comito pleasured cipulis biogasoline gnep balloo replastered zolty zilinskas havelsan econetic rcdm fattoush foregoes gojet greetwell tedenby newshound grabel atterberry aleg amitay squeezable finty celebres turbervill caydee shillue txema fairmead fiscals vaziani sintim trandahl khanani fuggles immunoregulatory stupefy hookline csaa cisowski ambrosial miljus cyclamens getback vancil withypool ncsd shoulberg ghesquiere tabarez schoppert fallons maarof enw cbrc certainteed peniket inceptions chowdown legette pebley ushs astarloza burrata jazairi dcra patisia daryoush couderay caiafa chortling enernoc becaues desalinating shandley tonkovich oxygenating caminata polner nakouzi gearheads nogar listkiewicz poteen gangar hoptman overprocessed dubernard remilitarisation spianato bathpool bagrami decitabine utzschneider intego houpt nordholm amorin shujaa sadka questionings muranen nekaris datelines sholar uitikon srch inexactitude cothay hengeveld griso onesearch defrock foister eruptum vincebus blubbering reasor stihler multicare zazueta supranationalism mapendo treizième paxi teletalk toing scabbed desia donnachie shulock begor chappies yogpeeth backpedaled jermon flinger kicanas trigano cottrer automats utec konduz jauntily lukesh druskin khwani infosport dederang dobrynska accusingly napolioni tunison icimod jops fireguard openadr geocenter bassingham mcclatchie wolkowitz tuffah ezawa alnylam shanghvi jayla zarar adtran whap abysova gormans healthplan faronics taback catcall hcis jeeze orchises osserman optimiser mluleki piggledy kayrouz holographically betw zwirn cazaban boerwinkle inglethorpe doomsayer ponden microwatts assimilator carnochan dfsp tanarus sakazakii benisch kowtowed bardelys borispol haeusler americanconnection dansky imaginer enflamed collery ravikanth immy cancian trulove choper leegin tickencote cylinda hoevelaken ionizers intratumoral pambo szekesfehervar kezerashvili quyet montenvers helfert santaniello ciil vdma transubstantiated sdlt baseer thweatt letherman jillings highridge dornin bronglais porthgain haveman omegle lulucf kazakhtelecom ellingworth kaseman airão standstills nkomati cowpokes goldnadel aclj vaccinators richarda droperidol fastnesses eulogise crosbys shuzhen marran shoeboxes casden paradizo genuis yaalon addresss vrinat klyuev caberta jasmines takanaka geocell thandiwe prammer swordbearer enthusiatic goldheart gyude globespan aaditya gdgt hoeness prilocaine palihapitiya perisic anchorfree norned wronging wammies sublingually dobbies riener pongs bezunesh zya rosemeadow dsny didden levell quann schaedler anwa sangomas mzymta undreamt babikian diggles howroyd duckworths guoyuan gentzel hlavac ashlyns monarc afront druick plantiff selldorf snickersville arisman falacy refind kokol shopsin unticked huska uyeno pettys leebron commet ubogu shadowcrew unseeable fppc petfinder pizzitola gueffroy guangjin wanly tyrannically electrospun omachi cvent gandhians dorice titbit zwolinski freedivers paltel gasior biederitz meuris nongame quemener feldmeijer bargal imagenet pomerado penaranda gumus overindulging obdii interinstitutional ausnet mxenge outgrossed missileers tarvit koopmann cizeta flahooley salutatory overstimulated meeru tricyclics jamarr bocardo donini llanwonno hogganfield butylparaben spoofer chandrayan moutain wakuda serried callused elies verryth zackenberg helth colemore powerschool bidonvilles pucillo delimkhanov dummying jakon iress sergerie calica egeon hatefully wisbar vyatchanin tecsar soffa peakirk lisen brudos tiedown rohen qarnns bootprint racf limc lissett danly schwemmer scwr hardoon brunnermeier shanika brooked ginnis nonpartisanship journee schafft kveta commisioner niueans kandeel goovaerts bandhs carinci neimark booysens sadou tallington agraz putterman helliesen seliga zinkhan trickiness comunities diddling hearin zurabov caplets bonking teichner beardshaw soysambu beting keshawarz whatevs ergometers acmg sibilance lania activant eudald pincock probabtion leatherby falavigna passeron vigliotti sherifa fahringer fezzik palguta encomia modahl actl blazwick koolman ephemerals canós guiliani kyemon slavonice cuautitlan quaffing kariana norview iomart incluso dmea sparaxis wheelabrator horsewomen kozulin yuesheng devean riar easi foushee liexian tolkiens disenrollment auwe thouvenel invacare jesta piotti motoblur unitrin mwaruwari gambril scurrah shawbridge engelmayer taniya tactlessly beegees schuelke pantechnicon hollaway methody saddi bollito kolmanskop kilm mounding muthamma kirkbean mvelaphanda matrikon muteba massachussets chacahua labwani kofe familiarities menzieshill touaregs alertly overprotection baltal grandey demonstated mefeedia enemys debtmerica renderos tiagabine onsi mocap mandelberg maykel footbrake happends hawliau venediktov doumeira khaddar roksanda egeraat kazini appraoch eymet kaindi deghati passanger noghaideli nojo uninspected microcredits bullshitters zoomtext extricates valaika ivernia sgdl reccommended daeg kaluyituka kutik linerboard kincorth slovenliness graincorp profiteroles jaeggi samenow keffiyehs mannerly scsr safaa fainthearted mockney schmautz wndc landrin harshing asbjorn airstar mccalpin polyone vizhi hisakazu twizzlers supersensitivity brahima wyburd abeta choreopoem linds kacst denef activerain baluyevsky tarves cokers koua begbies hoodwinking tischman ranelin molnau ustyurt salmona morstead mishon moskovia rawsthorn faceplant innerchange ludmer kassoma geeker clamors frithelstock sambuco thiruchelvam bixley submetering latman rhawnhurst joei mangaliso shedlock fluharty norwick ayariga hijrat skyrise howgego hoefkens lichtblick bisti underclasses austerberry pedestrianization thumpin guisasola fooding sancan literariness whelans koonse stocktake banafsheh porfiri rettberg estore worobec govilon haipe phrasebooks maccuish itasha mazut vitiating sabertooths furfuryl unhooking dimpling chanchez kangai lingani eebee brovtsev desoer pebbledash zocor ghorban poisioning dafora cairon shpe subleases artesanias dodes quantapoint mehmanparast mwcnts alledge calvey talb editon cvvt gateaux waggett babnik druckmaschinen allido phok gjelsten lostwinds hodr goumri decisionmaker aussiebum gigaset iqe scholtens maplecroft underinsurance jameelah lipoteichoic ruhlin unilateralist lungomare redenominated tilmant sallon telebrands squelches landaluze recipie extremest howardforums rudakova xiwang earthweek lgfl qmd kowalczuk padaca queloz jegou sharzer petricone nanh studioso arenda brager vikingstad clario ruenroeng gazzam woodturner dloc campolina fourballs egaming paraguana mashers migrantes everchanging mandabach hochhalter popmoney hongbing laborda nouria bottigheimer jenco ningming hyperglycaemia kanjo guthlaxton hateship loveship marjeh zentiva aipla kucherena glammed intimidatingly buzza kryczka longabardi wellie fastskin korns tiotropium izala schoonebeek casadio tarictic mikalah hinnies piscatorial hitchmough nimetazepam subsquently wickramaratne connubial kukovich cannelloni reline bierwirth frykberg morcone ebad guosheng littondale primario masalas caten ilston androulla kamecke gregorys toudouze gfatm zagorsky milca fqhc atkisson saidullah rajapakshe husic ulanhot cahall exasperatingly tupay markmann meïté presumptuousness hökmark agressor tridel harvy pottering gockley schoolday jalasto sublets eatinger rubinger valmon pantomiming sklaroff whv biotek charlcombe mulege cairnwell gelete yedda storrar naion viglietti derocher soat tintinnabulation gerstenberger flywire fleshier kasandra nealis tatis ccop lippes schlesselman curnock mettingham encroachers heckendorn encierro banknock nyron fantz sideburn dufus finnkino delron impoverishes bidded hypalon speechlessness kukeyev mosiman lagoda langas wicus huilongguan contradition lawdar mandhir otemachi inclan bigamists krupicka kneeshaw freidheim murderland sumatrae brookhill owchar keratoprosthesis yousufzai rearend fahid kuks capts khyam dermont addow luach balleza rouiba notnowcato angueira crissier leshy caringal owly wheke pancaked tosspot fyle zalon alamieyeseigha noiselessly durty greenstar akitas publi terekeka llandegai hinderer gageby skena gelvin richlin arkefly veracode ejupi webmistress ladybrook cognis soberness moyross marguiles lineouts schifilliti magnit jongbloed zindzi rampf ausenco girifna svoronos toddling benhard noblett scotson chenoy elegba aggreko cigarmaker pirls taiohae sonol dukane rospotrebnadzor shakirullah redbush insertive interwest snowblower sakhnovski rajnesh zecevic strommen escudier dejonghe deadlifts dumbasses hexie vasella slma loscalzo noorvik sardarov beccaro faustmann recessing filicia floorshow valloires mouraria xanthomas edfa dorame samthar baldie sandya lightsabre baldes pingers opodo obamaism glocker njbiz havlick grushow lambridge banishments targus socities anathematised quiwonkpa novair fluendo countersurveillance liqa omali whizzers bermant mammaprint gannochy newspaperwoman eayre bakircioglu oxonians ezbet citicoline trevyn löwitsch westerlands humitas kwock dalein sedici oceanco liuqiu aigs experiement faruqui illian punggye organiztion cruller dalbavie akesson goofiest crowsley forugh pointcast braemore bzx grotnes shadowmancer scearce spenkelink gracian terentiev janadhikar altnagelvin kannika patano westat callerton mouttet unversity leiker togeather chathill vodaphone lenney harcombe uex lajolla rald swackhamer narcisco movano balkanisation hudyma ragle kaylynn lansanah ferl derreck mboka langarica bockelmann tomsic parklet loooooong doesent mondory baudour damsgaard sedky tamashek tomass buinaksk mambu brusk earlysville kunimasa amale fishings champon wagonlit ztohoven appall ukranians preordering udston meshram baxby medano hoteling udofia dedridge golfland melwani costellos eej lisanby hessam siccardi tolcarne niebling sezibwa garecht brentry qabb jarritos aulani bcause chiedozie volanges gdhi prepacked mingma stereoscopes bessis bassens larese saoul facteurs horseshoeing ccow avrami grattacielo osaid propagandising jodelet mcgaffigan briseno kurutz sartore ipsley gosens vegfest mundle nashawena virot crimper autonomi appreicate oystein naschmarkt gutsiest whitehair bivy svtc minijack sautin hardel cerrejonisuchus dörflein chaib sandlots schnetzler eeda pokaski litigates rowly bretter sérac guusje clumsiest didelphys baatz katsavakis josetxo zickel dominquez gagliasso sarbu feebler tcherezov merched orjuela micek ecletic tomkiewicz vtn rajive uriminzokkiri scorpene bucktoothed beachings manhandles microamperes simme romanello teeside bichons shigri roia cavagnaro nothronychus harraby tonala shedfield matsumori watersense sailability ecgd ziganda fiscalini arbeia levermann todorovich velosa spedition pestronk eigel otepka aknoun sudokus comunism batbayar glovsky chalmé mamina clonan brietbart undrawn liftport albiceleste secondi rottino rummo braunlich ppss velti vandyk impresive timanfaya islamicisation divisionists tierre andartes wronger tomasevicz spead émigrée floozie dunchideock guriel acga gulbin baumberger romanick upconverter boubakeur romaguera growney solariums tryng ophiopogon truffula aviations shellers mcapi rusan sakabe sillito lombax sneeringly heathway zlotnik shacknai happended jaksic lomnicki mcnamer imitable woodgett restif twinjets srednyaya atassut beaurocratic ravensbury calander mayorkas glowpoint reanna ceratinly kildrum horbaczewski continetti bucketloads rokpa kosair breathiness zenshin pilsworth samcor huanghelou zhongping sajadi graystones humphryes tional huxleys mccafe engeman mouhammad neogen gujerat beatdowns negasso nomuka nurnberger asimos proek gidada walkoff kotil fichardt amisi discala passagework unfaded stolfo iside blethering polverini nohria selespeed wheego jeansonne katyushas steinfels gaudini childproof immedately caesarism mcfeeley emcor udj sotterranea kimbal vandas flatfooted earthrights fragale maharanis gamcare rabar bagnulo gaulden bernies envirolink lussick brachen baalke brammeier maandig levings emerito plbs unintelligble emperical shortcode shirker panteleyev stricklen perretti lgvs heastie frearson tumlin preoccupying daradji marbley wetzels tenderhearted kandamby mohacs atpg maezawa democratas pamala stegmayer varces crucifies baltonsborough holtam dictu purées underkoffler gavaghan acerinox flipflop jockel methlick abogo destory moviestorm altug trister electromobility ifrica eyeroll plutoids besets riversway grunters clulow telea kennards akiachak idasa aliakbar rienk statusnet peiwen dubler bodzin emch mosebar advs wawr suffuse uyeda punctal foued carribbean sewardstone hallym darv glossies santia farmgirl gieschen miscommunicated hosenball norfed crashworthy geekiness mckintosh hamdoun abdoulkarim asnis maghami centreforum leffel amalya gareb stashower lanh gloveman montplaisir lawshe adach fatfat idoko flashiness eeig yabulu maraden tuerck representaciones kiest estai caulks netmedia auldhouse freking holbury unpretentiously edwardses milliamp grovewood curmudgeons mediatory amedi rodiles rantamaki cymunedol velton liljenquist sanberg yesin moussavou mammies penisula irruptions whjy draggable dincklage rovani microalbuminuria squarks polyak kassulke karro dyantyi stockselius mcgleenan efacec caliri croglin caryopteris multidenominational ibwa disgorges pasuk defjam vidoe getinge parvenus prabin alamoudi wolfberg llangors cavorts polysomnographic yelizarov pastukhov demijohn twiddly martinussen arctotis zaheen vistar nourisher myfoxdfw readerly betdaq withou cappers carnett hemdani castlefields alderholt katragadda sanfords midomi handsomer tahlil incapacitant emfs vanderveen minnoch austinite lynelle wrang heckroth aepi openworld weast uppies hnefatafl mathee otting hardwear stroopwafels oxys calvery mucoadhesive higest grammel ncfa manceau hpvs falinge diemecke emcf sassier sttr abase fangak guisard miedl jianchuan palatas scatsta barrey soua meselech zigomanis conjur haimanot acergy benkiser shatra panjaitan outfox ederer banwen razoo lianke muzzio millhiser boehlke beroni gospelfest popkiss kubinec investama nestings pikkarainen pylle itif mataskelekele mannava njongonkulu bookeen gonsalo butscher frisks sugarcoating cartref omigod labaton thirunelli takeback mayell tharston authoritie stehn zeydan berchelmann dugg taitu gietz intertech scooted sufiah ayahs schiaretti tenderloins niittymaki eradicators lovgren philleo aotc ciardha altaie medwatch sunshield elberg kiilerich seahenge theisman fosbrooke powerwave gergawi defectively javeed edathy matallana suckage montpeyroux woolie qinan karins selwa baglin whizzy honingham adiyaman leemon kaisersaal mazuch rondavels finical ruaraidh shoestrings optimax swetz tynesiders commutable lvarez pactola teckla uaq garavini bargan arnson pictometry budnitz cought mophie galgate daifallah lifelink kivus michalakis inaptly beichman energystar reprogenetics hardstanding khloponin floripa mountz nfwf lingford guanzhou moonquake digiorno billingsly partisian slithery hemion piscicelli distastefully rainouts guanlan ferarri plymale fcip dundonian kildans ipis equivocates ultimata discloser techron gaddo boosterish ernesettle liqing unselfconsciously diavolezza beegle merfeld financieele zoraya healthnet whaples calando seigne yeremiah knishes mahjar anapolis traipse hairnets reynisson sunraycer coppolino sarpei warcrimes lafonta ijams curtici hassans hial mccomber alagic lanrev headwalls stitzer sopping aiful khalilah unmelted vatansever vogondy pipewell jinkee carlyles abdhir shawbost gloucesters towndrow galumphing zwinky artacho icepack maalaea floriferous sawtoothed kargin mrčaru follwoing undercounts cutillo bakhtyar indigos aobo carvelli syamsuddin juppe courrielche superpages depravities debrum partaker philipino guarenteed ferreres prebate depressurised zhelyazkova thingvellir gossipgirl bouchy nohe wocn berlato perishability uhls softphones sanitizes kamine mesinai ersberg dlink sirbu solair luthman sleeter refalo chemico cefneithin valone oyuela ketz vunerable gueguen kstar fadhili conceiver amezaga wellinghoff hdy arsd wounder aberteifi stantis gustines protract doglegs cevital stoltze matee chunxiao smouldered poppiest palsgraf scarantino zyskind nishu tightropes aminatou datai aiptek dogster catchpenny withall fretta brackstone schlosshotel jackknifing ramnani hassie easer habhab tiffney kogelo coartem swatton teilhardina wazzani baupost snowmageddon bareness roussey alekseyeva mandanipour leskovar bershka cyberball shcharansky kaczur kaddoumi hetreed fatton indefinably butterfill cashcard cuvées detemir anmar joulwan drachenberg mackwell collarbones derner belier vallos talil requiescant cavet fwi moubarak lewies knightian consob grazzi ajib donemana fentimans conservativeness blands ndoro filadelfo mohammadu parodists purtle footstar lahme fioricet miniaturizing barraging namanga depoliticization stromile moggerhanger newedge glatthaar saadon lavandeira debars kelynack spinned vodden kivo luckovich extemporised blook ballie yodlee takhta isoh puijila caraballeda nagakawa szema beserk scianna wetzsteon demersus billfold jonkoping conversationalists noooooo karron bisel milinkevich yakobi akenfield morisette yonghua kiamesha sightscreen tosher steeplejacks overaged sacia hitar waterhealth cousseran rundquist pricesmart pelekoudas kapitula chlorphenamine ijj shamble berloni ancop karosta fessia menchville stanchart royte khafaji horsebox gontarczyk marida kyllachy slainte icebar posher lennan faduma derossi multitool sitution parlimentary insectile deaniana greated nvtc hollingwood scheidhauer mcavan qiuxia mogilyov enfamil ibmp lioubov unemotionally nidever mugatu mizutori seaweb korpikoski edgerson overcompensated murderabilia alemi rivieri urique yaas nmes tchoyi smooshed sozen lenes misshaped therrell velayutham indravadan jacada melodramatics adventurously zyad chookiat barsch yashchenko uncombed distrait ehsas presswire mahlman strangulating arviragus gueret kildress dgcx todini agazio etiolated polytunnels scade tschiffely ratlinghope kammal ballingdon flaschen ethington mullainathan pletka luera supergraphics flansburg ekber naeve herrara btwn uglovka tifs postions wavebands svahn ludicrious karmon rmps grescoe vodenicharov palaeoclimate instituition beuc clabo tevanian castlestone kibbles qfii noisemaking divey metam slavering swaddle acpt sheesha oseo cynlluniau rackable positioners bladimir kheradpir amalraj lahde cpeo mackness uskmouth primulas kwezi myko panagiota carancas lrx katzenmoyer marchioro leamore litterer siloed distintos dirigiste hananel cellou goldreyer mamunur vnb worldfocus tavalon yuanhua tasti yowling marchiani taked fischbein jenky pilseners vistor duntrune mcvities moaveni wannop lundwall chks awaydays debry blockiness tuthilltown fccp coverity segafredo kurtulus powerpoints vidharba acutiflora dewick salinisation tqi sharifullah vieaux kupisz codrea rigzin schelsky eilein pbra pagents ruesga trifurcation eyeson aggar tolmach pharmacopoeial iaop wuermeling tangriev nonas karrikins disorientate rollcall diversitas roveri shahnazi sagg uplinking yardarms weiqiang mideksa chryste glibness brundibar spookiest meeka ahri orphanhood egland commentariolus stirbois farflung booo woolfalk pacht trezevant roslund chitika mosiello herbfarm wholefood miring rosenker pervenche hartless lofters baidoo hypermotard adao dundlod kommineni chulos kreizman staron leekes nesim nieporent miasmatic bhuna faughan reddock egalite deckhouses partouche ilboudo bizare lemonick inpact appjet kilcornan cisler aronsen rtts heinousness gigantically correctible refection baudisch schoolfield arbritration jfcom kehela aptekar poran trigwell stormhold specifiy securus chengwei knowlden deleware ritzau maimbung naleo cuilo dunivan swissnex varejo sharts seabound prinya bajoria infousa nappers sakchai convit khristian janszen lascurain buchtmann gafisa lionetti brokk lambrini branney timbersports epernay effel senik hundredweights abdolah rusciano deodorizer eisha unsellables blandit sperrgebiet cavana lugansky wesbite rebny campsfield frontzeck tlale bisquick keig natureworks mitica awdurdod elmen ondieki vollbracht khugayev rackhams mcbrides rathergate verdini beuzelin deye mikmaq rulan kahoe clis picsel inelegible uninflated zulfikhar picturephone bragar hespos dotheboys seither kromkamp windpark panicos esfri baronesse vulindlu rapproachment lalai ewwww hafizur theif blaber izenberg haggler gaynair huaynaputina wyhe teranga goreux newcastlegateshead newspeople uplifter ftms sheyda hadcrut rentech yalof brockel irenee spewak ewh admiralspalast feachem khadak pisaroni lascoux wesch jlloyd togather makeweight yamon saltiel edwaard miamians cubistic wassouf jube abouba cullingford launderettes tetty sadykhov panich ineluctably endcaps sarantos belhassen smartpen aeterno sabbatino faraya delise wsff nacel dubielewicz unisource spiric makali echiverri liddiment flipse boyishly aiss iochroma highwall trequartista explantation ocma barsness vaza yakhont elhage prosafe isold belston montclarion rencurel murkoff krensky pouching tasing sydd gharani mergermarket dollface ordan crinkles pacfic wallal guidera depressurizing tittmann avolio biodegradeable cornock cbac kbx salsinha chasman khabab garvis welfarist olaparib warnig nonaggressive kotelnik marinaccio commmittee arriana garlicky paumer bionaire karash skimpier loynes arlem weyns flatlining mujangi sadeddin mccright gulay glamazons whetzel maasim backlift miguélez serriffe koulis blachère rawah playle mcmoon imbrogno dotonbori dealmaking gocke rhosddu nassauer pingeon lambswool forakis onsong supperclub ialdabaoth melsonby euzebiusz lusterware sassenach hdrs deployability remkes kaviani fennecs vitner cramers konzen hadsell zuheir jesner motherese steingart genocidaires quartettsatz unknowability tuccillo blankfield bedington adlène guskin standalones ghawi districtwide microlite bastrykin maidis aher innumerate gwyddelwern fortunetellers biologies triaging slackens woodsmoke assyrtiko prme khalf brianstorm oncourse moravcik centerback farokhmanesh mijatovic siwy commiserations glammy mayorships arrache muxlow husinec petrohué skippable coasteering onesimo chorizos remanso fulgent innerspring kour misseriya aiswarya basrans lovegren aparthied headcheese codeveloped nathani icrossing convienient roussef palmor barria bjartur viñals castonzo godana americ purveys gumbos ministery streambase iscd walkies berkovitz mooli agatston malooly fauvergue cybertech unencapsulated wackier mukaddam wolayta tsheri vandekeybus chinchen dossia dongria welted inventec walin indiv frizza dinicola lolesi lindeborg muntok kielsen burwitz visger chilthorne qmm mohammadzadeh molumphy esgob quet seper sdrt traynham bromantic tiarra bilham iliza levain kuronen monoprint aperitifs masillae schw fombonne ffor eduviges aflibercept sequinned frostee bactrim lifeco awarness tetzchner aminur flra rtps adle majore sjambok jpay kerson enteritidis albourne auermann onibury beechdean fcmb masriadi kringe olinka carolann darvishi itadori dismays timmel darabos stukes kotera leckonby lashinda tonking internap maxeke icbt sensodyne bookmooch vaterstetten dejardin jacksdale lenscrafters lilibet sagansky appliqued hallford catalonians foldaway xenical pitcavage inversnaid pezo surpisingly raty travelators novocain kyosai ophiucus arnezeder kment caribia bugrov fishkind tinoisamoa marpi kollsnes moldavite salzhauer linkline inboden mautby kopelow ifereimi simmerling khano manats bradken recommitment diaria transcriptionists tajbeg suntower tyrannized samothraki branam soakers comestible belizaire schrimm sequella faingaa diverters crackhouse vascar goghs tonchi superdeluxe blondchen nichirei jamell fessing hasnawi salvaggio proctoring biddlesden steidel snowscape villingili fisnik xtract kröpelin hillings universit cyclen energywatch scorekeepers larison hautacam biocode dorneywood impressionistically melgoza whittome tottered arambol matchwinner accet abinader tesman attainability tonkotsu mumblings koulamallah crocosaurus effeminately remics bouly mitsuyasu dunnocks bochi mcsmith wildings heymer brackin isaaa pennan intuits shangjin sleazoid counterterror sklerov solnik muccio ticer smoothers vibrac barbolini bizrate unblended dervin ptsi macnamee includ patissia arnish norweigan elloughton vitalijus gumpertz asayama erres platell wesport openleaks selimov ducatis belesis kinuthia galeote neeka kennametal infida kickingstallionsims palmiero vagabonding lemere tynged trelawnyd ravizza massification tiesi gaurd jeffrie hastoe loë anchia kourlas cutuli unzueta ckin overbudget paglesham campsa cakey wherley vallina airporter ivereigh bokun kagasoff hpq lpns nationa dingemans depletions prearrangement eissele strasshof lerose remaps beautyrest corado sadykov chaowarat lokuge opelt erkes tagesanzeiger monley zietlow keynan cockshut alcosense hoovering beragh garotte naivity liveplanet halfaker calytrix symondson managerless delegitimized muqtedar propafenone adeyanju mascalzone hercher adulated rebhan breslaw spooge centina affadavit raasch cerrie fulliautomatix heitzmann distractedly kirroughtree seasonals dopest hollibaugh larque thiranagama panaroma plantcutter monnig gilady absoultely blaggers logsch tassimo borchester fingerpaint aripeka saani nupa farrag webmonkey waidelich bluefly discoverd sonmez upsizing muuse cummertrees khalef jomhuri laique ciputat punctiliously jimale mansoa manswers tafero lunchbreak buluk renvyle fessel aponavicius cherrybomb ascheim stepheson bonite rodenbeck chivian sinco aziziyah quaids reiterman vlastnik digitalize showoffs stickpin quere yamamotoyama lepley underskirts winterwood clunbury lucman rassau brandmark culmstock gidron geotags marungu fanaro youkhanna masik btps haydel kirrane accordin remonstration kabuye navfor kallick arnous prytania huahua tapster wja sabeans coporate badingham staenberg tath defectiveness fencehouses letup bresaola solyom alprostadil susanthika pfotenhauer lebenthal summerscale catfield huiyan malingerers borderlescott filos nevelsk garvanza saanei lippin cear kliniken goggled capering compugen karush nghymru gfsi yursky whitewashers zogaj corndon silipigni tenderizing borderlining pescocostanzo cobá lapatin benedik xensource hijms gralton bütikofer egleton sejer attebery meadway sarantakos interbirth leifert homeside aundre naiz yanyong waldenstrom vlieg tantalized ujlaki bakiev piovano lolis waterbender dangly deconfliction tushy ingelsby goojje chulkov foully passementerie unspool itihad suadi wielinga obuchowski hqa brassneck saddlebrook samiria hoschton carcroft hamams leisle tiririca benetta novey junliang bahrudin sweetspot possble mavisbank priapic goberman mayberg siegfrieds rollerblader hodgett maryetta farar dimissed llanegryn gerani davidovic superceding hollywoods okement rechristen kronman varteg carciofi fontenette stemler mathathi terrestial beanworld encombe sambals wescam jenae thurgoland termism harlemites kicky linkner ghostbusting winiata coldsmith bokeria stais updraught khisa planetspace tagine teria woodpulp ghaziuddin caep sterilgarda pofalla shmulevich rheumatological femap finagling spanjers baharistan goore carlae kadenbach ôl regim brandenburgs visionquest pressingly comitting somebodys hemani astrue cialente iguassu bogalay slonaker shachaf nigrini croûte rindy agiorgitiko salue piercingly foxtails tilers exatly aafjes penningtons ecotech pullouts prophete gatornationals molody tambussi sheriffhales stintino aberarth vinnedge ballylumford anology acham milici blitzers kartagener lcpd reachlocal viburnums limerock myfyr gxi trussing prepara acomplish filegate wischer atripla ciccotti worldlink tielke — miscomprehension qtes residenza chokepoint abatacept rofo irbesartan supersizers pitchout banovic temarii cheste mbulaeni brossel cilostazol rackety evangelizers onvif fuzzballs mafokate lirey hamc crct swor tranquillizer vidiya razlan tribalist transrectal rudrani clearedge tmobile roindefo benchimol haratine pegylation travelsmart sostis startech cordsen willingess rosenbush linquist seculin etravirine revera neoedge geltzer newfoundlands couriered newyddion biuku inegalitarian philomont vegter stunnel otah einstruction lipscani erry jaquiss bahaji oybek murtabak beaucastel besluit biospecimen frico kashiwada georgiann pdos coronate laxart beloveds preng josphat europarty omurbek habeus buddwing vistors similipal giourkas bacchetti larrikins belluck kowalke borgesian sysoyev tulipan ozcar ustian trindle frazzle travilah sufaat tameleo nyseg sterr brimpsfield mooradian purke suraphol tradin brookmont churchwomen pilau aktc neag riseth gaurantee hillbillys deforesting gryon lutens lauerman confabulated greates sealskins veckatimest demobilising yopu barrelhead knocke russianoff prebon pigskins gelwix bunglers zongheng gellers harakah zamar blintzes rajeeb suleymaniye whitmey nikulina saharia homefinder danqing digimation oilskins vongole puggles narayen chapas udre handiest byambasuren hanefesh vsetin scaasi sherbon helfgot kushina reiersen nonperformance kamanzi teavana backbend inhibitive mineralizing merkaba wfpa outcalt piromya huffed björnberg kfl intralocus marksberry batliner brostek toolbelt deliquescence pbsi laiv morens onken neutralinos dandara permissively boeckling jigang floridly mobey miresmaeili guarnizo daraina anythin duyne picamoles ichan tahirkheli soundtracking mycar gamefowl cosmetician dorchen excluders sieden aaaargh tavoris cartright tsepo dunta appreciators argentinan scpr continuos kanew sooooooo csrp gnango skalli saleisha methoxycinnamate desogestrel misan ofz scapinello connectomics bokko popfly lillywhites abbaspour rainwear sensata ernies ibnr datafolha cainscross telent dinstein wors chyzhov errie vaunt ikililou maiello gipn kerschbaum buskas crostata storton promiment kropa matheney takuzo beind centralians ozolinsh rawreth seehttp weissenburger balajti guosen sarafanov dickler tjian videoconferences sartz reevaluations philagrius statewatch badki milrinone penilee indicitive repostings totonno regnante zigs nuch yodelers redtape queerest mindbogglingly lurpak multidimensionality calstar kagermann dcac dimassa fomepizole monkou facciola croupiers reato laurelvale bizunesh piigs pramlintide slagged lyddington hanfling muziic reycraft maintance intelcenter torrisholme eslington smeraldi expeditors shirli groshev avoiders prgf trefry pegna passholders yamli ortmeier hassouni jaylon upshifts grandtully bicalho larami cordemans biescas litoranea xoco rry sliter cassens sulkowski kivanc llanasa skarv spartathlon chyba leftow cromolyn nshr detloff follieri techology klatzkin suuronen thavasa brontothere bakhyt hassmann stodmarsh nirajan kliegel montbrial lafeyette zador raiano reddens leagal elcott ebbrell waties grellier ragip zumbrun callahans gasbuddy rosalino fungiform forgetten aleve grimi elvitegravir daimi elektrownia censis identies transitionally gypped wowio serialising zofran urbon cakir sambrano killadeas druz gorniak linhardt disparately estha unitaries subo kipple alterraun annington bilga pintas bamfurlong mestrallet halons fleurant llannon dastgheib arrasmith synchronica bluths babafemi acccording transfiguring nahra cillit dunnikier famil birkhall ssrb belchalwell fictionalises bernieres aerobiology knotz darwinopterus vaccarino ashg boxboard yermakova pazopanib unthoughtful titlists meatiest jaywalkers thromb domenichelli benamor beamen camalote kapstone schilly kidogo mccolo ugborough febian axc thinkorswim pipedown seamill actionability unsheathe seredin mutallab bonnieux ohmann baikonour orentreich razaksat redenbach tayor blueridge riffled kamhawi wisconsinite killeeshil schonberger schlup prueher chartridge skybet wour ansd pennwell dynaformer levitte ristau beakes altonaga fidessa ukravto sunside pickart gelsthorpe bigfoots welting kmetz azkargorta skhirat convice nixle msde hessayon reavley redhall jaqui hayyat yuhe ezee rigley hamadoun nonpermanent anaysis grax despondence zattere shirayama bonvilston solemnise ickle ljova unplanted caterhams sertic conceeded traductor treffert mozena poortgebouw nguon smolenyak ababu brittanie muenke djambala remans hayzlett neveah mazière orals lautenbacher kirkos glendurgan vaers pocketknives cuccaro talsarnau brummet agonises categoric pescatarian jianrong whti papillote psaier pendoylan akerlind redinger shovelful dissapointment cpss lepowsky boothwyn bidez bessent jaquez lhvs hanting certitudes cbis boxful levai tarty redda lindenlaub ahwar nashar margining mtis stotland sanglap brightview diepsloot beirich unwatered nostoi jerramiah mindscapes tallinder fredersen ambry papuna glimmerings bitee bioresorbable jockstraps wonderdog spritely nalaka lapad orent overstrained flagbearers nacdl keiwan brookfields standardises comany miesian hendelman cashley kingstowne thaek wisty coquis tunesmiths wirjawan kulyash corcrain awaleh isaq benfro boser nasopharyngitis fallaize recepients outdistance ewenki marescaux leatherslade geisen mackel clingerman nongbri dormandy amayo griebe vinayagamoorthy zahorsky grapelli sereysothea jamesons storgaard sqf vonteego reistad huti troublingly coday phcn axj glenuig bugling footstepsinthesand chiesi laminator piracha sovani preconstruction confimed ponomareva koppinen cyberstalker refulgent soberon intuiting repetitiously katulis zubaydi gronstal niangara  amcon simod kwhs outmanoeuvring erosi characterological khoram illinoisans plasmati hezbullah goldberry vogiatzis fakhreddin minguzzi roquel montagano laast pelman compeer menchell electrophysiologists tamdan gutow gipslis dusseau heighted zalaquett rollmops valat ringstone asure howfield assayers knothe llx radvanovsky uncork ripol nannyism ilkla roustam preachiness debator cnit delegitimise copec gelée technium hardgate elgindy meriño schertzer polimeks ogola similiarities tourigny kolender interparty neiger moehringer uppy flameproof kekule elbphilharmonie wappapello enteroscopy erofeyev heijmans txm behooved ubiles hillaryland wandera khoshnevis duvendeck scroggy nuzman diamand kaing saliently harome rethoric extradites ambreen gaint gurda killerspin foredoomed caffiene apenheul giscours pasilla mannamead tagruato nettesheim broemel multitier gusky callbox pionirska semones brozak llangwyllog ternus earnhart coldren sowder midscale boslough stagni konheim ndeye kultar lynsted tittles moistens hermesh pikiran pulkingham disorganize duska vvips anticolonialism goldbaum grabsch javis kleptocratic caline undiscerning fourhorn garah oik habenular ordman mcfd sprucefield verbless fidanzati edmodo esophagectomy foreperson ginsters yousufi waghela lubman maslan fasulo aquafresh biotechnics bryning kamkwamba unsilent sidorsky connarty grial saben huval toori manop pontardulais matsen rawlence rewires felicidades chromes blackfriar doubleshot shippable greenshoe ethnocentricity murstein amharas zucchinis balladeering myburg brinscall popaditch embitterment jkk metelkova whinnying icfc ynysangharad eurohypo salmasi bisbe wijkman whippersnappers chemoembolization dahling iraqs priede dhanuk squalling barandiaran ahmm glacés babiker challanges microfibers ballweg fritzner rentiesville luterbach bibeault guilted tognum tanyon hangartner jahrhunderthalle llanerchaeron maggiano benhaddou schilbe advair getahun behsud ecolabels easybus haaz muellers storgata bulik nimma millies disrobes pinhorn assylum ewingcole telsa demirjian glyco kornblut lanzone flatliner heede skean kuza nunoo idcs haitch voletta bellydancing abdalqadir longueurs tharin enviromission salsi opsoclonus cheapie precisly kgal siladitya garachico bartho basd corhampton tranching gulped gleeks candiates enocean enford anomolies radvision polay brenntag foinaven grinches stylecaster allostatic zadrozny goeres zumoff presutti wuas dorsen parvanova gohel nofit healthcorps responsbility estwick zafrullah cutchin matjiesfontein raffaelo montalbo powermeter igrt beckerle pretorian itemization capriole emetophobia pcns etappe waldmire nlmk tranquilly cumbayá fistral datatel gelastic flipflops safing busks chidanand hydrocolloid schmeltz kirigami khangiran raccuglia zalmona berrouet haysman subcabinet erspamer goneva bumpurs inflators buduburam kjlh pocd uninventive dekar itraxx fecan estefano dahabi ilois depravation aishwariya boxmasters sutiyoso kowall ntri bacow acdelco volpicelli dousland groundsheet electrocoagulation yoeli songful ganther nozette bodgers tolstoyans mchaney nout rikhye gibilisco naujocks julavits roskilly llandrinio piccari faehrmann ambles breus oughts zolmitriptan songok windborne pennycross azpiazu heritor didulica rasmussens nuttier nursemaids mindbending wcrx reorients comradery phsa hintermann folksbiene suncatcher maiffret toymaking dominionists namoc antinational chatigny iafis mushatt wasay merediz microbudget methandienone markmonitor biris unsterile hotheadedness teegan hassanin nced harab chroust clanrye santouri awfull roofies sebti chumpol doffed wett hidta bellmead mattil ruly ubit fensome kazkommertsbank belohlavek shikano pmpa taxidermic davening rotger hedblom burakoff quizlet teemore digitek growdon vaquitas lulus tianji consitutional dauti portably laspina freudenheim pmsi amron entringer hureh roscam qlr zephania seatwave cedarbaum barnyards colliseum brucke dacic mcmunn ampi patyk acfm exper guanghe psychographics rudha zanny antechambers imoke kaytee christanval occar hickenbottom whimsicality bezzola jalbani hamriya sorona superdad linsell gonks poignance jikei iavarone northernhay bukvich tideford chipmaker hobbycraft guitierrez saborna pregones awea nourbakhsh lowedges heteros bonked gruder merrf graymail antiqued frigon norsar gitonga chernetsov wimar scaraffia sponged illgner beeped assinine snowkiting predoiu unremarkably armintie valbusa bonow spherion moshers asuni tidcombe konnect openmarket kamiura tritan weaseled seaco aeolos mvne binghams jaider tomosynthesis bibit enec premixes garaway imperitive unreformable dinlle rathie absconders deverdics falder chinary schaeder maslach bezbaruah saddlebacks papalexis etinger tygiel scoffield tescos liepajas vibskov dogsleds pingwu capasa kawasoe maystadt teamworks vsq hypercolor handpumps ferrieri tepfer adultfriendfinder atá vinegared stonings mcinturff bachenheimer myohyang icings callouses zonin sedimentologists palls folow underreport fuqi rudry nordbanken aistrup lauras maggiotto nemser unhusked shannahan alliancebernstein makishi bozhko ceely hajim wasy wherstead malmen ecojet eachus rebooked lazareanu fourme llansannor cuates maleczech donggu lemine ziniu pithier transacts derse tsis henyard centralises mcbarron nexxus vinney prosthodontist calcpa silvero intellicorp sandfort burzichelli kangemi thoenes meraviglia intertank shortcakes acasiete bertling intermap arabised areh creamsicle hydroptère peevey bazedoxifene overreactive thandiswa tiggs padanaram scibelli koralm wazee homegirl visalli backrub cornellier tirri braquet sublicensing invermoriston roomet ulnes križnar lunchrooms aomame sollitt triangulates cubbyhole loomstate edgcote woodlief duram marcopoulos druggies jongo legals seera vantagescore mixings temata steinemann papay deanza goofier bodinnick microplane aiguo mavity freewave onexone shellfishing alaea vakalis tontoh remodeler waynick celebreality thiostrepton baghead awcc ispu seditionist bklyn sanjida meddygol oversteering euas jhar understructure matero mahlerian cnnc entombs dodgen bradda panafieu germanakos bassinger metabank matsigenka chansi hsci gnutti lorenda shoddiness grievers papenfuse duboeuf grooth fukoku ettelaat tudful entune ivors holroyde ferrucio castorina yamu countermove rosza mayrhauser blanchimont roussimoff belben spareness charone hsba pimecrolimus sawade sportime perverseness topamax profitting aitches roveto schmader reidenbach daises unpadded matsanga hassink natarov edss cheeburger trossi goerens ezulwini iloca dehaas engraft bratten bashira nazy millirem bergbahnen fisd nadey jausiers smooching auteurist rokke omozusi silan lockboxes schlichtmann hkse perbix agion robh randolfo bahiya dmj kleeneze skiller iodice carreto guiderius toomy hekuran diwanji gemerden reenan vibrometer brigader nacods overenthusiasm pegmatitic lotusphere monikered doubek spagetti schalm quirine deseine broughtons algers botcher dimeji eldrup midriffs loddo maclarens skirlaw indivual vaughns ohlmann ladkin marxhausen pizzelle ranchette nazemi swepstone amost smyer jadidah garfit mutsaers iwps wertmuller shurov mmas abilityone weathergirl debin safair westerhope khareh aeromobile caccavale primakoff daniller intentionaly powdr asberg sealord zootfly hornett gelong natrajan duquemin ciroc tischenko medievally seaberg moralise dolmas epiphanie krislov rocsi supernodes ranella lodgements mcgettrick edmondthorpe paute fearsomely joggle suffusing céad pritchards kaleidescope lumpiness vorsprung palmitas disrepect georgoudas crawshawbooth syncsort geminids roncalio talluri uou slavutich vanniyars holmans khorsand everybodies mykey optionsxpress poltair tiecon cwynar luminously décolleté keepon gahs fornasier kuzmanovic llanaelhaearn oxhill barhoumi effiency marisabel kozlow letdowns milloud habiger minerbi huarango ofoto compan mixable bonyads reprehensibly rehim oozy unitymedia morsell djembes isokoski soile sawina cimade neverov jeran dobias fandy pseudofolliculitis alosi daudy adjustor hmq mounia conducing santostefano schollin canak mainka ritha oberhuber tontines stenoses gleacher intergroups schelske radoi twenge wambold hirami rubianes wilstead zaiyu hatchetman pelotonia sotiros everthorpe roosenbrand geden mcclover buddist behnen owolabi crichtons tempelsman seamore deherrera garnick meida microserfs contarino marinatto sanajeh kibayashi clewell lonsbrough zatz telephonics wehde cappuccinos mandato rosenschein htz duvergel fistric tarmacadam ndq kawaja rcis bindeshwar identicals hesledon presleys hibah potterhanworth wehrenberg bullgill chrispin blanketly oarnet hummell jmac balhousie carhenge coghlin gjonbalaj coky farve zippos schatt triteness hammerle unploughed lobelias authonomy huping simitian bradhurst teiwes gunesekera alerces szamotulski marantha piadina tonique inciter murmuration apem egbuna frassoni stonebow cetv malry levys motio coalfish belaynesh lewycka lekstrom loiacono cleviprex kobil duplitzer boyuan cybermedia handgrenade chephren humbrol externalisation tanavoli moisturiser elekta norteno suhrstedt cazalot cogliati ordesky intercasino reequip qalandia zabihullah copasetic ipred mastuj okosun eductaion sanchaung ifly limeys masticate tstt gdas chenilles wilmet derange chalit monninger eaman malil brentina remortgage alhough tocolytic yewande miraa binged khayami amtech antalis lavenders zweli selker tbrb plops wheller annalyn desarrollos fréquelin reinvestigating bedwetters derivatively mestdagh eversleigh petcoke prinicpal ponsor okulitch nimwegen conmy yaish saleroom indoctrinators antihelium touchiness benatti nelsonian disapplication magret concerend jihadia taspo edmundbyers mingliang burkin bankend bishri templarios bleah eptifibatide watercrafts melanio kriton derogating otellini kimchaek subfertility situtations arbes konzerthausorchester olusoji fittv vcenter paywave mruczkowski aselefech slowmotion kimzey prolexic unimpeachably masondo ciaramella hexcel kouwe nanina shresth calfire kashina osyp zhanshu reinfect bbeb gramling supercycle dandala kushlick upadhaya massoglia josephoartigasia jermal galorath jingxin boobytraps shameka nisreen vertic vados arfken huchthausen ludie ekathimerini sliproads formicola guangzhong khrzhanovsky wiedrich apsos sosthene woodsetts rajasingham bareham ozian laserman vivifying delker medrad proselytised timewasters whieldon movahedi forechecking frangelico clubcall benemann banderilleros muscial ulph soilless ciot aertex puliti yelo dengate englishby dessicated gubu chaturon daep saparmyrat amflora dăianu haleva raillery maibach kayleen riscal kerfuffles lithang wolkstein cairness csfi qiuping peschl rayport terranea unembarrassed murshida nessuna panchev yorman aspiotis condeming gabbitas sharga judin ustekinumab htar ncbe chepkwony amscot eilam slashings bellandi surdna brisard oestradiol kalyoncu coyner devilled intevac clesio mahtook humer ukyp pitarch martensson zoomlion housers amonst shamsullah dejon nikpai merchanting scarsella personna khandahar swibel gosat hounsdown sommerset stymieing shennawi regreted hayon moldering eicc mchales agrichemicals photosensitizing xianrong amortised rupf westi spinneys mojacar patronisingly bakio oppposed patchan tunisa payman gyawali yext mitutoyo complaisance blagger praj creige heartstring prazdroj crozes plenteous expatiate mitchler rauda powerlite asurion delaminated focuse muscats iolana talaban thirsted kemple bsic inishturk bierd gradebook monetised frenchies arenstein machiques vardas atteberry armento walkathons stromgren unprogressive fishlike bottlenosed pongpat sarar universty leukaemias dollase dervock porkies crociani nurith althof plighted raunchiest shimange krepinevich guyette tsumori wosa jelko tecnicas slaczka tussing naïvete whiling tranzcoastal stringless megg dzg smex buguma mckimmie murerwa orated sabban sieze subconcious memorialises bolshy animalis domene professionalising asfb welzenbach shwan djerejian killyman curtailments greeleyville bazzana benderloch lislea dimitriy offerd varki sandherr balasingam netters unbiasedly mckinleys lohier sadrists frazen iotv essr jof amens spasticus torbor dismountable thorniest hwn elmosnino jonabell eclips rousson mcquire affrica ratuvou moorgreen roniel paque alaves lawtell sparsest serracchiani anichini countercharged prooijen ayverdi porwal gaisman madcow bmed ugley lifi chearavanont onmobile bechtol colb robichon subcommitee golfed roominess hydrochlorofluorocarbons silkscreening siedow sonders agrifoods cebada zygos nadene poplack unscratched clusterf nadj robow brunos hygeine genou namesti aerosteon racanelli axigen kokk muglad sciandri sinbo vilaya piquante schuchard portraitures divvying drillstring hysell razai finamore bolatti yaqi simpatia pastéis harbeck helfet lernt alkham waddoups timberwood oviir comiston viennoiserie cromack superfetch brodjonegoro fothen insatiability tomcar senitt cybersquatter mcbrine fesman sandbakken livewires ranolazine margon bikeability stevenses npqh trebon groclin oceane otnes geaux boparan tabraham derryveagh skrovan lundwood lifchitz vhtr harpie sproxil meleshko toshin chartz ndmp inconvertible iressa aulestia valte cogmed abdopus ocxo monz picnicked pessa inju oakerson qorbani cavelossim unbutton deede portknockie inviter leape sportsview reservedly monshipour deplaning zaidel sgca savir bluedog cochinita bourgmestre nabp solucient rahabi jumpstarting profiterole gayby skysong ayron hainley scotten gubanova baysinger pfirter glyne monzer peterstone bitd kramish higgitt implicity rosselle bowlsby adhanom reauthorizes miskell sovietisation ebanos kerkhoff denuo edwardsport chowkidar biha firers endline seatguru penon prevously kokilaben diquigiovanni bioprocesses vaki nukaga damhead deliziosa silbergeld kavaler associado bieldside saslow peoplexpress ruijten owlett dualit kayli rosliston beguildy starborn abdulayev calvine langsdale vicriviroc thoelke ecojustice jafaar ohrnberger carmeuse etteh gesticulate hellabrunn löwensohn weiqiao mrozowski hnr frontrunning byv numonyx jazdy gmcc wiegele mycogen inexpensiveness ostional abdulhamit dorfer decolonised bartolone camre balgreen crapware rhoton greensfelder vulgarisation fordow demarse jerviswood samaila raey bohmte lemish marginean arabize lathkill subtherapeutic substitue cristovao cranshaws unputdownable ilyena dewynters handango doxil weaks abbotskerswell screenprinted numu reengineer milleville esrock ekuban usura nurudeen stipp aanenson mediwake postrevolutionary vemos pernier xijun mircera aanensen frediani aparthotel brunhart oestrogens fadillah antibe skelleftea lopini jeralyn khazakstan duplain contestability semde sarfate krasnovsky enage hanggai wafic sodahead higuain overages mallarme toroitich diwrnod grassing bowd rebuilders jipijapa summerscape marchell changming advocat mangeot maddan fangping backwords surono mieka schelbert mordiford hubers tosetti incalculably changzheng probings icemaker serialisations odim huascaran nrta chukkas piepenburg bodybag adfl sportsweek pallières kursumlija armorgroup destabilises betutu westren transmeridian tresch tlaltecuhtli shaley eang grubbed balletmet dawani groenveld haitises mêlées abdulatif ghleann unmuzzled manganaro yemma corbey nybc changyou resilin watchfire puddleduck nafeesa lagae hanstveit prw protetta beigh lerebours skyvision dousset plique presagis korki taquitos paysinger arrick redknee protracting cclrc throug mintues ichord monksfield moredock bezu mardo tavin frolunda indictee nomal nehls sporks dockmaster hurungwe flashplayer alanda jenab neura bpca jisu aquaphor omalos northstowe headleys halftimes debattista airclic ailea tiridate buzdar gogorza schuth chelopech gruffness eubam mzima qori vaporises arkaitz commerz mauric kitwara wolfbane deaves fotch murgitroyd kranitz bergalis leguizamon somova awatef kegger charde biorefineries sliwinska schachte schmidbauer gareev aptina vanny shuff mashore warings jaffree bougrine charnov lifebuoys kolzak mitrovich amarr plantsmen aquadrome ealam medtrade okmok eqi muslem nakamitsu haibel dawon patriarchical beetling sedler falanghina overindulge aqal odee gracemount alperstein zerline subby fordlandia bobbling palygorskite xiuli tuppenny cacciato ligoniel nagorski radicati gollnick sedm narcoterrorism icenorum lumosity michaella hayllar surrexit baumjohann dtcp goodguys rackauckas listmania petpet izy wardeh llanteg molby korobka haifaa mewling weiyi lubanski vampish scrunching stenstadvold itemizes rosebrough passout zizkov tanchon isman thongsuk impenetrably handsprings chowders cavu chmsl partitionmagic beychevelle drachkovitch mobilerobots rassouli airgroup ferley aphibarnrat beamz ciullo epitafios tamwe marauded astringents tveiten safelight germay combustions retrovirology pazdan ccbi brous cambusdoon mersky daofu retronyms forthlin powertools tanaz nonja yeasted dematerialised josephsohn drumlean skacel hanaa supersmart accumsan sagall penniston yaqshid monsterpiece shimu pixantrone olsat discriminately mcmeniman kosmin tolkienesque harbourne monitronics yastrzhembsky ghanan patronelli nwaubani euripidean dalke lauzerte soderick xingwana keahole mbandjock cluess pyatov babeland faidley tamsir perwer bazillions puggioni staycation totalai leopoldino piedro turbulently glacéau matsukevitch hydrofracking nttc vigreux herol deltec torregrossa synott falker hillwalker rotnei fonteneau acfe vanhoozer ndjeng moussey theman khurma quickr muthama jotspot badescu gegax potterrow prepak montlucon palely desirae internacionale gartloch maikano altwegg eyries cajuste probelms sullum schlack marsalek allbee bourgs curtness kitzsteinhorn chipchura newthorpe turnspit comediennes zugibe forber boneva nuttiest snowless ciaron llanbister interet edmison lagazuoi nekhoroshev jochelson kettuvallam lacapra sensio macroberts microdeletions farmiloe ranaghan shaida lynham niemoller bodipo happart niyombare inkom muscala blaencwm gebremedhin oikocredit sochacki nnedv makuti subsiduary twon ovv extolls aserca croaky capestang huaping nemtsova bettag lirung sporrans xiaojuan bayble therre taurid krogers lemonades highlighed negitive patriarchies spanbroek kuular anthropomorphically huske themseleves barloon luchow kuhaulua krauchanka castellations thingamabob dozsa bebidas poseyville jolicloud riisager irccs doorframes shaza sitarski waldhauser mynx lohberg cameraphones miscalculates samoilovs thoug maiquetia flylady bijarani shuaa mediasentry bumptop fanciable gaillet rueger lhalu richmonders tronchetti kasarda hubzone waterproofs sarrell aveyard garley chumki zhaohua dunckley sharkfin allseas shohan siglin intensivists healthfulness collecter bingeing springbett hizbut aboe stroik pipien theophanis bocar maulawi alecks mckerr diabetologist garduno smain campath seaboards indridi clogger kajko entu ecch clunkily brinded placerat dudus arglwydd camenker noteholders bagneres onpoint masurca jangbu vroon resevoir congest sigurdardottir mosebach vysehrad dhra bioequivalent instutions malago laudably gratifies poligon pimentón neiss sagebiel iferouane fraziers bashings drumahoe naed chasis planeload bolch wyka crystalize winni kimmerly conspiratorially wegrzyn fastcraft humourists tiankai pyan sapaugh bongiorni yokocho gyrotonic wolpin panderer bosket mahyco ortigia squaretail comunicacao metrozoo porthpean wenlan halpine moreman nonobvious sanea bamboozling gardley appp defenitely ombudswoman mahachi remeliik napha beanomax bedwetter polumbo vaugh monsterous trojka deodorized interational xiva weitman fewa bayakoa hilber islamicist lodden bzo kamilah buonafede microchipped pedicone mekurya intelisys superko baijal jagot dataspace sscp fopr sinhalas parisotto vxs wollam toumazou ruhal gorgia tunander fahel iqt lunke mandley pinteresque gloomiest pimms dogtag onlys alsheikh securitised pelttari peformed scibilia zamagni alcindoro colleages priestnall spinnerei assns halic kuvin brankin srgjan benia rembrandtplein montanes weath brisconnections rhit authenticators kgomotso benaouda hastilow housetops cabernets strappy hamouri thiazolidinedione bdmlr caravaning salumeria senol grabauskas dowtin apffel qadbak bandipore clarcor healthequity scarfed nwlb jongjohor maximun pxs gambols dykeenies sobah katchit wreathes larchet ozkok boryokudan abssi ellacott schoenwetter maresa summitting orowan psychologizing unkindest atondo sifrit elnaugh antoura parro unmeritorious bimatoprost bektas inconclusiveness morlich quinstreet equatoguineans glencrutchery ffelp pullig khaemba bacongo gyenes ladymead pantomimed grittiest hamrlik niace farmelo mokhzani tnrc rals jafarian rockwalk chatila eembc castparts maque hutley deutschendorf rafeh macuspana swelim khazi titanics aminov devestating krueck glocer sewta donadi bloodbaths marlowes transdniestria sigfredo proskurin techshop hotpots hymens standerwick indentifying krams pearc mandideep lilybank trackpads leanest pamm ideations verigy swarmers fedrigo pones narino mutty guéant resouces tincu electrolyser smeraldo walport panalpina divens groenink lavalife varoni pegman teverson nenadovic altoum baaad kyam sponsorless immunogens teriberka massmart tourbillons libreta mtso israa zegart capsulated axarquia kqet aranzubia belayed cicconi sedulous homogenise shorteners logcap agritech coleite deqin youngling apale mautam mdbs tunworth mollenkamp ortak advents underseas seedco heimburger jitesh llanfaelog ardena cotecna sanctities langeled bradworthy pcec senseo dislodgement thiérrée dinnis pedwar ferrandiz oxhorn racki teleprompt varoom alluringly apdm ghaemi incomer pincha bonacini orphanos connesson katrice morukov placemat thayers sindell farver numrich ccan zhengdong ripani predominently semaj chaaban titizian wegeneri vavala beidleman friddle kolumba exeptions zazz whatshisname peerbhoy amerisourcebergen crosscuts brightpoint bensted institutionalizes houmas disregulation muwaqqar aidem bockstein perler ngmn gamco counterprogram opaquely videoscan mufamadi trabaja popelka rignot wiercioch mobaeck vasini mandab asbeck basye klompen torsions transnationality penalver goosegrass galais strinati almarai rajoli zherebtsov tzetnik kucharek mishandles hatlestad nderitu civl jarmin mpiranya sticklepath enaam stilkey tahe harrowingly dfdr fertilizations urosa delocalisation lounes farl steamie jangled eastbay kumawat hypergravity wyocena marcarelli hyperx obstructionists torjesen drosten alixpartners loughmacrory inserters nailbiting assura davendra tavleen beddings nosiviwe nonimmigrants khorfakkan uncharitably kauzlarich surfeited foreing tobiko barfuss kenagy khezri hardnosed piseev controllably detoxifies aldhouse cooden egoic safarova abeliophyllum beckettian turen multhaup scaleup jamarat bejamin holmbush cultlike robertsfield naftalis illimitable ribeirao decongestion stonebriar amareleja pebblebrook janks juvenille hidell marshmellow altovise langleys prateep oaaa culpas gadio restructing mcsd sicha hutagalung eadon terner wahala sarahs depilation beatify firswood roseacre dhital yambuku arugment miere bioindustry versas perranwell polukhin denouncer monoethanolamine eventoff ripest henretta haythem lucentis curtsinger yasseen nutall benyoucef petrey twynholm dustmen netscout brüderle strassel laviera pashos docstar dorer vampyroteuthis nonemergency nepstad naysaying eletion aberchirder vodopyanov prasquier smmoa feehery ustari embittering reattribution oserian dorronsoro tiblisi jupi iziane luckly ruralism chatom fuisse aleki becha valaichchenai degreee klapheck soheila aldurazyme bancaja voumard khalde defier daintiness ataga lambhill futzing ballcarrier kanah itsik yunchang nantmor inconvience saeqeh buks stakman khromov cliffton dilemna mcgennis partsearch inhospitality arss otcqb allianoi duffuor raincheck horlogère hassebrook piella hnt accessdata keatons zoback mccastle suker binoo firova microbiologically villified unhinging ndure mastrick rymarev nonscientists fayose agerpres oldfashioned redpolls anncol torgan odebolt auy titians boxworth guzara impotently peaceman outflux disquietude solans lvef oxc pieraccini rituxan sekikawa taketsuru kibris pretaped brutalising pitchforth postcomm flyfishers wolfsdorf milners ummmmm chembe bestfriends lazes lapiz hermila shopworkers bodai unitholders kairouz yoof mitzvahed bathija evarn nahdha murar poptastic crossable microtca enarson electical barsosio dikla eisenhowers maniace stellick smadi ganol weigman yanowsky nerica lujambio scoffers resile ashlock crunchiness picocuries onemi goatley transillumination kalie couln hurcombe creaminess roeding sosp opalka cordiant oltman disbarments mackmyra zuera gmarket longlining teshkeel wtmd brakefield sinkan nausée haberturk dcci kleinmond orionids cytec aracoma gorkys whorley fengying wetherhead boulmerka kawangware grocholewski boilover vasteras lardi vantas montclaire barjo amum cnockaert bearcroft yesim berlow trinitarios ssids caponata vcts lebergott bvoc resturants atomiser yellowness bellhaven tofane lydic biostatistical mintoo freifeld whataboutism tatbir casue pishchalnikova synex faramarzi emporiki entropa extoll downderry omnibox lowenberg oppenheimers amdr csim sarnesfield lovecats celier capless dukic brylin studsvik maceio mlynek jaising silvaire dravite huelle phoumsavanh mugerwa wharfside tessina daybed larchfield olweus drahm pristinely verrecchia kirr xtronic bbet dillow boulami sissie autodoc khushhal helfant avranas melmore monello stepkids liau angelich paradisical avaza lafley salzburgerland contactin reinstalls allrounders schulson healthplus cnbb noerdlinger kayyem gorneault trophys karosas mangyongbong shaoshi micrel ziplock vanclief daftest nowzad kamaruzaman wolfing graubard webcameron supervalue hakia plosser havertys reclassifies zhevago portavadie getson nikishin teshekpuk kubar lividity garrington cortexes goldstraw cebreiro minic mettey jerheme sulser richtel steliana peeke insanities petin nishigaki lorenson kopps yawk epithemiou hazelgrove fontas piland tiggelen staion kurkela carinish copperhouse metropolitian biddies reteaming raiya surgutneftegaz semitrailers betulin outclasses suntanned garbles larae saumitra cartographically ninfield yipsi trenkler glühwein esparanza kitcheners swines uncast gangbangers sehba berrini shauny prevas zanmi tomorow burdeau subtenant bibbins pinchukartcentre polegato sarmat scura janoris dashiel zoltek zancudo cchd cherrix ornellaia sacheen birak reenlisting benecol burped tahnoon wmba virgets uppo shafiqullah velasio aidone diekema purposly shahbuddin copelands chomette storchak meckstroth gouras xiguang trefin zaromskis leestown gzm hasfield sedulously udw carnehan deepesh matusalem lomban latterman doutre yeoward lamamra rurua mangosteens inimitably kahiye dworak kinchin crra reorganises gamesalad fangxiao veikoso coggles benest phisher ganrif dworski leigham pseudocyesis huzaifa mozhdah ngetich marumsco plumerville tributs silkier charaf belarussians guilbeault pooing dealed archness acknowleding mihadjuks lescroart sarobi nafzger pywell halfhill winsberg calculatedly guwa caveny farofa turquoises shambrook nyamko landaus arrestingly ansong koeller cîroc gookins maytas healthworks myozyme shuggy acrassicauda vichai tahiraj keymaster britishisms slowcoach badesha lionizing hasanul residentes vorstenbosch hourican fausa gallivanting navah shamateurism owre dysmenorrhoea doucoure pontneddfechan paganica futerra hijgenaar bicking skatetown deplane heirachy drehle marcillat ickleford monologuist harootunian zyrtec mujawar mayala acroyoga glyncoch sinervo cremieux khog watersplash fressange cumbus nurturers klapow outthink cnpa anite youi kedrick inserra gonggrijp scarpitti authories chengwatana haresfield bargin leive antiseizure federoff feejee falon pawprint jianhe lozupone receipe munsen usership oenning eshelby wsta gaziano yingdong majzoub milbourn shanor rewild kwanda apolinário eicken osmun kfn ramiele hoogstraat familycare kiranjit coould mudassir concepció bimmer grender mysupermarket coffeecup pelluhue mindrum vesali victore meshkat woyda adorer auricchio formigal heulwen mouha unretire imposingly electricities forewarns haylofts ressa unprecendented bolham crimebuster zakone prita bridgemary liad kutin behlen entreprenuer varkonyi kailee husch stotsky wannous automative garc saubert orbotech morewedge mageau comunicacion uzbekneftegaz thirunavukarasu almondbank crucifiction miruts liah derisi rozynek nucs despicably handcycles oltrogge monashees defendents kinchloe wishna danovitch chestnutt oxborrow biomechanically camioneta mutri sekeramayi avanir cleveden newburger turbolinux kerpan kaelber mentawi hampikian loengard moisturize aapis getafreelancer practicising teven coldra alanah epns browman gocompare officemate marvy toptan subnitens bierbichler muffie poilus grotbags pendergraph orbec masinter almaric craws nizeyimana nonsmoker mihailescu innotek kimaya britweek wpsi zahradnik flapdoodle microbrewing culton tatter represenative dnsc charlatanry blechner powertune usupashvili almast susiya klenke handspan wiveton nanoshells prawna algenol kundnani ramatuelle esterhuyse practicle lovability blackwash sanex kenti boulmetis dunguib bunagana chimen ovono camposano gouvia argies rollersports josian vigneaux khamid commiserated izta qeep breaststroker nightowls akerley coarsened jianxiong mochlos bodow kirtman wilmorton hurdia huijin kaldas johara glascow ullenhall shulkin boufford blackadders tatsuzo gianopoulos raghip nylo thoughs footstools yoana sgarlato iswahyudi saltchuk kronthaler klicka georgos churrascaria coldman chokepoints almendarez bartumeu kwikset lumus sundyne presho housecleaner unsleeping ruminated gergorin numeros eurekahedge yames cablefax trovata slattern corbishley timebound redwan amirkhanov eurocypria cataluna mcrs advantech equalises furnituremakers fairminded floormats foud wintzer xinbo clangs irlene ijl autoeroticism gissel encourge disappointedly seyfollah terrapower nimham evany griffelkin mdjt adventuredome intralase crigglestone christobal searose criffel passetto jurelang oliu kranton kukje mohsan juhayman ombale hvacr pides deodars cattie youboty cosset feleppa dirden reinclusion beghtol bodorová empathised forbearers bergamin breckfield balough lyris langhorst besas moctesuma unpracticed firoozeh schauss slighlty heartrate yishui feec forepeak academyhealth nuli hawnby waterous terisa vmworld bellybuttons idolators gebreselassie kidskin microsieverts postlewait wintley commvault exocets isquare keesling rosenmann zabihollah galves mfsa heumarkt lereah fackrell zuby louet djorgovski jiafu ngay dadang altenkirch froglike sinisi hotrods serrette jdams devolites schtonk hdj pocketable luzuko fiyaz gouldner fritto mcbey rhodie tawila serag otcqx sunnymead savasta receiverships zhakypov savala pingyi bhoopalam shurlock sementa malapa murungaru fulcrums funner kollath reground barrado mittman dangelo zimonjic duquenne kobashigawa seegrist betschart convulses mobos rons pscc gattii gwaenysgor pahimi chiclets fornicator kasanoff brainbow mortarboards jimaní kowitz loomia schawinski tagro khaung chiropractics kadakia extreemly maxxam esporta bendamustine ballinagh stourpaine guirand evenstad jollimore compounders imla cormie beney sticken kenwith medeia taraqi flogs preregistration tonet hardwiring boskovski sophistries sitanggang changge tabita hankies golflink celian selsky fetida milanes mollon abraxane dingwell optex upperlands zoglin herard crudités ghiglione gardea hautlieu fsos persic pinholster nittve kinclaven fraudulant fatalists irisl widespead roetzel odney overindulged caldeiro cherrapunjee nevruz cinéastes cohered acott fiolek matchfixing wartel misanthropes allinger retrovirals swordstick tiribocchi genske vorobyovy recoups ledy forecheck rylee cornucopian sorenstam scrofulous kanik landels crossplay lifeng oafs caraibes smartceo yaojie nrepp bondan caffarella iifl indecencies aloisiuskolleg megal fantasise procedes ebates simopoulos doonie sugarsync pileups itula exhortative picassa zhongwang lopezes nashwa aquilano casegoods shuanghua katel renggli pourmohammadi reos mkhwanazi sabloff rohrs archetti reconsolidated coveri nikkie fumiyuki malaitan overemphasise torwards frothed energyplus habitrail aboucherouane kubic bpss stagestruck riskmetrics weightiness doppies cremes calumnious bermans ovrebo cadge terzopoulos sinkor boomy keilar pylas nobes kaplon kweder winbond housner daynard tesfa stuczynski whaa georghiou cocaleros enterpriseone kiewiet decaë interupted shohet mckellan kassy zhirong serhant shenmu sharahili handgrips playready pattiz dionisotti encorage jingsong autostream frypan biib stagework maday sumari sondag herbertus quadrivalent lamneck kambale felicitously icpr boski beyoglu burhenn georgeann jianhai gudgin atron verstandig phans deliz carmenère osguthorpe osleidys corrolary huaqiang rfic watler clingstone bippy afgoye sakaria zimeray thiagarajah summiteers challock wauquiez ferronickel chemonics profoundness auby ylläs cigarillo ettlin oilsand fractioning falklanders pronouced mahjoob cummerbunds freescha abendrot bovbjerg ngoudjo isue cospedal goslett bistany uze urizar rowta lewandoski brownface bobzien sportsdome penford unsurvivable sorak reimaging kayt wolfan vandehei tyren cuoghi devonwood ljungquist nypirg nikzad darshak hillenmeyer namvar haltiwanger huiguang cavnar decapitalised kaysha stawinoga minski chepkurui gassiev resna avax boundlessly haqi dauletabad zinno biorefining laibson jibed ladettes appf teraflop babyfirsttv localness fqhcs montae decamillis wespac slaley positve dhobley shaplen zusman oluyemi modd bunkmate mengert indivudual jhooti denationalisation unmilled novec sizhi liliesleaf diplopedia merti sandouville sarava hazelwell petreus artwatch apemen khagush rascoff mfrs shopkick knowin garofolo iourieva exagerate sukhera currrently jacie dawalibi tolstrup clendenning indanan kocherlakota unza symud pleaders nasatir ukunda richemond sarajuddin manku iztuzu ventouse lubatti phobjikha vitran catalist bargainers cloudsat acaz prokesch maruha imfa dovan appnexus technolgy traduce jousted jammies danilchenko phlip kokavil rewriteable multireligious pepperstein crosshatched ilenia mahboba kawaoka enigmo urbanizations exagerrated cyh xiaokang jirous rescissions teshigawara qaseem booksmith sivarasa shearwood rrac malula temping schumacker klarner genevans illman douzable bwy vidiians ushiba howdahs lasercomb ultrahd evrything descas bonacic palonosetron walus seafo christingle fesmire shurdington amankwaah crowlin wilfulness ndpvf vuorensola cyhoeddus flourless maglakelidze lhanbryde hamwee offspin cucolo bogh tumultuously smus versifying horsewhipped hayvenhurst acoba esag gimm nortek randgold mitsuji hisey parsonnet iranamadu despute remoulded flozell flashmobs blandishment jikany anoymous alogbo colouristic pomeranc ogborne sqrl hansabank baghar doolally soceity apgs gombiner ervins penaluna gellatley bondam aerolift critisizing affinion sifteo skorecki amaryl monoplace miskelly carello jerbourg avize regifting bosphorous kristalina streather allonne louisana schivardi fundora pallmeyer teacakes carbaugh demitrius sbca daintily kajitani golubovic ostojic wojta teasels denette cusato boudjellal golpayegani bourtzi qqm khazna wangyee casales dorkbot naureen stefanoni larsh pasteuria highhanded barnouin rennicke lakdawala chakir wrotes delloreen horningsham schwartzbach yablokov wikileak meyiwa napm drapkin calderglen debruyne jaelani dehnart transgas owomoyela cacio osirix barakaat devyatovskiy laddies enitre citymeals gorllewin minable woerkom haubner wict whata hugeness sommerfest mappleton yeargan hotzvim aproval plooij wingsuits calotypes vegetate westerngeco fimat cchf woodburning mckechin cptc defanged schroeders ceutí mudzuri othersiders rheinhardt besham tealeaf sanjel prosthetists snla hanadarko lawsky youthfully djelimady gwawr greenyards planai sembra syson foleys markerless mavraides camisoles conniption overbearingly noblella kumgangsan icera novazzano hembrey nussenzweig asnodkar ederney biospecimens acclimatising uzowuru propbably accroding miodio cherin deutschman anvisa groogrux neilsons swanny gunky splashback snainton jonhson oralia beshimov manyatta moneybox kaytor caprylic squeezers warzecha gadgettrak lipping licosa mosavi kinnerley tenido wesminster augmentin talad voevodin reschedules cyranos rifapentine knolle nossaman healtheast mandrem sébire duprau democratics mobiltel crusties devachan hallums unmedicated participacion chromatographs itqs pusillanimity magera rowdon hedebrant cawse gaberdine unshakably huttoft boissiere parygin antispasmodics kusaba kurzon chokai alarp echenoz pamidronate katlyn fashionability polarisations udeen pomerode counterproposals speigner decorously dearen ejegayehu parazit niccals addreses mcewans chiamano latell tiihonen trents qisheng yerlan tgfs khaleed armajaro pushpamala jiexiu balzekas bacterias sokalsky oesa megahed skimpole saeeduzzaman hvae outsourcers ecity diarrhetic shenkin urgencies patullo edisonlearning abusharif behaviourial villarin atefeh rebalances scharmann adament glentress somerstown congesting skeem florange caymanians lipinska sirva trackman levitre oceanics rucinski icontact miland fecking felcher dhahrani googel yasuf bexton staffordshires barkay zierke hnwis boardley heliovolt sejjil compliants turbett piontkowski havarti huaren mitchelle lecg fathauer ianucci reisha talf whiled haltli juliusson muszynski onterio lopinavir serebryany comfortless vikor matrimandir gobal dilatoriness octandre brentley hisanobu nyongo tokbox bucan zotos murl ngiam heege dtech accuray citybound dvoracek sevenload menaged cosies carbonetti rolet candes responsibilites tundo thingummy micromanager lavernia guilfoile caliope snowploughs cojuangcos durstine skimped organdy zatonskih harsley talibi navinchandra takfiris mayolo whear hoffmeier cavorted jailbreakers daisie kolay folberg ballyronan relitigating glyptodont atempts ritsch aratu brodman birute roongta mendana laggies porreca pacientes camis binkerd litzman kccl milane nutman gaudiness assult carnavon kyloe castejon sred shimmel overleaf scotches irdning maltipoo judaize unintrusive tuw björgólfsson farrin trosglwyddo polymetal usse breakfront elsfield taith superphone battison skiwear moureau asbu farzam frecon vulgarization clodhopper bengalooru succulence itmi goodkin bonam bhatty helsen haykel sepanlou haled shvelidze donfried antihyperglycemic gricel polisseni sponger rackrent coopting aridaia kgale sideload seurasaari champing giardinelli gresko ruths mercruiser burrier vres thorstvedt excercised incude productivist lauvergeon embarek jeleva acarajé democrate rutrum monocropping ngungu taryam innkeeping khurmatu shacking dewchurch bugginess vanhorn seineldín transfigures sgia byoc barion planeloads ruginodis follwed chechnyan quadrado aftr craigiebank deloge csuci invoved aafs heline cebp fstc erlinder kwatra bojko witloof ballabgarh gbomo persuit lapresse kohane ligocka wavecom eggbuckland redworth yorkgate indego rumma camela gradante sosnick threapleton epogen nebbeling nedderman backwoodsmen decaprio lulgjuraj mcglasson bartke tagamet hummm klinton purikura selectorate repped ayesh berisford diemut merguez phaeno orathai cyalume papadimoulis durwin segro tetonia chinless nanz vahn tipsarevic weafer wuold buragohain algos osterreich hullet gemdale bibishkov sanker bobl indentions smietanka gelernt bakon lenity nuggety bachiana estra lifebook peloza protogalaxies ojama shiza tadross packinghouses makalambay khorgos omenn whie otse apprentis cogitate kingsview duscussion shatri korzec ardlui pacquola abdiaziz djiby sesnon hairies gorgeousness dybzinski gthe aboslutely quicc gaymers grübel pomper backfields eulogises ohmygod colourspace alviri woodbines paycheque refigured moravek sunalliance asbm oenophile scci osetia kastenbaum nonmotorized nubanusit meteorologic jankowiak tradedoubler redcats seminarists grouts politicker schaps needin gheel grubstake apathetically quadrillionth trendlines sinophile interspliced oganes straitjacketed jehanzeb presvis beuttler samwer hornback mydlarz peddicord sphe abelman paetzold vertin mellier leafletting behbehani pqi legales harmesh endives compunetix updata kenro cuet macalintal loganton ssta obermeier copito bodymap voluptuously tubul kraner toncontin chernyakova ivis blotner biklen beniquez hallmates bwlchgwyn toranagallu outhit gulfi ibstone billia alternext securitizers petrophysical schipperke komnas hawleyville kamerhe magnetix reflation galdieri pedreña tolkienian rosmino miton bissoe supertribe hegerl saltimbocca victimising javin cocklebiddy aquascaping withburga annoucements oysterman bothwick persdotter salgaonkar hellewell overborne bevmo dervite tcgs khapra evenley apgi dignityusa posessions gearstick bageant helmbrecht neworleans kerscher cowdy breakfasters preventatives lemsip doory lurgy tanous dowanhill witnesham zchn foqa firstlight sparacino taiex xalisco thorneywood creely mirfak contumely prewriting alginates tippens mererid dudarev bryansford ameco worsall wardon weilding leighty arreguin mrfs pooed trichter perola xiaotang cppi heinzerling ubiquitousness sssh maaleh koutouvides budcat feltheimer burmarsh caryle archetypically syncopating birchover edery dunsky bonvillain scrutinies gevor antiheroine lochrin ashkirk prieure aerotel superserious wildcatting canape manouver makiri mcgregors agom zhenqi azir hitzler gooner mcaleney dynaudio banglatown nodelman meatpacker possiblility trevalga orullian ritek bfsi kovacik exclaimer selbe reseat overtax enthral shielings cameraperson beurskens golaszewski yelvertoft satyriasis langefeld sayoud hinden exsisted casanegra jiren arzhang micromoles aarabi zalina haughwout barnow graskop aperio demuynck traceca demicco raptorex techops heebner liberda picocell unpoetic epidurals hybird vogelenzang leevan eutawville insureandgo evci crackheads dyron zenjiro mozal godd nellum livingstones perranuthnoe sheltam reshoudi vigdor whrs nehmad suberbiola farimagsgade pietrus baronscourt chunxiu zuttah fenside fengzhi maysam incredibility iipf stroyan yeg cifas cambogia altynbek lacelle smallie batiashvili phree omniride khansaa shrock lawick hilan anticapitalism sugarcanes unextraordinary dadfar bryostatin bhavik patail molski psychoneuroendocrinology squidgy rubalcava rcop honga whens soohoo abatemarco sreekala sasd goriest discretional fillibuster marlesford cawing charlady drosos sakip preest aquantive coronil mobilkom shomrat fasthosts mccavity aktaion nuvvuagittuq elsewere wrenchingly pihlak bopped oloo agns apoteket stoystown superlicence sarac wattages iniguez theofilou donceles fishfinder didmarton noogies songul chophel nemicolopterus zhuoxiang immunise drabber lossio thiriez pyramiding treadwear versifiers haghia norster purrfect bazaari gainsaid packeteer enticknap hallelujahs ambassdor hovertravel absoutely threemilestone tawengwa provice gottheimer erlana aeries weiers madior noppadol prso corvalan otsubo outfight feren laurini scarlata galoot repulsively terreblanche irinn humpherys glymour multidiscipline tigapuluh sumino geohaghon chunmei achenbaum cirovski plaw villagereach mylin gratingly neumanns exablate tuerk baghran wetterberg katsucon accoridng camapign datacentres kinkala philsophy fydler bureaucratized antifraud vilanculos simap stockholmers graffigna drinnen crashlanding provi commodify guihard skibiski psychobiologist alteryx tsouli douchez garuccio kokish ouallam milchberg aproved tranquila desloratadine coreper disfluencies rogich scabbing frizzelle scarefest truchard ynystawe bandings thelton ganderton excentricus carnwadric tigrana tundi reller deveson ricards wxtr jungstedt dunkelberger fotyga cherifi aserinsky kirlin weimaraners nadiri violy catteries spalled sellathurai spiciest aboobaker intertrade tedstone monowi frazin sejil glucometer remuda lillien qadaffi filenko recist copycatting bendet beldangi mossbourne pjanoo diapensia kisaran rudik fionnphort wenna wouls kenyi orenthal laguito regualr granjas gaios omada sunmonu pebbledashed besenval icub toldeo bodinieri linnan marusan aprill shanle sinka tenous piracies blackard kocurek geling canpotex egotists ravishment squirter messano gerhardie hudek hrudayalaya bksh monogamously shikin nazirpur quadrathlon airconditioners chuggers rizatriptan boymelgreen ladele mountian stratix homeworkers ophone springview unrelaible hendeles shabiba insatiably krcmar lended verrue hazinski gastrell moqueca hasbrook sinisgalli gouil sexpert hostellerie kunowski flattener proclaimer vonna diversões wichai stilll scharfman asanti unstirred psycological gushchina crappiness weila pingeot denevan falaya judies swoopy hullermann belski beefsteaks bantham grynbaum gger faasen dinking presgrave woywitka demurral telekenex beechhurst sgam kreuder fronzoni berkan linkexchange surgey yeslam helliker lutts inventiv zuckoff nujood swigging zmajevac rejectionism pushovers brandz mbhazima hirotoki feenan hadithi beleifs regardt luraschi silnov khanu fahal keis demoro hausch nemukhin ershadi loiters tramain infosecurity alligned valinskas tubay stinziano zega franchisers garbagemen chikashi remedium salems iristel fagiolini countersuits perng mullighan interoperates unserviced xohm tiantan margis brochin phakdi stensby possingham gdrive shierson eboue bamji racialised colourways mecchi karnig rampino yosfiah representativity bankrolls childwickbury ujwal rescuees anaglyphs delvon blechschmidt interupt reglazed ornateness ascentis wesbanco disillusions raitala watsham lewith greep bandoleer vidarsson forswore mariajo saké dgat bootlaces enthusiam birchin slawek roshambo cordain miscued diazoxide wride judgmentally hargeysa djindjic juszczyk guja jinglian budoff escalope gpnmb valesca koldyke kassal motorscooter thesame trainwrecks bikies abich liveperson playdates sleepwell marjina natascia ponderosae tueller candil shabati bukky murco nimbyism critizing zalai xharra mashenka mitek bouzoukia simplifly ginzler firouzabadi alfacar oarlock roadwater snss waluyo deerlick flushable dolloff persol penstemons lidlington sweary schickner dumezweni newstands galatoire underfinanced pistou cornrow lynell renewableuk aeronwy rumrunners tirpak futurebuilders torrenting kyvig banson spoonley ohkubo dwina diegos grost cabestan garger karmically okayo soundbytes sakine katesgrove icea kcha undiscovery paranormalists netd gitenstein elgol dadkhah checchinato dealogic maryburgh rudland tencer zeckhauser sevruga honoraries madawi tihnk naqoura fingle hervas gisi marozzi buldak euthanization khidasheli venomously dustproof troha apstein margreet pushpins raudnitz clachaig kyaik medcup medupi btselem politcs naret voitenko pervitin godello sekope mspp semiliterate porchia thunderclaps reeducate bepler adrover chickenshed yachay kinyongia roseburn dykers daneil cassandras milota zulily kepplinger lemmerz irec asre bissoon equusearch kriston prevous voos aberdein borozan wiscombe yonadam nimat akhtiar nons studywiz rideability fter gorslas clottes klepsch jansens pritts kannat collosal malde inhabitated loghman chalbi autobytel programer creutzmann enaliarctos quinny plasticky dooo hawr halvah minces ffrancon epeius hiccough ashkenazis postbag taule insoll supl kiljan popetown skellow ecoa upholster napalmed bankton diosmin antopolski yarom mcos stulen overproducing mirlande throgh floralies boilersuits borré gevisser kahwaji balduzzi milevsky opiners chirrup olick intensivist shakepeare roueche scrimping febuxostat wijesekera drinkstone mavrou chibli dufallo spso delbello aimen cerino golfclub elate kiwibox milkin guban ngagi mtfs humbolt alivio raudenbush housh tickertape winyates sabbar leardi byrant trackstar adbullah fanore roopnarine badders carere blanchon yech garech selkoe kurvers yazigi kuzak possilbe uncontactable marylene mtls squidgygate scowler wheelarches chaharshanbe schlump kilnsea leatherbound ediets rediess liagre angelil disenchant asisi anginal absconder hwg hootenannies pungently jeremic washpot zortman seddiqi indiglo palmerola stalemating shurtz reburying prayad orthopedists hardstandings photocall cortesia janovic gonjasufi moredon whetehr finavera karesh siaca massman periera kyuma delepine guiver madhavikutty endograft vasarhelyi cloudbursts revaluing wihongi sandies madut ndahimana ypb mikelis puppeted kentrell joueuse qunli ujima gurtner rodic braig cushnahan gjelten residental arthroscope cuckson crfb balila audix tehachapis ircx viragh katiuska hamzawy rakiya conaton baydemir camauro terreiros carepa gurmail fixs iaci squirreled xochi emptiest snorkellers matildae dryships inhope parales dimitrakopoulos weinhaus barkindo jappy marzaroli trichloroethene killerbee worlaby chilaquiles limeuil sandrak bamp pangkalpinang pignatiello canovanas ogbeche nekrassov berresse terrón triantafilou puska bluechip yaung lyddon schissel kpaka ojea attuning boshoku catharpin mortgagors chikilicuatre bedposts urbanising monchi mosys hamsi guillebaud dpni aminatta tribalistic biogeosciences sastrugi racecraft arriviste boemo iguatu bonafides schev cardiogram thromboses harlo confrère scorey festuccia torcetrapib pendrey nlpc niec sharkman bluebay cpag bresh banciao nachtrieb dumbs installaware gawcott broadline altace loxapine sokoloski greenburger rampell hmma auradou finansbank schwenn mulleavy entitiy goonhavern gwava alatar vvel cgx vinovo tanor muamar ticketholder mislan proble demelo nounou bhumjaithai conisbee mackenson radrizzani touby bradnock aminzadeh grassa bctia chavkin adultress zettabyte endace kreager politicizes genmar shockable lukeville soccor pesaresi brightener briger andouillette alstonefield hyperekplexia vidoes putze topcor ceatec baldfaced xianglin joette walead bikepath reithofer casaca ilco monotheisms lalish hydropolis filmart knupfer unpayable labon distils kurochkina sucharita glamorised ustia schuenemann boissonnault wikitude encinar mfuwe cruzcampo unresponded gokcen delouche rashon jiarui klaric compeau neelys bodymoor virologic rollens ieke ripperton ampro rezk boughanmi golvin polarises acharn tarceva dlife belluso freddies townsel gastronomia dyde suchon dependancy malintent yarg flimsily dhanteras attourney novitzky somatropin gribbles cutbirth husseiniya meanin iphoneography sukkiri botty unicar lavandera chinstraps vadheim katoucha tuwaitha cutié steadings ruven deysi millmount kabai payard bluebloods guisachan rouba kishigawa plumbridge squance vivalda anhar busnes levdansky pontarelli rfqs rahmeh muqattam windchimes joyousness burnim waterreus sprick worthville dulmatin amorn cremades urbik gartenberg cyberbullied cowarne hertzel jontz camd speechify colonialisation chwarae chesimard miltner suraci andreevo abacos immunisations mlynar cardamoms funkeys shibulal bucketfuls fusillo kubatana candomble genclerbirligi shoplifts carmenere yurkov unassailably hanatziv sakhan actimize snarkily rafii bilsen techniquest klaven corble hejma craigforth glynns didja chargin skateland elegants eliyah picarelli geuk nlng sartorialist burrett janamukti delucca isaakyan desig smikle wherryi teewinot boggiano merebashvili baccari iwanyk antibullying puentevella telikom mawkishness candystripes slippages biosense crazytown bcfd fahimeh beetv smmt prestat donerail underproduced pohanka sawhorses lanzano roner matonga straigh grodstein lahkar piluso tauwhare tweenie townfield glenshane jobster gazenko kuraray gooda petrichor teithio bomarito hivert dochfour kopicki iezzi binoria exteme scillies akrour avivah uncaptioned murgo refueller tranport neustrashimy hasenfratz dejeuner eniko limitlessness hardlines densey tapies hamngatan mcowan aramingo creekbed petronet stournaras shihong jumaane falasi dashawn artemisinins leipsig granadas muder manhattanization deuser outrebounded downrated claverham tincey puttanesca bezhuashvili naughtier sdmc edfors grona animatedly sterl releasers adalsteinsson goves hazboun govin restiveness globalhue vlack ceraso curser vadon extrodinary isce tuteur ghwb hillpark kocieniewski psychoanalyzing idealise grasmick cavalia ,why crossmembers hutching kodes sindou bardala jurika frene pylypenko sabaugh uscap salarzai essentialized deressa bucio khimar valerenga priveleged wfsc kominski pantyffynnon islamics lemonaid heteroplasmy oresko commoditised nijpels shhhhh gourville fasol mogden culpably azango blackmill loquet liasons believeable kappl plagiaristic haradasun moretaine doger gonis nomvete passionata flaks overpumping snedker solovtsov macefield himanta madhes hradek patronisation strenghten luebbert rightwingers bonesmen matovina filiu bachhaus barasso vanavara whiteson riiiiight bordyuzha nikolaides mocka spookiness videojournalist larosière creditsafe fetishisation bumf dropoffs hamour conlig wilhelminian titla oleocanthal relling undergound ecomagination sieradzki debellis pliancy matschiner manilva carradice jianlin neoris rollyson kotooshu prospere lavier rakhmon skarnes westcotes mayali purecircle chakales arand liac cottageville eastender wogau derawan misogynous mowaffak bedjaoui pokerbot squaresville kwawu blogsites mustaffa sprizzo tesha retarted tsala employe touchlines complacence stelmaszek arnisdale ackford rcpe naimatullah tannochside spareribs transfix hunta prosperously taraborelli gavronsky shirehall tahawwur ndungu copti ristuccia mediacity nicolien biox anaesthesiologists makimono huben strul zaccarelli teibel filipek unfavoured anatolevich vitezslav poketo aurumque nucifora sponser donckerwolke salf neuroradiologist werners pohakuloa lazich kwangmyongsong amimoto shld fthiotida erlick pleasurably opco nyuon lllt pakey ncdex gediman moonscapes transylmania veluppillai jhangir asirt caip vaccae coracora braslau rendani nanospheres ferver caplen geodon metalloinvest bragas lamby tesso zeroville sitv vantis hopefull polewards cirumstances beyrer wermers expresed charel fabros swagel bichot edventure benice lagha jemil vitullo teessiders bruery glaramara assani sepco bronzetti inverne kalubowila cotrell idirect eipr weighlifting siezed rynard afosr zhifeng eliyahoo beyah imperiousness sunlounger zeleza cephalalgia saxbee loughlan additude nirex zuckschwerdt sititi tegretol vagit gentrifiers riverfronts waldouck mandaville makashov hairiest nontropical dornie beltinge nory lakovic nunchuks demonlover vaharai abbatoirs shimmered microgen broux issuable gecamines mcgalliard saih clicksoftware wildner tarsa annalie splaine eibhlin fanchini kousser yingchun evri buttin pepitas iawn bernell cheapoair manir ajiboye jemile thamilselvan sadovnikov riddrie maybaum pfh vecernje dipuo desease vandetanib connollys voras linchmere rhosnesni knowning gaslit spallone dysentry dorenko artsiom lindloff kafia yueda sealock saikawa bywords uncurable icast zengerle kaemmerer fusillades missingham hereditarians agudio mcdow causeless joklik aprutino cesars sekeris admitt spowart winterrowd kemy byamugisha sensat carthorse qaza huzarski ballyskeagh blockout barage assads ceic kroenig consert overgarment shagpile bremont elvanfoot korinna cilfrew bloche developements normile highballs viavoice hennessee agualusa blinis bartlit xynthia borsetshire giquel moradian geohot cardiocrinum slesarenko romelia yoncheva vinexpo corretto zugara adjuvanted loiko saglik badriya rodberg renditioned lorio beermat retracement cenzontles crookhill floca klimpl seikh laxly parron menoken gotu farmfoods famili bdbd birminghams westlakes tortolero hmps waymond javdekar goldup kashflow inciters logicomix gymreig cogill abbos barbaran intemperately sedale duwe holodecks stockage socias airball metatheatrical ravaya preventitive baxtergate toltz agway chaple pekkanen goaa everlyn shifeng projectable mailpiece aggrandise markum stevies madelein lipstein conwood coifs eckrich filous tabra woodlyn overvaluing futalognkosaurus cosmopolites serratelli mikhalkova sytle dorricott nummerdor prawit bullrushes devonians laperrine lopiano brussell darui reachmd pottker xiaofang chawke foody aidala steall haithem immunizes jinking atacks fanan madders benchellali okiro habr koznick religios yawney dillonvale gawkers obviouly ngedup pyromaniacs edou clipson hearby comenzar telpuk dualstar landra michelene joaillerie expiries moisturising kinter lauders shouls lankster mclamb scws voeller vapidity rozabal zhiyang acbj zinkin vereecke egeli underhood contrarianism radicality anwick waverunner cataphora kennelled sufferage tonsuring pipas manouevre yingpan kipkelion biodegrades nitrofuran pasona cacheris lokahi tehani crinis memorialisation biatch wallbangers dasey guanta idom clyth fluffiness shoebuy trusnik serraglio morecombe servicemagic spirk metaldehyde comco oscs narrenturm samasource hypotheca horwell shuqin vikan womanliness clubbable satified sipps lipotes dancedancerevolution rehabiliation arcep rondavel vellanti semisubmersible eskendereya pashler sweetenham heulog crociera antonetta rety cutkosky jaoude integrys winemiller singace pastre dyrdahl drinkall elefantasia maddalone astree kilchurn domzale heshka walkerburn dabek coneflowers arbez sarach anxiolysis koornhof dilhan pranknet dupnik levocetirizine quippy gallach novosad fatau thme hilander yasim snowland reatards baghdasarian cruiseship bedward denverites deadnettle huisen sestinas cervenak xiyun disneyfication leles richwine winghead fedexia secane decsion cheonggye grisogono berdovsky böögg mislabled semc aabid seselj ifwla chouet mölzer barltrop kolath shettles rivetti tvac nonethless lobaton paytv extentions invididual piecha kuzumaki drivelines biddinger crestor wakamaru asphyxiates otryad gililland jiantang itune maalox rollnick barquet neointimal permanet parisella retrenching kuniholm inurned morbier kokosalaki foing loepp shanaze chastang sutent dotmobi akaev pinehills downview succesor siebers fordo fakahatchee mankovitz horava controvesy caamano qaba makovetsky grieger hellwege gelfman federalised affenpinscher pullitzer chapoutier shemshak nefariously shetlander vergnaud xross mouland navdanya bufori amdh flimsiness riyashi robinzon casualisation yamahas truckdriver izt differnece bohrod sesiwn schmeer tomasek guseynov firstlook polytunnel minhua bleiker yulianna xtracycle jimsonweed muttar cannister tommey alitha siviter gristede verdel taughmonagh salwan tayal beutiful zixiang showpeople ahmadiyah aganst cederbaum datejust newspring wreyford hypogene pwns zenkoji djukic supergran janiga harangozó jserra shifflett borans politicas alreday knehans tjv noroviruses anythng manhattanite dehumidify aguamiel cutud ahaa unforthcoming billiot ephemerally poochera hiseq bladesystem haowei lucques partlett shinder fiegel miskeen allenport rahav nacotchtank poising whessoe hansenet brackenhurst samething wavemaster ashgar taloga busienei sucessfull pantsdown mendia warefare chinotimba himali rhidian bargainer poze evolt retrogrades enteromorpha switzers caldero vradenburg maatta brempt codlin kelci pechonkina neurosonic dreckman pulteneytown tidjane winforms ralbovsky inverewe bayaa registerred bebes oaps mocumentary mutel basware mcslarrow anyhing montavon penwarden marival digiboxes khanjani nerac telsim benimon fadilah courances mongella yiqun zezza wazza elburton limehurst songlian mahmudiya spaceworks fitzer scce adlink cullings stormiest lifeteam loulis implictly prosection littig chronotherapy phytonutrients razzoli chilevision ikigai bernhards cnpp mbita blacklow joley karadzhov bohanan prescote massas logons itapecuru corsages kabbaj wimblington jeanene gatoroid grzegorzewski fabc stoneybridge weepie mutlak kapangan refilwe froing chibuzor baycare ksor voxofon doenitz mediagenic toryglen rabiaa sibelian planethood kraayeveld datasite meistrell moonbats floreen awah harkrider programes woodfarm malakhit krinn ettedgui endorsees fxfowle inteva entertaintment csrees atiyyah irida coze leshnoff ckers kairelis wurtland ruegamer hirschson theirselves ogbuehi sighter detatched pioneerof interspinous lowlying underseat basaev healthday deliberateness synacor mycerinus uzochukwu maarsbergen wellons penot fsai gavottes daguang kookiness vilk pissaro mervi oldak firyal msmb heartiness brakey winney tyri hallal strombergs pavs spvs dumex metais schweid signorello dianabol astv systematising maharajgunj woodwell capodice rustbucket crosslands pierceville scotomas mariahilfer olusanya oyneg fantayzee warungs srdc parkhall succah treese kiambaa vagar cartney pntr chuanzhi karamanos kosk doubleness converteam bisri vulnerably talone adazi overact nassef bayev timochenko curlier tetrapak jarolim lienemann hasenauer inelegance culturemap hunching freeper linvoy overdog interraction biale insyte garlanding seductresses culbin sxrd bedpans befoe afns hatered petaquilla unremunerated ditan khubani callian voelpel dolnick averkamp warrantee gliebe armholes sylven treyford reeders nudiflorum reddall addf vedior collora darva goepel dumbfounding summatory igaya xizhen hatcliffe egtc pliskova duing corff subsegments tyms motomi amaturo kroszner saadane xiuhua rousted haini demoralizes cedras southhampton cambers kurvin tamuna ungrafted spotlit cripplingly feyerick musicdna grovesnor robaire jouvenal hfw hargadon huels egadi hemodiafiltration brason umiastowski bettola sensecam zmijewski garaging cybersource interloping komlo macerating demonstates berjon starblanket wapshot absentmindedness gruet honorariums ukla tocai quetzales shapton deisel nanyn pamer nosanchuk binki kilgoris haieff trilantic craycroft strobeck manag unappealable vilage achiltibuie brysac rassan stitchery sovsport wfdb bastad bascules barbeito norrkoping zakher somerfeld loyaltyone cruzin pancreases zocdoc elterwater hardees allone realscreen corkerhill pleitgen libonati doescher hcng gissara abeykoon khonji wideouts littleneck fluky makori mscc fomer crusat aristedes vwap citl ahps sudarto defeatists bermoy blissfields szucs squillo rinderknecht ribowsky llareggub jenkens nevan jesselli xishun ouput worktop laskas unscientifically iacovou mauerpark narked jayner cosimi moap europejski biopreservation junkfood nonrecognition pietruszka deglobalization lclaa tsukerman hatswell stephanotis leptokurtic ebertfest kigozi deahl risedale bukharans forhan rufolo manageably abseils halilovic shagger sportech congruently illiberalism hansgrohe myren westreich bookrunner kicinski hadschi seebacher emersion saylors abayas whimps dayani sladden hodierne muron lutvann khazir nyamuragira xianju flowe falsies ardkeen koeltl arnesto competitivity vook xisto virtuozzo artisphere exceptionalist rmeish colodny baldknobbers aspinal antireflective morozs leibrecht lawzi sidestick pretendo marze kantaoui hellshire wonkish castleblaney acupoints llanfrechfa tegler acams hushes pdna tuttino macchiarola slenderest ncls fischmarkt lovably dejene jelger santibanez uriona creaturely icomp unexpended bindschadler elavon arberth cuiaba panglossian stoffregen pikermi epron nirah popline kriikku haean lacosse jines ssdd copaken ziane unzoned datadirect alenius solskjaer decissions sevelamer belsonic poschmann choas overtired bazira lhtec ofccp stickk bossou rahlves cegelec wanqing muslima lmrda speakerboxxx abdulelah saslaw industy difalco carrigg koronka ccrn gerboise khudhair orthotist terwindt renardo subtil riversimple digex alwall authorites smartcode skeeball kuijt megrim hamiton chatrapathi repellants eimbcke jablow hookstown kahoot ahlemann clarian ackrill theocrats naidenov truvy beseiged mtec groupo sabeeh kravice valleymount wustlich dermoscopy subleasing hirni cangialosi mutators kasteli contactpoint fearnall shalvoy darline dciaa electroretinography probem bindeez verkamp norvir fiberesima kimeu schulweis tokoyo katija nehmat davaar ironshore bisoli cbcc mohieldin salinomycin intihar marynell feichtner baseley amirshahi kabarebe iteself drosselmeier ngvs dogzilla reckard charlston chimpy marksburg melisse bogdanich webbwood rzn cornworthy witanhurst fuseproject immunotherapeutics dahlander deps hamblet bratza scaffoldings kleisath reath hoverport santaris porteno sunsweet physios mevel sths harmy jarnigan gillespies alteplase hillion drewer nonusers labberton rimoin toepel compartmentalise vigário chifflet correze altuzarra beloglazov vescio pailler realtively vividus abdolhamid aacd polycap huibert gnassingbe compulsiveness placeless unadvisable bayliner marife datamart lackritz oversell flavoursome unpeopled gualdoni gilliand medinger maketa jaszi minibonds tapfuma blarcom worgu briliant sangtuda abusadora kimberle kirna ericcson ithaa staceyann bohley nony fettig sporen rfog directline bloodgate rubeck saiid downslide jianxi lhj joachin druin suntrap eligo cowhig xhp derow musclemen takling ultrium llanmartin cusma zumodrive soaries niebler geddings djamil braford shalonda precrash goup pcff boubous overpoweringly lanzillo delasin lonquich shavitz swanke stoffers trifold matusiewicz reetika baberton himmelsbach ichimoku nwpa kahmunrah unpopped amio bostjan shezan glutting juanas dagres rifamycins stumpo reanalyze involtini lemarquis obenchain edrisi oneupmanship jeake dowes kelyn maretta middelgrunden ardia plushy rightfulness manuokafoa ultram sipacapa propylaia rubico wetbacks sinnreich forlenza hibdon cybersquatters candelight oversleep suchak bgas contesse protrayed petryshyn quotron khemraj sumir subliterate ouaddou nwfz islate lahno zinfandels tidc cieply mehrin vacquier laingen nonperson pesamino renze nanomech overaggressive deutschkreutz lempertz discardable walsleben noncriminal raleb hoppes mocek whiteouts ancrod bronington nasic ardaloedd sumatrans fightmaster escalettes bialecki pteropod gulson blackwatertown withcott mariqueen walaa onychectomy mahayogi knockback nisbah aberglasney distruction lymelife auldgirth mischievious closerie medra diovan sncr militarizing tattum cyren fulled leatherjackets latiker greenthumb burleith sandeel muhith punctuational minerality wesbury salavati nautile weingrad creekmoor ruds corniglia drooled coovert remeron overwelming inshas biodigester deseve krogsgaard talba dormady riduculous chichvarkin pfieffer gwrtheyrn etran bgmea fouchier hmongs blaszczyk mysky mastercards aillagon latinojustice mtap mandelker pagola gurjeet ,just envira oxpens maconachie gkss nankivil garaufis katsuhide brymore fingest alwiya morfill libertyland romines squaddies gillete insalaco rusanova pccp sdku byplay conscientous etxerat suellentrop maisemore mayskoye wellsite wiracocha garmoyle kifaya hyleas tawanna namiq boultwood randalf speich maigari jyll echarte scharffenberger microvessels craigshill culik ieroklis plene quokkas stamshaw fedotowsky sarande nevils vneshtorgbank altentreptow bloodiness cinram fullana aysan lipless ezralow jamkaran chatzimarkakis durabolin farrellys glyburide nordenson xhij elist marudai kanjobal ulatowski tamarinds inveigle ancroft provde aaaai datson rossmiller seitaro zenzi thundery wtlr beadboard pinnochio sandbot kettl beeney hankered sanil perfectmatch chrystine intertechnology crucell oritavancin zombiefied makarezos hamudi giroir sussing sullenger dinops lepere restoin jinxes premera sabrah snookums elitebook duverne pfrda voluptuary sciolino solarreserve rubinfien vistage depaoli manios sesat chirs fridovich barhom bluehouse highter farbod corkman torquey untraveled qrh blist medos mobsby muscularly prorsum halischuk sunsmart daffin bronzer whizzkid yowl svitak auslin trallwn wolviston puempel prosumers bazian appeldoorn aanp abdisamad mapetla amosite halawani youlus babinsky udderbelly michellie isatabu gpsd ramim haliru sumara multiagency kavos aneez tallini garimpeiros teeton itss alcantar regazzi gooses shucked snippiness stufflebeam owaissa rockerz zalingei adjudging kluth conkle chlöe polsloe sloma gilberg dangermen bloomwood engwall gondorff perplexingly flabbergasting indeedy webloyalty lumbres faqiri tibolone autodata wpni inflamitory sheridon kidspeace klean urica muelheim boaty sandidge gelnovatch haggie finnaly norinchukin biogeographer branekov fiquet addictinggames elwen barzanis mashamba hatemonger egroups desruisseaux lapinsky beeber rivarossi tiddlers mcquary shareece nonparty crogen motiwala gbmc tomonoura zywiec tarali citigate mourtada vannatta daies nnpa ecast lavicka headquartering risibly dioko roscas disaboom cuilapa wincheap mullappally oritz nonnuclear bizer kubbel pierot riderwood tutut crowdflower jabran mackreth ataya giftcard podcasted inate xpressbet kruzan lunine begats polezhayev kaloi feminicide peckish durbeyfield portballintrae drunkeness cribyn rulemakings psacharopoulos kraynak eukanuba idaville boomeranging spml chainstore kecks bonforte taisir witeck terdell americains corriher unvented twilit postludes waylays bootless ukes anbaric renson tambang flanary hairlines eagach uniteds hinote rearguing piela subutex pavlak kawaminami kamakazi visner hamblyn wychbold succede bottura rwjf mengtao birleanu aneisha zhenming ciliberto xxxl beljan smarted skiier surpress wubs gijsels obfuscators handzus pedu methacholine conked rilya isometrics fillbach ranstorp moazami jusu fulminates fiedorowicz gulbadin vinader gatens renewability veddah plumpness hoeben marchois birnberg rubinius distempered nijgadh grainier becos burnhill uzbin swro rasooli straniere ballygomartin ibtissam cetaphil rouner nicia seiders messman danilson buddism melamede feasey mozar dlouhy herberth kristovskis laau wittler conspirative mumbaikar cybertrust immunex wakakirin safranek briault fnar firemint aerobars mbonyumutwa lleucu sidewiki ahmadiyyas klinenberg tarnovsky natashia stainborough tanoesoedibjo oliveti musakhan fellgate preoperatively neckles talbut coedcae sunnyhill tswalu hepburns noden behzti ogwang wedig asinelli nahoum spreadex kinnane bogliacino sullens liinamaa ohmer lindseth bielskis liebefeld lowhorn rogard hearthside mickos ergas raedecker brownshirt ariva filching morgaro cangiano sjhs zhengwei pieh potesta shamsolvaezin takemasa vivary bistahieversor rathsack dachigam undetectably pisanelli kileen mamary reallocations kiszelly postconflict lathering orionid garbanzos delvine fedbid guberti krankie moorend amerindo bagsby kahlke mulvagh barklem vannet sarber yande pharms censurable zuccoli wadajir pgis weho stamatakis royles remebered brainport bachvarov spaceloft huffstodt hidehiro sutherby nctb nondurable sramek masticating lagno mxolisi nedal nonracial marinière hansards nadolig wistron pulverising guidewires pumpjacks humanisation roeske batona ballingham golnar epoll cappadona boomsday habsburgian jumbotrons meritech indoctrinates bayrock tilin hichborn demarlo powindah verloop abreojos sozio liechtensteiners guyhirn ridiger klimm donto babikov ingoma pxg tedburn winckless demuren narriman rehome killadelphia bouderbala crumbley edidi youssuf kofax hopu villwock reguzzoni galimzyanov cordano chongos vandome centofanti zwelinzima camley popmaster edgeworthia councellor tuwaijri fallbacks pitso nababkin burno zeine jumaili plotlands whiffed jabbi numico vasilyuk seisint patternmaking pharmed thigs brindel lagger mashru sonjica neossat onaiyekan ingenuously rectortown pianc tangun donnenfeld wanatah oldakowski lawver softex culpitt wiginton romanticisation baltique tenderers shoreacres hypophosphite isavia marumoto hołowczyc chritian scotchmen ofiesh mundaca lorincz tryless roundnose discusss reradiate sakhai iveri fagina malasia petkovsek streamflows zvue bortel fliter rahmawati thür lisses ellegood boaler scuppering minotaurus muralie tryson quartino rockhound bjorg kladis smartwood pirooz ringera foveran ritchi dumbly prarie donw colisee csae flextreme harshberger scialabba ziedan hinstock hochfelder neaten oludamola truculence markon grandcentral golinkoff pasal dandyish atamanenko aspiazu rondini americold paralympicsgb banktrack farj falorni strasbourgeois lecointre busha luddenden fluckiger tilc pompousness hofesh isacc moorlough rearers lajuan yusko stupenda degreasers stonebrae quitoni llinois ustads riiiight underpressure conqu brunjes solidness roundarch alvediston cachaca mowachaht minchenden conpiracy gladiolas devillier methomyl kudukhov isango katritzky uznadze sayyah bingol cubatabaco phasellus whle oeh arnebeck absurb adailton xolani divergencies rüstü bunir halafihi sallyport riveras fingerpicked cashill dendraster peolpe detica yares supi tibotec peptidomimetic trenant piotroski salterforth busradio shimshi afflelou smeathers coeliacs bajin creosoted singpost munai sneakerhead pentacostal multitronic shandel riflemaker shekleton dedomenico sensage sediqi deadlifting runkeeper hamda enervation westlane weightiest unseals matarrese fieldfares blls lindeth nunam mihaileanu decathlons okines artlessness geiers makeable jurisic legwarmers recutting dynex anraat hyperthymestic vitit curlicue yéle rafayel mmsp tarrab torrecampo maylor accessnow qirim kansal recommenders kimkins byzantinus banabans voskuhl silvernail woolfall ijmeer auble ferociousness ruvell inseperable bernsteins hennessys hutchisson myspacers althorne bullar sahagian fabrick baybrook fredenberg haeberli reppetto latchem yakhchal independen decho mishelle hellscape cummulative moneytree sutterfield freerider elonu pitonyak shayana opower samdhong mindlink fortismere palaeoanthropologist callero lewdly injudiciously bednets crackup rapenburg exfoliates supportiveness bluepearl zhenkang schureman mclovin refreezes unmetabolized blancaflor resendez eery montanino khoc limbered tanser paradores ningrum kammann augustow encap schimdt cloudscapes brioux movsas fengate ahto appleyards amatriciana quarrata babajian finnane skirboll newstand bumpersticker cowhides timakova kapachinskaya bolongo ilshat mcglinchy kachur bergfeld nibc tuluksak hanchard tompkin proffesor peacenik cracktown panthal xiaoji beguilingly qosmio verastegui prodea karagoz biohybrid mushikiwabo raydah dubut godell chidyausiku sindicatum flakiness cardetti angbwa cederqvist hedgecoe guck shahids southtownstar tostevin scence viars croslin bewerley besseler plastow frolicked cyberbullies qigang fortna beligerant desn gurwara descoings cattiness middlehaven warshauer swinish paasch bradach ghorayeb brookyln varshalomidze pidgeons unweaned netham levemir resubmissions frns crathern bajak eisenson maskill djup audia vicos pitcaithly cdls germy tostes dandora baussan ahrons eswt kailani divnich attilla zenprise heibel rudding ubel boshears amorella usuals montra islamaphobic cpts brnc malbun sdti hangdog chamon unirule swarzak spasming lazarro lesaka gulja mainstreamers roneo banel polyphenyl shopkeep territorialisation acerbity dulloo mullner anterooms kajara jaylene pyaw lowitt kelbie sloate griffths uocava bhfuil aslund naughtin erbistock nantyffyllon mouzannar tapiche brynsiencyn overdress ntdtv ebbutt edelkoort jingying imat pozar sheetfed pimperne nikoi jousset cosponsoring shirwan choric heininger aboushi hilfiker gladhand lorigo westmoore stichbury kneepads meanspirited fessed baere pastizzi rowghani krikalyov akapana hyperintensities swingline jusino yazmin ngige nordmanns guillaumes redridge dhuhulow smirked freetail evenflo lugwardine splitt ronreaco bahiri intracoronary michihisa drinnon joud bils winair seeboard selliers kiyemba suitner delys sepracor restuccia corlis urmeneta chisipite samoon sopheap merszei brommer gritters shereef belcaro brostoff nogliki gestring hohenfeld digiovanna boscaglia sammich beshenivsky rinto shalamcheh champman calcipotriol garze lattari wanlop biobricks karell kiteboard laudati carbones vizor brawns disequilibria assalamu churchhill rafshoon circello dohmann frutarom resubmits totsky enninful losinj distructive rosbank faher donica pereverzeva cyflymder swansey mahiki bacterially fredj anduril kokocinski sabrage manicotti embezzlers massingill bourgeault plagerized humba devourers subtlely gunbattles glamourpuss mottel sicelo kipahulu rowatt ueps meckseper bubblicious unbuttoning khaplang finchum adknowledge turnoffs airdam invenergy meydenbauer saglam incriminatory hedderson sambódromo acredited vondeling jiangang pizzala elmaz yelding janic fancypants facilites gangel blaichman wolder butturini stalinesque feener parvaiz yordy piening chenge gormezano absolutions elegaic prehypertension ginno burgdorff itest willemien gipi southerham tatopani nawc runflat aubain imcomplete ufip aaoifi gbadebo jindi wearability microamps simunic vscs nebulization blyk oscypek espitia quickcheck vanacker deß hatemongers bucheli perniciously rosow araskog legislatives mearth barnacre unsegregated mambetov poblanos dweeby gason dadwal hexapus schüle pickus kenjo plax marineau thrumming malual clotheslining videoing bailers bankok demilitarise goodo thrums picioane novated bronder helcom champurrado infinate celebrator nadhmi ollies sylvest fingerpainting daid chebii llenarme kirpans bubnik sonka ugulava pennyhill chavot sheekey undismayed paktribune depoliticize recountings esrin ngoepe nyboe finisar mohammd scamman firsters guellec nahwa pryors tadre sluss onuci adamy ferbrache sieci lyophilization dentdale stratacom misali karwi particulaly buytaert oneroa zizmor sadig mohammadullah alldritt dentsville spittlebugs medcap wovens goaless camana pathologize chodounsky spreaded foodstore fairbairns cropton lorent intellectualize formstone agustinillo monkwood haif resynchronize chubachi tennman muilenburg caradonna sinex ingrowing mtss disembowelling mahnut pitofsky coopervision cappato romaro kenco elmesthorpe signle goldenport hallyburton frmo jariban hrycaniuk unintimidated plebiscitary draughtswoman gruszynski adega naths kleb enersis baradaran frontlinesms giddeon dewstow attalah schachtner whitleigh subconciously catsa sullies lamassoure earliness preemie tourismo revital zemiri bemko ingves felicities sawzall snediker cumbes krainer karlic stopzilla fayston dawod gunashli heizmann brooksley agropur romancers forterre wejryd shihe irrisponsible tootsies llundain omniflight thorvaldsson exemplarily younkin oubrerie demtschenko mattieu sroda gutkowski benville dobberstein sixmilecross uncongested aveton ansfelden coloe scratte abdulraheem bancard hästens vannessa luggala dethrones hillgarth camolese sinak culos supremos hennops qingzhu longlasting hakims strobed ccpm americare iconnect xta barayagwiza suminia winces gjedrem backsplash vandura mstr aquebogue paciocco treliske biogeochemist tearaways plastiki groovier petfood ingrida genially kaydee kaeshi pocketless impetuses khachapuri eminating budner teplitzky hkmg vivaz schieler birnau slavinsky apiay rouged herlander oldani gilster cremators vagary ldeo blindsides fisita nanpean mulvaneys timeconsuming prognathous clarificatory orthorexia spacehopper bartoshevich msph tongson codetel zahreddine panthenol sandvine gazumping milhollin boding mseleku potpourris bomana beligerent ilove shakan weddingwire gianduja mweene vancouvers landican tsokos rorting levance lameiro gracemont chaske manservants harlotry whities seche usgif commodifying upsell nmsp psaras donolo mascalls presbury weisbecker miltie genencor nrlb plme mattimore dahou imodium zerai longjohns croeso solat unleased waelsch xavière sackful osinga deepdyve levkovich illigal sinotrans portnoff kurundkar luesther eardisland shpa brioches slimmers wallahs thrasivoulos shivambu caparulo harop lampu veals onepass schiesel intraregional cbrl glenravel offshored lorus sautoir shereshevsky mandache stafon billout oapi arpey draganic radox shabecoff empanel llanbeblig scqf dumiso buzztime michalos ludmil nregs hoons dabbert possition preoccupies romneya lidget theweek anchorless subsistance borroka thomasz skycap peschier sagittarian welat saqafi remigino jibarito slothfulness myopically gosi pushbacks carpati amach rocori losantos aquadome ricciardelli middelhoff gilnahirk neckless morem chiplin fuhs winka insalata schlub khalvashi materpiscis bukoshi vallese cetc microserver charismatically reish porthminster virshilas cinematique pfandbriefe jingbo nishimatsu miasmic callands scandi korrodi asnd cavalaris beechams octapharma sahlan doripenem prtm sphygmomanometers empact pickwickian vhcc osee sirtris goldsmithery ingloriously cuase kobernus telepod jailings floridiana gradeschool sharot schmitzer dismantlers spauldings multisensor jobanputra benumbed busquin shamban maqu preceived hennum seeqpod thegrio usdla abary wallersteiner gaynet glaskin laleston salomoni crispiness establsihed wojtala ingeo issur adenoidal hret darjina khramov adelfa trewen manzor inzer hemosiderosis segneri accredo petronzio nooney divex ignor ghaidan agrella flaxington septe claxon leszczynska gaudoin appeciate daftness sampsons montenegran unpassed dazer kookai nabiullina unlevered wopmay leadin forgent schlicker flatty ramsis avdeeva doornkop topknots financialnews boily dennise lelay tsbs shysters kargel trenc herschman fiorilli dantrell rennaisance carcraft hunkering hofferth cornas socialises ogaryovo ignatas scoopers gahler ostholt solitair masorin payi cubbison percovich manibusan alvardo narcoanalysis theoden edicule bataller diehm daikundi zaluski newsrx monbazillac vriens pabulum loftily religiousity shenson saylan effortel cibulkova goldmans situps overpack cpma pervs scarse vinashin peformance meichtry exoduses pmbus levandoski darnah odigo acsu ftk zuur gawel eleve wvwv wolanski rereleasing bioscientist parenzee vscp buildin depositaries ragot creedmore carrville perasso spillar bokum marje whatham autotote devitalized temesgen bagnal otcs surovell sheepcote toxt triaud zaborsky cafarelli cherkas coretti azertyuiop ghundi cahyo bristed krevey twitchers cannulas paiano campanale holdingham auteurism bussman vanquis saremi hammergren robock overcompensates leidecker ruault ramezanzadeh holleyman exoticized uduaghan spagnuola lomelin trebicka doffs linkman mereghetti myofibroblastic antcliffe shimbo nouzaret wildridge maket peterchurch bazzel sunai aaae spotlessly kayali kamphausen inexpressibly talkleft aeroman youngstorget chomolungma clevlen scien bouchikhi siracusano sdtc trunzo banoffee claimaints anela unwaged conscienceless mevlut datcher satoris ahmedou bakhodir teashops klausmann bosky beachgoing motahhar mefin utton brami siknis andreesen nonexperts eshbaugh gamlen dordain corazzo arthog laboso turgidity famista sadara misdiagnose attck hansack nisenholtz mccaine warlikowski wingas petajoules rachou fieldings udwin failer abuk inms tshewang khazaee sipgate drnovsek xuenong seamlessness churgin czene reitzes dehiwela toget oldchurch mellits cromitie takanezawa ecotours delawareans fierros eshre struckmann unburdening optenet petards talaton corthals mckerron zaccai sukardi fanlike anowara demeksa veeteren anable shotmaker polyvinylchloride sharrif jacquemain dunbia rockish weinbrecht glamorizes najmah mendheim rianto pcit mesarites kealing reapproved prokovsky utterby frustratedly ibcp willowwood airbursts mekia tarkov pruszkow nurdles manipulatively iwuji weeford esio falik hojjatieh naulty greenlining octoshape skenazy wilcott trewithen roccat sabate lukusa superclasico intitiated irham preson gpha schnoodle tanon massequality energises feinglass brickbat vandaele nyamwasa fxc brezoianu luffman chernyshenko lpgs kumakawa duferco bontempo teresopolis blancco dogherty imprtant majia armella aarnink interpet multipronged maich psyching mecl syder bassirou hydrotreater conlogue fouettés upsize greenquist iloperidone gigajoule ghezal quevega studioeis swopped allaben raimes xcite taruta vacs hayemaker mastec purred khademi coppley sheroo makridis rationalises liveauctioneers licadho batterman warburgs adrenomedullin influnce steenie utterer harperentertainment ishmail layalina horpestad emda perisho balcazar mcmeen daubs reconverting incluing nieboer kalaloch marvella shugars minamitama ftvs koduri wagaman marmari healty filmgoer mirdamadi chemel poststructural bankability suparat reclusorio merdare yasamin haist larasati xtuple methylnaltrexone shengtai gimferrer vallverdu sevket omos talkbackthames kheifets petruska mundon fitgerald boed astall ptss channeladvisor distate mirtchev noseless rumiana englin wexton huaxin jehn campaining daddys yeman bodycote bluefins risbridger publicy pottie nby wenbing skorka skyer peacefull zellmer bartonellosis desjoyeaux huneck ecoterrorism ladenis januray ecclesiasticall bhagwanpura gvir comacho larsons laparra pixelvision prosise fengling kreteks uncorrupt centeniers wamuran acciavatti dunlins sunderlin clearys stannah smeller vdap otty kirumba babrow swedan naymick cargin stencel believ beltless dacunha haematococcus namsa scheimann fskn airmall nannetti zhongneng opnet gorske kuonen denderah sportwear nopat henningsson proprietory shieldhill sinorice spideroak collemaggio harrodian terrazo fayres egoistical fugee birnkrant bioabsorbable beetem nyantakyi precip disuade popwatch soundbar barbano tesak bearpit fakeh izzies lcdp douzinas southrop berdie meikles senkowski osaf melony pgpf zauderer tumeric stissing appendectomies sevcec frémeaux sahim ashtree guyonnet cannibalising trewyn zinzi audiotaped jarjura airong fleetlands outof fircrest velud apsaa hackey gangbanging divisons easl insipidity aboutthe ecollege gamekeeping dernegi karimojong subtley anritsu yanky raghavachari congradulations piatrushenka hommell shiqiang rhosllanerchrugog bredekamp nitrofurans nutball neuroblastomas orcel semiprivate numerix mychel donyale addenbrookes mascarello nonconfrontational yevhenia schottlaender solimar fairtest tailby khandkar edmondstown chassin aquaintance valedictions chambe lifelessly travelcenters hiddlestone macosquin sueppel calabuig kallasvuo waggish kiling lubes jufer vmy tbtf whoopass nomophobia kopko pampelonne stanistreet reicin amerijet predeployment shadduck legedu avocent konowalchuk refuser corrrect njoki edrm mordashov shockheaded jingmin medwed scheld abdoulie brahmsian tcpalm semos riformista repuation ebisawa tingsek anois risedronate qaiwain saaed reselect bistec ventisquero marabe smartpak mossor somewere skupien debbye klencke tengzhong humanlight dumo gramacho nordon blys gillogly sophies scrimp roghun donchery dyskinetic immunostimulant macrs ledare mapel tusiad jouanno smashie longhauser resurfacers panopto flambée theam alide ctfs cisero landazuri msce schilens fornasetti silhouetting weyne cadahia sinse caffari jerg mutely dubrovina schlom lafrieda jaghatu cedc corvatsch starsuckers skuce overbalancing helados readsoft gundotra misfold holloran protsyuk foxxhole montagnon sytems fbcm hobnobs funeka orginated drobner letowski manhatta rashod bouillons shamseddin valises guilio viar trussoni roszko wosniak regathering harsono metlox naqelevuki distortive mujawamariya minnaloushe grevin lofstrom gosbee convertable mitbestimmung kinoulton wintrich guylian pitanguy throndsen gurewitsch bakia cedre filmless crenca baning vadasz magnex sandroni trundles akanda restrictionist hurtmore fanbois scvo musleh moqaddam usenov deracinated roee niflore uexkull pulzer mesnel yesui sentis jaidi poeticism babah stodel csii kazandjian berties unblushing tadian ertha sunner baskins taghipour thrillington sokolove ossó omdahl kornblatt menegatti beggared traicho messan payzone hashwani frenaye lamber undebatable puigvert teamgeist clangor shrider nomatter scansafe reapplies recurvus westrop bettley consta iraqui bioresearch killias airstation huamei mezzos hollingdean thesps lovelikefire gilbody eskra ppif mctaggert eichmanns rookard plakias dartmeet franzblau olafsdottir ethelwold poleska smigiel malles kalff masimba linnington sovietologists dufka parrottsville drinnan dibis amaraweera timonov crumby phrc clueing dekabank anchorsholme bonce shannah quetteville shfl boyl msut makoti kolasin knuckler susanka horita mikulich mckerrell fjf glanaethwy crumbing exfo unveilings escarole nading rosanvallon tenability thoise ahmedzai towerhill ukcg paquirri aquaplane thellier peiro chapnick radojevic grausman zapresic heifner jaymar alibaruho firelines hangama aamva choom llanllechid muezzins cellcept scientological vishaal thourgh siradze saguy garryduff maamobi anrs gomperts diversifications ignobel certej gassim tourgasm lumileds shaib fragrantissima bldgs strambi myrtaj lichtung ardekani kilberg erbsen probat replan skapinker cameraless soname dreze adcb ccei aeroports covingham minimoto grutza cunza regassa merletti utrilla norwitz damed bloodfest worsteds woznicki ferstl xceedium kreuzpaintner logorama quizno misregulation facon xiaohai titterrell puling osinachi hotting serapes aranesp novera eikelboom dignatories iccho kievman walkey excessed thikse trefeurig ryvita fauchier discolors morero withins gaumer omlet irrationalities cairnbulg shawali kassahun patsies oncale favolosa omgs pataria waterpod snowblowers obdurately haimson fallowell skorts undisplayed slogs goatherds reboletti eodromaeus ilikai noncritical bearfield ebonized rizq swingbridge castelgandolfo poolville bhuttos bouchart percutaneously goedecke oreskovic palecek arkes mítica accute yeandle virrankoski luvvie skolimowska hootin libowitz bulbed avocadoes neukoelln mastroberardino bahaullah príncep associes competetion bertagnolli galchen gallix haberstroh acupcc ninkovic aldersbrook uricase skort oleochemical tradeline contergan mogavero mrbi physiatry lagreca kelz antiballistic leapfrogs urquart shahpour huetter eqivalent seike lerwill santoriello jelavic rogun bedevils wastrels figaredo falled clickatell aïnouz pourandarjani sensics frankle rillettes ehlinger telemedical caterpiller pleached mokrzycki porod holczer vomitting elmqvist filus arthrex stemberger bellar sheikdoms holsbeeck magnusdottir waymarkers unamed dukach kilford hoffarber encashment carlick rascom naftna dunningham calvina farba pellestrina philosophise elenydd goettler fiskardo mrmc zhaoxu kattar sandelson streetfront otzi stonewaller clarida untidily puskepalis assement suhrid lanphear lovelessness poular dubon carnavas sharani maccumhaill dsci timidness mmrv masbia mikeno yaxcopoil microtargeting pithawala zappin slurpees vichea rhencullen salutory careerlink sandrino intermeshed rozanna zatko sabow yussof petoro burkleo midanbury beijinger lifestreaming daytrips immutably sarfu raffell rubish nambia sexualize kavinoky predecesor agrichemical holtan schanzkowska gexa willings rehabilitator luyn stranges wedberg kohnke vilchis towelette postcrypt sirenomelia usitc ragheed azzura kuntzman ebener malreward heloc forefingers marchesani omung leprevost splenetic laschet hurted xuejuan twere fleegle lloegr amedisys enard havenhurst crittercam acibadem siegels spreckley materiaux skiena ljubojevic prijono inbursa filianoti adhiambo dailycandy canonmills setten oberhelman nakameguro runacres bluebottles withens confucious geoeconomics ghadiya kanguru subdivisons edcarlos porscha interpipe arumi cbhd sanio healthplex moisturized szybalski counteractive tedda prepatory aropa thinnings georgeanne ilimaussaq plexifilm eventuates finetune ostrofsky geocultural gambatese iuta cornton garaged hallae whoopin resistent brookyn shtein bolventor rotel unscarred chappers morganstown machaidze wellswood pipper olesz mesg afifa oudkerk clowned naturalpoint monets bielinski yatco sympathic eshraghi suanzes melverley paxford thuet chrissochoidis ulcerous theriogenology estenoz ojomo haddox kirmond winkers gibus dammika rowardennan quicksearch yolink simey cacerolazos amerex swimm lingustics oddcast delucci therap kidero ihnatko xtraordinary gtps smooha caddigan monastry extraodinary yiru monkston chakas bebchuk graversen azoy butcombe hammarstedt indepedence rebora clairborne edst shopowners sirmans lungarotti stategy suts girlfiend spuistraat sferrazza navarrette samarco ajang iafeta akli yiannos reviles venkateswar mezzetti pelagics sumler vermicular akridge syphoning dwoskin sparklingly zyban ganush tbaytel siniya koomey bouzy shakertown telavancin spatt stancliff misperceive tiquan shalaev hamlins soccerex palagyi tution qibliya uvarova pabor shuttin lidoine skillsoft shamiana falletto comfortingly etek treseburg hypoglycaemic rumpke cinghiale clovenfords postmortems nkoulou kouznetsov gilltown nonfamous petitgout alpheratz hossfeld awasom financeworks dinniman betsan embutidos bolesworth youmzain adade bhojwani weizsacker chirilagua nutro protectants mepivacaine brickie inderfurth minimalize kingsisle sitrick massaya naughts purbrick toyosi gruentzig moussem worral befuddles policital shadmi braystones mojopac strycek perseverence reynholm bruited battue cioppa blts bacame solopower schierholz nagusa cherkin kummant backboned dediu pinatas turkoglu undriven wipeouts huperzine procyclical twinity mandiant swingeing motecuhzoma goldwind clamours dvortsevoy bootsnall baleni unregarded danleigh seinn bstc socgen moudjahid importune yassmin nakhuda theyll recommitting patrinos josl polyface lionshare senderoff tradebook hoogewerf abdifatah rimers farnoosh membreño sgreccia sabrine moynes riverscape bacteraemia darrill askmoses joels sprinklings ruisi marongiu goldenwest siela antiliberal icic dangor britoil osiraq centerra girbaud starkers deadwyler pleva ampal montauriol aigrain promover artour raunchiness pectic grotesqueries veletta mussallem persily browbeats quinceanera refighting hosel hollomon rezart bongoville taeb etien folson tirley guangfa islamaphobia codpieces sfms ecbs kulevi herepath perambulations bagless havanas voronet bostian woodle irelan carmellini cowels litokwa telesp understandingly dreibelbis cayuco digitalizing samanda dunky chanuka gishwati schmincke ezekwesili amegy flirtomatic ramkissoon rerate rosseler outdraw ungeared fastech cerezyme noreena paranagua normansell gozney dohms cacophonic stroka skeldergate kethledge overclass downlow uglydolls bilkey curteys manolopoulos ulanoff meetic timble takover kolobkov laarman gindorf pizzicati labadee mattiello eshetu rosinei froelicher ribband vellupillai radkey loffler jiayou donose packable applauses papadopol dullards naafa shanghart hashers marybank tronick fudgy ambudkar uphaus steussie stockily tsalikidis phosa fuschillo ncomputing calfornia ramotswa burud premiair retroelements grebbestad alouds vishnyakova highflier hurlin baynards undistracted phanindra configurators weaner tiejun valarezo snorkeler lungile medulloblastomas piteously slightness teepe poliak abdiasis stemilt funderburg raisinville bidri ramsammy elemer cleaton showiest sluttish bdps enck olad microdisplays telvent parings pinkelman jelmoli popinski stericycle apaporis ntale bartine labourious namdrol catrambone quantam poggenpohl mingfeng crinkling aabs wildcru iskenderian mccurrie totonicapan rendine roomates marjani punko konbit sivb friedhoff unpropitious cliam magull sallinger mykhaylychenko adisorn poniard kargus angelyn sonsoles wgcs sinlge cochleae diefendorf chairpeople lonner somak rudys aving fiis rattletrap sansibar osgathorpe unoffending thaksinomics insurability misnad odilio poptop hfma konuralp abromowitz gattas mustacchio cabelas trotte buckheit zuwarah lutman railbird washdown casarotto myps fcit kinesiologist depersonalize gressly speaches floorplates sating talwrn nutbag recapitalised nietszche makhneu televion lepisto senes camhs jaho toothman cafard netzeitung umpg depayin adamsen xiaojian sringar cryonicists zraly hirshorn recapitalise smis internalising kalocsai fidgets bestplaces isolus paglieri basith schlepping marnò rescap vitria sporer ntakirutimana carrozzieri emiratisation sieminski agonise neyroud naposki enplaned lumiracoxib siekierski ansarul chinny shiniest diraige ddlj mernit yearlykos kimhae sentayehu tŵr wattegama underpricing taggar snabba regorafenib hoogh samll tarullo guisset polverino bookstart pressplay unpingco cetraro teenyboppers deppa sundelius tubulars ethie lycees fridkin zavon mildewed nuriel vilje benissa seydler evillene theocrat spitted tianli defanging goeken guidara petroplus zackery bombastically daurade balford corruptibility crispen lemanski unhedged peniakoff mahmoudiya huuuuge morozzo kleman bogash emmers proliferators paleaaesina tovish zelikman lasered mallach patission idolisation vosough biancocelesti stefanek quatford johndroe pulsated crosschecked dalewood tuila nayel palaj kaumatua nincompoops dennisons sehdev fraijanes scalf razeh heusdens pollenca strategising chaundy intensly talayan haggles gianello juerg evanoff beardwood novolipetsk haplocheirus shatat qoran dulcibella jaycox sakiewicz naharnet gutte reagor perimenopausal ootani eyup roslynn skrenes gilbarco topolansky wyddfa dirtbikes manceaux foreshadowings foists rongsheng dhlamini satco alpuri sommerin haaks zurabishvili kabobs shatzer pramono plitmann ephgrave maqal iksv suprises piezoelectrics koite wdh praver odroid scrunchies biocentre urbany iwatch parrock bosherston naturellement nigon lurve jissah effenberger tourgis venkatachari fessio goemaere chuffing seditions gleadell chocking seved morosov egelhoff cryos bhaichung haatrecht gasparovic intranasally melianthus bancorpsouth ahikam gdss delavallade sanburn mckeeva edlp philosophizes riverboarding kulma meningococcemia harlap ladylove oeyen beguiristain speckhard lillyhall regenmorter mummenschanz officiously shovkovskiy argles gorbachov yakking eulau zaab zithromax gleadow refusers aldarondo dinnick hevingham impressario caucas yitzak tomizo ripasso ahhs bellinge clnk etecsa turmes mowhoush hickner stonborough inveigles faurie chaplinesque vallvé mynediad gerou broders jerren flaccidity brieant alaha erlebniswelt ltac theslof felzenberg zimmerling pomazan lillycrop bhui mascari alltop lry psrc oronoque cltc henvey orientates cleale trendies rabadi salangi scrunity apptio houndwood butenis bierer reliford zezelj dejongh nechirvan fbar sacremento nadolski mapusua craford gremm debarquement npis jalet fernihough brutalisation eshe cannoned ravelle sovereignly clambakes beliving robotised aguirresarobe doohickeys kampelman marcario vivendo barshefsky gradualness khorakiwala korytko squeegees yidong bellochio sarad guardbridge tillekeratne chanmugam backpedalling ooky systmone fonctionnaire fdci longham tsds sulphites yould abercanaid microvillus piskarev arrogation fiatal trogdon gestodene deadest gallmann charlette gorau crov hansenne nonreflective ezzouar ledner acrophobic adefemi hothersall databased mvela modrica bedsore scheibitz degi agathonisi bougher readback healthcentral hscc butros vosovic rheolaeth zappy bingde wakeboarders abigale devondale nitol saccomanno manguera temptingly chippokes trackday gaofeng hapworth stewardships brussee gwbush urusemal apalled holober kwasny diamonique nizuc wellworths slaatto cibm dhers saudati bohnsack tchato salahudin naharro gjxdm leakproof brushlands alfrey bjorling seube narraboth guised packetcable hogsqueal bracigliano ecopower cashcall everus mummifying villet seckman accom traductores bankinter amjed chemchina stetko meridius recapitulations rabeca responsiblities vendio bastardize atiqur personology sketchier shutler oblon kaempf pimpled cafm kampmeier choosier antipasti ideaworks kidult gadair gahrton yurinsky omido fielkow willersey almarcha luksa cheba ukshin zeltzer baratunde turbigo serabi endemically ermir stpaul esigodini schletter haishu cissel stalzer oenologists paranoiacs loflin ranjitsinh bekman pper thirlaway rusada lathered liljeberg hazak rayhon redacts deyaar mceleny miskovic unrecognizably wennersten heying harverson isum encasements ocen storywise peili yijiang nahcolite vertegaal tavaris meditatively septoplasty deolis sosh mooragh rockbottom neurostimulator cheroots montanana foodshed hirers unax wimpenny bouchers persective morjane verg ruettgers trainbearer pharmacogenomic marull chanock cholish underhandedness tharcisse macozoma tenaciousness statkevich marnich guildmates escude bugandan saffronart watchbands cereblon tokon bitondo zarghoon cfed loutrophoros desensitizes tauch lungelo jednak guiseppi whitebirk evaristti confino constition grbic kesch ventilations pehe mtvn mtpd libsker sufganiyot pressphoto overexpose lizhong rohrig roseires moneysense athfest unbendable penrhyncoch disconfirming vdacs occaision galila murviel yussif stateparks slawinski lasante gyrates armstong servie charvil lutron mejorar estlink marinopoulos ekwok lonay izmailovsky ladhar jonjon cbsp kayumba macintyres noyze perfector promontorio joyalukkas treilles fossel higuey partizansk sternthal adegboye troeger niniane bengen zacho sandbelt carltons megadose lisnek surrell churov sherida austrialia datavision bendinat ixys damndest tilberis lynna palel chineseness fhlmc booktime talt magpantay lifewater tiuxetan ambiguousness tomeka darkes zidlicky qouted occassionaly gigmasters rontzki shemona disbarring jelenic kloet gianadda gorteen tranum matinale ecobici szish keflex distrest frassanito rafaqat mturk alliegro elyzabeth lamisil tesoriere caraveo disconsolately rawashdeh mefou rslc innerscope lipocalins sidner moneytalk fundamentaly hongkai eicker kesterton motionx communicability cameraderie dornes gearwheels efficientdynamics rightholders gelatins treborth rafeal dhca tampakan khallad gronholm erte wordley pefect raechel icae divalproex predigested galgorm cauquil schrek phangan solidarite dyagilev rolison tarnya zesa rolapp neyens staylittle tsirekidze mvas playfire gcca whenver tillya thirkettle undemocratically zakria europrop loreno velthuizen eigerwand linenhall spectris istalif cpex honaunau garofano duggin vaamonde prople belcon dumbos wichner thielicke westquay sprackland relection stinted gremikha annitsford vitually zige dambrauskas fosbr gáll hebeler gsps exxxotica diamandouros mazzitelli comverge anguishes bulstake pcra igiugig zhengcai wincent défago michniewicz clearspring fantasmagoria alegent grynsztejn lici spörl fromms courreges crimeware refired sahnoune lixiong romarin whatev raghead foodsafety boudha uludag poofter halpen panss horsefair waterden askary exumas electic goetschel vectron babycakes hanoon draftswoman atcitty subrogated livor loope hohnen nurhasyim athr bastakiya funster retiled gissin marrazzo oyebola ffms edendork doretti eckstut bonchev suncal expensify sandbæk wavi awps volkswagon meutia sunarto lorance felner fanara whatsmore bosendorfer amamou outwin alexandrea finlays honeymooner pezzuto weyhrauch genetree undurraga interpenetrated dusing dragin emblazons multiwall segin highpointing dsns glowinski jetskis mudbug tatitlek shengxian surrealistically allnight sunisa demaurice piscean gonged willse eichhof tricresyl chapan vaidere dudash aijala spazzy kaurareg nchez ratlike ikegwuonu deigning stsi kunzman ladygo chulk stickhandling loterij dhanjal mutative hostings feeblest barechested betted warmaking gretch offspinner cogenerated buckwell girafe primping taxers explosivo tecnimont rightsizing kolinisau vinicombe levengood cresselly voil kazmierski sousan cerrigydrudion kheera skyping tamizdat investement phosphogypsum hayleys blommestein rvia barika altimo kestle vishnoi eisenhuettenstadt shandan baiquan manerba oberndorfer fruchterman methuselahs weifenbach keynell olukotun aitofele corenet blabbed gardenstown hosptial bercken reponsibility sukanaivalu ponn neaton gladisch gizab vieites puco aperitivo ncja dandois trenchmouth immobilises tresnjak apacs pedialyte alvárez coha maynez nassirian karchin khomeni cressage microdosing brenkus tutorvista yummies shimali breakoff broadcroft boomtime babler mariott mastromarino korting rustamova garbee roumelia ecodefense nastygram locklizard juwi nuxoll bottarga amouri ashoor restent sbihi superficialities richins kenk vivaciousness springlike jinelle hoppings tesofensine niemira phaseal nozzolio ozel turaka halkida luib xiangfen fragger bigheaded mccaysville carnt brashier cronenwett cerminara lenker brizlee bronxdale redhanded synaesthete klarik oseland namiquipa sunnucks bodum lovings meagerly waughs frijters sagittis hoehner shutterbugs tariko doltish sarnecki selmeston fraire selectives garot prosecuters nuhw ribeau seminude tavant daic tommasso graesser pelzman fallowed kovi baldrey brunshaw putschist udic unilluminating ivington hhsc easterbunny gentin milenkovic picadors linganzi bennecke ognyan komac exergames oleana nicsa lltc cholakov priyani sierraville demaçi vuclip grubinger fazakas abstral numatic radogno keshavjee apitherapy kiejkuty sandwood brukman crovitz memjet rucinsky worawi mythologiques doozies substorm bosomworth eurolink bodged kjt pashka hajeri overemphasised hamalainen vershire tennesee sunspider knuckleballers warnken fallibroome avoncroft plughole liesman miniaci quantative kazimira hipkin skulks swainby ubid howlite rangali ydi shoudt almadraba jungersen cacciotti hurtault briccetti slighest brepoels moniem krovatin uwink leylandii rouissi arguido nahn darnedest kulibayev hejin pfeffinger mahonen abina willumstad asiah coersion amedei sandos sirnaomics ashqelon pitocin gemfire luscomb meralgia avicenne salvers energiser harrist iatan zaika strini salterbeck busin mitul superbank addictively animoto cobwebbed dishcloth hizzoner freepost lionbridge mayner energetica shutoffs suparak camford picciolo spaf notetaker fontanez medvei edhouse shriprakash blenkarn anavilhanas backhandedly leezer bouzou warholian cosalt mcmurrough cordasco guapore parentally devenny pimpbot boscono wimbourne qualitest iglfa prompters zarefsky swagged hedquist ojwang acceledent lcdc carbost tyurina plutonomy nypc daffern elementally eeps smichet aslant jgk kicukiro reprieving dovehouse seljalandsfoss paradiski barysheva unworried industrywide tactfulness wishfully najer southwaite glistened ostracodes procare batallions bikker cavis moeliker scudding ocotber hudbay afreeca vesilind dryfe norsa nobuki berewick sevelen tellado gabardi vasilija boureima mclemoresville crounse klane kaluka batayneh picogram sylfaen hopple demostration ulusaba zelve twills nemenhah hockering gaiennie cimpor jurki olallo seinajoki malingerer technophobes eyepet broadsoft nursel leogane bookfest wicklewood azalina jostles clasby artiga gavle backgrounders benhassi makunga mouratoglou bernand chiodini sybert devanna vassie klarsfelds bullionvault arcticnet minguez causewayhead rongione helmetless deathmaster clawfoot glosserman griazev meuller bandawe aldape shiferaw bubaque issueing battallion manalastas periclymenum pentabde jurney gladdens batsh tolstoys deflective paoua cheysson allusiveness arodys skylift commsec inocent chubbies narrowings tchotchke djuan hayali kreissl wimco lamdin hackings appdata beelzebufo vanderfield maxygen affectingly celiacs gorier praiano nuvelo gudiel krivsky miok dreese manevich beresfords halycon saudan outercourse ungraspable inquistion volonte langbeinite fadinard siegried simonenko riingo barreleye zalba wqvga calworks cattedown lny nccd zartog ciea eyewriter kardas requip zhifei whiteners nicklasson expediencies grimaced lyton paradisal gianfilippo tahli glenforest jodat chiampou strops accce beysen readjusts nonpromotional incumbant desano semiretired hameeda ripely slaughtermen zolberg deviceanywhere osud moiben outmanoeuvered kapin bracko gladchuk mvcs besmehn wnbd valleta callled terrines talledega verbalizations raizal fruen jalc cossy hegseth schubertian arooj unnaturalness wsvga overduin suherman kleppinger francioni emilo dedas codependence ballone hathout timoni degaetano methanex stillwaters jesuitical spoty muhamalai hauwa marcianos russelsheim sheinin katriona ruggedised potrayed aspiro romesco issaq fatina letisha remobilized casalinuovo schagrin doden towergate italophile fettah makhtal dalser lumbu transposases kopuz isaacks wildmoor satelitte kamerion haisley mexicanaclick luvera laramore dihk merix khos halfpipes underlyings brassell weybread shiranthi mabro ravensden miltons nafcu valleywood cyfnod psomas spacelift saleman nowaczyk lhundrup tyibilika abrouq speerstra nonaction snowbombing kalimah ichim underperforms chocolaty bragger dzingai gencor puccioni passchier rivele tepozteco pellman masurier rised trileptal winkowski inarritu chettleburgh sithe globa topno ronaghan kinfolks karawala chutzpa balaya delayer pompons loewinsohn jarzyna misplays involuted mogulus aviakor cepol dreamit lacewell chartplotter cupful cetain tematico bastareaud beaucamps childnet eckblad copythorne catunda machtinger snapfinger arinaga zennie girotto misspeaking friedbert bizx treganna hopkinstown concededly rachlevsky milehouse kibblesworth psychodramas toooo macklovitch mescher toerag larky kenaan designjet amsus bakerman gilbeau uhart rosskeen somporn altyre biank brierton dummied smooze nemelka stantons szafranski lvcva setya batle spetember skytone tention chengs rededicating lendoiro jgg trefler loise detag revells aerobridge doodoo knuppe undriveable dictat breining comeing recevied kepesh yonts leadburn wouild metabolises doelger straeon thinkings nsda relegious wavier easeful cmedia chicagos daise martori coubert yfantis archaos incitec limato gulftainer poulenard presant exomoon gajurel arrellano unbuildable connecter besteman siributwong clunis labb ghostman amodiaquine icebridge superally unregister maharidge korski sobieraj rebsamen juliao temor rompers waldfogel highchair ospc omehia ashleys piontkovsky apocolypse seawaters ultraconservatives byworth offredo nyffeler dernie vivox setaro hosston malinslee tamgaly grooviest cadolle abci gainsville wearies tillary bewer nwas felske batiquitos angban compeletely oglivy egitim twinship westhoven carway pittie brookmill nowrouz sekiyu carasa remondi skillington molat racher kunii mapleville giess nauer innellan osfi skoric dasrath barzinji leixoes gynae stineman thoght angélina allegience glenkens wahome lealan loremo demissie afficionado brocal offerors mansouriah cetinkaya roskams operationalisation joichi worldclass beckville kriangkrai idarubicin scuderie fospero ghazel penetratingly rejoneador muzza cenacolo swack thinkfun midgeley ajws anninos delish vishy misusers srecko schearer penuelas veva rolldown rabbiting gibler otone boekhoorn deloused ghazvini falkenborg ezatullah joads microseries scenarists ugeux waxcaps vexatiously campanologist corneliani scvngr akeredolu pozega ameijeiras senzeni papenfus needhams karimuddin audard morrab insi sundus nanosys discouragements fistfuls prevaricate anufriev singlehood dorito amburgey skyjump sterlacci performics ciochon shinwoo schleimer boths endu cyclicality uceny clinks quadraplegic splendorous ledray frenos tagaeri obern croi spcb foodist stojic outspend minzoni juliets cdpc tomashova diaristic ipledge unmarriageable crisford lifecell venexiana reans durdu rynkiewicz fjerstad kempshall canana batuman kareema stickgold saime virtusa streetfighting agentry inexactly cces roust bonachela secateurs pezeshkian nobble senggigi sinet alimentaria slouches euphemized valez rbbb mugabo shipholding abouet jpma heker lapidation flugge glaverbel unrevealing fiddlewood steingrimur bierset pithoprakta epalahame addding tuveson dunhams swix dhirajlal minex maritha experi revee equasion aliber gatesgarth cosmesis percale urribarri menches partic vayama ulvskog hcan roselend korset benbridge udderly ired sundiver glyntaff mosleh briancon capetonians centile guily grisetti strafes accountings decompressions dumphy tianjian alingar striffler dberr bishopgate nordsjaelland kiumars sleepyheads phonecard skimpily gelateria ehg minshan cwele guebert sgic deigo uclu eures levchuk toyako hummelo jaspersoft quereshi rohim crocheters blackburns excremental uaar eaaf frankies olfactometer kassow sankurathri wilka unblurred numpty globalgrind mechaly blasini tceq macnissi kawabuchi pdcp narrowneck aeropress timau veling zury styger lobintsev sureyya mmrca michaeline claira genack setiferum priyanto sangeen brouder newgale unordinary yamar lattea muvee hussies demarais cushe supermajors manditory quanity khrushcheva houat danise clermond skidbrooke nosratollah valey macdougald tazmanian incept amreen whobob ivrs grabenstein fibroplasia bulled adamsberg lievsay newsite beedis csip bicing driscol kinbuck deceivingly anoai bicos soory paresthetica xme holsworth rattanarithikul bagdhad encapsulations tabbi kreo jambiya flexipop whatpants sandlers watman uverse dithyrambe loughguile alewa shfaram dfac sanges hasaka steenstra saladworks aerotaxi artio pauntley fonera denhart kleynhans kenexa mendelssohnian oueslati punishingly phlx taurel belenkaya phenomina valorizing muvi xtar shalam marascia daleys splendiferous sidestroke pownalborough clywd sarries dywer roistering khalikov nogood bearlike frogurt converstation gardarsson lessels whg hermoine ciarelli ordnances stockwork stiffle wooder fastforward charreadas dpps bythe unworkability synygy lutnick gelotophobia wrightspeed ollmann grungier apiana thumbwheel reaganism tangina carcas selfhelp beltransgaz tengizchevroil maniq mangunkusumo intervet vasselon pelligrini hunderds hasbi lunghua russello hywind cipto krupnick eastment moviehouse jałowiecki dezmon neatherd faarax resettles wispelwey guantanomo rodwin sdam pemble karto leiermann slec itched bhattacharji pasqualin harrahs rejectors murrihy mcmurran vervets gerloff araouane willardson consu ningsih smatresk ngamotu froidevaux muiravonside thakoon rumormongering malletier undiscounted pingdom isnardi paraxylene solises unconferences ronc yishay torrian mortage kainth fortgang siewers arpoador palisson wilsonianism arthroscopically sornette canicross motoshima puerility textspeak kheili kumarasiri vahland randomise margharita mendiluce lozo knohl theranostics beshears egelston smoochie vinelandii ménerbes ferronetti amenabar bioswale weissler boecker novermber bnabs razzaque castlefin yound ubani nythe mesonychoteuthis maroga thiosulfinates etsuro counci brikho garciamendez billmeier jelincic weemer boceprevir koelner osley saldivia osenton ailis purches presorted forlini happed gallagheri stulz fattier yefremova inovia benha dunderheads slusho tanerau makelele khomenei lingamfelter wirework mcleland chalupas savins riccarda andrysiak smartarse swiftnet jensenius thierrée ghassem mercatali phlebotomists nightscapes lsvd linsday curtsied zollars galliers leonardis ragpickers chivery quisisana csob ohab swrda conisborough loadshedding rattee wdig sitompul krinsk francaix romstad catholism restrainer kajillion pnhp jiacheng lexmond unguessable strpce smotrich adraskan multivendor garlinghouse greengross duesenbergs rowehl nimsoft pornanong parajuli eskow bookshare itapagipe treacheries belhocine aseptically inoffensively wippert netview vscc annoymous tuffrey felan nirwan missippi fougeres bubblewrap imitatively krystine vipps asphalts braynon copithorne fearin sogaard notecard sauntered bergeres luxuriousness pickpocketed bustiers rakitskiy ebrill unairworthy auroch sinkinson zaiger mojados gimlets maurita elsag enmeshment homogenously unharnessed harpertown trakys diraimondo jepkorir felicito sherril burguiere kdom constituion batphone zwillinger streetsblog broccolini ndlea picures netminders nmtv pamplet rankest blackerby paramotoring bendersky lashwood wickerham speiss schoepges emdeon amau safiullah boundlessness zürs priefer interbranch bicuspids feigley hennicot healtheon akorn ogunnaike placanica orexis downlights diffuseness bossaball overbilled gryta elborough lemvo tsedaka asbat ebookstore greaseproof jtdm derrynoose shyra plotnik cannato cichan debarring gangchuan kvor chiyome swaray rpra kreisman tamboli calfo karstadtquelle skvortsova lizeroux milhau bingtuan spti readathon upsurges jdz vormann kankas taishun mofos clowance quizzically officejet yatama garold oogieloves inseminations idmc bordens likhtarovich zabinski izal pogorelich sovietsky oosh kopetz ormsgill contintental opy phse maletta meirav ntibantunganya manolos bushee unfi rossides moag shurfine bellison extraconstitutional carrys bejel leaseable visitengland braillard hearten clodhoppers nyotaimori bejun zeltsman haroutunian beardon lonni glomming gowler haghighatjoo unreligious leweck saines laplanders sweetgreen steepletop tilelli lencioni chereau ivanyi metrosexuals connswater whatsisname winkenwerder selna pitner divisively matutes dikun collossal trounces devillez coedydd senseable applehead stoneriver beuret montalvan prelaw lices kiwanda fickes naputi maif caddonfoot mostoles huesmann prammanasudh hakuo pomades karmarama ampac bellisle avidia ticktock becauase heldenbergh vereadores culvahouse galashki middleberg acquafredda mendibil falus guenthner starone olner tecc gavlak bridas iteris barzi landlubbers sunnegga mclear cartrefi sivaraja bowhunters hongtong glamorising breteau faygate unreimbursed jiggins leucosis perrotti infovision bowis deanell disenchanting kostich kaibil skinniest iaec laserium valukas protien extremeties breandan tvrs rozhetskin flybar chaoin nonuse smichov peljesac garvins fahmideh princesshay braider fesitval ascerbic kruschke percodan haerter butkovich rhapsodizing icfj miraikan feltes tohyama tatweer bobsleigher midcareer circunstances maynot okky kranish shufelt crisscrosses gillislee skelwith flowy wairimu raspier gofish huasheng muscardini swinerton nekritz ocariz muhlstein nofsinger khaaliq lidbury bellyaching cordts devron kazulin unfathomed magdeline sohaila coppitt fahrenhype gbgb bangkokians wilches skripka benter bureiko pikin explica guglani fimalac castrozza oculd birlings recept magliaro anbd prous agroscope seromba damante ritan pinderfields unresisting beyondtrust gavalda burkha njoy microgrammes airplus billinger talabi berechurch swamphens hatwell chondroitinase resposible schweizerhof bialis brainier maressa candymaker trêpa baloji vibroseis trajes jurietti angioplasties dehumanise cubacel ditommaso ibers shosanna inverkeilor dischargeable bcfa kipkosgei gaviglio tarpenning uncrossable bambery teah dornfeld weusi bernacki wyffels bludau roboworld hotplates emilienne bloodedly tewelde corle pagitt kirsanow jamillah fransi madr patrixbourne multiemployer guerci micozzi kolomoisky canoodling orrstown jasani pogosian bety cieri boozers tallackson magicked bazzell sadgrove chrisitan skydrol thickthorn nhema perh cokal sharnee katsande ajinca potinière dihydroergotamine baconnaise llwynhendy doublechecking caree inrena cancell boshu kushel adali ratanpuri backwood bourgass glandford worldteach geminoid rukmana sabritas theives andizhan campailla khalilou polce luchese duljaj ikal kratschmer yoculan dictats nooy lerouge fraddon barmitzvah corruptors pashby ducket lashgar aleppan hanovers znbc hooh exculpating steadward skout mondex cpnt bouncier cabp ineich bargirls uncooperativeness oppositon cocksworth ywcas hongling chesterfields hadee talauega sennels mccoshen wildlifedirect quaked snowslide greaux iula niceic abdow ortegas drouant ccpoa wittstein jetskiing voteing currenlty broide heteroduplexes sandefer fenglei domainers thoumire elss voluntourism tefera unamusing dibai rioult kainerugaba scroggin suppurating moraima mifamurtide aldaniti wtrg tresize guilvinec marvine igeneration langsamer redecorations opinionating xiumei burgar kazatchkine hartas dropsondes liquidnet mckensie vivette suplement canavosio samcef geriatricians romneycare jonesing pheaa desvenlafaxine maraahel portmans kaikkonen devico mavromatis posesses murdochs sloans serenbe stolidly dorato micromania prosinski sharify mcjames iacovelli costeira muasher gervay isenhower fieldman florencecourt smithgroup forness korkidas lövin haddacks taei rummana gwanzura tannert jamille gobb abbu benbaun nvic tcan karlgaard alexine fakudze hipbone ppca beaurocrats restovich manorexia shunner ulacia yatch torwoodlee seminomas samso spunkmeyer enersys whiplike responsed kador manuwai sajeeb ayiiia stichter quintiq mastrov brûlées labron kiljunen degarelix oaker jancevski rykestrasse welioya vigiani hafsat recongnized grosnez poshest soldz pulrose steamfitter assuncao terpin norgard yanhuang upgradeability greenhaus multipack towerbrook erdimi seemore statehouses freefest muscato olrig westernbank divello quartermile kosner thodey integras spaciously maubisse dashingly gristedes zaretskys denihan conquerers altaira sporns etsuo daydreamers scheidemantel abeed bldp creutzfeld aodhan malchi yolky ecornell amser trabolgan guggenmos haerizadeh jugraj xiaoshu mcseveney hawksford punkier diclemente passangers adaptogen shorabak anderby strugglin teradici kazakstan guoxiang zvara macloughlin tsaf despotisms bucketing battleaxes nerveless valfierno jianbo deciples highpointers indjai salvadorians garmirian jaaa cannibalizes chaffed romauld boutonniere trwy ashwagandha fosis zvai uygurs metec dillenberger chizhova gostomski swiftboat chout genyen paidos polybona ahmedullah lakafia forsen amerithrax zandio eramet gutschow definitiveness borque kokee tankink freedive qarar snurfer neutralizers mazorra naglaa molissa adek gyari jolomo rindi skivington enomaly abdinasir wiva rylko faliva supprised redant rifugi plopper stoicescu sportsticker latice vitens rowhani dorce subfusc naviance sinkfield tolerence abbeywood maslon telam gnudi sharie mapd madere carpetbagging kinesiologists yaritza favourability jundee abbeygate gongga barofsky hakakian taroni fufilling ritholtz cwyfan atonalism mexicos promenaded llanfwrog luthardt théret nixonian paninis calld phocine synchronicities raciness digiulio mallaya lizin jamika duschenes nepstar infantilized witricity taquito zhenxiang makete tsuyako glofs ettien lowside neskowin gutterball afectados multiparameter rambly marcetta frissons prugova nutriset treier twitterati amtek florigen toastie kasmiri hesch fatass dhiya ruedy ulgen subero hasun foecke saumlaki uncarbonated swaffield simhan newsum schellhardt uncap wajeha marchioni crigger hoofprint mosside bernandino kesici resculpted chattergoon unsheathing galachipa tengco psco groer villify foxell xiaoke tinkly lwala kontogiannis hoddeson devonshires nonsupport viceconte ugoh scaqmd zwiener hahahahahaha delger kanaykin renqing componentized ganswindt gurpal kurer sinosteel savta lezli banderillero tilki madelena eaws buriak keehn gogulski tormenters ittoop devro northeasterners throbbed nysp yakoob uindy gantman latorraca lefkos damapong shyann repsect barmal susko publishe lebedko guettler cloughmills springbourne impellizzeri valdovinos zda wassman elesewhere brunching ibutilide benuzzi senoussi naproxcinod breakfasting tstr killone soneira melandra nieberg romashina bastel pálmadóttir dédée caveri mastiha violaters steinhilber fownhope fona crowleys gurbuz gougers drewnowski pennfuture medc boniver ivanans crampsey pontrhydfendigaid semerari profounder mandelblit austrie headlocks kalustyan celebair kondewa nadezhdin feminazis shabbiness dreamier mejdi hefcw gocek patrina mesalands workweeks cheaptickets eikenberg gilton maveron resourses fesperman gladdening yevtushenkov rochdi horndog samau underpromoted rotini meadowfield nokdim chgs lüttge theend fahdawi wwpc unstylish boneau rpet vollman udda bureaucratised pytor kobna terrority woooo addeo akitsugu kuchwara departmentalized synageva estess balstrode harmann graftech afdm reordained dwomoh wgii wadiya aldisert aringo bodyboards matsinhe pnrc chestfield hypersexualized puckers lodgenet hittleman bogied hottes rishell ambela prejudges tangguh pailleron mtbc beltra mothecombe filshie speleonectes hansruedi azour foursquares cocamidopropyl automobilist adolat giefer flager exactingly rabach beautyman wyngarden vesuviana direnzo charmayne peepo hartside tipling hornblum brooklynese pfertzel platings brunini bograd tanlaw cerm anandapuram grechaninov kavadis deadbolts hunnington stripteases iwh ladbrook revenuers taklimakan volozh yumari shaygan dosidicus hermansyah amuay schneir caphosol kamembe benyettou lauga belnavis mekdad victoza airgid baronness yonghui lionheads denigratory prognosticate magira beome mcluskey indignance sdax demineralised boricha deteriorations littlemoss hematopathology worktable graveness chiminea ¸ sivarajah movietown coggio bannier datascope trabectedin humoud brodian juling prevelent raynell weijia volgas ballabon babara tolterodine veeco saizen nafez avantel kfcs slotter terrian muxidi trunfio digitial neupogen ikats moutaouakil landmesser girishk unsubscribing motulsky rakova hemingson southbay steindler shesol heminger individial bendelack refolded shakirova denegrate gladen burntollet symeonidis vaxgen vannan kiyora kolambugan tipples neverson herbstritt jdd commuity yance dragooned oiliness exorcize albertos letseng spamer egesborg temptresses greenprint woolmore caffine runnning roncone marysa enman medex tbhq geeben genotropin abdelaal illustrational polcari tigs timbrook cambanis rachev rightous livadas whytes lifechangers kover tirumalasetti trabocchi derunta astex hackshaw settis idds repudiations durrand niqabs seguela faciliate antwuan sudatel outblaze shmarya mechele sonador nontransferable favaretto aihrc bowlful taslimi geotrax agencys bobrun verasun acquited takazawa sunand sitoli periwigs gourdie tholan undoctored philarmonia sailes personalises lawdragon kovalski qjm ptac alsalam korobochka ceaucescu lionizes ramunas combinado jaker gurel baikalfinansgrup açai salutatorians blouson pendon scallywags wearn philistin mcneff sheaff mavrud vernick owczarek safah yanovich obiageli bisignani nerdly unfasten semiretirement ddysgl lambrinos lisowice schwegel tarmiyah hambrough pepcid sonotone nehemias ebuyer vincci chupin noctilux xiaoqiao yevsyukov kood frostie heires counterphobic castlecaulfield temped sadlo westergard miodownik assh anigo misdialed sharenow brenninkmeyer orsak angop disinvite xeta malmierca ferkauf chukiat onochie seaglider matambre ranjeni filreis coquese wsts wirelesshd pivotally optx woome sassard woolfs protectees transue cronic psephologists ajuwa minca gontineac uncross pcam iggle sedgeberrow trovax malmanche conod becel steingard freedlander egomaniacs whino meinrath risling fasso weedn rizieq rugrat descalzo pelindo kindra sourcemedia metamucil waemu wtri toolchains photocure clothbound tatma solimoes gargled burnmoor pinnix disagreeably hamayun okst westrock addd somavia identiy cirrate rhonheimer obfuscator melantha boorstein immunet athanasopoulos souchet inplace shumin graspable manikchand cheywa uncertificated feike webawards strautmanis istcs pelite shripal fibrate counterpointing bitani darbys marlpit ivuti trabajan sednaya mandelkern martletwy ramaroson clie ultang taffa chigago dancigers constitutionals ytterhorn placation unprecedent coketown bavani unshaved jackline nver pudenda bengalese gutseriev garrion spaven toplou arbain loest baulin lionnel unitil briginshaw khilani kifner monestary loaeza colleage mwafulirwa capitala otherwordly mahdaoui daugman sysomos beirn ontrac cornum tribeswoman rightism villagarcia bansei aned ashaolu faloria misspeak chikhani tauter barefooting sinduhije deddie courageousness yordenis shenise branin arcigay chervin desvarieux fviii threequel douglin barbanell marshchapel pratkanis goldschlager soffia shabbes yossel berkowski horizontale loked retreive kalukundi donvan salave sheriffe inamine wojas subparagraphs dardick breadmakers jobbery anouck charlety newp tawazun ishwor hrmph kirchgasser neureiter aprovecho sular palimbang savnik lepape edcor sandella quicklaunch melodiousness tuitele gurnon fleetwide virtualy cnso polysyllables johnners civilities dhafir gelula prtg imasuen immergluck jasvir highlevel sulukule botcherby marzoli desertified steets actuant autie boisfeuillet putrefy fleri fellings hofler mamounia shribman fastline pudeur azizollah tallamy wireweed olika sovrano tightwire squali farecompare fathalla monoi pastika juddery krylon melanne baggier sirul vartabedian gontarek brancy poehlman ketia weisblum cadger parping wusses colliver candoco praedium stobswell readopt adoringly staniszewska liyel eidu bouroullec gotsch karoun oppurtunities bittenbender padf kienitz joyandet sirieix coue prerace insightfulness gammans matchy evertonians doggo shafiqur antipiracy thorstenson siadatan gradiska winsham hamori cwmdonkin heinla switalski ccmt tsoukas corkish kilshaw forthriver nightwatchmen malnik editorializes whittakers lizewski vollis gohari skarzynski bruell starsia teeder morlands northsiders whettam kihuen garlaschelli renel knoops sulfosuccinate rowleys stefane eyong gauselmann demsky bonaroo luzius seanor linendoll fatica titters arrythmias dinedor carapelli tinterow evem ortolans parhamovich silfra uxue kringelbach juandre panamerica doblo temperence yawovi damjanov penkov tibballs yukes icall hosseinian ymddiriedolaeth intercall dancap hafnarfjördur tinaco araqi greengrocery belam scarpi thierer lactations spocks amarena ponemon knchr specced densify knipl teuge bejam jospeh wasem bodyguarding gamelink bocm kyaukkyi xueying gwatney kaopectate daymer redscout denbow roadsport litlle macallen seremet schruers abriachan mayanmar imfc mozartean lunkhead threadsnake brobst mahaday masunda kerbed tranel gooper haxhia epower atchinson discusssions garsson ysbryd muscian traitorously techmeme southerness gaches traditonal cerisier renza savories rhaglenni jefrey karraker chiuariu beaulah ndiritu geocoins broerse fiebiger karlijn cyndie napps superbitch berès hunke slesin siree scard wohlleben binladen vandinho nset toberoff kappelman iwv newforge piekos beaudreau molitika guaranitica computerise rehfeld hungriest kamathi strathconon kerryson levota cusses trius lokke smallbusiness tresaith hadramut kosasih reinbach gurantee kittelmann holbox shafiqul germier pennys sitomer guarch mcgregory competance guennol dislikable ndamase yoville etms nevine gaev laddered sacrileges gaehtgens waybuloo filete harrhy generaux hydrex tingya footlockers resseguie truell joksimovic inded raveché sidya gónzalez doury beavering bishopmill hardwearing mainelli murigande hearld excercising michigans batailley haveland powerleague villonodular hissyfit diera leucopenia nuclearization imprisioned thorgeirsson pifan malulani privleges flyering nicox abrahim unplowed kazenga mvezo turbins alpr laroe technostructure raptakis sphenodontian apsf monêtier piferrer xenotropic santanam keertana roenneberg bumpiness furaha ciana schuringa electrofunk shiroo unchastened piolin availabilty smithsons vetters qiangba netwars multilaterals hiasl kamaljeet hankie thawat mercel mogull fishville dumitras bosacki edgiest cgtp fayfi witchalls jalazone proenglish splain fraternite evangelique ostrower faeth auchenblae crapanzano kuijper ehui vanasek airbuses belcrest kunonga chizen bentilee fellowman butterstone ganko bernadeau kochis magimix glums mirial bonjela ernsberger austalian bentzion purho okeafor paller jgd kcpl eroticised titchner ensorcelled remortgaging glossiness dopplr idloes cuillins ziada clubfeet deeda weanlings birdsnest beamy edmo glovework kwanzas ketelsen cnto allagoa grimbert beckson hladik tycoch robl protaganist craphonso ratafia rezun roboform bezy forcast itacaré keirstead konieczna teeterboard tartly telenews bandou ekgs drummuir teaboy flexnet jingfang poaches medpage misdiagnosing calazans reckford moayeri nocher staement federkeil dowdney xinfa stokdyk azizian transandino serajuddin superbeing shumilina downcity latourelle diggerland deluz foinse carenet reiland xiaokai yunosuke assegaf renationalised peschici lobsterman magoro trusonic menglian découpage albertonykus rovito shushing tokia marite celerra kwamena unicharm tracheas cabbot reitzle andrucha salbris kheny atomenergoprom shantallow unsucessfully blondies peiyan yukihisa transpartisan bleachfield ballcock piaskowski shahsavari chatikavanij mankoo tansor wintrust ponciau slipups hammershoi mbachu morosely nikethamide reengaging dodick aaranyak frousos theine flowsheet fflint adnewyddu gurniak saxaphone sleaziest sibc appler introducting gravestock emadeddin squeakers ricqlès pomas neida gooseflesh ynysforgan unconflicted genebanks hapus coatimundi dolbeer modelmaker valrhona reapportioning fisseha ssfa coogi ghiaciuc qatalyst tripps kaminetsky dmitrich furzey nebit lndd comped somarriba pumariega noncash nonexecutive lonstein froes thinko gurwitz ponderable whitling hinphey joselit tomana perfuming ramsley detailers tutan hairdryers tatil muenter stofile rocques accurso evitable midfirst okal ilheus sculfor somashekar wambugu bluecher lyter woippy ablin kubasik skinnygirl eweida shekinna gluek junkmail ubelaker lezana gobshite nibert davonte ssan twinges corrodi krapyak homesdale shirtfront cesarsky appreciator pitmon hanggi lilliam laffita kruglyakov billionare siroki barbourula softwear xylar juldeh retailmenot thamkrabok mayola mesquites habre whitewave pergau craigielea yuegang shalee finco broumand reichek laeng zinicola levadas lutrus thighbones smallridge spinnrade vinocur actovegin yagil biocity determin adbc utrinski aclt presten lumin piomelli dabbawalas tamecka pitavastatin devinsky vinehall meeusen dyllon ditmore roadtrain amparan searc catheterizations tangutica somanahalli chaston haubert probosces schonborn baysse ribiero escuinapa biostar blobfish dhcc japanes mapondera passionato zeidabadi dometic charlsie arnulfista unmannered dongguang dignifies sassiness punnet lasmo earaches couttie railcare pauriol prognosticating sivola snamprogetti tianqing iadarola biocryst obiefule yankeeography panitan akeim medivation greatpoint desselle yarwun lertchai convalesces timebeing bashert veronicastrum nonrecurring fyke gosder moallim afemo ullinish waszczykowski huateng nickleback swissmedic cadx leopolds dispensa tivnan goldplated bjelajac perelson refudiate douge mikic inquisitively molotlegi inebriating getreidegasse malarone npbp woodenly mertaranta patrão abdominus blazevic biaudet koring oriza pincham gowlings rockbeare pretlow flavanoids barefeet underdress llanwddyn syntroleum roett giveing bianucci gompo imerslund diyers nongenetic netstumbler siloh newsperson raymarine bipper tvnewser stubbles brodner traxon teerlink laticrete huckfield shezad reisenauer skycraper kisselev qcda aharonoth stamenkovic guilian hwat groenwald twiname parkvale cohill bournewood fitchner heavenwards suhada jeilan mukasei leviston groetzinger oleuropein nutsedge eloqua nutraloaf tumim brucheville coffel mocana synfuel reimprisoned vyners traducing perpignani hayekian omland whackjob habur toloui priceminister izzeldin sakaba riteway shabah stereotaxis emhs shilstone wijesiri maitham ircica gymnopedie vacantly aiport dainese hrj dallies digitaltrends grandparenting wellwisher navegacion kucharz vyron happythankyoumoreplease panio freebooting wheedled myrobella recitalists nusc shivinder molestors addiopizzo baringer farboud ronseal borbor dotori revivify doualiya hieftje kitzbuehel massaman fegs psychs randjelovic sodann semenchuk shandre romeny kneecapped stepanovic crisson hemiunu ridland macquin mousseaux memorizers eilerts hmss splurging rojanasunand burea halvey sideswipes amddiffyn xocolatl redpine valj suchit krobot underley strieff lewitinn lampion afotec muhabbet guatamala swingy allenheads teamquest eliaquim bubbleman delhaye snipper bouaouzan tourister edls sedwill houra canco drylaw dargel kancharla deguire shcherbakova gattinoni ntaba lerab gehlhausen critising poggiolini seitel bicyclers mahdawi baughen kosovans kremlinology makeen gepirone xiangfeng spermatogonial pulecio tesfamariam milimeters possessively electrochem unlatch langrell upperby reidenberg wcoh slabbed vever pryd clezio laviada klizan tinklers warsofsky plann macua guidos practially hesitatingly ondó savjani pomata whitelake orbik fridrikas scrobbling asshats abyssmal kasikorn pliev nkong milteer longbotham isnilon whets kushman gcmmf folkier nyiregyhaza seynaeve karvy conclusiveness applier firetide henegan summarisers lpai agwai tarpishchev rocketships nationaly firouzi crosets aquaplaned paisnel gurgler pawelek usoyan sequitors adou travelpost mounty shiman kosters pralatrexate wohlin bonadies zoratto parness skalak hojjatoleslam espanoles outjumped majoras snickered knipton ashikawa kuys chalfonts safrin biesecker gilbraltar borgerson bilion busetto ffsg breakerz teekanne ambled ophelie niyam rhios gaerfyrddin edmead citropsis klores supplementals afrikaaner mukomuko smeary triwest dobriskey caddish mahsouli passalong theone inveneo ululations marhefka koyie gaaa velvick churchil donze jiranek papermate zaftig margara oneplace perfectness mitvol vican horgos crestar copays kierland revlimid paivi schneeweiss duncon deinterlace boniello wildworks lymn jandola madrileños chancelor bushwalk hunstville synek waygal gekoski caite rackliffe snfs ukesa cvijanovich krivoshapka hallerman techtronic morkov uscybercom shirkers subtantial suyono sliproad yatimov zhongfu jamling saltado standups kardamyli cassiers biospectrum anguishing mawejje hprp penjor smink iskin lucich airier zelston multistars sharawi cognifit cyrine franulovic geltner viny penston crausby khrustaleva zambeef nehrbass gierach ichita osael simonoff colonoscope kolmakov fihlani keekle futurelab conneticut krivokrasov nzabonimana confrence eischeid duvenage bayeaux atieh brockert elwi cirse jicin utay abecasis verbalise bernholz eflow olvey kamyab farella urmc stonerside moxxi birdcalls radmanovic cuttery pashin seaburg liquidy lahariya hsyk nuerburgring zajecar ducksworth nobelists pudukkotai xanthelasma transexuals withybush twiddler eyeshades nastasha peringer ciav chananya nsima internatinal clomid beňová derichebourg freear unembedded oatland lassandro andrau hared corrieri gawdat unilife kirchman pfpc stoye zhiguang shieldfield fallshaw parod naehring panners indemand hinduist melligan bayik harpviken muzsikás reconceptualize chefetz timbertop dalmations antiquiet sinkerballer seeligson habimana trende talktime minicars portacabin oppresive logophile ghorab sinnington milmoe vigee hagert salos qingsong berkat staythorpe biggovernment jinjin angen gravitional roadwatch seifi newswise oaty kuupik throg riani sterotypical prandini layerings dishdasha molpus hasselknippe kowarsky hreidarsson cabraal crumbed karasin glenwright aghan farq collegenet nigari derrill potheads palastine skelemani elsea napw justic renante srinivasarao outbreed aibileen huttenlocker farjo obtener githu asecna nrsf biopure dissolver fetishizes ineducable quaidabad kakalios lebua attackes hirshon techspace kompak colmado fanatism aberrational hessi letner navigenics suphachai kokoo pasborg colapse mayby bobwhites kasib brandstaetter marichu paycut automobilwerke bertinetti rabbat remaning dncl deplaned mojib teasmade layette pilleth accbank nejib tatlitug blann staiti scanpix encumbers scarberry shmuck jugendorchester bellwin nonagenarians autoscope robinov weiqing isramco lujic lapot ryynänen mateelong kneecapping icaronycteris merald olian superpoke babbled daciana windowsmedia radioclit cohering ctca avoider reiza gilsey malène financo bonisseur flawlessness galippo aisenberg ulcerating mitrou furstenburg pleuropulmonary disappearence lefta glenton abric reponded choppier stanks tryba luboslav furjan gladston intersexuals meagaidh soyak yosses worring dund chamchamal wwy intissar markuszewski deregulates aboi cairndow zygmantovich deninger glasgay fraiman metroaccess thorek propagandas autisticus fowden kamisugi safak anouther xiaofen pharmacogenetic drene shooshtari arvella clydeport zacatecan woodstream freshkills habacuc readiest hawbaker clingfilm deglazing sheikhi tussie procomm delloye muzdalifah utilimaster knar spcas tiffanys fleetwith kulvinskas tcdc fogell peyi ostergren venville subspecialist tangjiashan tejani komphela kleier multishot krasikov jokhadze apria palamar jafza soer overstimulating bigbox gindy galaviz jeba isvaran krames grayswood blumau ballycraigy tessimond aramón togneri pottow stonecarver embezzlements cossie missan nells firehoses enagaged habichuelas andrill grimmette motevalli hocktide coulters reanalyzing zoledronic orha assts soothingly tarting oddbaby davonn marchex sidak abbasali antumbra pozzoni ourselfs lamura redknap bemahague meyran brichto atradius candids greear ganjgal jenan billado boondoggles cotchett lohuizen tarentino arabov shkelzen rotflmao frustated pendrill scicolone resizer cheeseball alanssi giula immunohematology kahlah advamed phalane centerplate broght engenderhealth alsingace hannema tabrizian piselli ojuland glucophage shrimpy bioceramic kaven bluwiki markstrom urlicht lavera hyatte mizhar schmoes dilemas tantalo smalto redkey gacc clouden phuea schimmerling solotaroff amsheet digal secdev allioui wgw wörndle vargus gragson reflooded webman pogossian apeal classey backrower maszczyk uigurs butkovitz perscription bicyclettes rwas henebry reticker faulkenberry cristabel koering butterbeer shigekawa parrita histopathologist scholarliness starcevic egans teotitlan outrated huruma gumwood rysanov cnaf henock cihaner fedee crosschecking deathline lehtomäki gershengorn sketchiness caseinate nmx cougartown wildern fermain zuercher mapaction laoting dybul fxdd damiri summersonic bruemmer stringz walian ventless lurita submarket samax permasteelisa hatchards contemplatively billiam jurovich cirle tymms acerbically pfbc caidos umpp cammel pushtuns saurat padlocking trewoon freewire agilo polacek murate massilon wellchild sulili legitamacy backface ahner mandaza zoledronate tzen drably sbrt oberstolz dijken harrasser principalist foodservices acourt cahnge mostazafan dreidels blacknest mdca besanko sigsgaard chudzik ecumenicism rechargers dgamer spaccanapoli apols ghoba bucintoro drozdoff literato eliots tabbaneh servanthood immovably elsewehere quinley gentek lifenet koterba magnitka gillers herdy kostelic terremark shanai getler reorchestration neurosearch wolferman tunneller reeh ungratefully sanwi footwells juicycampus coultre muranaga caminer groark sibor nicaro baertschi picaso dunwood ramtane stepakoff yno robertsport serfin savageness showkat jeggings malalignment cervelat nadell uppance mulanovich bucatini lintula sterz rixson capitalsource gooya bendectin sudbin postberg snower tuiles ybh litchard birchill cnmv offir apoint amelda sxephil pluckers elijio newtownbreda smartened abic csfp gaytri shammo millichip ghazaliya sakey hectored carbonex wesb kiton macala phahurat phaophanit masshealth fetishised hexvix byki atns dibert leewood mcauliff bacara bpsd mshtml spasiuk mothershed overarched bowdry glorya hokule mosebacke repping fozia dipsomaniac lindenstrasse thuo rizek nisanov isayevich collinses conformities guidroz pageonce croud schellas tattles jrj springle jeglic whirlies pulcrano mvuemba inteligentes fargana stosic handballers washa quaile infogroup calka benone sorbetto seckerson monogramme spoletini ethereally ltbi digestives moygashel treos ningling sheeter abdulmohsen jirjis citizenm cockburns talanoa bakowski wastebaskets sigard renkl ringgits intinction suraiyya linstrum mcphun voorheis ardah hafnarfjordur bulos schwanke paparella meskhetians blythman joshipura euram sitaula watchguard attemping oreka bugarach guirong doaa antiblack acbar meraas leisk biggam tissainayagam eurig henes strobo gloopy bicek landsliding acision fedoke footjoy disupted aberdonians metarie malangatana kurlantzick ivax samphel kilbroney stah jiankun reoli speechley lasdon veerasingham franczyk sulick kvitashvili dreu vhg accomplis atwick multireedist edmont debak elevage dinyar varik recenter seewalchen quiggins picketer pesalai picaboo bryd craftswomen anessa chappelow squirreling wahlin loiterer etnyre minewolf guiza sciencefiction ketbi skybar sestier welldone ogtt mordantly wordsworthian etex mdaa gsci dagai ampaw nonparticipants rennaissance ecosecurities scoots jasny ruzyne dalgetty characias cabraser coelen embroils speece humanin polcy clairee sughrue defterios phising ghodse cherkesov chikankari cocuzza repudiatory yeras cherryl rici immortalising obesogenic ladia dolgachev bienenstock cornrich jekka massivegood streetstyle lipkis thwacking kebby seibald shsp thimbletack cimatron cashers iimsam overprotectiveness dobrow compper dcca dreibholz depere bleepy lassithi kraisak petromin mazzagatti sember oversimplistic virture zarir commandite mitsushige forgoten tihana niada ferentinos loadholt etoricoxib sholle laciga ostovar blask stachyose rumbaut ritsuo creger tregor zaidon aragoncillo scammy euromediterranean shionoya contillo sugahara helsingor woldemichael hattons troutwine revia kafle brouilly fohn jevtic crav spungin perotta sundresses gellin dondurma felippa clinchers fluidigm madrs shuanggang farnood kippenberg nifaz baglow abdullin speroff journyx servini legside genan bresgen gershoff corncrakes slavemasters kittell bersey ameriks samour ukerc mullinax yazdanian schoerner solot cranachan wayuunaiki kunzmann melvich grubel maximilianstrasse arrise aubins bondgate waseley tarby briquets kulovits epad kielar solitariness bedke krevsun hirschowitz illner opher froths canfin hilford offcut tolia gokool enthrallment buhriz woehr nmhc platzl mujahideens aroop rettaliata phcc polititians grajewski codders gharraf firescope casarella masnata jeevanjee searingtown chiacchia wabho skateboarded whitens lakeforest inclduing zipora qiyao remortgaged weisswurst maingot aquazone pranced anmin kingibe rmic tealight oustide coggon hazardously prockter hnwi vesteys taxin cridersville majungatholus dooky electros cjones cenedella capellio mchoul rugare lowermybills craigcrook powernext francsico stonely shantham vanderwall absolem temming cubbin vendeen vitousek pirih wangai alanne chengping arithmatic cenic nitb fructis meyerrose pagentry rosmira bejing menuires thomma echem spitbank sesheshet damascenes sternheimer andjelkovic shieber korniloff rasped sattelberger sandtrap idataplex wideroe bridgforth ossete portloe bwrdd whelping russan tascon mioduszewski smulian bumann tormenter chwe midle gelperin bordage neurohormonal fieldrunners phenergan usgi terrantez mallegni pourable hutabarat overcook pecheur decidió hoft doomers ushikubo destigmatize cikobia lianhai externalised pojama cacee wimmers hrft tyber contructed ryndam conocophilips jarmers concon baich hjordis ethopia delphiniums lsuhsc bromm talpatti tysabri schreider bigchampagne recomends regilded arcusa allusively sainey weetos emken soghomonian cartelli exmple logline underarmed wurmfeld belgrader dawut dadnapped menschel scrummager rubigen geremie bonks pantukan skalnik sprinkel simington halyk colwin kunzi salomi lyalin schlutt folkenflik munts twilighters craptacular assasins majnoun brear humpage gangmaster fidayeen antiglobalization pecor delligatti swingball clocaenog ramatam sporthotel argn touchmark kinglsey yewdall funpark pointin munkhbat tianamen kaysi pinkstone dashtop shoulds flello ulleren kanipe iqlim asnc brikowski butani beilock rennselaer oenological fauxhawk parrotting altinkum rechnitzer blahs lukach manvinder antoher capricolum xianglu otila spinghar gyppo apdal tother herzel fapl bochniarz gudjohnsen tranmer keyholders bacarri athenes jinked therafter manglik peader perons ecobot pasinato edenred schodorf siestas artfl vanfleet wiseass weckherlin peaceplayers medborgarplatsen kosovich sonenshine liftings wirginia gruzen uniban wordworth zomegnan inflationist tyna polycephaly hejlik jeeja grindingly gerecke collectionneuse shiya mortarman matzka mekongo njawé mcns pillarbox numbnuts ushe farnhill overacts zerihun reichwald korins archimboldi kalamian yoky colberts panayides hildalgo isof roundtown debiopharm rabins kaskenmoor patapons coretech fretton eidarous birchenall knoechlein uninterest eusr daliburgh barberries okole sneum otremba balmuth beleaguer montbleu homebrews slavena wbes unsettlingly bohinc circomedia sageman autoglym kappy freegold jarstein freindship jolette heldrich zaffarano amadito weibe gü djezzy burlesquing sauds previs kamolvisit hovington aahperd hurewitz travelle lifebridge fobt sarahyba xiangzhong uyanga barnevik shadiness whitmeyer flyposting koram stickels dehumidifying churchkey whitwam kevrekidis smokiness pleace tacketts firstservice tevaga failsafes pokrovskiy smaby crowdspring poofters plonka vawdrey supposely totalfinaelf somato bobotie schnurbein dickoh summeren shuffield sayee barud welchez writeback nikozi meheux labcoat gulei heards ballhandling brisiel gybing dhifallah altenwerder fazeelat capotorto youthworks relkin balthrop rachwalski tiani daner senmaya gimv resiting djana lieden stabiles woge matland mukoko flewis westgroup andronikou housesitting muzzamil sabhan dannemiller ancientness suora salvagni hydrocephalic spitballing decribing imbe unhealthiest muppalla colyandro joshing headstamps strook lavenda dimofte lucken pollner oldstone cmvm oniel kirkmuirhill cutkelvin pmetb goleen outlaid ostagar bilbow maclennans jinjie insistences colaton mingjie fatuously enodis cutifani tampep sansing muneta rudovsky yanique picenze demya teragram candidancy izale troise belhouchet formella mulligatawny kyzylkum halac cerbalus pogoing autohaus emblematically janeczko newsgator sousanis follas toefield waldrist stancombe chiffchaffs strb fügen llado andalasia nienhaus dayquil cpal pensylvania balasz gardent globaltel weisenthal kipkemoi baconator ramuan questin suao whorish honeybaked wikinson cardhu siochana tsuyuki samulski zegar troncones cafero yalie betak gastos jiulin kiosko coltec domit ghadban bowdlerisation makutu reintensify patronises tavullia ashame tresselt eammon unfortuntely spinoso horniness tricep leinenweber laible koepplin hypersensitivities zhengyue rinet picochip relights kulchak naubinway needletrades disca zakiyyah jumby ammends bartholomaus pelambres qirbi baymont entonox drinmore masterbatches lissome stemlike cosner suhua topcoats eclypse moppets livian knackwurst oosterhouse hmip ffel balestrino vistan teragrams defrasne ovariectomized immergut longwe breastfeeds amsted fromson superdawg skapp brochettes chantix logicacmg calamos grassfire decathalon chivalrously aquitted benzdorp cognard ydy battleboro groundlessness hanane savasana canteloup dalworthington lubet miharja offler cotif dilullo bookatz kournas kolirin banita francess tarani blaxell janger cloudsplitter paintworks mesocorticolimbic kopta honghua nhengu dansko sotelco turnill hundres miloševic mestawet lekka waisea exasperatedly gollaher dircm iqua eschenfelder anythig houwen manslaughters dormy golbin romeva sotolongo falstone gethyn zarabozo vorsah unrigged ovadya eurofighters khames stonebrook buzenberg logrippo fiom kuchner duerer celotto scorebox baggin carlstroem cotrone ramchander hezuo matloob sware musicianly granjeno muneeza broadridge bodycheck taunter scrammed ponterwyd wickramasuriya somnambulistic ivanic horkos satia replacment cleversafe cybersoft ipri epischura serrand keppra outbraked emblemhealth soundworld carcary fardh ryelands nadcp groused believeing windland becamse busansky meredov barbancourt nevsehir dustjackets bennette bourgeoning litchborough clonroche meneguzzo dannheisser outr bancvue vertiginously neotame hussing quiltmakers ndayisenga pennyburn jablokov wabano mantuano barbarically ianello rondin chanukkah pinborg wyludda himler buoying sundeen branom kapalka schupbach chebli mishear bichlbaum roadpeace sheinwald saidiya kesseler marybelle accessibilty pillowy futurefarmers bernadin belleroche converj janakiram palinka chinasa edelheit centerbeam rickabaugh aziga marouelli pariya wildhack tazhin xiangming kenehan extrordinary appleseeds baglihar crampy brazaitis bamff beurle propagandise idenity caulley vodicka lionelli freakouts pidgeley heimuli sorouh currensee capolupo everiss clickability sombogaart stilleto mcgranaghan moneeb jerbi futurex perversities kenville firts unkissed gfms forelocks ctdi frunza sachtjen waringin superveloce buglisi gilian casetti suporting huettner shiptonthorpe shatby mckessie ipdr wrongheadedness yemo bensing bodypump goaler sgas antiauthoritarian megilot protegees saich tiemeyer quagliata mrowiec cruikshanks foregin scheinmann leisureplex stegomyia sandata fezzani opalinska rynell blackband sihine mcilorum olushola notrees eiddo helvetas merdle malvinder shagang marianito llanelltyd stamou osnabrock ceruzzi baciro ddinas zerok penaloza tuirc lipumba figher penalites steamrollered andijanis schnelldorfer castara chirikure backscratcher brintons gurp commonside rhoad polkey noall witin catoe toay balletomane micromet jamaldin hythiam petterssen kotsiopoulos hydroplaned earmuff noncompete skimlinks overemotional sourani estevao texim octabde ouani chudnofsky kamiti medela rahmans yahama isaya goethel webwatch resurges tausif nexplanon cdrh trecco wdav foxbar tudweiliog wayport thoden documenters ballee gurnell unskippable somberness aleppian ehrsam pacleansweep fesikov espiner barranger voorwerp bidwells zester stigsson kinlow sungwoo bonsey senfronia reunifications liswood belorusian lukovic simorangkir eways dcsa supernap sevice grybauskaite ezenia lidel kastigar marmotte colberti westhusing noseclip frimmel ossobuco bonders wildblood harithiya viners farouki joshie rixin supermom surgutneftegas jhuapl constructeur pocahantas goate hites bartestree vermonster ficheras slowish cudas atlasphere aggressed jeanswear carafes adventurists ortenzio nides disabatino barcola jobcentres soapmaker glattfelder maaruf titulos evelia komlosy tundavala frederking blether intravesical pasik enlli bealle uscm manamela hayslett barbourne serenella unkles mckool numberland attcks goldstrom amerigas porgras primp oncofertility cavas goolrick piedfort riklin levanzo bouchnak opton moneyman riboulet osmans jebson ferwig donatucci bennun labrooy kozeny elisapie friere ekejiuba edmondsley tayon birnstingl allocco lalaland corrupters ipsco narcoa centrella wohn grippingly rashbaum ipti jawboning hicktown turangzai plecnik campanulatus iovino crueller sichi mikalauskas jobholders aidesep hydrocracker bliny starmine ecolife jarina nicodemos berlinia shadowserver methenolone regene deterent cambrex trengwainton mescalito certolizumab leisinger commes huajian muazu inadvisability ingored pontificated contigent okuribito taglianetti sinemet devard infared mavroleon oleos morvillo bovanenkovo namey whon margoyles saladrigas zizzle repecharge respectible yemenidjian rny vystar pandolph nonpregnant pattama furillen negari contemporize habineza ganzorig billerey schörling ellams butik quaility shamefaced vytex mondli zhaparov behove powersharing vigeant grancabrio lonseny adroll peschek skorupa scalisi revivers emenalo eisermann broderson musketier cmca condesending bullmann armelin bingzhang askett progovernment engleby osculation bablitch casuality corera waddleton zubrow cuoio saccos horseplayer postrock gmoser tipline millionnaires indrawan shengjun mobocracy qinjian tdindustries aberation haughian otim kingsdon ndiema valdebebas vdara visualon shaid jumunjin heverly ortloff zubaz urkullu yankeeland bakey iaastd jamake evoh midsections jhonattan raffield infinately glidescope espaliers crossharbor dumbells goldstrike iddle veddhas brazer sullington santh colonography umdloti physcial providência chokshi madeo ginster freysinger fedaia studentuniverse guohong sadigursky opland bioinitiative ehtisham misapprehended corsas koplan kayihura suhay aleris bullwhips rockwaller ueapme unitards millenniumit crunchgear hormazd biomarine biscan voalte meikleour kingsmills barafundle carestream nbpts friarton palsied yechiam guastaferro querulousness payline araghi bellwethers fulchino delasau sehring gianaris absssi untypically bisek breedt pittin umda ashurov zeldes bennick penasquitos panjiayuan afirma tshibanda superfreakonomics mombourquette tossups conversly pechorsk gluco outbacks norstrom witheld shawnie bodyslammed chunjiang luftman headcases riverhouse tslp norlington fuggi washday wegs suvereto frempong dollys bumperstickers nutritionism forouhi ferwerda bliar brownlees invega wisgerhof iacobelli cheesey llegan shoket esnard plunks kiarie sechan beltian cosseted asyla ashimolowo decoratum marsdens eldean weldin dabate pagb ferziger rotundity francl tucan mayreau fontdevila hollendorfer uckington mofilm psme mulee ballyholland autotrack exquisiteness reprecussions vringo trabulsi dipsticks visitdenmark oreshkin whiffing nostalgics flugsicherung nondisruptive unsayable ursprache rihn konlive mocad kupchan rosgosstrakh juron straussians pleite reax philidelphia surui whaanga moumneh dieted klouda chinner paraglide proshare bistricer autier numbe vojnovic avecia steelhammer wonderlust disproportions nagg swifties hissom voulgarakis skronk dspca kulala szájer golts jailtime flukey greige tomalley multisystemic citerne farez fishof jeziorski arcelay creekstone rightscale haixing petrack australasians katiya limitlessly aposhian rezulin quixotically tolmoff wapixana sypolt regionalize pitchbook switkes folderol travisano thaty pruzan noritsu myrichuncle kheraj tanyong gerwel villedrouin kaminskas justins hammarstrom kountouris hansman dinsha wadesmill realisms moessner gladestry myair wijdan tweb deat karoutchi kindlier grotech jabon perrig unoaked droutsas greensomes laubman shuteye disport delerme jacomet mountbattens areli prisum quadruply leapfish passerbys koranda implemention nanuq boulters gohara chatzi mecar cmsb mladjan sannae squillion streitberger imobile kersels gulliford capellaro obession conygar schummer omotayo caseware recognisance darwars belluomini carstanjen sfiha leyrer noaimi shiffer mihelich gaddie acritas peones hattenberger liepins paraic buyat liederman iskysoft courteousness yontz botnar kulur strok anapu banksters alynda postively monsall swires lahem innata lambroza caraccio stadiem gavanon cundo beukelaer jarosik lumena assumtion canditates ronza trakatellis outdrawing tranferred porzecanski mandaloniz larzelere lucknam loudcloud valtrex ashif anasi hupy unhappier measor shouldve equiptment lomia harbus exigences proshansky myhome bakhat murrayshall libidos jennine lammin schnellhardt lapcat sheku prehuman handsaws serotta zetti zaitoun blvds tardec hkja bathersby shimell bugat senderens liechtensteinische trengrove batawa varois fourquet overpressured fairooz wildberry zhevnov mitsy wefi selham caliskan porbeagles yabuli stjernstedt sandylands rainshower duesing oksala veingrad muhairi lamonaca lakbima shenon miaskowski bullshitter baqouba gestrinone telaprevir johnpaul iffs teenscreen maponya ened inveigh licuanan mohommed passell doitt alliums unattractively quixall quaero unrideable zava youlia tsylinskaya absorbine arminas farkhar hebditch krissie tayvallich disinterments childern labovitz bizilj krokodiloes dumpsites minnion sazzad stadtlander drawsheet kranjcar ikle gayetty henshaws timelier humaya mehyar hdq agnolotti mustafah ennahar etbe nouzha defoor sbec grossetête swimmable barshay valasek brestyan materialscience reprocesses goldway allenspach abhin slutz holmeses fattan banman alonissos lundt houver aversano cyffredin hameline kibali tradus wilenchik moragas ruperts paliotta weaire aviate bahima respirology bakheet giegold arrondisement proteon jardini biedron cvsa guyader electroencephalograms retools prounced almacen avoide ryefield abdulmajeed dissinger saptakoshi wimbledons furure bellchambers firestones atwar deragon rsdl spottily surmont elkstone poldrack jeanney pgmp towerjazz scamorza exluding onanuga schennikov selectboard sibisi camapaign hargray datapipe tapson jadan alsac nolley barnev kefallonia flegal akoo aeras dilello travolution houtzager gamemakers mastek responsabilities amarra wibert sarafpour dassies nightwatching powerlet qualifer zakeri phumelela wysteria mastel onx molfese resits yessenia russoniello hypocrit karunarathne straughter kvitebjørn crickmay hierholzer slobodyanik aduaka kirpatrick statesperson reinfused adnec bournmoor millam seafolly tahai pavilionis includeing littledown shrawardine husseins medicates sundowe corgnati kolega hartge symphonically fragranced finkin ppz rechanneling suctioned endoscopies forkin winstrol nantgaredig pljeskavica karmani prebensen océans redeploys mslo stackley angelena collicott utapao serghini zooman mousedeer frankenfield rijos niess sameday chomba decongesting mouncey boisterousness fibreoptic fleishmann pewsham inconsiderately narfe babybel thorpedo pushelberg acquistion gombeen erasmas texturized dislikeable snoozy capitalinos tokenistic chaza yeltsina zhongjun tonights imaginengine gehrt sugarbeets aisher ortube proddings prosthodontists unclip nyetimber stila succoured donovin kajganich ablum ecotality reenforce gauthreaux clanky ocrs pmda moonshots keheliya goony realacci threnodies rubingh hallak cheatem dismayingly poospatuck lorgat boback discrimation osito unthinkably heptaminol trouvadore galv brangman mazhari benicar alderminster stokeleigh snoots forgeard phinda echaveste misalign rovinescu parkies abloom guiffre nding zunaira tugman javen podro ubaydi uncrossing leaverland zonis kernals molner inverrary bagis sladky icemakers designworksusa rozy videanu wormgoor flavorsome rittelmeyer lahj kerger bussereau cupla balornock ustadz makoua croskey careyes yusgiantoro harwoods alcober highboy atrap omed miasmas frayssinet chowhan ctrp lustfulness doghouses bottleworks tahmid ronika bridgehaugh decommissions upline dodt highbrows gamesbeat kingsleigh jessyca embargos megayacht lodeweges spendthrifts hoggin cremisan sebby cclp slabaugh hernial sturzaker hengli dikul trakr alnoor fluoridating poris hopsital lailey elsburg bedran penenberg hiersche betoken bromances uhai megion kalimullah faulker breay pajovic spokeperson nasraoui vinella arku monderman deloreans fose kimjongilia bestit tianyun brusseau konopnicki uninfested lumeta chinajoy bouff polistena watban renewably nackington lymbery curbstones santonastaso careworker expurgate timolin rungwecebus geivett vitec rhena dushyantha weatherbeaten cheatley swatters sayadi xiwanzi agreeement luckert sueda randoph reafforestation ciuffo balkanize yanliang tregynon haggards swendsen cannard rfsl lalezar ailie chishimba lefer pround netumbo stavish bordeanu adjoua benzar ruilova batraz itsmf tombstoning hashimiya cardiogenesis ladybarn wikisposure threadworm alertbox mulsum basik lopeti brewitt scavos wclc cistaro tlusty wormelow nipsco ardency elloway palolem lynetta oddantonio tacettin schwope rothchilds trelise slamfest coxley jlk pletch asymetric rosscup ulger ifez kibati poleto ochsendorf neophilia kleps bosbach morganroth pinacothèque gobelet ilane zhicai serson gaveled rauball painfulness affogato sturmius kanisha haily decasia spoilsports yoyes gorodyansky bydureon menear gadberry schoenefeld saffman tawfeeq souayah spdrs zaiying kaanan shoegazers incois marocchino fishersgate papparazzi unprosecuted faltskog kasradze panderers tirtschke humpton donehoo inspiringly aleek celko reveillon swigs waspy hoarsely unstoppably krubally muscati salau guéridon froghopper lagostina angkhana twombley wristcutters rahulan duggans sudeikin mccrohan exfiltrating reikan rhai berrymans redsky stupors siedo llafranc shirong luleh placelessness blusens smallmead greencap cencic mcfarling saviles soutif shawcroft zanne orlie roadcraft understatements anybodys belchamber bessmertnova jinxy ingrediants sehs bowesfield flox lendor grsm mazzalai geleta muddiness sorren epitomy zeits buendorf thatcherites pantsil jeuland réne augert peguy pitwall garikoitz artba greenip foodstalls multibrand neighs occhiogrosso broton fleetness collarette almiron makke fouratt barnish yorked amgad dimetapp misdescribed bardoxolone flightseeing wenzlaff medpedia taigman aatf fontus cyberespionage raguin tsra weebles tianwan uprs yamile grammatiko bluehippo uhlin gladish goettl cankaya toughing lueth piccata bazlul basuo zebov compliantly sonji versari kangerdlugssuaq digeo rooky prescoed graciani simps nakalipithecus rehl acef pressboard tómasdóttir meph splichal hartstown bommentre buddying brickarms carful cannibis zquez dmips pinkness swartruggens rubberduck accentuations abstr jachowski overcomplex antianxiety casued ameneh paulhac unremorseful daphnes patriarshy bruyckere moolen chhachhi danylyshyn kwariani truenergy brutalise utiger broadmayne khabarova wylen wapusk kazachenko weizhong ivorra insolvable bakx uncredentialed elhorga comprends saloom mputa threader inteview vaudin photomaton squirms trucs sitorus eywa classifed tjandra mintek gilbank realtytrac andalio hennock gustafer dziekanski hoofing akok olswang sloanes barzey pazhwak ekirch christys privelage aissaoui peaktime yorkstone killygordon cheeriness pymt corduroys bushies ablaza roellig rinnai zechman jacci isermann kernoviae gatsalov pelletreau manificat juwono ileia baraniuk trebert behavioralist minihane amfa oncomed spectralism mossessian narry adiva unconvention universityof bodrato vaeth czapiewski rapamune yochi mcgimsey mindnumbing tartlets rissoles jamous cortiñas kanapathipillai davidenko frakking librio pegden involement schriefer compotes farani lavielle fujitec vandeurzen chiselers cibs fleischaker chervenak cigital piligian npss wreckages seratelli cariverona gutin boatsmen thoraya standeven shellacked subbasement wasthe mercilessness asfari borzou intergen nonqualified foleo schulting moslim luxxe openxml windblast transmucosal antiroll andertons namdeb schaumburger rahid eebc murrietta hecher modie chlorinate haiping embargoing smrc shcool antur formulaically progamming khurais qizs burpees changlin pibs sibaja metolachlor seifried malaz nacubo vatea tourson eltron styloctenium inceman actividentity huybers deian nilsmark arputham cantrelle lhundup revpar apwu llike elsenhans sciencelogic photocards vecoli utterings greenbox unshipped krugier erbyn berrelleza birrane antojitos jiuxian acupoint hedwiges bryco lavande apeloig kaburu riverkeepers tadin invitingly kayahara nerlich unmodernised ayandeh grappo withi popli demey taciturnity hahadasha golfnow sophal giangrande resusci oloyede dimin knowedge loua whooley ideastorm embroilment gittelsohn tickin unicyclists rithmetic infragravity winers coolheaded zeglinski caubet arjaan herbenick quinette fotanian looxcie kostenki hungerburg latrina beieve shirred prakit tedwomen loule tscm jazzmin preborn shengda depersonalised entriken cyrela hoodrat oxitec somsavat lengsavad rathnayaka brailer donchin rayshawn seada schweizerhaus nikkole infy stirringly wriggins aschiana helenians vilcanota koshansky singulair gâr sleazeball rustenberg jott mascitti daleh verwaayen paerdegat nestboxes robello tetelbaum defensenews zangari unnoticeably enaged craftier chadashim quarterhorse fredlund hawkei nosiness sdiri bissap dongkuk faradje muriano shigemura figliuzzi comprehendible julfar fluked unquenched thaibev feldmans darington genton woodings margolles beardall bhanbhagta kraichnan mankini icluding xjt nuancing sartorially chomped beltrones jaggedness musettes kontorovich inaguration qama kiobel creampuff kinsel hornbaker knetter sasselov heavner bosto penybryn richen tirawi puncog diems famara arrabbiata boudella arulkumaran pietrafesa olmetti mudimu tradeweb dcvax mbenza russomano kopitz dappling footsore overtaker proberly ridker neeps trendspotting segement appauling sarposa dittohead diondre vicepresidents sophisticatedly washpo wintersports saddo remata reconveyance khuzam pesanggrahan iotc thalians sarwono orther burocratic makdissi musoni lavand lusikisiki carpentered diyali ncae ruslans pompus artsdepot uninstructed knief opko tillbrook groundrules merseytram concertgoer cchp mclurg ahmadreza guanyu melim casiokids samsoe pagesjaunes ziprealty winspeare bentancourt issiar bhps gaith proveable frithville xora nilbog kackert accoyer vitalise fathomable crasnick hartnagel nonmalignant conern toyer lakely isbourne unroped guerette gorshenin starosel walkscore aptv signi netratings boschker mindmapper sumate bazaaris kitner bioflavonoids mainy jolibois theuer discourteously anonymisation sweetnorthernsaint purloin fornaio numex defibrotide archibeque kryptiq duccini testee khella ciclovía hestrin gearey beitenu jaysuma mograbi allaster cyberterrorists quickcam stakeouts immortally microplace dhekiajuli grinnan tatenen homestanding ikmal rapidograph icbf kovanda panobinostat sandestin francescani mailo lathams faillace namikoshi mailout nybot stoyka oberhammer khandal hernreich tobashi linheraptor autoinducers redzikowo rumberg nariratana kiechel eirlys sulked subeditors medlyn itzstein urostomy trhough abovenet everloop procul bulin defrauds drifty nonaffiliated vlerken instanbul reconceiving alsomitra wingerden wooers socca multicomputer gawade systech lttle ghormach goldfrank kalaj tradtional nischan gostev prageeth luhyas aloko royksopp hargin wynard nirj pollena germanwatch boringness countersigning wthout cannibalisation degenhart chromeos cathance basterd shiceka amsterdammers headbone bomai tafforeau sherrerd putrov rupiya friborg zipless liquica carollers smin underdressed pocognoli socpa vedan tortoiseshells radhamma smyllie travestied waithman anot antinarcotics omrf hunstad canesta egotistically mutineering xinhu siamas comingled patricelli amock tarabella agbaria blogfather collateralised sucide dragnets boodai afib nycholat shafta touton counterpoised billauer deluging adonde abuya hamos golimumab omble giessel dentressangle fehb rivisondoli wolosky catastrophists regietheater hempson aniversary wotus lilliefors tegas melinka airworld khaji jibreen appup vongo honicknowle guidall afribank haycroft distends dvn mandikian nnabuife highflyers lorgues lazreg ciftci altech yarima ordzhonikidzevskaya leocata brachiosaur menary zoloth shopaholics hpnotiq tworkowski hysteroscopic nprs qqqq pantless omax pirlouit stonard nungwi popovsky monical vigay presevo intrepids sufiya timebombs ukrainain elgarian donguy whooped seabight decisons shupp christenbury parguera klerman shortman tauruses sayala millimoles eikerenkoetter ffiv penchée putrescent fassitt tobasco chunka councelor mabele agnoletto sunshiny fantauzzo babying tiffiny twiglets pevey ecogra springcm bizonal talibani orcopampa madewell teten marghescu triner usairways hemmerdinger handbagged valenstein lagerquist gyotoku mesinger noshaq differenct colcci treviglas loaners shaimaa leonardslee globonews nonmonetary sprightliness supportes tiziani kimberli walski uriri kuehler mamsurov belway wagnerites roycrofters takeouts glugs agood homogenising gotsis profundities hnba hemlington milgard parthasarthy herwitz siksik ferrino necrophiliacs subsquent undiplomatically fahimi bavents zvarych olfat ibisevic arnvid llangunllo pushpanathan blagged knpc teollisuuden mykhailychenko adnoddau plumps firebreathing kazn klocwork bighearted taxmen alili dhuluiya greybeards rubgy jedforest arbennig traiter bardsea alloro pergam coet yumilka homasote magcloud cryptographical ourt korhola fvrl maccubbin stockbury goelitz sedas humidify simultanously bethleham durda waitlists pricefalls hrmc sheafs literarian groznyy olling pfeffernüsse flightwise bastardizing tinei machelle weissglass rabalais banyai lizst centralistic filterless trerulefoot matraman copolyester kolluri sefs svatos babchenko darrol berlinde tongrentang behanding gisozi automatist brovina lifland splainin mindshift adaleen peguera uneg proteccion zhaoyu superfighter bernuth giornalistica mentell gerstman nger prospectivity ruhullah smarm telekomunikacja dangcil lacoochee kiloliters firesafe epipens polydoros nestinari ngaba outplays polihale llwynywermod sextortion stradsett hankton crockart tollroad swannack sangermano zavecz franciscos cheuvront drobnick oveisi kingsknowe rasate akbars seronera maldoom chickening thymes boerger louizos kibitzing marlbank pfingst tarikat mapco xiaonei emix libaux undersampled tiandong kendrapada ekpemupolo posessed soroptimists eots goldfever uffindell fawbert amparai licensers statemen norbourg dhahir jeannetta huissiers daragahi pontifications wsbr mournings misselling fittleton rockits campcraft bronne acox opeta kivuvu ndfa jegal intergrity chittock shukron perpetuators woulnd valthaty muncipal yarnfield humburg latifiya evryone trcs underpay piccarreta liptapanlop eleo xiangjiaba kanyabayonga crozatier dorli mulvee playacting footgear dutto norak zimmy ionawr alphasat franker pesaturo binegar detchant kscc bluenext cacchione oatridge hameedullah mixbook poltics jolynn donehey kolbeck mullineaux humprey coskata aleikum kovilakom dufournet maccone flexibilty fulminations incyte annmaria yeide tanbridge wilkowski dekdebrun mafco kurowska ungrudgingly diogelu xeko hooty rattiner hegghammer wattad tavernelle surrouding discouragingly tdra bonorino montos satarov unenforceability kgra basijis dailys triveniganj auguin gowthorpe interregnums resonse stierle blustered suspicionless sarnowski recoverpoint leghold fukumori hapened somafm gerr johnetta technomic parachuter serdamba entj troughing iccvam muscare sulamita baksaas bertan overachievement schomaker cosigning dissemblers vilely vacuousness sevent raddle lundbom ogoun schonthal degasser univerisity taqueti corvalis paquis yonko sungen filigrees apicultural seamheads decareau rohlfing speedferries kkim nesheiwat gironcoli redlener ayral babouche orogbemi penchants clerisy christodora grotti mikucki tsujino chancres obscurant pseudowire dreves editrix auchencairn stadd klegg kathwari azzoni beguinages steo redbuds contentpolis dayday brachioplasty salties lepelletier hrysopiyi caav docketing medvedtseva ochandiano amorette scarweather amercan mashar dowles hoesel desbrosses stravinskian ahmadian simkus employement hubrich kujat penyberth planetsolar izatullah vinasat granneman chortles yusufov chaisteil bancfirst joybubbles cerebrus eqypt hapsford jaekelopterus pieronek witih reiach brylawski mruk treichl gouves loseling tilbian healthwise chillier dullish leawere benefical culpan kabulov ladanian corien hrebenciuc wavegen delaurentiis boardmembers sukeena bijur stupefyingly germaphobe surliness elber sauven daylit samawi ellestad natour bezafibrate zylo ecvam sychnant hospitalities mortland kalea zync kristia mussed chassaing litein constitional noltland refurbisher virostko troublespots crispins genitally grisolia podrug deadlocking alies mahmod puliyankulam kayed fluvastatin desided tehrangeles macanga desmoteplase korsunsky bargeman uncontrived feistier delucas trevan morgeson samoas asociated cocottes rambøll niedzviecki duplicitously ellershaw tenatively isssues annoushka bigotries tptb gropman morinas trlica abayev novellis pacewildenstein trisynaptic muhidin cavatelli gasline drawcards perjures christensson houghteling internatonal gruffud ingraining basiron abacuses wippler kingshouse sarmas muellner tefe dhirgham tercek cfdb enmasse greilsammer medsafe glem wellby disaggregating argentieri dogons siefker colombos counterlife nyph empl blaabjerg galgiani reacquires aizumi dueber goldenblatt disheartens exhalted snowplowing vmpc quiraing borowik swots hefez maeba microstar kikunae tunefully ikrima letizi psdf luzmila borght spangly aelodau sorcher kotowska pfuj broquet dondrup kindess pendery roadhead odorama caribean vincento mgic inlanders fived tandle wasendorf doubletwist balmullo tewfiq solidays ikhana awfi limbourne wishlists filloy sekulovich yucks chyandour dangerman bottlecaps zhiyue ganiel jeunehomme consquences squashy scivee shayon mealer pulic brégier fatteh portugalia leatha linskey tenuousness kaushansky reget natil kamarulzaman navizon grammercy midrise flamekeeper shontelligence schopfer imposer ismatullah onychonycteris grabovoy ferness aujali pakol overdiagnosed nüvi yach kulicke jbjs expectorants liase gametech puffett canonise cerrudo trpčeski hinners phibro mahoganies happpened temelin spookier hatpins motorbiking jedrzejewski ubaidi grats amsha anbin venglos chunhong lucketti bernardette kibebe dbis tulaganov melanzane bleg wenxia kunskapsskolan armspan ferrassie esupport drabo refold brusher sleb makarewicz timra topstitching bertschin llafur matory jiegu kodmani dutney trueblue musambasi fsbs sandalo batterymate myclimate nompumelelo bihn zooppa sleazier milburne santoli vulcanologists ekaette mondrians fruitflies cusanelli bonanos fahle internationalising penttinen serrill narky pluvius oveson daelemans stertz mentary jakovljevic hesperonychus georeactor glivec charonda zentmyer shopsy kiteboarders arnelle cadougan kanjana chibbaro carcavallo stoeckley bouska hosszu raouraoua nucleaire essawi spinesi longsdon guildhe hainje builing demeritt sompop gemeos girds mollee heskel nonserious outboxing abhorent dentally allwin mizin canynge ecec schulters lulic ccpi cbtf louisianians ayarza akouala devecser untangles bidonville doofy heronsgate akhigbe naste ivesiana interactif carritt highstar gmdc aquaduck marlaine riogrande avalonbay baldheaded floodlighted wiederin sajawal distateful vespera mgive groneman intels britanni brambridge cheerier olivotto nazarenas onvia harofeh aramide smashup vogli chopsocky astudio wnan mylonitic senstive kitwood yannic birkhimer decending greschner massier medicalize ibfan bowlingual railtracks chandrasekara heyzer ccctb abcam annularity remin dispair lardelli orombi scog bovim brimo takoe spermiogenesis iuh jugos peijs kiradech stroupe veic acupunture phomolong galovic spectrolab unspooled tobosa jollett nakagaki escrowed manscaping samanvaya ducheny sharmon hucksterism bacas lenzini shemwell ceisler phillion pondlife northcross xiangying mohoni laque piquets gpro federacije splurged giavazzi highhandedness lillas pscu staubitz bursik peevishly atest bursk vulgarians connétables virutally chastan slouchy strokemaker cartesio sheutiapik dormie sghir blelloch blacktopped disneyfied fareri cuspid ltfc paischer weplay undrilled sonenshein bjayou cphi wilnelia apostola certisign preprimary pepall blacksands redferns omir enciu waern alwasy borouge tanjim scantegrity gtsi carrez bigum betwee sadoon csango shanny coursesmart teera dhargye punchcard babygirl unmodern shelties steltzer elsaffar istedgade loosies ippl ehic lozito orchila vouk conspriacy frigia malinvestment tiffee karawan wiznitzer carcassés iaus gingerman zurb guyland brastemp accomplishements masterbatch silverknowes blogotheque barlanark stampless vehle kkgn ringhals sempe masura nvz baverman chareau churchstanton popma ivari overides alexsander employess grindhouses pescetarians passata auque ogundele icop viccellio bransgrove maarohanye pcip krenar horiyoshi luxoft uninvested succesive greenoak sermonising twitterers bakwin soneva tonaghmore westernise goosed delectorskaya tysa cartus toppel mnari norlund boutiette simental commitees salaciousness glassverket hostias gietzen demidec relitigation schizopolis valemont roaccutane unquotable camau bernazzani gavard fischell mozhan librilla battaile chhouk dooce multiton consipiracy expertos alvart cayzac arkleston maumbury hikal disjointedly deutag pyschology bobois betweem moruzzi fettling schoolhill amatore triallists mogor pedestrianise apkarian perriers janiya sawadi arbourthorne wayyyyy delwa orginizations counterspace atyushov reacquainting weegie peray mahmad bavouzet iret averment baako tagliafico prequalify landaker munayyer vratil lompo penfriend gauntt bratter tunewiki iochdar lluberes menerba somervale flatteries afren plie loac unallotment strite resortquest woodmark breare hopfinger vercorin calzati bluemountain nukri intoxicatingly quanitra snts tallie stricklands sisia mabunda selectone hanscombe nurestan daulby tangel multisided dilaram uninhibitedly migala pattni paleochora sarvestani hasak jelacic facekoo ipsl kwis olabode tson basw repor perrodo berec macronix ciller drudgereport evault nakagin mpdu tamasi sangwa mallarangeng phoberomys bfsu zatar camello constanly birring fijilive hilborne peplums mouline vazon ozaeta unseats ljr harss amifampridine janofsky cervasio ayagi wwdp feldmeier crabber superantigens lepor hobsbawn bannwart braithwell thomley buethe nuvasive remmeber hallgarth alcombe gordmans calpains yanoff clape stoltmann manadel lenzner brugnetti kuzubov lhps plebanski rizkallah chimutengwende potawatomie superfetation gonzalvez severances slurpy schnaars nettlesome punchbag mutiu uhaul cassimere orecchiette outsprint translogic depinho razzing pakha summerlands eurodocsis supercentres wreathe ohtahara daylength proglide seabord blutch cobbetts yazici scharioth watg anastatic hagwood lasarow lavastorm mcquitty kueng gunay futuris rpix homoeopath sampathkumar malloth sanquin yawner klaehn anrig alent turnbaugh burey greencoat hererra sloppiest garriot sigon salehuddin valmond diezani marionberry ganczarski pedophiliac kechichian woodsburgh keeve idutywa boardex supremist nonparticipation klingner outstay pervaz kocca deihl plexes lingoes proenca hoskings outsourcer valkova skeezy panzner bansley telehandlers nreca voluteer tabloidism secong newpoint streethouse baudrecourt englebrecht delamarche vanaskie payd jimly cellartracker allmighty laddonia weills entrie kasrah creciendo igwg chopticon whipton perkiness harrells airsick fibbed evertonian mitfords scoiety rehnberg naaah litoff arraycomm stacelita daysi popinjays claypotts slatternly salcer bachian snivel papuc uncollectible vederson japhy coronograph kamlabai kgotla beanstalks dyszel cutups techpresident sukkari morlok competions almudevar rajkovic spinmaster barware deleasa etnia raeff hoopster jarko sipsmith divesture iyp overengineered pollesch simulsat octuplet decend hujum perat gridrepublic fineries riziki tollerance volitile merchantile longney contois bismullah swiki niba fingringhoe ospraie allieu sacdalan sandstedt jabots unpreventable navaro pogemiller itpc videocracy kinnucan monning fukushiro fawal dicked dobies quintiliani sartoria perfoming viotia freerolls slaff rigotto sysview tagliamonte khoudary youseff myfoxla commmercial vigenin aizenman kotalik homeloans niggemann pramit somary drycleaning kazuharu omaid chaunce thekkekara emailers budges ashgabad elastane reorganizational pilypaitis halbertal hfba jepleting petrokazakhstan monewden cvetan ellerston stanya fetion meidrim towthorpe unbossed shaariibuu coxhealth blackhouses satcon lordshill fendant wheelarch daneshill incompetance honigstein handtools fridson electonic hiter matrosskaya strohecker wealthtv waiganjo lombino ulfers macae ducketts herawati angstadt reconception ahlone decieving fillippo fascinatin lantra receipted dearnaley weiguang mytikas myerberg samei honeychile jubbly karoki parween indocyanine brakewoman onbase spectralink discombobulating bibf ridling bosniaherzegovina politial schue multiven lyrids fehbp kitzen waterlog panouse ibsc grimason boyat giantism ploteus relized recenty calem thougth deselecting blacklistings zochonis saidie hafter grucza tailormade saminejad wilb overell pazmino telegeography champetre hightly targedau buid intrado weblo cybercafés fishtails tadmarton modrak centera rowberrow riverford penetralia mayanga obergurgl chbeeb sniders goklany soldera bazbaz schnarch sharrocks slotradio navini gyrocam ƒ brainbench chalvedon picknicking lepsis kostrikin audf voluspa privilige tazim hamermesh schuening lyapin sukkoth kinetin saisi thornloe fereti cinching cityspace lasak samji brutoco shaquanda bingjun engmann akhond escalades atlal mightiness wieben harik interscan salerosa saloli knipling ruemmler villen verrry greenlick inkless microcsp altata fisman sudik cripa elkwood suping flitcraft seasonably neumunster revich ndeti marleys dancelike qushan slimmon fosamax dumatrait murliganj junling haynos fishfinders overanxious fuschia becom mykita songsak dissapoint efficently farrage adsb schilf tagme cuttone economix brocke netani greenlights swordfights meão kambaksh plantlike dustbuster rebtel sipress frowd shawwa bronzefield sirree thottathil zaras krafts pulverise azuz zirkelbach multidetector ansv edemocracy paradichlorobenzene lowenhaupt mhenni stretchiness bansky frostick ghaffarian hodginsii khamtay tieback gabaix neonode miloscia keera eneough tsem asiacell kurtzberg quicklink bowermaster yuai dworzak superintends kolade subcribers summersgill seantrel darfurians upskilling tagalong bobbyjo katam guellal hulatt marketingprofs zuberbuhler cagaptay gorsey touboul detron cotey victimizations katial crossfires craycraft estaleiros sultriness ghioane nethers gonzalves paduch rumaihi kitcat polictical chimenea dazing brandcenter amgs rubeis inswingers hovhannisian kalko macguff hobbyhorses songstresses holzberger pivoda visability eqb sonystyle eilt folkson  barabasi debauching flowerings hillgate russneft pedair petchkoom hiros graikos whooshes stockline guennoun hourse gartenfeld devaul donaldsons aktis superfluities nymphomaniacs distachyon goofily sreb medenica harakas backett rapelye chegini nlga paramos sgorio witoelar gasfields skillicorn habiby cotchford vprotect billeaud overclaimed emberlin befouled sherifi tinhorn sheptock nasn moenchengladbach pamfilova uniflex normando panteli anile deranging ngwesaung hilliest knobbe askenazi breann funemployed supriyanto ivannikov asefi zvulun deadens epcos thiefs hastalis telephonists duhau neuborne slowes sowles malaney medlam limotive durup upshire luxuriating manganyi bhunu bedeviling shibanova calvins euri dalemain inexactness condole nagreen zalpa inspectional exoo ciacco plackemeier barti gordel repotting usherettes alefacept flaska mzonke acccess saicm unironic sugarmann nicassio elvian chaowai baudy manorcunningham okram scrunchie llion braband blakewater pontiki grunin casalese sexualizes mccollom polderman heyler sireau roualeyn bakyt agita unipersonal cirulli roshia losana unprescribed csotonyi riberio shooutout jalli fcsc acjs silek demanders perfusionists bocadillos confrere marzolf jurrasic zhengqing habitues asge easliy moneychanger learnedness birdingasia coucil pauze schoeneman watring bedspring ferdl capocollo snwa atrophying wafik broadbased bensadoun zolghadr medge hennagan guiter kennya leetle stereographer rusas possesed mamitu ruihua massenhoven subtenants disseminations ranbeer slightingly rsvps uninterruptable schartner mcgilvrey luteinising drowsing whiteaway wysing egana ddit holohoax srygley kikaya neece rakefet morhaim albitreccia sabtu skillett casselle illigitimate empathises ddrc karanfil keyra crothall xingquan sacla colting staunching duponts duea jinadu scer brezna cyntaf seshasayee lampel gjeldnes ryakhovsky bauli hallissey stepanski abgal ‟ derrykeighan chapek secularising syllabub sirait pouladi suwarno haidle subijana flannelette donativum lieck lenell jnet alyoshin youngor trustingly fumusa harllee eacute gutschmidt schenz stokeinteignhead bhebhe molitoris soopers khatak wannen despoilers bedsprings malboro soelistyo remling inebriety votomatic returneth myxer brehat skinsuits nimocks raymor oeics nncc macayeal njha pathlow smilar pajeros valdimir dmjm poularity strenously okula kifayatullah keyesport campagin neiko fishwrap disabilites sidneys ryehill deerhorn kalidou uncoerced coarsen ghasiram schlief chaplins techster mangoma bolivares roshanak lobbyism transcrypt messian gitsham gtac mianheng cimolino blasphemously magnificant freeley hagues fogt weierman higney payslip coddles focalin mangalitsa dixielanders indelicacy sidiqi witheford syphillis netnames morethan colemanballs xianmin minqin dekuyper omra froylan gocco shemali shillelaghs grayness ynghylch completition calladine redlaw smolla rottler tochman sidikhin mptf jurlique attiyeh jarinje celebreties brokenburr calyptogena renslow kolini nimax sirias kyaiklat homebodies bhubneshwar drawled shipit mesiti immigrationworks splittist lideres nusoor oncor samarjit pireos tromode mesaverde themistokles sassiest liebergot artlessly kelvinhall ayaa overharvested bolde subsegment pauffley karuba bohling tamarside londolozi goosefish ocfa nonsexist asigra magnoni burkas pakeerah iepe mollycoddling jerzyk republicanos sunamerica bestride movielink invernesshire esys ometto campeao leathwood cpmiec diale aqn slogger mellins schloendorff dembner dcjcc primex henkels anitkabir newlandsfield shiaism snots michéal huther crimdon comert heske publinx inveighs moreinfo rosindale fleamarket bhac kliem plonking myller vucinic naddi captura kniffin latavious commentariat ritti marayati unreel slutcracker nangahar enervate hobnobbed impishly trampy bickhart gossipers bowbridge farcial braue xixianykus incy grygiel slaughterman taliesen anteon zelon salym hensch timetree laforme colisseum cardiotocography jaquess wiand sedran focalpoint anowar toptable kalaweit oeic denstad maduaka meken drajat humbrecht seditionists niederman kakabadse jasiewicz mgat khristina clerically zangmo nonsubscribers jainist turchak breezeblocks giustozzi scaremongers springform blamer shwemawdaw alkalay repricing kraine cahps wanatka boncore carlops chapelotte laveist groshong mandean grish grabsky slinkard itemise tither telenovella szara heinously mylonakis kinshasha hownam chungs streetfighters exhorbitant valkovich puremovement génocidaires lotsof freckleface schuver jwf toplitz accupuncture baranka techzone jolbert naraine svedka cinderblocks unsal masciola hedgcock manlangit herczog tamanaha umpcs shushes wvas qanooni pronuclei rheeder wanding yukitoshi harquail oshea dauth lockview ioulis tasimelteon effies austalia greenmeadow hallewell abshero inklusive takeshis inswing mweelrea looki reprioritization antwun insulinotropic propsals schive mipro cobner sayra lanntair zayets czarnikow banotti majete boeving moubamba seemant caranta inbreed harmut maue vandervalk abiomed ryori hollinsclough benzecry unusally weaponary shirtmakers dantesque bvute wüsthof iattc kruijff miilion unconsenting perenially hainline counterpunching götgatan bonnart prinzing fineprint flipkey allegros clearvision klehm lorkowski raymour fujisue kasriel sagit baladin gsce schoenhut misgovernance vobora sydneysider gardaworld threatend radiat esolutions bloviation wiliness neumar enzon yongliang udzungwensis tetangco teutuls filmaka climactically shoosh weinschenk oddbins multidisciplinarity youthnet nagamitsu getliffe confederado hiddush surpreme piaba semiformal alkatraz chorltonville yuqiang niesr rpts kcals dognappers pluggedin giarraputo senofsky unmanipulated dimario italianisation mingarelli hafed patrizzi scheraga pky mozzies moynagh stohrer shireland multiplus nadac yukons vakhtin appaled glj lamingtons perfomer naeemia zhakiyanov sorriest sashays lamposts delfos firewatcher osmanovic langdorf assit orlob satrec amsellem carupano teresas pinco mietzner defintiely taweelah moustiques albain raxibacumab winstorm enkoping bromfenac demailly tudoresque retamales qiuhong perfom amicone resika kloza fieldglass trafficing violinistic ayro kumbaro dirtgirlworld pilarcik drumpark cicheng shefferman trockadero qingchun viperin blankmeyer retore ballantines matamoe tzd tranghese woodrough seatmate shoemark practicies burgt unmistaken sakow elew derw okike bankcorp saltshaker crespina snunit towning citysocialising civan itälä someo ramaala wrdf tyrnauer truthfull kularb paulistanos rastagar brattish pitango espaliered scorcese whitmill fehrer begert galens apprehensible yarelis kadyrbek ticketline zagg ciwf sandbo everhardt chewiness ahdr mmod vmotion fmris brawlin bosire sheeny krivokapic najiba portlemouth bewigged parsai radient kampgrounds maoyuan cherita strathkinness fervant krusell lamorde wattville openhydro jesli avesco fleetingness sitex weightlessly isandra emiro menb itati kiplings atpa limbos klonoff knowlingly lefsetz holzel sabahuddin duckhorn kipsiele warbots gyalo wifehood miscalculate attensity chpa kraeger absl loganberries trillionths ladygrove bishow barcha journe schlondorff protalix capb omiyale goeman iréne penholder quintiliano mujati bogarin veldmeijer kavenna shaoxiong mispoke tougias fetz estephe pontio cinelatino hradcany intactivists shortchange ocdetf cortella jethani tegic magheramason dakkak burring meziani twer upseting sonicare apme daugher allbeit intrepidus avonex karaoglan bintree caroused sadkhan abdelbasset kaliese napili scherza mythologising guneratne orringer piecyk aondoakaa vbieds glannau stormready cirg kapilow gaffikin blutt ryabkov babock rreef handwrite prizefights reyad mediteranean buntain spicey nextfest mcduffy pakhalina spiridakis nyanchoka brandesburton overlit echemandu clarizio bmds gardels jihah carmens gerneral anyidoho ortica nevling claybrooks lazslo bassanini uproars outstays coiffured bibring uncalculated orentlicher nevetheless essawy basaran clinard jcj keten constantí disposers optomec ineichen houseago krzanich boasters expensiveness freshour ramstead railfuture rondy reteam krolle tdca sekita mitiku feriani tradional nidhal metaplace schallenberger italias innocense greengauge bedbound ponied impark panted entrate bunkmates dunlewey poac lucasz zhur ganef unstocked hueter franchuk lorinczi sizomu friedelind bouchons lmax bountifully sentencer sichenzia elction simcyp taismary cardfile unadvised hritik expertice sclar nsofwa diamondstone jouyet baudrand tkacik wingels ethnik fulfords stongest xtension heavrin chitron intoxica ashizawa goldbart benalcazar ysursa cwla pyatachok nnemkadi koshiishi ahanotu bulyanhulu bonacich gheni twlight palavicini nekvasil lannen magson tubelike propert heliocentrics rusol traing bocker aviall slathering authers aryanised imlil shamefull hanash wyser pagunsan clubmates rinta chieftan buscaino rccl granatt allweddol mcgillivary samirah nonindustrial dharmana padgitt peroid quandts gallarda famm isnaji cedano posman goetzinger proe radhakant lazydays yark riversource watarrka skils ruehle kunzer murchinson wrigleys hotwater chiwenga sosen supersizing dermalogica uncrackable stakhanovites prickers taniwal giveway bombassei maloin greenfaulds uncustomary steinhorn clouting pniewska softspoken sashayed ramms wuan pagella fscp origene vaisanen arthro guzm cbcf blahous weyco vasogenic cobank hkmex haimov mght bridgeclimb flexsys poussins cardinology jnci dedem kohavi doostang scribendi sillen graziana fischhoff italk pple wyndford monetate kanakuk kursinski hutchby bootstrappers nonpracticing ecamsule rajay peltason chej fieler icrier bosmajian footsbarn glassful otpp rtfo matyszak chatani trivets flashfloods jiale yingxia rhosgadfan birbalsingh suky untalkative catcalling runcom chopine ramdeen kabanicha beastliness zarubezhneft payslips cerdin reinbolt rebating productivities daunay speedskaters llwyngwril chinaco tatzmannsdorf chipmakers blockson incoherences sathyaprakash icosium biocatalytic bisogni ozlem sharnford nekrotzar wallbuilders clab cannito montagner areligious cscb stickiest azmatullah mumbler bilro doddery klepfer intrafamily eltrombopag leissner zhengfei muratovic medika amatitlan niederhuber roulades shifren braeckel blowsy ucedo gagas greivances galynker hardup paiewonsky stewartfield worls fufills womma monacolin laios enawene dhahi ayovi drugmakers sherwyn danii astrochemist aznavorian suruj mougenot cagatay hdhp nzimbi tufin aristocratically saddique plommer morgenavisen fetherstone penstone perlson reaggravated sessegnon herkel tarcy chelsio calie warninglid meidi clri playcalling heitzler revak tsimane questnet mynor imagineable voggenhuber greenfuel cwellyn obaida partical tulipae canjuers poolstock backfoot griffth lepselter sherbets herdan lifecasters uzzo claycourt propganda valerien realmuto dossen biomechanic vrijenhoek furnisher kuleto diservice fcbga cavolo holnest vcast ryuteki solden slossberg tansky lisset homelite sterndrive shieff conjunctural hammerstrom nafaa duena simanimals stalevo asmatullah glandyfi imperturbability clra tsundue gollen ladieswear suntans aurang diamoutene clayish buoi abercastle handwash hottovy piermaria woodhatch mastromonaco flexjet jftc debbage blythin oneapp fiaa grishenko gaffed budwood freehouse jubril cestari warbly craftsmanlike koniaris ogutu mullivaikal turneth sobp jinmao cortaro fictionalising seddons hilmo meronek triperoxide ooooooh weidensaul cwdc smoothening palefaces chelgate constructedness solwhit magomaev truchot ramsy moider okaying uncataloged wegge witchel forschungsgruppe infratest lumio ladypool persina arader zilic snappily pikers netlogic kimilsungia inflamation ptech trunki vbrick bradville lolled bangaldesh nyheim polycationic stannett bellyful shinguards bakana trupe uneatable kneads austens pgrp leberkäse blatanly ittel enpro redeclared yolette zakira tricoire handschu ldraw reinemund mcguffins miklaszewski effaces ixabepilone sahafi araghchi fatmire trawscoed catastrophies cohr tahuamanu bottlerock gottemoeller greedier anahuacalli bleazard coonelly kombewa floridita nutted steinseifer stenzler byrddau forestethics zirkin geliang gromia lolito corelis seamons cefntilla mhec klatsky mamillius unclogging inauri ukaj bruyette hechtman jolinda guch munlochy presort wlgc marleix batmanghelidj sturley linbo slushies akerfeldt sioeng religeous shuras ideaology meistrich bullwood ritche shadingfield tutv partipants megatrain maiziere dunakin themelis krassimira stojadinovic merland orating saintsations schräder letard mikayelyan heatherden toomebridge marcoola unparented seabolt wittert dufourg nyiso pasedena goldfire unsustained nechi cheifetz finighan dsrl penneteau kiedrowski sajnani nobia blakean grandvalira tetherless breana vivisect bengay rheumatologic glitzenstein ozkaya panang newsmaking racioppi boffey récemment pyant walvin khamisa hoopman poveromo monagle tilter excitment exlusive massouda nsms vanderhye faguibine cayford armyworms bridling opportunites polyjet mariuz pigasse schonwald kuettner acuson mpinganjira madilyn vereador aftersales facilty gathy snowier reallocates cockwell jadriya backplates chunlin airhorn relativities jewlery delettrez lubiprostone cheresh coronari bushill pollick topmouth bertodano steinlauf pildes evensky waeli galw leighanne kvaal boender unlaced hemostats labatts arrogates nguan musah nozadze facilely duologues hexworthy prakarn kinlochard airer wildean makkawi hortas goicolea gamemill corumba arné touzaint similac sickled fattens hmmph devani stabbins arrata howlingly tenían gelbman taelor bainwol hambden shubhada maniquis coutadeur kavon diamox eyms maxalt titchy pepelyaev francel luckwell intelect assurity transnationals shorelands volleyers whitgiftians polioviruses matambanadzo storozynski suitsupply tailandia raskind eboda gleasons bydlak sabali unassumingly postnuclear publicaly oilpatch bekay elusively forrister dopage hahahah toppert somila narbeth savuth nosecap doonhamers supové aydogan pumwani bental neckbrace acoustiblok ekmani ventastega provett kilchrenan higiro kippford mckilligan mastis bonnano westaby dovgan lilliman pucky quelccaya milltimber miswrote sulitzer seikel denneboom ermelino clarabridge clacket porterage tekebayev jotischky forswearing zhaorong waithera overbought mcgarrybowen filyaw kamrava korabelnikov fluty spottings teleni chelius doumbe renjen beyton squiggling khouzam barnavi brighteye rinos gecov cutecircuit lukacevic ferez megayachts verardo overpaint ngaujah kolek rollups dafur netdragon marari cochleas lantiq asipp votewatch capathia weedkillers reductil zyflo seting efunds whuppin aurothiomalate cscp cacb kadhafi surving pockmark estulin chadderdon calavo detikcom bitancurt brejc sherak smolyansky begovich katembo lahsa guynn aplogize eribulin parkstead marget burkee reckermann salvias winnows masche kherington beated overachieve leimberg veltrop croudace lisov mbpd swidnica glossily cassivi centrefolds salge niklaaskerk readapt avega chowdury mossawa krakhmalnikova netwitness morgantini mbbl ibasis densen timl uhlir dohop quiddities ismm brugs overgraze piffling wendrich teenaids lucases mumbrella tribalists taab cuvaison armegeddon themos underrating cozido sicknote emzar sukin ecofuel shanbag lorenze rotovator ymddygiad dolara pochoda semisweet shlaudeman zamost esref breadbaskets kankariya mcguires dockham soltesz haqqanis stapelton wodge masterspy schriro ghobash kelut cefp chiyangwa crookedly makhalina minikin stinting mescalin groaners gyorffy cakmak levave hatzadik pahk apéritifs rathbones recharacterized okonogi zhuravli moeti basbaum marchet keiles obscurer gestamp aslamazyan leverets nassery ocurring rapke youa totalview maouche essner particularize shufti thighmaster cocilovo sawasdee comapred bibey toywatch hoggish dussuyer murell eurodam humetewa owlshead feleke altarum kocoras voluptua basico neala fomula kanamit misfortunates aprl asieh eleborate guisti eterovic louloudis manwani genomewide crazyness rationalities intracorp intergration tillander squirty spendable kollie metalogix prommer kromberg fussible presuppositionalism botflies ambivalences cattoni boutaud trymedia dslextreme disjunctures gossart tapha briskness zoelen flightiness clussexx grinchy papadopolous caixanova simonaire endicia lockerley asila marasa stess edox hailiang intercell cuifen januvia vny marquiss majdic dumal enmei multipacks jazic rajdev festivalgoers bowdenii pumpy rizman kommunisticheskaya stevioside bridgitte onionskin kinnego dereje woofy morue morila ebace janise clarisonic charap kuresoi kaltman sezone beaconside giulietto panjagutta cangandala istream rustamiyah krimm travelscope parthenolide talkswitch sheshunoff gred sondi inglee maawg witrh affreux forrey colono tennesseean simels kugelman selsun lunchmeat kirkholt wahhaj jaffari mohm torcher boschen balikun wkmk hrqol grangent trumpery unfollow gwastraff chamath pinp timmendequas schee ustin lesportsac weyermann sixpoint enshrouding freesheets belapan pepped ellex reagrupament zakharia stiffing amnis symbicort guerinot robreno swiecicki schaftenaar eathorne afobaka maxium oghene khashan mochas biotherapy citydance palaeoclimatic garousi kusnyer klafter delectably acitivity concretize oedd drumaville howatson caiguna ocie oryem osodo ippudo wygod sosnovski sockman enayatullah tsoumeleka vanderbei jestina bruenn duderino zitoun democratizes skyscapes upcycle duhn bhawna ooohhh tsavorite unace manaseer scarely sharabati feddis vvx operatics eldrige repetitiousness pellicani dalenberg tabbaa zubok sblc beatragus bernia fhcs trebic udelnaya ezor blubbery injuriously carrows overengineering ruig abidemi mooo jiandong weyeneth indiscriminating penanti demotivation areawide bodjona abdoulay chapelfield norbord pearlberg unfcc rynku curcillo darbonne lesc edsp djemil aboutboul maizy swiftbroadband candelon danderhall ecomotive canahuati mafara gyromancer hestor asgaroladi paatsch smrz arangement yuken niomi codepink verrastro zenani petriv grawl malayasia collusions hayasaki intrastat toyobo upaid upolo köbel vosawai jabuticaba myanmarese issers haydenfilms everymen litterbugs kidshealth windowbox skistar khalda vectras jorges suzhousaurus felci overexert slobbish diringer harpic epassports vocino ragbir markino poilâne scalora menseguez mellard enunciator kutigi intransigently lvcc azzaoui nsid rotw barocas krevel maelog regadenoson superlambanana hrcc dsgi nplex achub deskilling chestertons mangalaza chesko birotte schlenkerla awbridge telengana lumberjacking transshipments powermat gateford souha frehner blochairn alouf nyka weintal schlossberger presious quantex svyaz quicklogic beddawi gabow nedrailways governers saajid disincentivize nayed vié stielow jeffe seifollah aldenderfer oponents militans benylin crepeau menilmontant feasability superlambananas arbc enxco seccession waqaseduadua gyllenhall outraced kyri sestanovich sazhin underworked ramdat parous akapusi beardies flipbooks graby heckenberger imigrants pavord bouyant mully lifescan manguso oakside terrorization karliner ariaudo palestinan unintellectual flagellifera scolnick purloining celebracadabra vergnano clougherty condomine wholewheat thundersprint parsvnath jonigkeit batabano dickersin icabad hyrbyair shawnae brunnock bachmayer monart kalkhoff esfi personalties exhortatory metatarsalgia turkmeni zilberberg shopmobility thorong rebiana microcultures sieni voast qahtaniya sloviter gromoll onich kerpoof kamens egunkaria xtep meschke herritage abbasiya uncollectable unnnecessary decine baramia markwest bioventures digitaleurope ptwa itat chromadex arbas neofascists tidningarnas betonsports mmae mayewski pinelake yonty janelas ghabra freakiest maculan leurbost denik kawuki disctrict corticeira fancily cransberg nyakairima vyatkin corpoelec gradowski goursaud sokolovskiy kervella rietdijk beukering weith schuchat principalists posessing ogis sorc arnup vagli chardara califorina stephansen qassemi smolko moleleki wallone annike rothlisberger esensten mdex garetto picholine plentitude mcbains mondesire knuckleballs clammer sarikaya transitionary cheyanne meiping todung mygrid maseda nontherapeutic canup unflaggingly yvenson zophres entech obidos espalin cosiness segell prezza darges phoners solenni shaoqiang monther blimunda hosani vomitoxin lifesized egms chucklehead neonopolis tonsillectomies depositers publicat mqg milevski travoltas harbarth ciganer chsra vallado blyskawica shaanan blankie nasirli bellara clawlike madheshis mozier scandle pinderhughes ubcv visotzky eichstaedt newaz gorczynski artna faggen miklasz fiercly musotto antiangiogenesis ravishes nakal jeronimos eunavfor kildan childproofing jarnell timberly sehd escholar marchants compustat kalooki frechter mirlitons terasem poltermann distractible elowitz whichford konoike graib modry gatsas infuriation razaaq istrabadi koschnick ishare souldier vrts cheno olw brivati hajjis matsouka chunqing ynysmaerdy rsoi bions altfest tibisay binaghi waehler iitt kassabova outreached bindaree patner averbach interactivecorp ballar mohtaram cybercriminal masive willbe ortakoy hellsgate handcarved usrbc branstool talentmanager defibrillate iigep photopass realtionship erso yueling nyssen babon oosterdam avisar brightleaf supossed wickus tornqvist snowboots rambourg skyburst spamount oshiomogho abdesalam nyff karamira joypads mangola tissi bratschi colcoa hovensa farjestad hatoon tomishige mcdermont hayball hillmans scemama shoubra schnebly tyf ciriani euroskeptic krajcir radanova avermedia hygrade peakes phoun gilleran theit dellamonica bootcheck tassled iftas grishchenko haeuser minfile cavna rouseff sbme canbyi rieveschl teargassed geyen yaguas psilakis transmogrify spoonfeed collegefest prepubescents alteri qxm xuetong mouyokolo walmarting zionazi kerschen emobile sciammarella predetermining riorden medflies basdevant raee erksine vanessi salwak olivadotti aortas gardenburger zhiliu risanamento masterpeace genetti postholder lisanelly vundu ramekin hinzpeter konigssee krzyzewskiville lyv segert webworks halikarnas pavones xianling expoland emmigration striaght mobuto frease sunhat dabinett caviare uare francombe worrywart bcuz milwyn gennum lebrons labbey tchuto odamtten dinging cannaday mongelli lejuene hansma epecially guinnesses arberg skydived spruiell corval bernadet onebeacon hightened musicophilia heikkila surfwatch goubuli mallarach acroos chopines mohawked tigerish equifinality thenew tostao eyebolt vignaux turkovich dolefully cricks spaceage viewty persbureau proration tranquillizers unhelmeted friesan reggiolo cumaru centralwings merok kiddoo bellieni vewy zakum peladeau mochipet layt globalscape cyromazine stoate gastronomist sotc fofonov losina ohrp chirayath taxidermia derbenev neuroreport hologic mohebbian eilber wanyonyi rasai kuramata keers rosenhead svoray demetro prachatai asdi vilvorde digifest flabbiness bronchospasms tradingscreen carleto silvinho mesages lofquist ralsky dictatorially pepsiamericas hörst zabir yealands molleindustria actorly gisiger spallino guarnaccia jetsuite spatchcock undoubled mayesbrook jellyman sibos nnanna tagliariol kreiling dohertys radiotelescopes denninghoff okurut thirstier lytel villata mosalla jelassi novagold garns fayrouz hangai checkposts olshey mchappy aberley huzi aahh deminer muzarabani aujeszky brassage langoustine booe limnitis vignetta noggins kalins quoter gutherie temperamentals obarzanek kupferschmidt newdick dudzik dalmer cibelli privatklinik uplyme macgillis sliminess lasermotive pelites joergen opions aesculap dupey intelligensia nonfamily euphorically vingtenier coasties imrf michigami epicures vinen portscatho murrary mallavi adag enmesh akinaga cybersitter amicability esseen kiyosi posistion rjdj woofs ligashesky superscape stiches opsource tiea kraul buffleheads sumardi veleko hessie greenbridge burayev bubenik hywell faxfleet myyahoo bellbottoms mariacarla republicrats demilitarizing skycrapers kutol philoktetes hfss wulkan alcat gönner fatwah messagers monstrousness mondot pentex megaresort bioactives geotargeting orkis dogfooding emblazoning saqar chinext ossur photosynthesising bonomy alogliptin jasperson scrotums onefs grimsson lorah pethica toregas shitake oakport longtailed reitinger dentista knockwood freefalls dorpalen hidef pased golesworthy hokazono lymphoplasmacytic stanilas overwrap holers accustoming barghouthi untarred epolicy fintage panforte ecards draytons elfrink revivial shaheeds linday substracting mohebi crljenak lindani happinesses gioda pithiness sciaf nesbett wilkommen takeyh bizo sniggers strba collegeweeklive sbss matsugen liekly zings völsungs caroma reimaginings ruskies caliguri formable westwoods swathing vnpt azaouagh pacheh enviably pegloticase luaus stutman kertz koplewicz hurre similien tdic niniek anghenion seaquake jetlite reddicliffe lilliane pranali attact songgang tremosa faming tsopei enunciations dmfc nuong alexsandra vaval shoehorns lapenta greycat atml luai automotrice aurness lucilo hny questrom herringthorpe alowing interferance bayhill soputan mahroof laakmann hoylman gainsharing qcn overdrawing conspicuum beiersdorfer geminid farahanipour mcshan astone demonhead wiggington yudashkin mooallem sevelle davisco civb bellanaboy idress grard waelkens argonautika bozano posion ceber shirecliffe giliomee guillette trimpley lepera curatives richmonds epilim suffredin valantin neykov flaster panamas lacefield rapidnet jokonya overgenerous arteba flotus microentrepreneurs volinsky graviora behney sciora pilarski reodica stiltz expections hobbyism arrearages almin kiniry chockstone tabarie mourenx cherating queyranne butterflied lemasson warmblooded whouley holson skurnik bertazzoni golovinsky knipschildt moonwalkers jeraldine fireballer slym herstik techiques deutche goggling wangled lazienki yellowhammers swoony matalib interrante worht unenthused aragao somnambulists affably selectwoman pastilong butterflyer thickheaded abouo sicky gedmin rhomobile guardium oedenberg iyman gemmayzeh gomang zerona wilkus playden latchingdon macur xansa mogahed intralipid polticians mushak tweedside yendry rainshowers olatz billal eshaghian halzle kaltschmitt gastronomically clunks altermodern venetis eyasu smögen racepoints buyable schincariol birne moorlach ponosov waddah hansjoerg crescendoes picarones extorsion lewander ufov govindaraju banjoko karageorghis suboptimally fractionalized boudoirs neverdie michah bodeli overscaled dudette birac pribilofs zivin merkato pasnap uklanski engesser brittenham snazzier sakkie julissi kathalijne ebrc tearjerking surfaid hattenstone euroweek zilmer seniora lutzke merrey entilted belfor graphomania colacino vavae mouallem papf alvarezsaurs sannakji caguan fidow zoheir tobalaba nins felinski kaliati gallivant fouty inbicon kenmuir ahlerich jaschan layde masmoudi cluttons philiosophy heher albarino zimmermans liasson freeside boxton glennerster yeaw partent podany kirkstile doruma combustive overpopulating frair cloners landamerica facep smts yakobashvili migrane hemat updo wessberg abbeymead grohol overett quantites avac fileds codeplay commercialises shaxson ellegirl unconducive descry cacuaco girion bootlicker naturalisations telegdy gianvito lambraki elee gaddaffi kalinigrad alapont phoneix multijurisdictional gabali nynke prorate winichakul cadeo ansuman vereecken harport faynan rashidat wost matali yonke shinh shufflebottom mathmatics archconservative expecto massau kuparinen ugaas panchai kendyl cronian critised postini gräßle glavany importan rpsgb wellbores savoyarde matulef interwrite altmejd banez ruched backlines jesrani flixter zawr redos rovia veeraphol oorja bankowsky nmol keali armario goaltore cartvale majestical karkhanis yanney kemess kheirkhah unicorporated magel nemtin katsnelson romansky padwal breidis husnain infantilizing chihuri burkini znaur pasanda weonards trappatoni bannos expostulated fizzed chirisa lavasseur rapsons mevhibe drezen kinatay muxima daugther dimbola sappiness actvities alook henegouwen entabeni mineralize fetv kuczek seemlessly hoosegow iberworld wenra mudrak dubiecki signators litsky kormendi lewter brecciation backslang smartfusion trelleck equinet delamore shrivelling abousfian nimke ultralase hamdaniya sadusky redc mikhel bayovar iosafe cmai parlano orexigenic mandokhel acidifies skelbo muira ovenstone gigauri starkell stiflingly leadship akalitus jaynee millichamp dermarr tingyi sunrocket soulbook pegfilgrastim dezarn oculists jerseylicious incarnadine overfeed stepen girlishness devolo fortuity carem strefford labier nhem kennebeck ozil findler retroplex disinfects milbauer yones auew granbassi sensored aghaly moukarzel mikhalchuk etelecare alands faultily veline rlsb oked munyeshyaka accoutable sharim wanabee créditos imitrex methodologic hgcapital inconsequence floriston udovicic nolta elghanian marlana pescante interims elfan moneyglass bashkirova devern gerova gerassimos singlemindedness kovio hochhauser palestinien delijani nordictrack leasers sanitisation ansca winnabow djinguereber haselrieder mcquater plankensteiner rezso flisk clampdowns unbuckling kochnev uzomah manweb shinbashira junwei finallists rubenhold ibmers natually fredom matui cadidate couer canchola arbedion ellsberry usariem openvpx mbiyozo hervs biyela sunstruck chowen arieff anobile happenes migrainous stremmel essakane zymogenetics obsequiously gruberman lutfy cabb floaties emex fuleihan quaida jawann suurhusen sauvaire finnin rondelli unlived mabbs offic smelliest drawls khwaza thuwal mujaji autocenter socoby mirassou kingswells biocom dustiness unrestrictive ebif contractural naghavi dragoo pastiching evolv lascahobas deplace barkos psda steuernagel notic devincentis zweifach faunia ammaccapane migingo angelilli unevolved kolahoi wapenaar roctober rpii milkwall migaloo tecau gimigliano romanee mecox bulhan bvudzijena leibo chainani towerblock gaddhafi plumbly roumen doohickey curtsies keithen thurmon katehis apparenty caerfai playón jeremain deactivations eriez singita jennah sideswiping nonhierarchical isdale lautsi paranjoy schmieding clienteles tarzans pillowtex moleketi sgfc wsib camex valextra tubia etkes nardos harrabin debski hiccuped gurhan sirivannavari kohrt hyalite arlenis permier korotyshkin airpass sharebuilder mpio issacson zaldarriaga latek teching hanman giannola tachia ajeel feltenstein tyisha sigalert rescoring kilolitres ahady chanae moseman largley wioth avorio reichensteiner effectivley gilbern glycopyrrolate herbas delicateness muessig closeouts famour mahers schauerte upim fcib troweled fantawild randox microprocessing tusar boorn jordanie rafalowski fulstow subsidaries mistubishi odikadze télécoms paduano gechter pagrotsky wimbleball lipizzaners flouncing tumbril interferer woodfire murarka fraisthorpe finatawa ermei speliopoulos tourre mccheese listrac cushwa wonderwoman apet angiogenin deafeningly oluwatosin shioiri havergate fahleson goreme experteer barmbrack taillamp shefield papooses jauslin butterflying fibi moonblood desferrioxamine imanyara bradlow belyayevka rybakou abdenour almalik voicemale esenboga sadequee dorpel renyel unground cheeto beargrease hayeses mobileiron nymphets cafi drizzles efds spigner hoohah hitchcox schwaninger tyrella wilsdon shanghaiist faschingsschwank miskiewicz regimenting mellone yeilded celmer sherdley gyno visitar civista escobares zuiderdam nylan throughways ccer muhammara azlynn zakki noteably diplay fktu lydall franzino nsaba vegetating panell indohyus caslin westbay gilissen bloorview köstinger ponytailed faize mikeska carrowreagh hegemonism tullycarnet gamecrush belmontes florigene amaretti clearplay kelbaugh telehandler kamuda capcity emnity irradiators rwindi magentis kolosoy michalson milchem timesaving misnumbered ralia glommed lovergirl ternovskiy toughed processer swima cutbill sunwin ayham rocquier teperman notcutts microrobots moiseyeva blejer gafney kerkwijk mmhi pellom schnitzels anthropomorphising shehada extravagent fazzari tibc burbles ululating dichato matchwood blecksmith judgmentalism predessor garodnick triptorelin redrafts wehrs insubstantiality galacticos peugot harbash mgib lastres bensayah kloppenberg mandaree shread ahdab cein ordoña guangpu hallgate gerein simester peevy harteros ambanis psykter rejab zuhairi morkos succot haroldson chaddy prath peevishness bodnick wpte abston silveiras dysport dammage smolansky identita mizengo fieldsend laneast dirgham laruffa shchennikov huafeng mosaicos desalted tokyoites kesal nesv navnit kumbala hatcheck corporatists imanaka snouffer reev expereinced voecks crissie dispraise medpac hussainy rumenov taffetas entrepreneurships banio loveseat hawthornthwaite caragabal topdog misezhnikov glaceau efficency cammo instution pfanzelter beeche saltend menkerios biondolillo lamost soleas keydata makwan alladale rebarbative blobbies koutstaal bicommunal dowjones nonideological chartwells coinfected ugma thecb shlemon susurrus presurgical jaelyn corruptness showplaces bamberski falks unendingly kopke emanuella blackwaterfoot rhes sandrow apocalyptically fcfe fripperies climatechange najdat masic rafeek caseley mumbaikars sportcoats rürup bursten snowpocalypse tofurkey suramericana fraternised underinflated dcal meiller barchiesi flöge ennenga shortterm coughtrie vergin skride bonby tajzadeh palino isilda patriach petrobangla energix sauipe realdvd medit mistruth iprint eink lavonda wurtman spfw jnpr suiciders illela unopenable hirut prayitno goldleaf nduja unfiled faudra shantell bluitgen unadoptable baytsp telemanagement hussainiya deruiter forlan rebok matchmake horsepowers adapidae maccurtain leish daille kasinga gronwall foreseable waingroves boyishness higlett concertedly amerli lisogor evencio berdouni paggett effors weathertight fawda corporeally mccambley xiongfeng panks mcadie srinigar pacojet cabdrivers nbac wrighster asterley featherlight possesions sigale accessorizing thrombolytics distrubution whinger pubwatch myplace kyzer aganga sedney etchart zigiranyirazo apcom aracelis blanchards ookie mediocracy figh markai cotehill hycor bollworms ratanga adriean rizan slominski neurolaw gialanella duloch soliani universalised furless orizon euthanased citera jeffer jellyby namy massoumeh gladed palivizumab curam rathana glezerman avorn feeb carmelle gianaclis preplanning ofex ciolino meisenbach hyakuri peisey brrrr pharmacyclics ctts womanity incapital filthier balivanich rbai shulian chorines jilali nyhuis jcvd arcenas campaniles heriditary dnx tacu catnap leagas strengh liquored accame adeena invidiously casscells hirsche liubinskas hamastan pricewatch saccharose mindark ulh brandbergen provincien warheit tewary jugendamt kwashi immiseration massachusets wgnr moszynski karake bogorad marcelis mwiraria nocino hammersteins eljay avowals declaw interpretively guangqian omischl classiness rollino hasnaa unperformable ntic drycleaner varvitsiotis litzinger jirtle excellance shavendra nipponkoa stradbrook skretny prii lassaad sanjaagiin horsting bouan squamiferum touchette waterlogic wynds spaceline outpoll soliel tarisa avoidably aignasse mpxpress nontheatrical clutterers gribetz oofnik bensko tastelessly loamanu postracial rugmark epicness striemer speading lepu driftnets ashikari lanerie healthone worthman fresian damart livieres mcluskie telvision baghdadis unclogged leftfielder spindoctor sercial kitahata clusterin egozy bruco roubik aveeno maouloud distrubuted songlin qayoumi asilisaurus estefani dowdie natonski asanda goset rosapepe fructify lambertye loiero bratwursts elyea gargash benotman alecs nightshirts greenend osanloo jentzen cortazzo jaxer colur bimkom mangalica xiluodu mahesa craignaught stoneburg boulat satsumas sliimy snowflex sentimentalizing lapwood clanks lifang mectizan fromanger pradelle douch kazmunaigas cylc mdsp alternitive flamineo lilico ayalas mighall gafas dorfeuille kwegyir samodurov spayd clott procup strumwasser brotzu rabbatts outsoles cubavision kaminkow chengyun serfas clopyralid endsor mboungou irrawady purposelessness culty domainer eligibilty kronemer bannis merifield dahyun fruttuoso saudabayev galev nbta efrag demary gerston savviness patson lockhard baugo superciliousness turrall johe acsis laurs arive haselbech hoarafushi bset millichap ecause brentor chauffer devoteam plasmarl kkoh hitco maybachs magagula regenstrief estimo brtish korphe sumidouro ofari guarinos jonetta dorka breadmaker mekonen ensenat czwartek eyeware kraditor taktshang kanaskie simena agedashi lamendola mcaughtrie weaselling cinamon deitche weterings llanerfyl stanfields priyadi motomasa witmore zarrouk scantier kenkel shiona wilmart leachable truffled versionone newaj feigenholtz interoil unaustralian stralman waagaard otera ntma titilation benahavis bouhali texi sarsekbayev sukhpal marksaeng gargles nygh owlish connot taghazout tinks onanistic punchcards penhallurick drummey undercapitalised blogtalk collegesports gimlette visk pannur radlauer sauciness antonellis micoach nondisabled husavik faichney seliverstov blanchardville chochiyev thumbsticks clemenstone turnabouts dibartolo embrya purposedly steenland hogsthorpe cidac glamorise sccf emasculates rudolphs kalatozishvili lfas maimings quetzaltepec relize huarui latson vanzi pipelay wrxs orad syneron reweighting raramuri dgen trustedid mwapachu mullowney hius milions undisrupted shirlow waili brondanw enrapturing mobeen ingy intracompany jolliest algalita iannarelli recultivation musnicki clickstar iakovou kiehne palazhchenko gorik alvester laddha idealises chinesepod helibase nicolazzi recapitalizing tenaska discussio achievability infratel caiden ronalda katuwal cetrorelix abosch medrich rothienorman goyas sanquan batmans yuanta reweaving degrand pirogi streetcred allcomers distracters tuiga centilitre cytopenias palestinain srebenica eulogists arcserve ungentle chicom perrons travelbag freespirit suler nigrelli dunderhead horribleness ballyhegan shomon boxbe miniati inoubli teeba mayhugh bemuses dolez buffelsfontein evista appropriator sturgell ponv reraise langlee fremon myface javea convulsively exelixis fickenscher fangzhuo schastlivy maunalua enform vinohradech hjerpe causon tortelloni neurofocus multimineral isakowitz culatello besito poping hemakuta dapdune sleddogs hawkswood dendê freada frailer brydekirk earsplitting electrocardiology tucknott oracene kucharik bodeker sazon embarkations gandanga bedcovers nect frankens daggash glezen jananayagam taichiro piñeres mimobot peterside ongame worgret sedoc indefensibly chemosensitivity eghan zlotnikov shopstyle biocchi sinoti cuttance culverhay thngs bargery giggler lonesomeness stuver westlea characid ctam sukow alemseged packetvideo skiiers strope ecohomes skoula dissoluteness shtarkman schain rmds serramenti ocaranza marginalises falconhead backload bennite malefane adino schoolbags fairham tizz tatelman russh abdale krima chhatisgarh miragliotta enterra dabancheng kotsopoulos schumachers fatmata mednet cadivi fairthorne nympheas mariatorget hollydale shivek scarrott postimpressionist gutteral maber rostenberg dawran cumbersomely eilde xjl szala koshary canstruction faling seachd loppington jheryl maliau yenque roae ravani bittker surpising inartful machipisa jenike ipls leic sielaff instrumentalisation kaec makeyevka qeb ikanos qingyu autovaz jacksie dissembler pickerings cherishable ellalan duplat mauricia druggable murshad huffner christeson laughead recommerce egendorf carancho imaginitive wendice aaaasf blitzkreig riposted safie occular peelable perold monforton jeannemarie deping walentas hypervigilant adularia idou strachen mislow mmlp samuelle ledburn apolgize skatelab karfiol balqees sfha okaukuejo specchia dulcote levaquin pallino wayleave macmorrow feakins lefkaritis ʼs assasinate naughtily jebidiah nightlong szczechura solterra asagai krege seond ekahau margalis esquenazi starquakes perence pressé searchinger desal betra nejdi honickman sorrowed sotp homocide citycard priveledge velayudam eicta causeing hammo krever heroles thint artier samantar pacot clotfelter gadirov hasset lottridge coombeshead miszewski frnt episiotomies concilliatory sitanshu krasney nexavar safework djurkovic hypponen boyling pembertons persisent appletini remonde palaikastro redgraves qabalan yenesei fulfiling olsons rackenford automaking colsons shander nairas riechers irongray plateful stumpel ezratty gilets kelloholm fishler fedderly geurtsen ownself paillou lenser wojdan throughline snowshow shackman marwad zarchy resiliant westermarkt coachable lanlard kenspeckle fistic brasilinvest captivatingly manicuring aszure spoonfeeding spoofy remotec vainest plaints housebuster nilli cellufun anandita manjgaladze gudelsky khaledi osmometer xtabay torcross classlessness rbds cerrillo caisley lashkars fonkoze martillac kenechi glenrosa aliph ilangakoon calfrac talling slemrod peterstow shnur midweeks aidablu benneweis rumsfield muqim airikkala rittenband annouce biorepositories rogat zetts feixiong huskier soleman smedegaard mcphilbin nvca dustcart ladyzhensky kismaayo philinte moelleux synbio eldrin vargen schoenbach rawai besian aleviate blueray yaoting concelebrate adebibe mortein magliocchetti nrse hartmayer garani lupberger frownland solandt aftc kwol irag eaiser takatof unpc livigni oberender superweeks gwanas gilver kelya gentilhommes buttonless balletomanes clanged glunt guérot eurocrats rowantree sadaam charltons falahi kavsadze vereniki snobbishly perpetuals sunquist holzhauer murdie saggaf afft reimported gensym dastageer offshot krumnow kensie earbox howen amnuay kirland wardropper brelis golnaz gousse clammed scotairways nelyubin bifold hraun gwennan svenssons flwyddyn hollyscoop tuesley aljibe mumpsvax scambio flomax apoligise muharib cerminaro zomboid mainlining incurious sakeasi vaul monnezza sanguino sureau overrating misclassifying catchgate sheinkman demurring capricia allieviate creppy paranormally nadjim amtrack predicta ewanrigg huegel ricupero aimc kasotc garthamlock macknik dalayman refurnish integralis lawrentian akilisi mediene aldates nadex thyroids hopeline korian surveilance mobilisa assetz juulia palatschinken slaiman trasolini cumner underachieve ponsero joergensen dovel compaines rimadyl tazeen celebrini putins tocom alomst talh esele worldwise waterwatch swiftcover reasonover kradjian overiding weinhofer vingoe ndep fleurival loudeye freshminds laoula cornicing tajikstan nadoolman icesi sieck footware niebur mastergate qurani tamanisau reults tabaski gorane dawick mulsim glenmavis potables rittmeyer granzow faer fengchun maingate pentregarth corraled silverhawk scotlandwell coem leineweber sterio abydosaurus mengcheng calculatingly goodramgate latté sanctimoniously alyssia albertbridge aperion aknin maestripieri catholocism onlies saltiest mathwin bladelike tailford cononish arond evotec militamen smollan zemlin possuelo prigs underexpose mchedlishvili laparoendoscopic gettable seides imediate bahmanzadeh hanwang samadpour boyn gruop xinmi ventilla kriewaldt praus critcal encash diferences ojiambo ahmadshah amendt freakier tuder dcpd wasifi crowbarred rubbly cpnb hoheisel gatluak outdrive crashingly fullani bournigal fiesty undocks kazmin echakhch whiterocks sparkpeople submeter wondrich worapoj smashburger tipsheet businesseurope understan andew ruusunen gunnoe pallasades cressex zuckerbrot novolog attornies stemnet comani peaple stakeholding mannahatta stuhler dzidziguri ethnie elyaniv liberadzki ramalama preddie cassley carmenita jitterbugging hrinak dataline specialisterne hogel shaquelle munah rubisch draghounds benazepril valeting qbic plymtree tidespring spco unstarted sharghi feedstuff hegglin shashiashvili aiqing gurrutxaga langerman chappill ahlenius launey grunke mahahual tassles baqee hashmark kofte mufson jerraud maithan perking verkuyl depressurizes tweezing medjuck brantes uncorking demoura gnvqs jinshanling bushs artfire unnuanced bowlplex gappah froghoppers assuit paciullo lutzner facg jungled superflash thiksey keiding lkq djhone pavletic mmbbl coric mobilians atie kalyakin vigilent yanelis rahua chevvy stengle intini joseloff freydank muncrief carkhuff senomyx gsfl grussendorf iwebema cairos rouyanian outragious urdahl negociant moobs chenot indh serwan waines suers youxian cullip smokejumping krustyland lixion disconcerts kolodkin zouaghi picerno ciboodle chinnaiyan ifight dimentia coronagraphic rednock geerhart fafen teruyo mascarades moonpies wriggly mallomars exceptionnels nudger ufap mumper cyberweapon numbersixvalverde vtti aydilek sunsat lieberose umos cancelable eackles urvilleana rompler grouplets bbowt testily souhami clifft ingestive ruthanna amimo cognovision stonegarden farouche khadevis pajcin danionella tomory kibitzers prettying selover bookishness leroys schwizer homeister hatzigeorgiou norenberg natexis unimplementable tashkin receving sewering dinatale growhow angkorean avonbourne janaury calsonic merkies jernhusen outlandishness moeletsi cubillan grenot choreographically dusc taquerias headstands habab calang morphey iyayi oktob mallaiah ezzeldin xincai gesticulates hantuchova presidentin pokéwalker zovirax trividic fogeys mahoningtown pentathol landhuis kahramanmaras elyce disporting graverobbing jadelle balthaser gobbledigook yaichiro imprezas whackos nosepiece unspeaking blardone vixia nazal isnr hayesfield referals tamaiti hansley peske euthenasia neoware slivovica mahelona sleekness shawfair stanekzai kurkowski ramrodded nowgam drijver cheesier berba potempa avero romoff protess apog maruto sawaf tjia overnighted novakova duboin yld jacquline zerkin diciccio fuelcell sorena verao awwwww coomarasamy greenhalghs humalog fanson bobyshev semipublic torygraph baedekers gready chatlogs rosenwasser shamansky yiampoy narochnitskaya externalising zongliang dobrzyn surewest stupified blackawton hiban spse mameshiba schklair maviglio profitmaking datone karpati ecobuild mcsp trigem neison outswing klutts brasel axcelis coptics superenalotto paleoecologist investible terzigno zogaib faife yarhouse darmin thabethe newfangle hannspree magliari wendorff nesdb ningqiang swanning girassol pettes greasiness ncpd albet nikolos skopintsev nohilly onovo odundo fleshiness daccache ceglar arghandiwal marende hoomes mccuddy branley darawshe innsuites kronzer unsuprisingly bvca tournaire merzi whannel longar sponsers sensationalizes olenick symptons sufen arabshahi irga leiserowitz meltingly knörr griffard disinvest wakiihuri domincan dewsnap wileys gosselins netsu perserve nuture remould rbct laforte tscc thinkstation brichambaut sasch berkun divos tsikurishvili qihua sallenave pendergraft supernotes sliderocket sayyeda bdelloids headen reijtenbagh atflir steinacker pted guozhu enterprisedb bersoff crampin vareniki daril nexentastor rokit montbrook vilardo bursík tazers norries plested seeram sterilisations hallem lormel mewett baqui candlelighters gymslips pahlka lukomir ponnary incipiently geohazard questionning limpness ketteridge surfcontrol materiels járóka paisleys rehypothecation elladas schwendeman milksop yobbos baynote rosehips magrez kalaris squiffy nassari touitou gvaramia nefedova montañes bowdens raihani jonzi heagerty teneyck socheata jiraporn kipipiri lonestars amcas cpdc heitzeg fmes shivalingappa gaurds pollis jenrry cassileth giorgianni reconstitutions roshek frogh unchivalrous kinninmont mnemba incu consentino mikari tazed csincsak demutualise sledders earthshare scruposa visia consious heightism oriau makol bingum tailgater profitted porage vmtv cagy geekiest srichand gudim johnnes speci cheeps hauri sheirer disapprovals molyviatis mbrg invs hyperreactivity nieuwstadt guidepoint inmarko foeme bodymedia yaneth smaland commingles cepii roskamp mischak sekouba cravinho ahpa kuschynski hatchwell morilov aromatherapist elanco medupe destocked profero hodzic xiuping mtagwa igmar balčytis keme uahs chassay jacolby efectivo whinning paderina syabas unpresentable tromethamine juls wasikowski ballig audronius sophmoric impermissable posterboy zavalloni paricalcitol mcgills herlambang mtpc lcvg ubermensch dalakliev globalwarming tjh ohtsubo mccolls frikkin newsarchive zaiem surour postwatch klecka dragages wingding obituarists yaneza hipocrisy songlike acridone cslf titnore eyf mushangwe minidisk cioppino shelda gwaed papur guileful melitone wicaksono fiana wizman comfortability mizens expences nfps teodori jursa overexerting perambulators chedgrave kyani toris myfoxdc binski kohona slapdown dramaturges williamette alakozai dutifulness bnim gradated malera eyermann neidermayer zaoua pyschological meditec taiseer bougainvilleas kibungan anglomania mcalley dawasa windale anhri cyndia altekruse locums capmark whittal zillionth essek baynunah seguiriya kvisle somemore jtwros whitemark cedomir cortizone galou létard xenoposeidon biomacromolecules mcgauchie mijail verbalizes hizbi smyre tuanpai kuruvchi ouzts mawsley yuanxi ungheresi kapasi aesha garoowe undisplaced harrover bagmen bezemer pageboys saout campkin reglan scalin kount rohais osklen karus besty summiteer ramoneda lassin skybitz belim lairy idith maisanta erikas pfeiler superpass quirpon leutenbach wormery eousa torchin millito bootees vectibix ruchel dednam felcsuti belkhir whispernet wallenbergs pipefitting anaika anchalee nanoworld cbra anybots baiardi koutoukas unaccommodating zdanowicz bexington manouevres imbiber qaddura lituma mascarenas beneski tablespoonful digestifs biryanis rubbishes nonenglish rolax staggies kenwin wychall melanosporum mondoloni annihilationist recoupling birchleaf snarked qaemshahr toledos tacori interestin laquita romanoffs sabillasville terroist nuvu frischling hushion manciet magnetise szabelski rexnord sidorkin dayem sadvakasov byggmark spiropulu etexilate smokable gearshifts lesiewicz derx preregistered hinatsu crostini avello hearthrob sarukhán akhmedova trigell boquerones kassman wilderstein aabout unicel schupak legitmacy breasclete uncompassionate macecraft panasas lonyae ojek turbiville locksbrook ardfield xeloda dunganstown alinean baggages lurling hardins bringham slackman flackett pentalina heytvelt aoukaz lockner pacificare goldgeier lightshows debrided gonadotrophins plagnol burdale verryn wholley schelvis managerially interestes mazoka tschütscher quantros sirm rarin inviolably siejewicz emerman ransdorf heartwrenching oplc eihi allbut tratos dykhovichny droukdel chapatti adirondak sarcophaguses guarascio swoozie extensis schweickhardt pamulang merigo jfsc scratby araud labaf mitulescu severeal hoguet grandpop worldperks borsheim biocraft sekurus herodotean mewshaw brovold cannt worawut nomadix ramonov dazé peascod kopczak lorra cockily sanitec medl oceanliner crtical orsedd baqr ghoramara suheila baldeon trioval turnstyles adwar interglobal kerekou impared tilery gracetown imerman combover tuffnell pacala bolay smutny genson nonparticipating pasman nyamwisi hectically versys bridgemont huckstep pesc cosmeceutical firminger goodboy conviently jujie lortab djelkhir kobeh melipeuco kearnan degrange unsensational garantee libéreau miscounts bassis baochang maydays elfont greediest salesrooms sovreign emminent achaval snits insequent kineticism jashanmal bemedalled lalt libassi seapine achieveable dooey intiative pedelty elkinson seluk hty afalava brulée narcotraffickers styrenics swadi setliff myelomas affo maliwan lariam outdueling mozaffarian isscr zuddas fraioli soppressata mosimane unattainability patalano hackleback rogness mitler bickersdyke trencrom zilka electio nodine kapò uspga haot currock kowey marinaded gurri gurpegi ebita schoner telum jamame hedric zhongxue deice apwr zetterstrom ereli nlea trendrr ariaal daem pleinmont overanalyzing ungass fatherlessness carender simphony bogacheva taimani naksa enrapture gurvansaikhan heroique haplin inama corvaja grillers teixera betzold superfruits mastrud shihhi thanya pellum nivose rebidding pinvidic zichal overmatch geiges grassia usajobs barmulloch levisohn grovels shadiest ehuman gagon jowder belani tilmouth färberböck bukantz nowzari ryers rutterschmidt anacap microcitrus shentel cutup rakhman riyadus hickin avolar upasani amyloids ibéria mpsv tlab sahwat eriswell szele risottos amzallag bombadier velicka knouse indepence audobon cassidys meletis biostatisticians sledder stranglings maggay internationl ccgi hryvnya zipit cmfs interwork felberbaum denak schonbrunn exorcized azafrán tabd nunno kallström cranmere hirliman hellekant dsam pilsners cocaethylene jhoulys alipov westmann berce sickener akme txtmob khatija ghastliness calyxes vehrencamp inidividual uzumba zarnowitz tarfusser neafc inovative marshevet nebuliser srrt tirofijo ghalam roovers factae llovet vasiljevs jiefu memecan landahl riyom parellel unscanned molodkin seesawing navaros igougo konenkamp popeo repilado clerkland bioquarter longmenshan hessenford trematopid muqdadiya nonin berryfields kranju methamidophos mistatement globix iaccoca sepeda kaiserschmarrn opeb rhyddfrydol bhangal azmir scailex delawar bwakira caseosa glied madisha naftel publicty podolny twelth dishu professio hanao hemstreet hinant supermans nimotuzumab genewatch fooey santus onthank mauby pankratova fractiousness maisi galbo eint krinos migranyan lukats raeth iscan miscik latinobarómetro magnell collectiveness accountably stetz europolis dowloaded botmaster gobsmacking unbackable fyw mandery incandela thribb tooway gorfodol tillsley bouzaid blatancy ahcc pactiv sectretary underoos berst frovatriptan kilolo bifengxia feuchtwang orlovac dhoki helda inspirers arvinmeritor melahat lifeguarded gustinetti duac suebwonglee transistion kitas ifund markettools fussier playpens streches martinsa hastier simulataneously shaif lewand denationalised tokujin humaidan balluch fhit hensey gallardos genarlow aranoff retentiveness zeyad lautem yantorno yvonna caracortado aresty slinkys porthemmet gurkas hampreston trannie mudbath soasta bécasse closeby subpanel kissogram athwal ghising ptsc meshoe bentprop vafi limned rostang ajuste ambulanceman dthe vinnik culicivora indigineous senges khazai schomerus fastsigns vygon ibase maddiston lockerroom gorick sagalevich bedeck dualeh bisheshwar hersha kaboose moghimi ingosstrakh yannan mudawi rheinecker expelliarmus upturns medee kaaot vidala scheppach atheel subindex dishearteningly faild arabba medicalizing jahmar cussins gisagara nahabedian arnhart dragonwagon nonexempt ellcock suedes unreturnable peckhams bigmachines katwal drewiske tuxes mophead noncircular sathyavagiswaran silab moldonado tidemark overinvestment takhe sfwmd sconser handballed goldex fossetts collop magomedali kupers accurev gabbing lozes ebookers tetiaroa swanagan jevremovic cbssportsline slingplayer csmfo ifsl wikholm glicksman brynu alnajjar sukru georgetta theodoratus locharbriggs vandebosch nullis dinitia environmen tuckered anthros saleve unmasculine savvier vendex fundamentalistic awino zeibert norgen momanyi diiorio subira otherway ultrasone cyanurate roedy conisholme viget invitiation matalqa shoeprints bmgf firelighters gurria flewellen tuaminoheptane sukkwan competeing forbore rweyemamu sangean alieva rentfrow eurocents kohtz colaneri naggingly kintnersville janulis chuluun jampacked roadm antitussives respironics suzaan yoennis bdkj theirself abdikarim arow elmworth oteh saniora pronexus mangou mephistophelean yingxiu ophthalmics rohrssen vitrol wittenbrink kalisa pfiffner menteshashvili nwoye nbgi scottevest youtrack holdway trotanoy plenel meatout whichis bctgm thirdhand heii heltzel mwewa pelmet langsdorffi elleke oberacker jamychal twitterer nances impella cummersdale mitting excrutiating grieser cailleteau weinfurter demaundray gorsel eocarcharia addlethorpe wajma khamanei strengthener resprayed mingwei topcroft medvedow ulsi unignorable zevran brandejs abernyte kules tavui daulerio intellichoice klaers mokarrameh ccsso doneger irabagon midsentence unnat institure bradburys soapmakers genart moaa preds normalises shvidler jackon unsubtly rayburns electrocomponents pravada doreena cackled burban unigol iriekpen strummy waaah hanzaki orobator kouider zelika shallah acores phoner chronophage deadpanned roybridge keidar housekeep unsocialized ibbett schinkels baraawe rasiej kasoulides holweg macchietti jiverly resoundly eacho mnuchin logocentric aviculturists lineberry tafreshi smartpass kwalik chattiness robbrecht witheringly discomfit nistelrooij mcaveety flymo snappiness deklerk breguets gronant kabalan itsunori religeon shuldiner avenatti ravenscraft instranet punking settlin modiba pastides thimbleful eurocommerce overfamiliar qliktech beleskey ikee anacor snacky symo iparadigms soerensen errin sharow louisianna edmier yday josanne ruching sinokrot ferer hurezeanu ockwell oathall biegun intermodality jorisch debasements mcarthy mardirossian zubayda bearup pizzaiolo karpaz ghigliotti nayna umbehr megeve elwesii aclidinium onhand klie soltanieh eular spehar nowcom itilleq steidtmann barnets medvec mirium dulio posturings biodetection thinkg feamster dourly roubideaux reinjuring deathbeds houte speechs dreadfulness moskitia flytech fioritura soniat deifies digsby multichambered aubourg primaried kawneer seamicro quadrantids dresse jonnes crosscourt prescher perondi sweig palkot unwrought bagleys recanalization cestria aluizio snakefish campier captials mckubre talanted dayyer unironically manipulativeness moorclose wozza popguns everyscape winings treacey zhihe doninger bluerun pivnice geraldina clcv crdc onforce friendfield bnic gushingly fiances neovius tsakhia nurhayati trehane lachelier mayersohn schafernaker cuillier grenham holmrook francaviglia lashawna accessorised rollick yunsheng corkboard symlin baylay angelfood orgalime florbetaben chiem vroomen goip mudbank mojahed bolnick rieken fatique téléthon ecodiesel kentz blairdardie wonderkids owlsmoor cerebrally qualles valimo bradaigh lavander sextupled senseney adpc greencroft lauched hopke mmfm battlewagons yedoma turridu signeul essaydi carumba manook rafeedie berendo epay khomein overbidding dodn souheil replenishable jové sandelford gavil mponeng zieve upconverting formanchuk comissioner lesiba caturday shefrin anggraeni ellenore alshamsi spiderlike faliure darlo bruar baoliang stiteler khushali franscell oinks estupinan spigit kleptocrats pivni unbeautiful detamble demonstation suie sapina horowicz hardmen sicap tubbing wigtoft neuringer splurges streetscene tantular betokened ecris qianfeng policylink vyarawalla entrecanales marefat werfenweng macconnachie okonjima defusion mayben tittsworth smalldon wolfard futuristically volkens folb pillco collecters debon rubbin kulakhmetov monachino cachuela butterkist rolta misheck unattentive quilombolas deria tattooer chaurasiya trescot helderman uncoils murriel undaria gogear artsbeat pelisson finagled schiffahrts starchitect hisser lousaka baianas nerdier hollitt dorokhin sensable xenios houshold schuttler • taxloss demijohns kanasaki marakele gostowski headhunt bazzy destabilizer kirwans scoyoc brunnstrom slipman moritorium siddikur undeployed challandes bernardins stepladders ladling metavante portmouth guertler nlets okuwaki kavoussi konterra keynsian etblast ceoe lapiro whooo rosema erikssons ducille cynuit saretsky progen paraguyan motele devaluate poutala winterkill fastpath danesmoor bucklesham lectorum dygalo meatheads cratchits kanel pitham ciurciu multrees manufactuers bonesman utsc greengages zumwinkel bayhealth guilelessness niiler limpidity renfred evpl freee opmd jawole capoccia envirofit anitelea nelan nimbu tahhan reconcilliation bitkom zepagain caronna electicity thummer refire canestrari wahiya grumeti checkdown pacan baltray athttp dotasia tencha sadangi snptc forsbrand fungurume corbusian fehrnstrom savuti dryish lurn huvafen strangulations uchea seagirt artrock nawbo pantywaist werx vandamm cajani skana mugaritz diabetologia aïoli luebbe cretz laroquette heireann lumpp kanit lamel brenco octeon pharmasset suwalski universitie whoes georgoulis schurrer zafferano eejits shaggier ejiogu thandeka fuzhan minohd ddtc ecip herteman aravit pragna gonsalvez hathwar izat whimpered wiggans hotblooded sapard desposits repik krengel cedillos tarries vetinary campness schackenborg lowed mocny whalid feeks grabarkewitz paravati bortin midgame patroled peloux gwerth unmailed kunick regualar heatherside nyeu capaldo reichenthal owie brassière pearline crotts patongo bloodsoaked walkon pirarucu stiil gastner sajil changepoint grindea taimuraz supersensitive pierantoni ambulating cheselbourne alqaeda lehmans urbn conniptions peninah waltzers quangel qualifing scrimmaged horray washko palazzoli rosemarket mantillas bambis handywork osterweis korchemny bateyes vivari shamama maelstroms guiora shokouhi nerma reinfected japenese praefcke skrewed surpressed ubiparipovic manifa gruppetto santaros effler solemnising bodycare lilianna dominiczak duked waddled inconsideration ruksana fillibustering carlsons gappers codys xinpeng vandevorst cumper spytty kurella amoni brochier borzi cevs robino enviorment grillework peacable nypro dulic roxford junren ongeri peccadillos supping guffawing toretti blackfordby durana discusting miserablism rapaciousness seabrooks chevez träsch lobberts paternalistically sobero aboutreika grcic abertoir eradicable antismoking coyles stormclouds insolia arctics mittee mephitic beginnning cairenes rivasi kingsly stenvall mezzolombardo abdelbaki anglophilic metaj eletropaulo ahney articulacy keyboardless semionova requetes medx deskin injuncted oohh loadsamoney fscc nakanda oteley kiddos triebe sidetrip facilties ruitenbeek tugba rocketbelt lebovits backscratching telephonically beinne duckler vivona subtyped unpenalized berlais aeis floridean soukupova afce videocameras savviest speedbumps balmossie ergots adpt meineck possibley mangor niederbrock koelliker shanth munyurangabo dreamforce bacus ditchwater haussmannian sessilee detrol matanovic tricor benattar intertanko doai swaner deoderant kulzer dupay danthine alderwoods xethanol rewrapped hyperview effed isoardi alderston undersubscribed godsmark deconstructor dcgi intermittence naprosyn buckfire prempro slummy backstrokers fornas puttable inconsequentiality wimpassing placedo bahnam governmentwide senao unfertile drinkability elberse nebti nemyria moooi vorce krisworld maugh huanglongbing esskay intino fumus krimpets scolese oweis ghappar caffell cianflone mamajek moneymen fenk zabari twosomes phwa rapradar baulas bungard stonger actifed getresponse ahrari baubigny allowability cosponsorship tunneys inhumanities eyjafjallajokull szeleczky cableone dttc trwam nutra gotton beaworthy moddershall arrangments maurito follitropin enquest adow ongenaet macpro doorstops hydroprocessing nassawango schroepfer fazilet scalex peróns mandelas stoever whinnies granizo hardick roehrs corvington sharsha skywatching pensthorpe bloodmobile stereography loftleidir megalopolises mokedi ramonas ndira horter constitutent funguses merlots swanswell agtmael prunings sabarsky cynicus intials blums strothotte washingtondc nataline archimage gobero guadalix eupd vpotus nnoli koryolink jicky bohora beckfield pelkie potasnik wigmakers frankman vardiman flameouts riedlinger olaleye pylade howeth modiselle siglio jasmyne aviacon paglicci technomedia shorstein piggin getfriday abkar inshes donihue opurum eglitis conrades panuke cedulas mileusnic ntpp dvornikov reclaimable efner marfrig everdene hyperextending kotzur kreth tamberino zetumer afls meydani sharmans hainsey gompel ellay blsa haskanita nadcap greyest brackpool cowlam unconciousness traxis fibrocytes flamboyancy ecoterrorist hejab overnment zerya violaris disgreements staler perniciousness budennovsk supremecy erway highstein noblis fohrer laurissa lapka batofar meler waverman sözer eclipsys luvaglio ppci feliksovich tecu londell pintxos ntdc centurians queered courtlands ticia amnewyork labandeira amsprop mchp walhi tanzie goloco belviso sachinidis sankaralingam microdiscectomy soyfoods cogitations sullenness wycech wuer amburg adaobi moskovski probers concia makhmour ferrús dndo unfished spherix kasukuwere yonnel hornbeak gombosi mulid absolwent savoi unutilised legendry avodart derwenlas tsahi jurow doriel landrys futaleufu galeo buitendijk guardiani tajuan sabemos kaidanow romark nsimba adegbile sakkas assarat haverson covisint marrington duntley tarence sidles relativley gauldie mastrogiovanni peopleperhour pramaggiore markwalder intc bauditz mussing forminte tatnam kaemmer bioport heric gitahi footbo exuse komodos taromenane sukiyabashi kovarsky guebre bestrode pnwer buschman tarracino univeral preheats tortu willoughton niezgoda edinbugh singlemindedly saynow chitou hydrofluorocarbon brightkite unbooked dhuru jamilia harwit oubaali adwait tietmeyer onsides elmasri crispest mingyong borkovec efata melhorn pubcos christmasy fritta attou schaldemose mindjolt competitve ipulasi kapana ttha longserving kopinski bakhta swepco cytopenia raquil dilscoop precollege beruit microcircuitry overspends geenen liverpools arghistan seifzadeh khpal mufeed contentfilm himeself sékouba jayasooriya ramkishan masaliyev santanas ultrazoom bassiouny motherlover paglialunga xarelto nougats aconex pirez reconnections wildbore mowle wainiha alati fsap pophams povaliy grapnels popovers timone kokinis raggedly alphatech strowbridge phuck mackems schmear goaling iannello binliner ecsp chipotles hucklesby hatefulness cynde dobberpuhl wppa altenbrak acclimates hebronites catster tanard soluable elsholz shirvell torraca dailymed benriach sasajima philinda neinas vinitaly ninoska betgenius shoppable haveron tertz squan nadezhdy potowmack espressos antiunion multiplo craftiest aigfp stavinoha suhur moebus zawadski bilello uhhhhh lednock fezza undigestible ragozina paramhamsa kolditz stoneyard weglarz megaten wananchi vnas chritianity prolongued gaiger montias riads turbinton baseera underexplored koloskov squba rosendaal ponces rajnoch salyards atsg sheepdrove pokemones abedian glionna peckers amirahmadi minnigaff isdr lizzio gtel aglitter blemishing tankful kalapara maydwell shodeinde interiano sobon effervesce drumkeen gildred pustay bournside benedi seaclose cbol kulchitsky dabah ilora lemole chhewang fullalove mansyur exg aderans outwear mckennis counterinsurgent lasering spackling bangbang zly senetor ninebanks derico kausfiles stacchini sirimanne ralpha ntsebeza shastar obermayr hamptworth mbagala drunkorexia cardoons cereso fallwell astrovan tabaro springland hiim asantes vorsteher weate garanca talboy clemes motech biodel amankwa weijing strenous unspectacularly cdphe judical chauhdry storminess dalhuisen boardsource actonel cappellacci keryx directcompute buturo sibleyras tqd scin fbml mycock insdorf suner crewneck braggarts kivuitu monsma fellous sdunek affliations ucpa washcloths yowls malei wizzart millponds sahebjam villepique pigheadedness gannushkina cherner setterstrom tegegne whem concertmistress vanderhook tsehaye extrahop recladding healh sciens bramhill jhagra acass chervochkin ponying phished secularise colateral agasint fibresand bellatti roopam snowmobiler eikos laynie donowitz yafeng laugardalsvollur thinsulate tdz herme buddys dobens hepatologist wagha stoessinger spaulings afsaruddin goodwills homicidally resuce techtown carring pracharaj duddleswell deheng fantle poczobut gammarth diligences suprime jokhio mojitos portaledge moyesii haansoft boeglin sphr denburg folfox eroticizes endris sudjatmiko alayo kenmoor bioserve mlinaric dustier innocous carizza windbags hipocrites afetr iridotomy tinetti cihlar hairbrained xertigny chengcheng hamingson marsano clamed ghei cozze reecie rushern daynile tetelestai cydf olaroz nhongo stacom sublicenses takkula hardyment freekicks hannahstown zaluar umdf nmfa baracouda awdah jpra cravenly biocatalyst pevenage geving flaxseeds macaronis uralsib tearne voilá privacies bizy nosegays pensioning anaesthetise sowings seinfelds santuccione develoment larché sucden gumbiner guccis alpf maruziva eccu teklanika sklarew borlotti moneypak bedfordia blipping laup tineid inadvisedly huriwa austriamicrosystems culv stormier stankevicius lashun facius trumpian aelvoet parcour growning sospan brandeau probablility scrote clonie dufry hargroves depietro floodline hudspith fiveforthree distincly calster hakimzadeh hygge lowara ramde meckling darlyne hubayshi taikonaut musicial americanise dionisios coovadia tyjuan saintia tolerantly geekier heisterberg doomsters snowmachine cityengine inlcudes travagli lawbooks bisazza talamona silverlode ptns zerain thicot deyong cosmeceuticals outpitched supersleuth abli blancarte landrovers bleichroeder ttwo noninstitutional decencies hodas saxenian dicovered koetsu irresistibility soyabeans skytypers gasaway nunziato dornenburg copenhageners petrojarl tril strawflower doodlers barthmaier buatois ripply mcglowan bombsites himm ravinet nalbone itoo glassybaby painterliness balbardie baddoo rptp smarthinking galisteu slix wristlets biodegrading razeq aquinos miskella frankenfood otelixizumab homebirths chabbert dovima cristan associaton pirinski barondes muxworthy denos wholesomely pimentos duporte worthingtons farrokhroo prickliness bienemann verzaubert colaninno hautzinger sanish nikons suslick nexaweb underrepresent zetar sedivy zeyada moshayedi skvorecky dragoneer repole abortively praire acras theose rickerson teplica thte saulters habeb perkiomenville axinn freska chaptalisation damola johansens klunchun electromedical tumwa nuble budenberg filmyard beeland youthwork bilwi lemondrop hiccupping pipersville denzo rohnke necessay throneroom grullon awardwinning afrobarometer mundinger afni voteless qpos goldsteins applebees scarselli needier cortefiel nashan offguard leppink brynford aawc shoulderless jeanloz altnet daiquiris manufactuer gillepsie vowlan andanson cdai limitative hervik flewin unfroze libeco kompan marsali pillcam assumpcao qddr formfactor stecf champoluc rheo csrwire pawloski mobiletv acthar vaji boltby sorries stonking leasco omnipoint basepath multisource priyadarshana crimbo genderism stroytransgaz whistlefield rashedi arcadey strategise gulfcoast neptec applebome merkinch breachers bloused chiliboy inluded helta duvie sgms organistion plowes toyomichi remeasurement jarrai derrico schmill lipes mudhafar gruaud mirembe marhsall trakh gruntled scaynes wistman mavrakis smolyaninov paczki pianalto orumiyeh osoria persley guttenburg xhb fellates leacon prelanding munizaga infringments popka mastracci bendiga sneakiest curzan ekstein timidria unho environement preposterousness mergler songhurst honsik spaliviero turky kalimantanensis theorys kanekoa riihimäen fuddled medialand abdolali babayi pijanowski fiorile akab findin impanel iaato messineo azizulhasni sleepier issakov annonced mitina biometrically granich kalpoes prestigiously weatherproofed nargile cathall tamilselvan debitel lamestream bracketbusters slieau soltner almany conventioneer eluay aicraft secen budelmann audioguide cahyono mulal perfomances unlubricated rabbity cuting godzik industriebank casciani cytyc rechargeit fengqi memberclicks loropetalum neulasta bananna queffelec hardiyanti rostek véfour tourk topicals peacocked lovaza vlna gyori quatar bostonnow navile edal crog criticsed mangoush mahmoodi sjoholm garness entsminger comé wellwishers gichon bagnat meetmoi gatsometer setmariam mitat vangundy stalbaum unterach begona attmore drakopoulos decomissioned buoncristiani hellsberg sawday olumuyiwa truckies nomics loook slesnick gabyshev backhander ajillo jendeki kawhmu idealogue atalon nbsk edrik luksch johnnycakes qpx predetermines pcad dussap kaimar trishaws biofields edinaldo singstore michelins okulov shirrel sspca glucocorticosteroids scourer madhuku participacoes degrippo tholet daspu naoms slonina moroko manusky koeneke pelegrino granquist jabbie empreendimentos climatique wisterias rburgring parkwest mantriji winyard nichicon thinspiration gastao pushpinder otherworldy thinkvantage vaccarello braidhurst autlan carterfone indama nwlc clfr earnout henigan escarra dysphasia debica kenfield infirmières aleasha elysha butterballs brynien campanilla studly iraw feleknas barehand committeee rumailah chillas laffoy angelakis hepatitus tofutti schnittker kehrmann scentsy chockfull toulou bordat sinornithomimus lezlee padmawati potray carrison coarc cytel pyret holographs seussian rozsival uncomplainingly pontycymmer opsm meringer croiter helioseismic natelson mokuena gophone dissappears delce ascetically cadan hospitalising kingseed receipients stockbreeders releaded porfolio blaiming roskovec squired topilejo ladened shopland serran flightpaths suryana indisciplined liliam darroze carrefours grobel gilthead heles shikov holywells bourrées vlassakis henehan craigowl downsizes parralel businesslink maxick guarida sucumbios gremmo infopro breakingviews dsas kayum fernàndez eqr difonzo wisehart wmmm yauatcha opciones goutal mccourts glowers ehnert turkheimer dennee sizakele hitsman tuggy agley relevations mformation cuneyt celluci carloss jorvorskie queasiness dubout gazzano postrace firstplus gagnidze uspaskich glencove ipoker loppers rackleff akkus bomke podimore fantus ballyhackamore treyarnon yardages nexcen acdf dedworth belguim gosek chanon fitba coastally raiter haasara yohane badreya doxsey jomhouri birdcall jephte pretzelmaker dizikes horwitch chappatte bumaye jellylike badui ottobar incontrol kipngetich burte shinkay taxer grandads draisey cytori kupono plash barberas souless sideco pogam yaleglobal tarrion serialists ditzen mevs chamberfest filipovich qinghu papun musicalized jjimjilbang holocost highflying gattari emptily snode jenefer haasteren michitoshi chongwe abakarov atbc trubowitz belchatow guobin losyukov requoted skrebowski chandiramani fenroy merise körbes superbot pankshin umarji petrivna mpinga matress chiuri pearlson luchko chumbucket shneerson endako magennises pliage palframan douda uscp phillipino thameur snobbism chebotayev softnet hessert wallbrook shafritz unfortuante fipr witcham labrot hagmaier omahans awlaqi utest pescod kinger marinacci adulteresses militarise bitez bumpier symbologist undiscouraged seenan leyvas chenevière stressy portantino pogoed zayi comebacker ignagni delusionary kazakhstanis wasbrough feaunati hundleton fiondella kossler vitreomacular adrasan statisics halfshaft kurani cobu bartend prinicipal dishcloths vogster doublehanded barkwill heapy lumpish mefa wge bunbeg hutomo qdg conservatee podleski corallium leopardskin geed icariin zequeira asshiddiqie celynnog innovest cornley canix masunga obnova scarifier nikahang rainproof pernil cozily qadirpur microdistilleries muntasser bukstein rotblit uehiro bancgroup vodoo zhihu rapelay ashmanskas marcassin hunzeker imiglucerase supernerd nuzzles batjargal doldrum njea friedrichstadtpalast forestside worell bolie easyway hotbutton netdoctor beschizza zonally vsia metroeconomica megalomanic sheperdson castleside overexpanded curvin becas ascuaga delegitimisation becomingly verndale baheng trenesha smilovic vylka bottner ngalande milanetto bundlers khalezin bfms goluboff denaturalisation zargham travaris figel nilaveli dangos fussel supriatna eurovegas nationbuilding lahiani asmerom shuaiba gagarinskaya smethers vachhani banguela galron greenaction kharrar witchunt seariver electively supermaket lepeltier quesnelle wudil pucon agualeguas wigix heathcliffe anafon abinanti kotsolis soweth jcra gresswell capey slingin ferromolybdenum brewski accomarca eacts westfarms marver saltmine treixadura grumblers hamfatter battels dubuclet cheeseboro snored grafft raibert murcar hossenfelder madumarov ploof recyclates devolutionary xiaoshuang unappeasable visionworks schweiss médiatique hyperstar mulvehill noruz azun speechtek tedlar comice siminoff dashboarding superkings lated futurebrand thornwillow liebesleid josen fazes mecaplast pokalchuk sabrett kuvan makvan chinawhite unremitted rygb liveblog sinkerball kormendy pérol hayre burncoose rfis mfcc warstler cackler coopmans superrich attarian bookstand paffenroth shelp dettloff polehinke saimira idva kibir ecbt kelbessa humanistically overbuild redeposit shanara swaleside gwac kozmino prangs bouthaina ostini palazzuolo kahlow jadel boreks carpetright ramshackled yourslef mangieri yanli cobranded chupina frontrow guiltiest vellis decabromodiphenyl rhapsodized fiberoptics deductable coreconnect balvinder gembo gullestrup tiegel cotorro gwartney rayyis bosserman paluska machista ishmel remunerating pceu rivenburg playsuit perrucci graffy momlogic styriarte viviene kumbayah mamaa licencee pioro innosight morcambe garishness priciples rdoba peraliya hisi cheeping shaojing fadika trhat boute quietens ogalo baraldini queensridge lybian blustein ltachs mystick tegnestue jetamerica fukubukuro shanoff skiworld womenomics blinatumomab svartedal rovera verace oetz houseparents artunduaga underpayments sooam grasberger twitterature ngare paperport ibes airfinance aufiero killalea kadence kwaa inkersall northface sliger fontainebleu janmohamed askariya tekeste mcleer donabe seiser stalter acutest narmin appelate proselytizes bphc techtonic moledina gilula berinsky outgunning cnci etilefrine galvagni conservatorships underrecognized mazatl torturously carosella iabg karunatilaka tosson ifin jazar migalski lumm dotta bitterer remarketed gxc zijderveld delonghi lavita codiscoverer incomm mendaciously immad galban tayeng miquelle woodfold elbot siscoe fluoroscopes vobs reseachers southest pescovitz reesman olympico weaking fultons shopfloor blackavar fitterer blackmores ernakulum naawp kalca dobner bearsuit rspp anapol aboudou aughenbaugh furmaniak canellis piggish rettew palheiro phenominal urrugne bahoz brijnath ricket jarosch frazzles mauffray sospeter mortty zingarese philou lumanu soooooooo daboub kelepi unjam mackenzy weely guadagnoli dolcezza leidolf gmps vitual trafficks milimetres podladtchikov abhey grappas gourna wanma woulld poct saidman youare kazahstan screwcap cliton whrrl proplem buenaflor guilet sakuji imass prouve thonglor nubby thyroxin interpeted terrenas kremikovtzi raelyn haircutter smokery recharacterize schipani cainero mitsis impertinently atributes debkafile plasticised xpresso scillian ifsec ionx hpsa eruygur hacdc nondiabetic terabeam facchino impassivity poreba ignorami primly kkbc niswanger unreconcilable resposibility jacarepagua moonpig mediacenter paulovich molsons waterskier liuhua imperilling hollyhill levolor haswa cosigner blackcraig izere klebanoff smetters sabihuddin draad netronome susah dominoe rauhouse martitegi deetjen adhamy mcelheran counterstrikes auxvasse elimane boukpeti gogrid deftera eoir cordycepin abdullahu gasify highberger elmos cattrell fatuity aesp waterzooi identica asylmuratova danjean lunardini giardinello sararogha mhpa depastino rubiana tanesha wvi meissnitzer merieux neather lubricin hohlt pomanders xiulan naragon grizzles worick lighthizer karaokes muminovic mollik villavaso kuzuhara inertly picciola usabc naqash renzaho bogler careone yannos shinskie frankurt godsons espel panigrahy chooch pixmania kludze deltana republicrat gregoraci boundaryless banic stockinged dataviz yershon lungen osmanoglu africanness unprofitably peewees charasse ilyaas joggler linaclotide shemari adaptiv merrylees mulhauser hankers goorin voudouri chaiten sizzurp breeziness brightsolid mudhaffar camoapa kve amstrup yurendell elitest chronoswiss shools prettifying dolares mcghan dabblings grittily rajith stonex sweetface lefthanders toddrick stableyard claxons monacos walmex iceton ranneberger labovitch tryg nalecz interindustry marlabs cafetière islita aniboom panzanella yusak thankyouverymuch fusspot swyddogol survelliance bancwest noly varatharaja nestell ymck christodoulides fastovsky mclaughin épater dunkerly gplus doumbouya slideluck chermak sheeesh paulites mariluz cecep chainsawing grumbly goozex counteraccusations gelbaum reakes corkscrewing vranjes unvanquishable schoolteaching gooners tamberi budby dearn bissan strenth preformance pucino unrecommended lipnic proofreads wailoo towneplace goalbound buld dicon ferenci balloonsat shirkat kurlan debrock cussin pfiefer qoryoley machinegunner paganistic crimelord horejs takva kirschling dedapper deradicalization zacar fooball cabnet fishiness koepfer nuclears bendolph susanu meitzler venezuala jerini tajammal przysiezny shrim chompers aciphex protonix zapak jasmer towl clarkey bellera continential amidror skylighted hoschek nettled falshaw hainford bamroli ganss jitian clearcube sureesh alelo ubale ketchups pattypan abudi dargham exacty liposculpture revotes outway unbuilding gyala tvss oceanariums narquin mildenberg rescanning sighters anodina funfest vergilio maypray aurakzai stillit subhiksha pharmatech tartes allatt gottschling zerohedge boriana godesses spunt engrosses redelivery rushmer alhaurin compartamos valmor chikwava sooting figgures atiur jairazbhoy mustiness dorint sheerest gardenstone shapeliness spronk khandoshkin shadoxhurst kroening eventally staticky bissey vukelic bathiudeen utilicorp uncomely mandefro jewbelation youe osteoinductive knuckling westthorpe vaugn cullivan dgas dartanion longly cizelj trimont gujurat spinsterish inexhaustibly applicances mazombwe diputed pélisson tripath hagglunds fiorinal rhuallt quida acenta begood hursday schaitberger broadwoodwidger wakanoho avature inestroza vestberg loasby dibbuk cfmc adlair finching udovich dispensible replastering abjures rooftopcomedy netwon vandenbosch parlevliet decrem hkac meide gigatonne descibing minuteclinic shankhill ounsdale nocc glowy akasheh geekish kanetsu faryadi ingratiatingly madhat malpede crashgate postill despensa railworkers kibisi shinwaris norvasc vitriolically tavakolian seeable designbox bouazzi wilcrest faggins horvilleur pfitsch lightposts svaty bresky libiran paquerette opertunity cavilling hilwa gendex blankfeld dexcom sutjipto rothholz saligman mclatchie asfur cruch codirectors zipursky babeau gjetja rooijakkers kidsworld stenske desrve inmar smartnet hatherill trinlay reeti ahronovitch zakumi masochistically chopteeth escajeda flexcube montevina varod baccar manale dininny yeilding saiburi reak noortman innergex efraimsson kaime claren shalesmoor dapd seasonale fertik mondesi kayford lingvall homever volcy sirenuse unheeding nrai superfresh artccs zhenghu loomans antimacassar steinwand trepidations avascent gasten ossetra ngwynfa drived bloomsberry naiop fridriksson abuelaish alabamans vahafolau zicatela heremans casteja actioncoach scoobie paultre motoshige lunday servive landeg valvematic dartez themseves fruman kittenz aircastle diacap prpl husock mondegar isavuconazole pawlcyn jonason dehumidified bleiman oddpost tzus necesitas ritterbusch pluot kickboard zhitkeyev ilakaka brufau citifinancial soundlessly tcis rinnes italease ustoa coatroom viracor calvelli mystifyingly constableville roskos astuti innholders polares jichi mamadouba asessment sabatoge overthought ealry amrika dashevsky breden averdieck tabarzadi nvocc samlor tamarasheni remmen tartiflette solzen cabindans worktops nawani mcelvoy forcers wipperman snuffle sexperts suboh cantonian handpicks hawlati jankowitz furloughing taffi valee hamdallaye asselta shimakawa bcls designlab sheepwalk brickies imapct lunchpail sandmeier glaschu biodynamically kahunas futureworks havranek crottin dragani subdelegation rabbu accordian juares febrary rowold revisitations latag carefirst desertlike mortaud buttonholed wallchart yummie oilier zenei ieoh kulman matterface atavisms poreotix schwarzenbergplatz johncox mitisek klempert apprehensiveness washability serverbeach martavious spatharis emere poggetto greenkeeping thiermann mukhlas aifms kastanienallee minimart bronxworks longcore concommitant phobes zyvox lüke eddying stockouts rohozinski meiendorf vinaigrettes lindbloom jgto amimon blockparty brucknerian fssa prystowsky mcenany bashfully mischeif bloes trydydd ziegenbein zauberman kovalsky lifegem gibellini unguentarium kassinger amerco wojak homegrid birkel roodenburg technogenesis nonpaying ritger valancia caunce schapen mereside borbely bogucka trounstine transfixes lvfs grossbart terlato cordelle convera wyg anticrime hwadae zyla omesh sehrish paktya johnthan nondomestic nesrine badeel swda moeakiola hiott kraam okudzeto oyon camner eyeghe aaaaah nateglinide breastcancer verheyde tsiamis mobilereference twittered cpfa jeremiads krainin inrushing solargen redeveloper berylson atunrase mettraux esbenshade parfaitement najarra tvel ampelmann questioningly parlamentu mayahi slaght phidippides johnmccain grandnieces tenerelli koumans menyn dilorom durrel chear zuckerburg dietlin kloefkorn streckfuss imigration delance bryjak opirus trusina ravallion hafize togheter kholoud denery breadknife cabl ceejay korad verhees bubye iratxe lafranchise sychdyn telapak choza ombaka retik margram eutef pangandaman synagro daiga mufj iifm underhills assabah washingtonienne lvads tuhabonye traduttore radeke bhuddist corbas uplinger naczi worriers smooches quintuples cloudforest brenzel liroff backburning csibi siluva shotspotter ogalde kershope lusciously chicola npfs candotti sabyinyo wackjob teachstreet nanosphere coaltion trester seesawed edfund abbamonte roseobacter nanofibrils perkowitz csapo elizabetes boilard sinol ammendale mcanthony mismanage pendeli groggily ugel eiast fcrn siggers susia rougue ozpetek caiso summerstrand pakradounian jetlagged burenstam sillakh sadeeq barbian unblinkingly souplantation yanovski dichloroethylene caerffili steinger eale microsensors toolin kudela manugistics bohnert savoriness cetta tuzee kallstrom iqoqi maistry deitchler rugelach bahcesehir atpdea stockcross ausma abrades messerly ingushetian tianyong durá unwrinkled latti kinani hollahan jembere unpressed cousar redchurch prepack collators waterthorpe alhazmi forswears navellier bergrin abercombie jje playlisting vaisman hildren esmc outselves rounsaville fawdry lanoo kasyanova hymon mingy uruba insinkerator tamagni borroughs gonorth siwarak gillier themslves leondis accountholders zyed fehsenfeld paltenghi amms maggiemoo brasswork sasic integ stornaway beautifies bloqueo mcwhirters fransiskus brucennial pustilnik reckart gribenes bonitasoft bluewaters trygstad piegza cestaro preciosity foringer barfed albes kurczewski bartnick threadworms shoebills ichsan searchme bowane ajristan guardpost felisberta ceviches zhongjie noodler ypma sterndrives polyglandular yof ryanne wittingham meliden unspools moniotte domers slobodchikoff genao dedryck sirot tokarska xiaorong dharmu arfordir unclench aspillaga bogles espelage bletchingly steampunks assest innie hyatts geekspeak mulish urbanise nwagbuo mgallery kukava imbroglios nmls castaignede ringhofer moncivaiz chiffonade paskoff owhali anatosuchus chimichangas ferrández sereys simontacchi jasinowski televizor searchmonkey guidobaldi newera scolnik kogalniceanu plent litovitz broadtail yiqian fomunyoh tricoteuse squidge bijagós emip soother wacaday kipas folllowing pensant nafie semifinished lasciviously palringo preloran pyranopterin espineli pomery shelanski rgensen hingson berzina jamilya hardcash gurspan rury toromocho dider raymone muindi aryzta crnas alexiy marzell maloch tommasina nirat faiola xtradb centruy pooches congresscritters xiaoya snowscapes finvoy faoa seyhun kiesle encourged wots bégaudeau westquarter omnistar hazime liwski ’ tearless hounam grarup neimann pontina unhurriedly critisised hireable khoder pushtoon karnats meyinsse chicharito leegomery vincis intermediating jungbauer jinjun algarin testiness moqadem dogcatchers smoger terribleness phileleftheros shufen scheirlinckx noski shoretel spinvox moreleigh janyk xhelo drivecam avacado landver richarson thesmar roslind mcgrogan estara wanogho filek realties andariese fullhurst senyszyn iaem korsa poppyfields mukuni weymarn harbinder pelak gardey synj cancelations iordanova denegrating hoplessly hdsa incompatability vsos katami ashal violacein jaxtr falkengren piacenti denitto hodara syha beagling keilty exxonmobile grasscroft antholis chanteuses prosy embas sadistica whitledge trépardoux superagent equitana woertz kovanen josko kunisch sidling suldan cwlp doliner drevno vreed cageless paramax erekson timboroa broompark dekoda eclinical frought misappropriates pihlaja kotecki neocolonialist loconte marineros bartfield villaneuva decieved revas orstad blutarsky redolence exultations billary cousen purewave cwrdd fiering gridwise sturner vietcombank najih newlink kracow contintent semkiw cvijanovic xtca brahimaj alvheim ironists nonkosher dreda ehrenkranz revealled millenials gimzewski mankinds atheistical netcentric pixelate termers gonave snowie brooklyner jettas siemionow caravansaries braynard bintu pudlowski disagreeableness introna fauskanger asheesh waffler cellura poltiical kawalek alawiya bushelman quantez shikwati citzens skouries padeswood kibbitz pecorella beijie hearting dohner reho ninots authentec demorrio muscly zerbin sabria gastroplasty unitedairlines hafit starbar bonnema saleban burgaud techtonics billong telereal privette diflucan serviceperson organogram vytorin healthfully boycot roadtrips ahya rosenannon trinklein cismesia moscows komsan lenarčič sevinc repetoire bioidenticals zinch lemisch lieving innoculation golli ponsanooth woodpiles morningness batat fayman dicocco incurrence wolflike crez expec katseanes impracticalities minnig malandain smartsource shrubsall montsame paulas ballcap swerdloff poldek lokichoggio cdcc suroosh taxact differentness recind vergette celadrin overregulation convencion zenkov palmiter chênevert caffet mapless cpff vellenga malmer earflap garufi tarrifs safway bachoco vkl mingott cherisse patteri krooks walgate karalus crispies frieston ergh mountpottinger bachuil genderen obbligatos munatones wenchong heilborn multitaskers unhão nahavandian prefeasibility shmendrik rottet ecds pastrick poulner tumbly morrical hanlen vogelhuber cunth kamras traumerei contined sammour evridge guglielmina clafoutis rozza marshua bitonti polaszek malinski deltacom laffineur glantraeth udrs putrescence vitolio diang cpcu inbounding spraypainting fordhall titzer penmanshiel cedarlane staig hodosh duchoň knockeen shindigs americast casiple hechizado steinel culbokie economised faridoon charsada acquaintences overextends edgbarrow cavicchi traude interfamilial familyfun margoth controlls greyrock melaugh palagio pnut caucusgoers biema frankendael delbeke yobes giannakou midlarsky mulkerrins bdms kliot quinapril schlepp digitalcameras studiedly gibbin kierstead unoffended opressed mapumental cabilly honten ladyga steenburgh toothsome eaks readspeaker giantkiller biomechanist bernsee webforms tarnasky throwleigh immunodiagnostic paskett rsus panayotopoulos ukec ghilarducci manylion jandel lenowitz broxden spendin htib stragglethorpe demobilizations kirikkale resiliently seaon elterngeld nozizwe ejeta discontinuations mistras loopiness rabett mhoire whittacker mierzwa carabante haughney mislabels becaome serodiscordant alexan guac cotmanhay aabey danishes rtaa japanimation kissable dayni plottings braestrup backhanders skovde arbitraging tiebacks jenessa ultraorthodox applecart tokitaizan barreales poujadism mojdeh hafidz makova katsikogiannis blankstein overmedication thirer munene xsight prexisting talkman harbuzi sililo khudadat podkański rcpo petcock disinvested vinken boppy veillard brussells boobytrapped coevals spiegelworld undercapitalization kropper makover dornhorst exprience giedd chaswe tgfbi pannone dropp pehn sekayu ashr tooooo liepman jemiah schlaeger reroofing josico nansel breakthough miotto whittick mlbpaa hypocrits bausor denay flatmo guerron wanyoike shoretz mbodji cummard jossinet phemister mabrouka fagre leanora wagnall contemptable leting carajas barichello dunguaire profectus brettkelly nicb photgrapher memeti ibarguen volumn happned presss comentators emass twitting robischon traxo sorrier openzone typcial pirret jerred mattani junying employeed portayed skoura gasless wyplosz moulard ohny shelterless abullah jiazheng hapuna jiashi pelzig netminding succed remberg karkus kunk galanteries assigment mmgg dechrau innospec toplists vulkano edmans dhurgham middagh zcam holstege taxprof oldhams niblets mtus leeane gops steamrolls magliarditi waggin ardinger klesse scandalise nutmegged hohlmeier longways haqlaniyah pbjs matchwinning meqdad assaraf matices zeppilli geertrui houseowners onyszko subfertile crispers garyn universalise falacious mangera prodhan watchfully carronshore houghall canadell tabbush melch mbom nceh glucocerebroside petrisor bunich cravioto kydes microdisplay rosendorf fastt bonnee microfilter waiola mirrorstone xolair sunbathed edemas privatly llinos rashee rogak schaffter communcation muffit baylham aimhigher noseguard cadex kagisho wimon stouten masrour ortuondo brynberian tamburrini wollerton devah castrejon serbedzija simonich spross sequans kleanthous clickstart peiyi echavez netone expendability dobbing wryness flexitarian hinduness adbowl karadsheh justfy charoensri wootan superpredators mallouk klintworth allerca reappointments sackers obsurd pertierra barbadoro evangelically woznuk carleon csgn wolens nativeness kneisky hewko kaigler ihsanullah talismen eliahou bhuto saltin itwas nevels unyieldingly radiolina noveau casalena zheijiang follifoot morfogen waudo flajole davidovna glng learmount umbilo ohmigod wagenmakers centrais canggu amcat feilchenfeldt soccerbot spreadeagled weq dazedly shalleck funo barkely humoresken stiggy velencoso bubbas stoled ghairat hetrosexual ghag watanbe coproducing moxam dspp feeva klibanoff eurodac bernardito newtowncunningham floppers rozzell eveillard rassak sayyari robomodo bulatovic gemal muskal tulkens eurofound greenwise rajaraam inspirationally telefonos artemesinin teenies balades ledc amraams anways harmoniemesse kynt zielbauer burooj adumbrate pickaninnies ciaravino candlin hurrahs wescoat kailuan actuallity pollaidh nastas molemo kultgen communicasia triscombe bapineuzumab castington weiting xpedition siwak brisland härtl cottco zhongde massiter imouraren somach pitkerro mhanna scalpings ghos whislt nayeri smip guessoum wingfields crotchless hoerbiger geranios leashing nellies elastoplast surburban securian schake fanfold belarius skulked seediness fordu shimahara executional temporäre funemployment shamsan cruthird behravesh coldiron bringsjord bakira lierman temelko outdraws seage sainclair olimpiyskyi nosb nneji somersall vuchic logisitics jiyane siutsou jackolski castenada chuqui ibok zigic agreat symtoms torton obssessed leafleted exstream synesios kybosh fadeyechev racinos smirky tsxv fatmah missett miglin remailing busanga gronauer antifreezes roughhead deafens scrimped femininely hamlili mesterolone westenburg medikidz sosban nonblack thrales feray hoggers qurabi senut westsiders kulczewski gallaxhar paraphenalia confutatis oput ribay kanders furgeson lprs funsch anatomized corscadden redrick amfpa disler schute lalmati ditherer linlathen ganeshguri runzler remineralize fulmination corseting fials feess bedframe fascitelli gerolymatos khaliqyar lansberry meleka frischenschlager pivitol damanik intubating ychwanegol acgt gildehaus zakharkin ramlet lafree verrrry elsheikh tuttiett mashai monetizes wdcc blazej conversive prounounced mazare kaitlynn immmediately kamenar laninamivir pouffe discolours heckaman dacias genevive temaki safarini highlining schattner tawdriness mprf daslu canniffe eufloria frattarelli zeresh mbada londongrad frises tomorrownow paulée ecotaxes ningrat pachar bjornar hoarwithy ewni belabouring messitte vidattaltivu airmiles jumeira cnha homeservices lapostolle ctid grazax marando frontward decharms iasia jitteriness dodgier upswings marinelife blachley barcud safeways kavanjin donetta satlof skipinnish bagur cambronero swaggered rotarix checkbooks elcombe agbank aegisth fiftyone chapayevsk sobro unkoku fairwell steinhafel heherson gluckian jenzabar onetaste sportmanship lebsack riippa acomplia newtech landesbanken dtto razored degenstein andoga spikier agism thtat verminators mcaleavy temblors thaddis maplesden mckneely colichman semiotically motavalli millerson jeeb badenhausen weldments sisli foraying szumilas primelocation polikoff freezed musem garuba villacarrillo benlysta siopa ineffectuality altink ddoe declassifies mcclaskey npsf compartmentation leared santiz dragulescu illusional midmonth hoodwinks bofo ogolla ahler uncouthness zynth successories limet whibbs fochler aahoa conatser tibbermore kirketerp maydon saintil aggreed uitsig drumless artisteer schwencke dungee nonliterary atalissa swingmen eithan jinjer crowston professonal ostalgia zalul odein callconnect okland kufor solidthinking harira alliegence dejana ffhs kurlbaum evildoing supersuit hortle reflowable heenes costumery uclear litttle spirtual sherraden iakf argentian shontz pipperidge dispenza dramedies illegalize damüls boroujerd okerson vinyan slah diffence toddles perschke langdown consumerists llanganates jealott overpacked mohagher ameerjan furushima hooplah littlehey benalmadena syrias stoneyholme almor pursuasive moddelmog missel iwuh batjer ksbc traffiking gorily bhoyrul omneon mamoud jogmec autocentre unbreakables intermune sittercity prosectors morganelli rapex clevin itamae pleguezuelos composters bathie polydoras mukhim selenochlamys qcdoc rejer bahave arlinghaus stylelist multicurrency matuszczak muttathupadathu kirschvink bogeying baudier twistgrip janicek chubbier samareh birklands aricom distastefulness arnwine fawcetts toorale mantese dannehy momentus kelsy disaccord hoerth nogaideli ballardian lifepak lochbihler agwunobi suretype pily eyrow shibil qflix angelically overcount lasowski wember aunti flamming productization santiagos healthywage partow mistating curliness liangs secac fundholding tribrid adeoti candance chavula inotera sciencedebate lillehaug kirbo lobmeyr stebic weyts hajdib abdellilah roedad itabo nafri shpigel presedential publitalia keylee lisamarie clich disgyblion gandolph perlmuter delterme helgelien colontonio hakura sugishima hovid paddypower lbbw wildcare mokane belloto sinatro devalera postitive reethi irobe chapins cornale imrpoved lincvolt arived idexx matouk papile ventose baldragon pribanic abayi everydayness piperlime omama bornaz footcare pritty mangatepopo forne deltav dzp castlemead micrometastases guipure santamans noncommital trisler schutzman reneker klender anung corniest neidig unchronicled vandell nashel wälde fehrbelliner skmc udink sensibaugh orfani poblah nikishov rookmangud mandina hectarage khatem uncashed caricola apearance ooohh timbercorp dupattas höweler huntsmans gandur pollutive athra lovedean hankiss handleless zimmet gropers megafan stirewalt portguese depolo katsusuke barrella sarafem multiaxis cloudlike mudblood abramenko johnell batchley bosiljka leesong hydroxycitric lipsman hypnogogic swishy hostessing angland arenysaurus schnoz strads tuaca briberies superchef millgram hallucinogenics dooner promisses bosquez erbelding antiterror dhonau cupholder xiotech raynella hoffi lambke watchability moark pettite whitrick smartbooks michter martner bromhall deputises romiplostim balsawood nonlawyer gardenview scpp cunit bébéar runnalls cmim jundal frega nuvigil panichgul thermoses euroset sanzone craftsy thurbert spinco antiheroic bocarsly festooning ghazy reprehensibility dunadry conell opina bagudu lowermoor lopsidedly chynlluniau pontfadog sidaoui sinnadurai adspace targetfollow zooborns iskandarov chiso meary stylista eameses apointed haindl gorrill mulrow croonquist exagerations lokshina proce pinior bisgrove wersch autralian walaker wimples sapirstein jital roydell ezchip pranee smilowitz scorseses stroeve sumaq extraordinarly yuanlu nonoperating sugardvd kaufmanns henmore bootprints parlux wheedles socarrat moughton basketful gahcho korena abusos domicelj schanzenbach caveda suhrab helmstetter fatalistically vivara sozzled bubriski huette edscha powersliding pitchcroft haffadh timbol aidsvax monifa vesturport gravenstijn chiggy dineequity lyubasha clipa webair wennerholm ahcccs viniar guillebon corniness zakay citified popule shushed patientpak marranca rolexes laplata wasteground nadca labau bosnic rabern voltaren maroteaux cofman hefted chalencon disentangles uncessary ninepins salissou shuklaphanta armishaw sohus ricardas chalcraft minnesotacare goave netqin lowit britglyph didymosphenia gcwr abellan dismemberments holtec stobe fenics papayiannis undertreated beknazarov koloma accordent wochner sonatel podhorzer amerkhanov metho kroller uhomoibhi cvne oblas rumaker cessy nuissance notc assida schwacke easterside ebly tapfs evfs postmatch theraflu keirnan dynadot caglioti beuselinck hoggy southgobi electon oneunited kochersperger lacoursiere organis beinish khushnood brenkert bohlke sikura moessinger labrit crabbit lvpei upperman chinsky carying metry royana lovley izulu scsep manboobs stomached incresed lamphier nget laidi staled misdealings coporation disavowals wpz quadrilha meints liptrott prodanovic purblind hausfeld tuggar twitted rgen cecillon vorobiov pinots homburgs critise guajana uncurl prets medisin moviebeam yackel giuffra putrescible bawds occurr kimerling ergonomists jhai fgtb koprivec sukaina rubberstamped uduak clankers weatherbird andreeff trabado tianshuo craigholme nicas linkov montecore dunsire deconsolidation tjeldbergodden plsg mclenaghan pennyweights corsell moronically haulout femp staduim palanquero infernally kuhlen brasside susar anfam unhooks bofferding alege emerainville etbf plasco auchnagatt elizbeth herit pucllana telexes siriusly woldegiorgis shpakov morso rakotoarisoa zbarsky gomolka cubanacan wodarg controllership pasola yonekawa febraury amzn pmds saviana altemio schandorff ingeominas verter marawah kipevu largoward nadirs osisoft fluzone kutluca brammah holdzkom unawarded hmmmmmmm hatzigakis coatney pulsenet bowbelle transdigm tinging chickera investure cascini lozere mcgurran wiegandt wartelle contributorily goldhap troplong vaios shingly denoix garafalo asru luxuriate cyrulnik boconcept wizet gelee climatecare syclo stayne maliyah indicatively weiberg manoucher excisable ammori dagze ruthsburg sernageomin kaixuan twthill jashon whadda cookshop ndegwa bzhania cartiere lrads effeciency portney franzone chegworth schit convatec pcln tsgs yaowapa wellink undecideds dichiara samiu thauer xeroxing rotbart cloghogue dioum somewhow bourlanges levrat poundsgate studentification scholanda hiccoughs stives ezcorp terrone ofrece lukewarmly melcon ryeish heartstopping medvedkin wardani morphis norwine fissiparous glipizide waveshape dolcis hausam knyveton delaronde biomasses reinspection shelthorpe tular ohmar zaliukas sharnol gorbatenko castagnède phins gandah dahei nonexplosive ceradyne ennon arrendondo mackilligin captively samnick jagodowski ballinluig bioproduction zelek villaldama compaired vamizi locklair novacor beable flexitime climbdown ondeo krup unflappability geotech prokurorov anassa subagent multibillionaire bogend unobtrusiveness monopolises nfrc meterologist tearstained rosstat mesiano polkomtel purry goulou scorcard irshaad cogifer beatboxes geoeconomic espandi fsrc llanellen demane eureko perigueux lownie pharmasat sosnovskiy rootmetrics barlev itit gottsegen shimrit dhanushka eurovans claborn dishonours woddis amouee bozzuto koreng smfg yipping schlaug roher muilleoir bannering schaler dauterman trewby incretins sugru werleman invincibly zachares nalukataq commisions carrard shoutfest dreazen szymkowicz skaret igoeti racegoer kabasha mccadam amundi coalson rossan shearsby vociferousness unfelt cianfrani niave harpooners kohll alkqn gicanda ittersum blaemire gunhus misuser okadas raharjo krener bemusedly mejide batkin arlinda ballat asselah keyfob rafaat tobolski paletas korissia dereon weaponise abbruscato hilburg ollar toloache dulkys caragol trafficmaster ionises divorcées bonusgate grafer fairyhill laracy jiggered paellas pallinghurst outcall shedloads suppy jalolov promethei reinflate alperovitch zutt yourselfs wfda hinam thingamajigs lecjaks stroehmann directconnect hettleman armann nkusi volubly apalara twanky wintuk corjuem bokke chiselhurst patdown lannoye rizzetta proshares kaymaz aidh jackstraws tomou pakem kamula newcott nhmf formworks monyane istanbulites luxuriates perneczky rhapsodizes poetsch josifovski tipuric braywick sominex pennymac busari minatory cloudworks aloul nouredine narcissistically mesmerizingly sportvision nmvtis suvaddhana hifn parboil mworia spertzel desset scaringe younoodle wardhigley garanzia mulchandani lizet shaftel runelvys jelveh finansinspektionen competiting oakleys spitale jawarharlal aarik tabat bosomy kazuyasu nassiriya chilcoat bidil tininho firstsource defering breglia saimo jaumont ostracisation bramscher hyperwords bidis gerisch haskoning implenia ashkenas voiceage transer ninties bukharians misstepped walizada elvert transcervical cheny soffritto phandu miltiary nagyvary kaywa izea linkscanner sorrillo crailar payden stephanson mcjobs uekawa brayboy jeacock noppharat daglas zevalin jazmines busho dewaldt tegwen bischitz donike deshka flashily cavaney tannura paperno nevas cablecam speakerbox loncaric thonged ablity zimansky ruchazie nestegg juakali mihaka pluming mckinnely detemine umida kerasia breidbart paskaljevic merrows chameides migente khoshchehreh guojie clubmakers malhuret webid roadsinger harnar crisping rogaciano dowload kanok grca phonelines mcilwee motasim dickoff siutation dildarian omeje kakarala boci beacopp ihss nattily sedmak shanay observantly awvee kanavas ezen recalculates degloving marusa jewna shyrone biffer infobel airness newlook sadaoui vickilyn hydis varatharajah himbo feshie quetame unmannerly owiti wazhma bittles ochola radmacher dehoff loglogic gravelles incudes yodeled offill qalibaf iaslc incased rowenta moosman hermitic activevideo giveback chissick akea dezheng hupfield toccare radiosurgical superglued viriginia shiftiness pulqueria momah villainously piscatella keker anacomp espenshade puckhandling komag maoying ultraliberal javanfekr undergear undervote stonhard pupus casgliad luperon chalkface tseckares rivenbark chaobai niklander bedane henthorne protzman utsw pisor snooth qios ginman tirelessness twse nnoc disinterring chidlom hongxiu untrainable zvecan wibbling phonautogram whingers nerdiest demirtas sollano airhart waterpartners plavnica fillinger similaun anqiu avalons archera navetas susham twiddy alizada teviothead lanehead steris mekelburg tgscom papacostas daith shimmied bitgravity surftech kostunica draut scandel judicialwatch mongbwalu riffy balakhnichev rossiiskaya reaganites gavello gopie discolouring hoelter backslid shaylor fatimi meltz humanscale shisler miklowitz ringos waitressed recinded pensiveness tojam senagalese radziszewska arelis forcasting kosslick risikko nxy mdba findout cuisiner europia amsf mirrow financialisation agho adtr fidelco lawhorne traicion abramashvili bosti rockstroh chalal odsts neuroprosthetic brcm hartunian cattet burnhead rachford hondura muchlinski custodies wristy vancouverism equest auchengeich cocain buether abeche skeate mulva whitecrook ortutay filsham ismaeil cemagref nedge shatzkin trabing diedrichs zadravec deggans btwc mzinga kastens stormonth stenches pedastal krabbenhoft amortise chipolata yashim hisse maduekwe bramanti sarangan schable shreves bukowskis fascade theirry cidem torek chiangs pritzkers strokosch rewane speechmaker brunis havce kinkan alléno oxyglobin devonald crippens falsetti valyermo tuani atvm steamrollering dabbl kondek riddalls barnholt benty meyrav sorice kasala kamiz verbij kndl artiness sarur frissora mahawil doeke weisshaar churlishness redpill ethnicly bamali mouldable printex supershuttle guehrer nanofibres crowly panjsheri sheepridge napali shenfeld charruas wyllin bciu justham unsaveable huntsberry esnc nightlinger hanessian treworgy contiue shoff baroquely stonewashed wanas katine twot afrezza foundem pushta kenedi moulinot stewy gekkos interplanted liviero witsell gregorka watai gchat scalvini completionists baztab tichaona kahol beilke dommer kfy charata makim kreole rockler disapointment ravensbrueck elniski valena timesselect butties gratta mathiang lehmon sarhat dunka mimodrame godwulf boutcher gultekin eurocare hangnails superbreak windspire damber raffone videomakers desirables triacetone yakes jonae hingerty paerson folksiness zukang pubco nbcf beiges tyrannize strathewen marper harjono skywatchers hartsuch sfogliatelle mouneer bedder towsey kostry indect dancemakers youlgrave snowbarger mesilate pinau edwords rizkalla maintian brainwork mistatements famiy tehranian turjeman deyanira hustead koltes baillères airfast firstperson gategroup btop kingara cobrador lespinas computrace zierold sephaka schmutter buckroyd khazakhstan sayari tollfree milibands cyberdissidents trockener connaghan climed alisande pressurises pltp martime gustaff chouly minnet daitch ilness baifu angelsey predraft nicklen unabombers filostrat alhabib trashier sawblades ocansey prindiville sitski benot bierenbaum bunkbeds henington ghosties junilistan soliver pwllglas kovals tsemberis meidinger guloien grixby playpump katelan kalentzi hosseiniyeh crediblity méheut prevx goyens somohano letzig yurchenco sroc opensaf bandleading keppens eleuthero amerispec winance helwing javani monologists harouni resplendence imdr gurkhali gecina mucheke maheras baqaa leaners headful hawijah whovians zione jardiniere robon rotateq kamistan sadeghiyeh webisodic mernier coronell brunache strutts sumanthiran olevsky jazouli nubuck aflutter velmanette loukoumi heddlu muthui bundeskartellamt katzoff nikkita mueser eeks oystrick monotonal vishing sedyaningsih overcorrect frite ruddier eisenhour documentarists mollycoddled soussa chabanenko tjaart disquietingly cominsky birdfeeder postimperial klusman hanikami hostless pécoul emulsifies yuwei garrahan schiliro petersohn witth houssami greifeld bernita chimonanthus minshaw billye freezable whitus simer disjointedness courey punal watahomigie amenoff tooted lowfat caadp updrs bearnaise elal dpas desme sadoughi tanowitz huitt grillner carcelles jekshenkulov axela baskent thonis iseh hoardes serebriakov ganier hempsall kaheawa klandorf negociants curesearch aaoms thicks troytown arimidex peare magaoay spece kampela cerling salchows abcarian terjesen deadpans miraculousness paamco discursions duruer schuemann berocca mawete banro knoxes teligent poonsawat coombefield amobee ritchiei distrubing deservingly stength supected aggrevated dullsville chonqing execustay sonnex corcione gazillionaire visioneer michelangelos superstretch luchina avici nighbert scurge chimalhuacan asalache limeback actwu kiffer chieftans copius secretay learnedly servicepeople presentiments theallet lechers averatec firesale gcib baghaei dildoes borgonuovo slobbers wmzq ntshangase welshwoman laverents repriced geminder chayevsky bugli linzie burc trigilio krisjanis glanrafon cctb hardhearted daood beczala ambastha slimly waltho kwakwa rematerialized njitap angellotti khanzir milliron linsdell benbihy grimsditch karweta shreffler barberosgerby phio defrosts satsukawa dresslar sekela bmts ergonomist klepac syyn juryless noluthando abdolfattah ievoli avamar amiad koplovitz vollweiler silverfleet hadjigeorgiou buffalonians robedaux jermareo pretreating kobylanski peptoids falkand htere spbu diyana marvez mayiga tsatsouline placatory raghda skystream mashkel nadelson haushalter capiche milkey nrtc bargewell pizzoccheri ebike longframlington perfumier rojales politicshome snsc birdfeeders futacs gwenan alllowed nematzadeh halsdon sitagu durrants adkerson marsot hyperdrama ngaus pozder refridgerator zarlenga hakhnazaryan executability haibao orituco snowmasters avelox dumenco nonpigmented brégançon mvule dawnnews nitzanei lightsome fanselow caseback benca fhlb khaiwani oraha laiyan genous knec azzaman newbuilds pcpt poobahs featherbeds bulldozes ecoterra struckman witlin vitacress gharu katula servidio erdkamp veitia jakrapob accrediation retendered youna genae unvested rongwu nicita politcally mcaboy amergen banrock scmm hamard coltish husselman alworths nativi gartocharn spetman prudie appdynamics questair llwyngwern nevisport lionore mazelike swoger reclusively grotell saverton condop clevert kolokoltsev directer trayner faqiryar takaratomy majimbo teensiest prty tacopina gristly concequences ifton laraza stenico cansecwest tilp yeghern tehilim komarkova taec ivoclar cermis superinsulated migoo millivres burrer spievey telasco cheiri sewai unworldliness outgain seyfullah ribagorda vivadent parps valainis communties ecybermission hypermiling rovnag immobilisers sefydlu latrepirdine luveniyali snakebark curioser skittery dementedly bukiet vanisha redistributors balsys wineke psychemedics drabu gheal zamen culican sannier purja maniruzzaman homechoice drepanid aguis enyce oversexualized kuboya thaller petosa multispace malandros bobolinks circumcisers mortine zangabad subserviant devolutionist swalmius atryn pyrotechnically kirghizstan gracelessly mespil vignaroli mutualisation harguindeguy sexcapades aaid virut shaodong luukko roduit kojola oultram faulknerian tyser finlinson apostasies individualizes smotrycz ovnand kaprawi kunavore nakayamai nadaf djurgarden maleli unty relucant thomert fairisle shirttail cwtch anykind hickham sympton aanerud undisruptive automony jokerized ethnikis forbort mandarake intrepidly tesoriero newsmarket jorris driton camaradas amstrong disaproval otion reoffended lavanderia kreibich smithfields fountainview footaction protuding melal pulverizes levram mollifies compeling unbelieveably clariss wangers walkowicz dixe cholitas psncr epit clearstone polacheck roils palal daytrana lazerow forpadydeplasterer careline daybeds somebeachsomewhere dingers cruiseliner osuagwu hojilla causations wallbanks nilaja rossinian accesibility repotted parfumeur zhovtis muldersdrift stoved cianciolo kirnon helinet sarova chiroto oleaga jouir pulgram melott kabayeva runnign abasse dipnote lunacies schnuck rullman shimmen reasses grimier novatec sasses airflite natano gmmb charrisse delanoe sprogs applaude mcclarin tealeaves piry rorrison pressenda voje centry mccurrach jarrel pennybridge ostroy vardzelashvili snda reeboks spinningdale soilders vivify mainsforth polypills achieveing fenjves loback dyrs junc angrick murabito guarguaglini ibovespa krepps golomt ecuadoreans shatwell mogil yueting familylife separovic polymorphously seliana warier brethern rnla atheltic johathan servicos panelized marentino vorobei makhosetive dhandwar weihl mocospace ceragon beared headcovers sjaelland kirkush articulateness relles litomerice gazidis pantsless eritherium aerus birinyi thefunded cyberpatriot webste regling ruthledge bunnytown lowfields chapagain taichman zucchino danuel libral lazek midprice mblox derraugh hehman hambastegi tangelos hernadi entrepeneurs boontje touria telenova snoeren sukhee guttered ikitelli wellywood uninstalls nonconstitutional riesbeck oversizing withut silbertanne haimoff phonagnosia wedad knovel perserverance afflecks backelin golau lorit independentes shankaranarayanan lezaun hymietown impove medicinema partrick selesnick kachroo mmrf novacare unfarmed babida cryor bakshish betzler mimun blatchly bangrak bessies transdniester mandsager swsi rancilio econs enviga sireen ʼa wluml bouzigues lohuis renationalise hameli säumel miluska warkentien bethelehem mekhennet brooksie strepsils shibas talibanisation ensconce hioe overturf erikka balendra miscione unipublic orangethorpe chemoradiation inkombank mudpie interfereing staretz molberg marif chebundo zekic bejeezus lipkins piggybank emmanuello sambili petojo speea barzón deleg gillinson cardiocirculatory woelfl cagen emoted cloude riccetto stancioiu gaspoz aledort cavadas jhawk elevenfold guetersloh shirwa gerardin sukau drumadoon elkhonon lotuslive lopsidedness construtora kirkfieldbank jonik roters scorchio steigerwalt idelphonse jersusalem musikvergnuegen leichty enginyers qadissiya dreariest klamt musion yacaman novellos fighing migraineurs cavernoma linel katsas hamiltonsbawn allpoint martignette shaquana undiversified resmed shirring hesses offwell crimint sportello intergrate poisonously cabretta saidia khh collatoral desultorily benecken wanvig ntera freakiness nexuses unformulated hmtd korakas cokehead incorrigibles plutocracies zhenrong bussanich colombiere schnook cyrpus heddiw ussia kanouni skelta adolescences sprayings referans rassman wiebel bradnum chortled leccisi rephotographing vukmirovic fearmonger searchengineland mulvenon lodovic speedone rhosesmor leighann prough delouise nacom feranec plepler gamier greyshkul calpol abhazia ecodynamics provenly safod molea beverland kulveer lukefahr unshowy kahlefeldt ndoka nesper leichtman cohns lupito snowling suhartono smicer casualwear tinderboxes hakawati rustiness exploris akkawi witb olefsky roadcast denoke sarops ghaddafi overprivileged trachtman zhongren aiusa transcendently farassino epoxied ncfm odigram blytheswood elswehere clendinen knujon beistline kimemia bouclé histiophryne juvéderm metabo palapas tillydrone glaxowellcome avdic willbros fabulism kwiecien aquatically htsql nighmare protopappas everyblock tafazzul guogang pelf secci aquaventure rcma asadata ophelias todra siteminder afjp kalyx isenstadt costil rart bonert uniformisation zanclea mujava chalat puzzo meniscectomy kongsgaard belshe witteles wonderbox hegele bessard coûteaux wujiu responsibity manirumva buliding thordur hansler darkhovin woggles nabokovian capinordic informat bcrf ralling ffis shedders oldhamstocks headscarfs inconsequentially wernet abex barsocchini naqba microemboli bedie telemental rutner maryanna apetite zamal shadchan senegalais aaiu panasuk odhav fionda kauch sponsler nesbitts anaky malleny gannex nysschen citycentre lambells atwoods zourab sleekest vrbo worthit sandate nonrenewal hurriyah hertle sheenan upturning ciljan elps unpoliced fabulis huwaida druglords slyngstad mittelos apim sulaymon attunity epocrates nueske fissette ipoa frence metrostage limm vizinczey weisert miriana ochr salcito macconnel rbnz nozoe bandimere musawah likierman bellanaleck recharacterization momsrising budreau helness tideless schuffenhauer blumgart vinchuca zuli topstar fillin lungful soulet sandycroft wtge pseuds hospicecare charneski iming sharlie knewstubb debting daliberti surfboarding himpler flateyri jajoo drevitch carbro greendown loneragan phedi bruhier noninteractive calao soaped ocularist jetbook nercwys kaniewski aftertreatment vieled hulkster capcities audur referrred motorheads movlud bambuck leastways kaczala osotimehin arpeggiating kronplatz najid corbat bruntland imafidon hicheur intercivic bpxa moini horrach cnst yoakley candella randock egprs tenbroek imperas detoxes cloar segolene gowkthrapple shotmaking perambulating deliquency nicolussi harebells consultas houtan unharmonious anatomised demattei wanblee filimone brzyski inforcement nhang ullger tinside acknoledge juvederm sefydliadau malmoe palmucci maune keyrouz ceccacci aggreement yankilevsky truva masawi laughers lumene drainers dannenbaum bbpa biotechs bouys oqab bacanovic swiger goldkorn procyk heisted fingerworks stenfors xuesong cvas throwdowns viengsay megadroughts insituform sluppick biache papaj claudon gonwa famelab clites ubhi nhma azelle bodysurfer muxia baisch rospars icandy hublin zhongjin carletta melican figleaves mgpi rasheda stepmum axam butchard daudel tchama usuing openmindedness rodah harbon corsello kerschner cantoro pattisons sureka remoulding omlts jamine boubyan blogrolls deelites vijaypat hackitt morogiello felliniesque dfis hartdegen moïsi badula commisars cosmit bicetre bjorgen darkmarket nadimi maingain seldane lodell kwalia bensaid puigdollers straatman gobbets sseldorf linyin kazek fsps parwin hasanain behura affort releaf goldensohn wygal improgo tredre kovykta hakapik townsen mckimm citco svrs sugarplums unglamourous benfotiamine shawqat endarkenment barroway denae podkoren adsafe eòrpa tweakings raciest olaberria shamefulness maeue doletskaya arifur carbeth vergallo makombe coleburn reithian mbdc gauzès korsrud brévent sangki anaren beansprout astani penparc muhaisen carpooled tickly eressos kostelecky kinderhilfe soumana jüngst troys porpose inviduals jouvencel shibly ffls frieson medtral swanier biscaia pairoj griesbeck lubed chandratillake chrysographes trashiest laviola endodontist graysmark polystylistic disipline glute mcdougalls dovidio alcorcon spitty timemachine coquettishly easypaisa adjame parette revoltingly abdusalomov sabathé wellburn unfulfillment unshackle mosstodloch leonera quesenberry sightscreens shamefacedly kantra rengen dareen mockel daffer baseliners xata popps excessivly autarchic southernlinc readdressing cannolis vitolins tthey neuroinflammatory backburn vucevic unstarred helpmeet mwakasungula larcs huxtables counry lievremont swicegood timurziev leeker geurin innerwear ramsier toylike carholme dostoevski hanefeld gwec mcalexander bunscoill peebler lunasa peatfield rorys hankerchief makfax zalika ankaragucu markdowns ooida toughies torbothie burmania commandingly haryasz siepr alini gunchester dodginess phrazes magomedsalam brisenia terroris klibanov excells edradour fdep cholesterols qardash clwr processess hedgefund derogates nrfc shehong zania bertolaso glanaman navle kilmahog fourviere desses zpmc atrianfar hogeg boethin unwillingess roadloans smolkin jellis eusec mutakabbir glindon caerwedros kignoumbi recalcitrants rhubodach bitrix contrave hoshinoya seawatch kusuhara algenon corobo whitebird whiteys obair skelp monumentensis lovvorn posiva rafd klingstubbins ultimatley contagiously minstry knibbe devido natika thangarajah brackenfield jingly duncarron faley loeffelholz erha gilbeys tenden cauterisation kurkul kathee gutberlet tsurikov caduet igancio deicer graybeard assocs eglu difficut electrosmog ardana zeltiq melquan pentecostalists awkwardnesses conflct braillenote sevki slippered loadspace armagan autosuggest potstickers parasaran bryanna volkhard loipa shabbier brinkmeyer jccf pregancy surgury kayelekera hapis heavan oref mouin eaccess dalreoch dibattista kuhfahl clifts damario scrounges céladon wallstrip utsire lashanda rlam bakone goldstock humayoun rasuli birkrigg fagerstrom lscc ofterschwang beardsall antivaccine ellidge bodis mcilduff percudani iread spraggs suporters zeyuan mccullochs annelis algranti allissa mahaiwe bidtopia sermeq transmodern bootman boloney peisner slackjawed gummelt infastructure mordechay picca stuggle powerbuoy berings kryptops kildren wittelsheim doughtery fessy egeler ieua cheerlead gambolling popieluszko drudging rickham saikua yellowhorse goddamit bacote mobel daviz sahayata overvalues greenworks borke nnlc notarianni krejcik chinitas desem norat vilne rudko househould speciousness dawdled condemed dogfighters complicatedly belaboured mjunction xsel dowski portabello aberdale dikler cnpv bavuma kussman flysheet settop rewardingly backcombing promotores inovation noront pelmets crossly outqualifying wolan zensational debilities fictionalise mondrago freeriders optiks hortscience mikhalevich edirol divyang repentence timane ongarato homu rapprochements dishrag shadsworth pulce forign curic gjepc truanting mckinlaigh waldseemuller desmangles fenz surescripts gerbarg unfulfillable broadbents whitchester kaskelot kukly macbookpro trendex kilmadock cranapple webtrust mugavin blazesports watzman beermats upsweep thinifers calenick mollinsburn greninger naqibullah bafi theatregoing tedlow caitie pejovic tdis janiot indemnifies exfoliant shalash lethargically lobke petillon clientless volcic xizhong calister nmas tarlau schwegman mergis wittbrodt greenloaning kingarvie appealled minallah patarini boretto thorkelsson cordara echorouk shutan fuqiang leukocidin bookstock kingsberry asnes durovic implementability minexpo leukaemic sdst salaciously mayfa anacetrapib schooltime trands muzicant spents jonzon tusing nellysford ivys supremists crbt covec philanderers ophthamologist velveteria marleigh plesser wymott valueact macgillvray azerbaidjan classiebawn scavino unrepressed prevedouros ebidding looj gaganjeet soundmen metallidurans yielders sherelle sloggi sagong joydens poliza transtec lumison monadhliath thierman ballagas okonak underemphasized sprayable masduki verenium decarnin qingtongxia shopbop kalogiannis woodfree wull inson mosae yeppers mrgfus carphedon neithardt sewin evca azafady stringbag loske sucursal agrast cphd epscor tasmagambetov chrysalises hanad kurtaj nobbled graët scarfing eighton treavor unhygenic ibol parrotta dimebon hindolveston boubekeur cetos motaleb titherington pual deviney unglue balanoff usnavy bikavac ltro gombocz boniva polene prender waterdale phsc helmsburg tiyapairat cadhay ieca stratyner sicinski sukhinder poncino goeppingen ncrg seriouly lscd barbaso aacca mohammadou novari accoutred satsias extraordinariness welnetham soudant tanberg temistocles pelczarski buyt piplica dnpa zgh nayfeld toumai marylynne kleeberger scioneaux kendric backbite jeanbart manzanos tinch disbenefits guesstimated reguard ruloff chint esveld sandercoe hissen kmmg csmu glulisine arsher khakrez suffrajets buzzwire lxk illegibly witthauer hhla menú stecco danli gymslip overstocks reemphasizing putsborough jamarca clynder adolesc bohac arpeggi xinggang areopagitou hidajat enaje iostar mooncraft whiteknight paulinia autostar coburns annicelli yhoo seediest kozinsky debtx lifeimi repressiveness mupariwa usie clitsome uaeu initital donnellon rawstron nsrp buale kohsar hulugalle velaux bolberry neubig sodomising polomka communcations multipane katrena pinkstinks atousa bizony berlinerblau cagdas abduljabbar rockiness kuligowski bancyfelin faragallah walper desplanques vialet mutahar husari ghazvin breezers lineth chantell hatefull borick beeford tiffins mylswamy interracially allegaert tadelakt ethoxylate keggers lunyov strazzullo akapo mcintrye malaak colombet jeha brendell cazo bunkroom auville hsaio kasowitz jribi babydolls chatrath cilluffo startingly buhrmann bryanboy sheppy glase derryck recenly gamesmen dumpleton nostalgist tollerate postin arender fateev intersts silhan deferr odato discordantly rowanfield logoed faggy serzone amurao lavone gathuessi kibwana stroz nufarm mijke feagans dellaqua ranner concientious nealson genesia siewierski kurobuta rosyln kabwa natinal aubeck muntarbhorn extensification lxp tarrif nemawashi tilberg pyos azcom pageflakes roderique acrylamides topland ulimately ksha rangon alessandrin laeticia investools kusnitz regifted debaty morphix wilenius servranckx condotel croquetas morbello hoier gaoith labvantage grindings tadena faisaliah shantee siderurgica cinématheque esbi glasto gadonneix uninsurance selic rowlatts pietta flummox persey horkan gleit bellyflop flexibile mokoro poped paceline ackowledged idefense kedric straplines watercar stuggling harrowell cubera forequarter acuma thongdee uraemic onwubiko lyuda distrigas mehat bhere nothaft agreeability nondependent dellasega jiwen altira craniomaxillofacial orianne hammans eggcup collateralize shakealert karters guesstimating newirth teaspoonfuls lifka auwal minilabs tatoos quatchi robeez orhttp nantie smithwicks seinfield doornbusch nochimson buzard bolthole polydextrose umprum gsic kamn likably jegathesan enourage pieth pongsu tiggywinkles lowenbrau anticlimactically rivinius sexaholic fraccari andrad rilwanu growthworks connette hopeland hillarycare baigrie shurrab laspada gretsky mcguffy buffler drunkest itablet goig steamiest forterra pilic tranquilisers goerge alshehri fountainbleau enterline equiduct suppposed totzke dimbo rizgar stimpmeter alzbeta delzell trodding saraghina mulrain promissed aripuana tankus bairsto clabecq reproachable securitisations moshkovich lemosho stamatia jayyousi mitsumoto includng seedat relgions urucum gabell prozanski montellier yesica palmeirim cobent sheneman npts anerobic plokhov rejiggered gladiatorum bulaki ripson gnma qibs thesp ungphakorn enimont prerecording programers smokeable samadashvili martinat macerator bhukya worklight vertrek carahsoft opcab argant rendl zipperstein tariffed sherle cpmf poad jokery trahtman savarino reciept tendal keilitz polten vounder scibona beribboned gyürk fonatur akuno loughview giunchigliani obvi gladkiy meirs beijingers mancillas elfering aktionsgruppe millworks oxycyte spragens afica joxel dopirak fuerstman flowbee buildling arulanantham propitiously natee schaerr taxine afida lionshead abdulhussain goodguide factless xingtong aysar chalor jalfrezi furreal skimps onyia minnix catam rubboard compèred janiero carufel mitraclip kardamili hypolito savander ronnen feagan malcolms reflexologist birthmother bielicki emkay lovetta theplatform dimetra munninghoff wynston smartdrive cordevalle grayhawk nationalises songping bryncrug ritze roht psem marbridge pejak fjellner bernerd altogther wihda chipeur climan glicksberg pfspz sopheak nextpoint homoet viklicky degise uchishiba demarches critien oddson knifeman hazam lifechurch crons advaiya perroncel nordeide sirignano gabari lupski rilonacept kasuba frednet cabbed spaccia goosse sklamberg stelt obousy gregorich mousketeer pirtea whalebones hahnium poidatz mushonga macguineas wijenayake conradian undraped tashichho thosands munthir mandlikova gollogly jaiyen nexxt hoshiyama telecommuncations plapinger chrisopher webchats amanor deflators mbrace springsoft paladina uzcategui uyttebroeck aleshia saffarzadeh egelstaff odiousness enertech garbhan misdial scarceness demetric fromages yerman altshul surveillances dashty stunder prytherch hanqin cowlicks jackton ubergizmo narte delny hospitalizes strews jessberger vulindlela tresper megahy absard cynghorau esrey tumori furballs lowys kadkhoda heddell lesil maulings simpley ghota cormet whitehat mammalodon unbuckle manfo atteveld chaimbeul kaulkin bellkor degaris endoscopist polaner caulkin illycaffè hyken hiesinger talarion kasbahs zunga xtremedata honomichl kandarpa yixi carlby defusal colish lagdo khazaal hoverman resnicks enrgy bezabih maryka donana jumpstarts soloski bowhouse wozniewski demonises panthar unbend macuxi unbolt dorpen abutu swith dochow conata cipf beclomethasone cyberlab noncognitive czop havil fagged kurtag hillhall crofelemer gretkowska weatherize gnps abili bluelithium schoepp muumuus thickish millerston lalov buhrman totted lenain knickman obah shgc chlordecone yjb wuyep circularise saquib courjault idga tuscano stropped yadavaran bousted mevacor warmenhoven ecumenicalism lelaina krysko caesareans dehorned genuflects trilene lufrano alaneme picafort popout squealers repressurize coakes underware peenemuende duplicitious balasubramanium puiforcat streetdancing nanoflares moaners eyeteeth herdwicks petrolleri passaged beznosiuk pretorious npap wrighty oforka currnet assistan antiaging bhol francileudo landstar limns foulkrod nerines nowosadzki endocyte retchin jideonwo detangling suface winfields sirotta sacrafice ghilad rawkins panoptica kryzan reminiscient sapar guoman mubenga dought geneses chocat tutukaka ajok puplic patasse mahanay gunshow nanobio kruszyniany degiovanni kambarata unforgivingly dockter anothr vannuchi hudsonalpha perovich maziotis jammyland treeman crosshands sadick duckhouse sarvey barod keelhauled rght luhrman terroism fedflix venoy konstatin shapings shengyang coastie gruca cuete leacach hardfacing rashim playpumps kymry odendahl methold relending mariet chruszcz dalmation mosakowski gaitley lisnarick yanci biniak carrim saurez soffin dfferent backhill philagrafika tching herreria libdeh waxholm trieschmann isoglossa cambar hirschkop freaney postindependence fontis houndshill vastest rimondi pandeya edinbane tayab designtech wassit lianwei vaccarelli salpetriere sógor womanless sibl cadec huwwara nergard uchibori wondershare histroical stubner influents zoepf weathercasters lazarovici lewannick bringewood samtech picaridin wiancko slitty salivates bombmaking kundor enemo naptip dewer shukoor mahachai sivananthan timblin svanoe icelandics goerlich xianliang organovo freshbooks synterra haminu cerza movellan coolfin bakhmina axten gynaecomastia coastkeeper federalistic yordany quana aristede urfer wenzek bibipur alerion devloping kusile liasion ksiazek murwanashyaka kolaj cotarelo breznican lamoureaux cemt unbudgeted guangyao costumiers ratzmann aufdenblatten clannishness rustock kenidjack siegsdorf tiredly depersonalizing rosendin akdag ripi pourers politicked titantic oponent mokoka sambucetti gudaibiya roenne prelimary sabaj jampolis zootv taramosalata olopade contorni bodgit starite yateman ewallet ecography kiriakidis suraqah orangs unificationism giridharadas aulc dominatrixes gesturetek calçotada orajel zwiefka mansewood klingholz stjames outgrossing thackwray devassa dogpiled swormstedt oppostition despites bounciness standfest soupcon irisys glicker bourdoncle voegtlin cyclamates kopane gobbet rudimentarily fornicated etidronate ouvi candidiate powerlist woessmann damnatus tozzoli osetra despagne guirguis wisecracker morineau kloska clsoe waverers maroda luhuo passailaigue runty magnabosco schoenecker relativisation craneway candleholder srokowski fumigations mingji tellkamp terrawatt rashanda medicins hugya blindley dimson underexploited lumberjills lachter borui eastcastle marichka jakartans juurlink kaffi firelink krolak confield aggrandised webload silverdust broxted gamman momument ezatollah knur bacille kinkiness thunell theya sametto recission kiselo mfou ispan eigensinn goeff makarapa easyhotel govenrment mancation iacoboni mackeral lcpc cafetiere hershon postgrads shurpayev tamro augar eickhout octobers demke appreared vinalon fizan minicamps durfy reboarding jalandar comepletely bilderbergers requalifying dicipline chickenhawks moamoa tiné ministerships anniverary prompan shaub myllyrinne speis hankus dührkop khataba portnaguran tibula zenter triscuit edetate whoonga najmeh cratic excerise tepel advo zummar jollier genlyte smyrnium juola getlein huziak manjaca insidiousness rasilla delpuech buruca ghx helyg pajatén broomloan provoste funtwo jubilance girmay dergachev magomadova quieroz nalbandov jaakonsaari legspinner kapito mathais hitn kabinga ghneim rembold indianopolis evli wirat yoostar baghmati tibber nateq choupana hlth baiqi unbuttons pennett numhauser szczerbowski unequals eplp tilborgh dairese sharkbait uejf braises mcclennen nhin quitlines moroseness resat tasawar nkan cozzie assymetric daphnee natb clopay kopatz sacramoni tmrw supremicist gwir grigoli winterflood leblancs petrano tky strimmer chingoka schumpert immaculee rouwenhorst abdolvahed mbilu decadently essola moisturisers edham ieep ramsus lushest wtwt spongecake ouevre atually aftel stofan bullheadedness hanescu pediatrix footling mohmad kyree alpharadin bellco wiratchant souman ballyoran himelblau saravanapavan pasticceria ceku emmigrated nonrepresentative wigging pedde lakic manoussi nonlegal mifeprex ceratizit schupf mzalendo runyonesque kamiko neurochem eduventures limewoods suspcious feelingly hoogesteijn weliveriya gssc fitfinder kamagra floxx prapas photinos lebioda konicek sentimentalised linpac poyon lingholm roadford ronbo souaid ularu minkova chrx ozzies overexaggerated abitova swimmy samanez szaky koche rewarmed lebherz kwatsi celing olofi roullier silverspoon jawzjan stangs vanpooling ensuites bulletproofing somodevilla birmanie playnormous mysimon flinstones rahimpour unparalled elkinton omda amscreen disect inaugurals jelmini woodstoves lakhdaria ruimin writeoffs snoozefest bowcock goodhall reinholz yatauro azk interhome bulbrook carnkie undercoating binmen reidt kanshin lauture chondral skelhorne upgradability carten ribadier cosigners papell breadwinning disharmonies pepu caucasion wishfull bemand baige olivennes elaha haralambous ahmard sugababe hmap fataki embarrasingly boesche arleo chentouf hertforshire overtrained miell ghayasuddin freshpair penel carmichaelii bexon homerless wilhelmson aturu indianan oldemiro venkatasubramanian eorpa staerk healthconnect ekanga voisinage boquhan lossed voyeuristically solena todday sobis mundulea lanoka kidsfest abdusakur skolrood balnagask novicki innovis faifili stranton joselin procyclic persuation pécrot tonik chemaly zachanassian ferrexpo baiyangdian stoelting teletrac inalterable visionland overholtzer costanoa whitie kaffiyeh toqueville scarlite turst darvocet mewbourne cristalli rangzieb championsip farmaceutici petrolheads zhikharev teisuke mosaid schlow hyperlens ockelford berkett ksentini cheeseheads pennslyvania gouts konecki ashray reengagement mogadon prepaying neugeboren stefhon hengjiang shireman crusing dcrp scheuneman hussani guyomard spellbrook abdabs shiftas dawra infracore herenstraat kosmidis subandi marmie galloy amosu hyomandibula stretz parkfields ipath hufbauer mitia hufanga ebonised lankey cerrejon mainar gudmundsdottir vinyes newhailes ycg ouvriere tracheotomies altarock deboning wisenheimer highish bedlinen cpea espnhd caslen hussmann uaua bovensiepen tunit troncale hornqvist lbbc iberiabank mkhondo tuttis piggybac twtc kungayeva papastavrou iddy pprs spurk placeman postglobal lards microalloyed fattiness berlinecke klair kness coge dehm michole xingjiang chikari noncardiac mutiah tressie sprengers algore omidi chongyuan foreigns broatch skibine yeonan egality rocken thorbeck rathfelder generalisable hemingby flouncy menichetti avonwick rellys botherer raggedness rootsier paktiya morua lessini portégé zelenitsky mutally roudnitska stoldt yodle nitec kinsmon obliqueness cutrera heuermann zhone acapela albade nicvax indacaterol ciarrapico zamrak kacar vomitous undcp frappes astho cmmc speedballs safelink mashatile londiani skivvies lateralised nonreturnable geekchicdaily pouted scruffier predix raniya kyba deterrant aminopyralid lember elica surefooted preened addling anchorwomen tomabechi ingibjorg birthler nanosensor connnection mattas overeater falstaffian bidh baralla theola busaba mocktails tiptoed intereted obamae ishiaku nhms wringers codispoti miyase rybarczyk cspn millons rompres hogh zottola chanceless furudate larot barzansky saiccor yardenit shareowner stratoni baodong rubicund expectorate zebro nelva aproximadamente santizo garbarski dogileva instituion kousseri schwark wincc duaa dunie megaresorts lejune kildow dittoheads ccsi andelson werer haynesfield ngawun trupanion themelves dinkas tendenza perpetuator adetomiwa dvur nucleators townwide ilpa pathwork mesclun iuda winstel sèze qingyao killiow stepgrandmother dozzi hillebrecht raducioiu zegerman individualise rusbridge arpel yovia gbci hazmieh miaoke marander intitiative ninfo dordick illegimate tabarre gsol reddox provate vornamen battara nareau prestowitz hyma wouldent diminshed sajko mudders molosh teola ekulona gavea kotlarsky crdt camaleon encarnacao bonizzi shunhe lthe ameringen hatayama smithberg comapnies hedgeable arngask ogoegbunam absorbingly neuhouser fadipe ramekins amnd broadweave deodorize abduallah kotchian saucing estheticians canaletes arcelik fourtrack chiefswood digimax speckhardt emmissions ivanovi mahammed masiel quilombola solvej cornor monarrez maccy cnev misallocated ministerially expectorated utegate daqduq tragardh lipoplasty rezaie passionel missable indulkar recoated elkhounds minaldi bansen aldemar kettaneh berkshare accumlated bakchich vcxo hurwit gillilan blondness stavenger lopucki eastonville cabrach innisbrook zotinca dariani frenzie aquamarines isqed hamshere steelite northsix cymunedau leeville luftschifftechnik duckface dalsace balogna deglet zazzi droi mouswald mirazon amjid undercooking erulin nagarro abuza assisant swirral corioni yonica mesopredators reifman machart iciness welshofer exceedances solkin foreswore cannellini sadjadpour buryatsky garnetts haidy songsmiths schwartzes sauerberg corrimony roszkowska sleazebag varathan pricol jady qlipso jadwa meharg emeny failiure diddles massood outproduced queslett lipperhey ngonyama monix krepela ciparick idolatory ruwaili colladay lpsa blathers newsweeks jwn horsetrading esrailian retrospectivity smalti khetaguri tweenage bmad faligot peynier rosemore shaloub weltzin triparty tuyn moamen digiwalker duathlons spoonbread deustche heedful genium sportsters arrika naulleau speakerphones bulcha tavarres serbis sorest unassimilable dicatorship boluk rabonza plihal opperating americorp aguera kavosh dobley sampas semifreddo lantagne bardacke haloacetic solicitously tigiev akikiki antinazi aptilo callay teppco recip mgscomm mahb wpuld eletronuclear ideh beaterator cawp mccoskrie darrtown submarining maquire sagus cubbedge lefterov saverino ambered fesq stomas mavinkurve corperate algeta zyskowski entec rolheiser ioma yuksekova ostel maghrawi freeconomy goam leferink bodensteiner vardanega shabbaz elams richies refoua crosslegged lojka latara portaloo baycol nonliterate genbutsu serbinis abbing ironkids echavarren shihuangdi xuemin hsic hooghsaet fonner greystar coutiño courset dollarized mcgonnell borodavkin tanzaniteone lundeby romashkova caballa malkki strawberryfrog queada mclonergan holmbridge sliceable lahyani cauna prostatectomies arrivistes japannext fossilise caroming loxam heimbuch unoticed feinblatt sasiprapha overinterpretation supercroc ellstrom weaksauce priviliged ausberry danahy viswas mojiva overdesigned kapya dosova idri yongda crunchpad climens blalack tabbat aquathon nfwi greenshoots murdishaw overeducated laylin windeler dragonlike sbtb decomissioning preelection nbis bullmastiffs kahikina markstone muxlim chijoff toweling cabazitaxel tabare gascard nondairy tzipori motorbiker letkemann intellectualised gorings bronzers klipspruit recogntion hurdlow artfest cnnturk kufel pyrg senbahar xkss huskiness eurosurveillance baoshun pernicka ceola emack badria chlamydiosis jctc holeta fareena calev supplementaries kitsmarishvili kronholm dumbya halkerston stanforth stonesby stanziale ketterson tejocotes mujahadin daintry carbonator sportservice hardluck amlf americanlife daredevilry imerovigli fieldview shatskiy roflumilast walum aldermans salvayre sicklen ddss overinterpreted giantkilling tomaševski meterology aderman alsmost bartone mugyenyi ondres wisenbaker indebtness saichon civilan sukamto hypercompetitive suiciding uncomfortableness souvenier gribbell zhenglong magaha chernoy bahamondes plook metalith abfab radjou alianca halshaw kingweston pipelayers evanses stremlau northminster aundrae chiantis intellectualizing susdorf usonians marabese dehen kliesch rehrl pogmoor doniyorov wisan berinstein storum newsfutures segee cyrila phel allowd sutha imagemakers amorphously mendelevich rüter highsides clandestini midmer fixham extradicted coccoon reassort halamandaris personol chenalho rehding inminban chastize profepa staska levitts androsia golbeck stracchino mitk attemtps dendias supercolonies quaffed oshitani dergarabedian sceme apimondia vehvilainen sibani ghaida mohmmed shopgirls hkdl recyclate nonattainment cfy kaztransoil deliu caucaus kuchling ivuna trafficlink dicale paerl ihad nuradin anash offspeed vizzard dunnion raimer urbandaddy monieux jumpdrive forein stoitchkov underdiagnosis turangalila jangmadang underspending hejji kaouk focued brakebill roise appennine sonntagsblick gielan bardale batarseh cleardebt avemar escolastico chatline ziketan ukrspetsexport entomo tarne arova hylenski ogadeni laghmani zurmat surcease netherfields kuzniar ponderousness wearsiders tribromoanisole daunis winklepickers guanjun surip izecson wnion sahebi defaqto ayouba hotted tsdb sunwear straphanger wittur abandoner comeup roithová ropke pranjic manouvering javins anchen donavin sahner wasso hlavka nichter dillendorf audri delicta pizzaz disinvestments gilmours rollonfriday liberalisations vuthy ccjs newhills sanguineous situtaion rodecker mufa killham consumeristic idigbe flustering bloggery baarsma salvadoreans camillagate arrogants summervale uggen magradze fannies changizi shamni migrators shaquan ziprin suppler expells doise springham rcpch khalib babynames measey buddists zaluska delaema glander gachassin corss newhill substanceless miquale nooz mcra prehab zucula jahmaal fedspeak dotcoms whoopers permenter gerten radhouane ostracising lemrick aliquo nardicio wanlaweyn thomana fangupo bilerico pantev ganllwyd rainswept misunderestimate liuda omniport spewer randich doranne tensest nonvocal hooshmand yesteday governent xingguang waifish lazarowich jatto manorhaven kgotso seley quckly natex smerling isins barfrestone seperatists karban crockard intersolar haselour batoned einfall centilitres dayshift allegedy ethelston klinz kashmoula dellorto multicarrier sophiline backlighted dakarai airp nepco shinbones brodys soigné schvaneveldt atlapulco nwigwe mckaigue calascione emuzed mirii fleetest nasbo askjeeves facilitations exultantly asustada ijza kulski vaniqa bridgestones ostentatiousness efie bryncir nickodemus slcg ganjoo apoligized unburdens samkos upcourt blindart bagcho lacount jetsetters enfora dispise kungyangon allice thibadeau outdrawn cwmp unliberated sameur ovenware prescreened fountainebleau cauleen ankershoffen meadowlake leftrightleftrightleft geosequestration batirov pruce pocks kotkai braaid naesp tuctuc periapt atvi jokiness bizley gdba candylion crawforth challice keion medpro hidenao gesté coextinctions centrefield chanrai lamal neuroendocrinologist geartronic mañuel stramongate kratsa bonio alphabus ziesel efalizumab slipstreams mckalip mcilhinney recertifying poell petrolhead nurre ruthian thuddingly zippi kenetic syafi freasier jimani demichele radosta eurispes privvy spacesaver badisco handwerg trostel coldingley jumhuriya kitzman ashbritt freezeproof overscale hopless nfrn hesl muxton carred rockwells meglen sabbaghian hometree screwcaps chrisafis okolloh hording vranac privledge pollmächer tchouk rufinamide tasy kornstein jotwani runnig krahnen gardephe newn trocchio jxb galadi horrow marcelus bialys lacena jerret lifetree dragland seabase callifer fairtlough poulation humberts wilcove youwang wesonga tuscani ruqaya subheadline azrack pcis hazier labná striplight blindest manduka donyelle togu capitalone kormas cypriniform karrubi insys bobsleighing grönfeldt sconset zubrowka catchline fetai ganeden dpmd miniweb dirico ushcc anually blio bradsby choicer oberzan eliash scuplture elrio dehydrators rengstorff pottelsberghe nukui jabrill mdladlana doofs rockowitz hajela cellarman swydd papou jawdropping grunwell laundrettes frowick lachappelle kilburg morger treseder elsynge vitsoe backorder challaborough schmelling irva nenashev tharps stokker critten fogginess bfas petrogal instructionally ufdg kukki escénica outstretch desipio jakeli shevach saltzburg sentimentalities pacome teetotaling tangoing rmif blairites schottlander pezzaiuoli tadaka debaucherous experiece ohsaka hopechest wncg kenshu sunspel merisotis amerge mandlenkosi vlahovic mulad elysabeth averkiyev longboarders leukine muenchner meytal myman thumpy nagatsuma averge woodruffs abuu geotechnique macabe waterpipes orgainzation kolchinsky fayers mcclenachan grevett shebar aftre wehrey mcinness glenglassaugh treuting nazila mizzle milashina vedrine wussies exoticness goldstaub guvava leibson goodsync georgallides remitter stottie snakebit calagione aviapartner datek zelenetz govekar tillikum psephological neigborhood wrapp chaddick odoptu morenike dabbashi bazlur sightspeed affectless chountis tenthani washkewicz millbeck siusi faberman bradac maturen falcarinol syncardia bozdag pisanio guardianfilms eurimene hemmis karsai chelbi martellini kabwela rrvs camblin semirural proxauf microsft bmra giersz timelag yildiray nebbett rivercentre husick rhoca aahpm jakhrani mettee disatisfied recoupable phipa parkerization matthe bolognaise disagress nameth snogs hadarim speeddate unemphatic jilli cemita tarke muqimyar exterran homeserve menking jiangbo torpig caloiaro deregistering nyombi gangplanks hanasaka lepad skypeout stielicke ukriane plextronics ameriserv honkytonks khazova mukhu thalesraytheonsystems ciminera opinoin propulsively mynytho recarte geee barnado roofthooft kwana tichfield waaaaaaaaay charterholder sargentiana inquorate teerathep koshwal treesnake knuckledusters concom snyderwine amusedly angolagate karpowitz rrip monkeyed ogunsola aurangzaib meikhtila prevoius reemphasizes dbmotion tarmacs surtaxes incongrous raij earhole remorselessness chintheche lacatus hilgenbrinck biodiesels riechmann hboi fottrell smatter hronek utherverse busurungi eforms prolem gruer ketumile kennford repressively yasith remaind adelene darchau plastinina kergan frba vucci jbala rezonings agadem resiliance nonpolluting investrust youwriteon actable equens sunspace vathana glentrool sigm edhar jobsites schlefer forysth kassianos commericals wdfc sacrifical wichelstowe sullivantii veev iboxx administaff allighan vacilando waldin saipov carbonfibre babied eliasoph pollycarpus honeybuns dastmalchi marrige phocuswright pintal privatebank jeremys prusty medsker joncour zaniest abdulahat nebbishy sical serevent buckwald ermanii viamedia makeshifts zangar tredup sombrely mouen djau operagoers superwomen tiramisù perniciaro merrel wheddon arousability shibatani maierato kashlinsky shotting cirrincione moorsley loughhead satmars ansoft germophobic grandpuits hyperinflations pandorama hemscott donorship litinsky qualifiy sarking sdac bronchoscopic breuss ramindra bookstaber notarantonio crady gazillionth bittorrents wimping neagles pampeago suffocatingly valoria jonquils uwem hastreiter marinone daleside pauc yanira cetrulo schonbrun benmussa restios clinginess ogoo kirktoun lustrons ashvale arluck yding laserlike alliteratively oberlies nauffts forgivness henredon gutsier rushfield abuzayd tacomas dokhan bosideng mahoud tayburn glassgold weckerman bandic ultralingua smarthome cmro venerdi bactroban toddies abrt duststorms centaline rivetingly twanged wracks elektrarne midnatsol photospread naeema boorishly incontact scheinert patchier featherlike ajlouny strengthing altens lifedrive neatened apachecon glenochil isbc mudsnail bistecca schulp shaat nobilmente junming chicoma hsmai hamieh osteologist parrini amerenue lokken groch diby mechtronix rooflights olaszliszka micromedex lablanc broersen ossendrijver trosten gaviscon fishbowls preza beltany initialing recrafted herawi garlik shueyville hadian ecouen filise moellers yemenese hokuyo overexerted badee aseltine derbyn badertscher plageman underdose minnieville quiterio penality danahar blommer fdml ribbins airscarf masterbeat ochang photogs addisu goodbaby sefularo nopalitos spadoro cleggs harryville marlias unposed curanipe dulken utiashvili niederaussem unpurchased richardon kamlish enzler astroturfers maaden boraie milthorpe cotteswold dimango haefele uchitelle outstandings candleriggs showeast cakewalks sipunculids zinkan peponi gruffer willhoite verimatrix bootstrapper bossanyi cyno vorapaxar hydrochlorofluorocarbon bhulaiya touchier oohhh spiccia bulyga rattoides rockiest yunsong websurfing regelous fresquez rumfitt playnetwork councilwomen celestially yoduk liqour fenves edifecs grandmom ngumbi phenomonen nsti funambol underdocumented meshuga happpy irise cucuzza goicochea galatasary rangwala ploegh cornetta schleiff toulemonde lozowy subsonics tenorist teletrax thathe varaut muwrp tabtab komatsuna mcgrigors fratti rivarly womba wimper mortifies desseigne affliate brandons waterwells rescources ladram bellando wintonensis aoci perol mlib jebran nyangoma npta villifying bemf baumgold teliris glendermott madalin digestions berlex prodisc aarto stellent bedrolls yepsen seckin clampers eljahmi astapovo petalotis cashmeres locy fujirebio idid januarys reinvade ampatuans lazevski fcfc buildability microneedles flatworld guangxin sirisak djemma kaide opportunies contemporized neuger sakhizada snuggs mmna bordell travelsafe consomme delizie dimeola narcoleptics talkes koniambo kondanani ninestiles veihmeyer baisalov homeliest guyford wijsenbeek implys mcgookin passholder godlingston kuwadzana qinghou marketgait capol digennaro homotaurine unequivically interbanca squawky laywers goana codjo ukrtransnafta sutureless halaweh delgadina sharpstein leily mirenda morkunas clintonesque airpatrol cullagh bromsberrow degressive compells windcheater psephurus outdates russomanno glaciei cartoneros eileanchelys corvel kickapps juvenility qoba lysek thiab knudstorp xinliang armorsource quikpak looniness wonderbread mezain taskent serostim portugeuse ctwg sweileh shorebreak hystericalady sitesearch asaps nonpotable wagnerians malpani ambulancia fischietti futron viticella pasierb unbelivable crima poppens stonewater vanecko dohany koelbel sentman zirp lyovochkin erento jastrzembski iguaran fishtailed tilstock quartarone bogdanchikov tehseen tashia rolark rhestr newcastles hallab sleaziness unbreachable hktb pulickel irrs difi cushnan concetration härstedt libbers boseley industies thorps embratur megapiranha modernises vagabov kolarska brightonian freedomcar gilvarry willowford heiken yalcinkaya rpro lorriane intergovermental monteblanco piringer primally dunard oglaigh coutnry symmetricom foodhall underwhelm nerb shahreen riesberg sibbing nemertes sensorless socializers worldwinner palwasha avancer tambra kortekaas bravelle deschenaux halozyme mudslingers lameda damaskos merrist nicoderm chowing mceachen harisa flammini wahdan bartron zims flowerless riexinger majaw incovenience tabermann tamperproof reallity wideorbit larkrise dosky dipirro ndfs taliglucerase cramster gonnerman gajbhiye emling shayma tlsa eliota srisamutnak gulworthy kislov slackistan kirkendoll riesenbeck disengenous reolysin comcare magied kräutler urbansim htey tadjedin pakastani caucusus bigoli jurcina frontload yeywa charpoy caramelizing languorously greycon trialpay walth penchard ziegelman snauwaert donnici misclassifications prakesh sorian netfires verbillo ashun mifumi klaris darfour subcommanders kintra dileepan hafith paulusma zerp auditees reniec monjayaki strubegger hybášková taquería xpertdoc cigdem childrenshospital begrudges increasinly blats crissman leered suwal eschewal plantswoman hanses casmoussa etant rodrigez ajarian sharhabeel dilettantish minnijean saydiya tartlet videocasts deaville yielder sikich rauseo pompy wasent raage designware brka santisuk efforst misbahul klauk streetwars henkelmann drusillas paillettes genise unconsoled karaganov pcmm actionaids stubberfield maekyung madelynne borsani hsinchun filmforum bestas treelines staunched kräusel ogac denplan chlorophyl gueorguieva hamoodi putis malphrus ciska abusada backseats tembu mysogyny zeitchik wanden libson loebel tsaritsino burmis disapearing unrenovated gkj haizao tombolas capdevilla soeul tissanayagam nucatola huanqiu larget jctd conferee nondrinkers lyndy livek windburn asell lstr sunders dukem uneducable stanwich rahmaan forkas chilliness kapetanakis unisom kambriel nawagai rissient besly jeangerard narmeen whatling citzenship toase glenaan notetakers bagnone bibba cnnradio corrugate dawaa monopolism hyfforddiant hookwood snowboarded redmonk henricksons elmu denodo ghafir deeson pitkeathley nairns renuzit healthpoint yangguang payack zabola japex ampad giammanco bestriding teag celerier disatisfaction loquaciousness rothnie degrey visualsonics horwits gubden hallandsås unpriced seventhly solaraid grosskreutz belcombe garica follick sekoff alkylates tariceanu wilkis cianchetti surachet beachell niedzielan eyesocket buzzeo saubade buffaz zojirushi mcfe azmak coricancha hateg graffitists dismantler uncorks chavance silim schabort legent shafiqa gtaiv rtpj klaidman printy feles governements transeuropean zippity samanna tagliero nachalat rockresorts getups glucocorticosteroid cordoza anyadike arbess goffney liquefier bedruthan janesky stetser reschio gongwer cracolici charbroiled duskey assailable adamatzky abasov faceboook gudel rosefish vizza cafolla pacificist azuaje sagent limtiaco directnic casteneda citifield elektrobit overinflating wheedon pellicori preauthorized chilmanov brexton googlemail kennestone colaw poussepin greendog snowdonian aagl chakai cortec aurandt megaplexes cdars bigshots ballybeen joojoo mgps suppling sizewise entekhab holestone santagati saifun peavoy kilicdaroglu teleplus nonevent coinings dismisal ndambuki bioparco vists fortunatly stibb sarahpac artt paxo intx whyley havazelet bazargani harmeyer azedo tomuraushi ecologo kagans sayable tsiklitiria boness navtraffic develpment tenderise wwhi poulis kurdsat schuele marcinowski macdissi rankling paulite postolos tappah gousis whizzinator utma toftwood fratboy barbarianism youtubing temporize marylynn compeition astrakan hrbaty oldag archetypally mumbere dayon vancheri yazji karinna ipredator menstruates moneduloides wolson encouragment lincoff jasman veitel mosharekat buyuksehir bulkiest derosario ohsumi grapey juxtapositioning alsoswa canabis mpoc unconstricted birah folmsbee wharfdale taghreed oumma anthonia yoink expediters cheenath luminas wanrooy onglet cityboy poppadom musuems taurand stadco smartops issenberg nitties catalhoyuk bushiness burok branzino razorwire ccusa lfepa stenild idilbi ballymany sherell haffield bapela outterside qindao mistargeted olders grugan bazetta nanx changwu patsalides teisha zimmerly chambost scowled masalskis jaggernauth enfeeble tousle allsteel dratel erkesso guliani qingguo abdirisak letko gonorrheal kaenel cilybebyll predeliction rissani simolke beglov barrise vannina gloatingly pathography montrey lickhill bedzin maniadakis neufeldt sapozhnikova ogorodnik educationdynamics jobmatch cbiz tapella spellmans bluntest externalizes cavewomen bucalemu ffrom staibano melaye enchautegui fuger heinricher orgnization krumper reyka millpool colanders sollazzo acoustiguide vogele maconomy lubke fottorino costopoulos mangoni jasmon shabayeva yagihara ncoil consultees janácek suceeds norowzian strongish bergene bojs likkle hovater mccrosson visionart mavric digu ceemea tredoux chavvy turchinov nudler indarjit backstrokes digitalise zhixue mcvarish agaoglu effectivness argutifolius auchtertyre schwartzwald uberuaga longdistance malitia theatergoing mcgoran depledge netmotion girlishly countermands bumblers stroble glai betor galthie lammons ukibc edmistone sportcoat nfda clatters ioflupane ripolin gellir adacel kabluey kadogo nanomagnets asadero prochoice shafar folksay lefkovitz firstgiving moonlike novorossiisk tradelect unflashy decopac wyngate avrio equistar uhmmm latexes iftc boente bbut canellos saccomano risio abondoned playfoot gudenrath treesa alpenhorn minehart kibbutznik villaluna carbonating afghanistans tsipi gainous scutiny embued pinkhassov nestbox chwilog kiteboarder groppe stuporous farmstay norrises suroso rapprochment sulikowski nikolich lousiville safleoedd burullus lassco baniyaghoob hafemeister krutz hallhuber coloradobiz kutyin clinicals jiggering affiars mehsuds ghlaschu seidle reidford agajan foderaro vixs lontchi wizer louiselle nibsc saggitarius renetta brittenum sudapet uchel severfield umaro rolfson hungerhill dunda appolloni stripclub mcgilloway chakothi wessi nyange smpc jereissati engagment zokkomon musbach ituango frienship sporogenes vinar lyas dolmabahce vatapá mciff cashable pancaking garndolbenmaen copers hockwell sawab madoffs photosharing jarislowsky shaheeda privatizes maylander yanquis carouser grundys weatherized cultybraggan annemarieke taele cbai recievers decis soleirolii hitson itida fortyish escano pilegaard fakkah zayouna machetanz polsham waggled fruska dtvs mroueh bostanci supermaxi headcollar ugobe nyccah kondengui atrivo eliezrie billboarding alousi sablich fites shanae blose nilgun perthcelyn nonscientist unwaxed lemerand bodhráns jamalapuram lusti hoareau chubukov watandar berdymukhamedov mariale impromtu sagittarians roanhead birkins buchthal tottendale mawlamyinegyun kaauwai fresnedo omnova lazzaris lickona cragged magentas tsatsa cheesemongers mediasphere blackglama croser polcheewin charnwit nahc backaches murisi lobosco lauretti obering brezovan shanthakumaran cybermentors grashow actualising borkan thakuria cavuoto toisa cpmp meddwl padlo righthander winklevi demandtec recalculations mermoud dimare acknowleging indonesias liveblogging headier ghuneim strey mélida negociations aphrodisiacal schloegl nitasha weaponisation wielechowski knekt marican fabish fierberg shaunie lycatel ruohola opcc beefburger runningwolf goatish commonfund knews distinta tauxe masseroni catizone pronating milarsky aggrevating caglayan troman inseminator chands lonliness gladd lyudmilla didima sgis kaynan prasco plessinger camilio pancks farinetti kalingrad gumbasia salahadin cannavan visant uniters bodysurfers barolos phurbu brightscope spekman fasihi ​​ natsvlishvili shahrad gujiao gafi wellbank tuchin eliette scadden smiliar leavesley imperi zambarloukos guzzone petcharat ghorak rhoten maradei diger gsms petrzalka neronha bibbe kfaed cahi oberli roseworth rubaish ostergard defenestrate destructuring befouling surles bertelsman delissio kassoum gekkeikan wodehousian torick korzenik podila gospelaires spurtle tvoi venemous kabukuru sunergy greenergy krechetnikov killke digiallonardo sandaig mccleese infrastucture muscadines laloosh pucciariello atandwa chugay stcherbina afba ehteshami shikapwasha cdnetworks avanafil priniciples ozdil prequalifying ahlman dunsdale novich markhams vultaggio assymetrical dinosphere advancedmc safana vigilence peachment kienbaum dachan kimme littlehale ardith meditteranean wnav scotoni upcrc servic gannons daurov holymen takamanda gramatan wewer conscionable carlye wysopal sharkawi bumbled topfen walpert hirshson essmann fffm orocobre watersound embitters milanowski andaloro totaliser bistline sexercise mhaiskar gigamon wanjek antiestablishment bmss aitha bobsleighs aghanistan icims sonejee kahmann záborská risal younousmi gajdosova medja econonomic neurointerventional rfmos forsaw newhope kpene poust freesias baldermann poleaxed playita necesidades hestitation mlac facchina purpuse hadly theatergoer gonzalito alraedy llai hiropon fullenwider northstone valleycrest allena ksis dunain rumangabo beleiving sweeped piniero nassan debbies boecher hefling ethereality warnemunde hearson proctologists dwifungsi ilaskivi galioto shlep capesius amidol thommes administation incentivisation nadich godager themselvs lazowska leevees oodaaq attenion bankster vatubua stuggart yelped overslade sweded whitworths idama wgbr balistic strenuousness sluiceways carbonneutral midsong oboeist carlan hautelook mundaneness labalaba rainforested stoeffler portelet benfluorex gristwood userplane sonderlager someof ritish jny ticas qingmei merridy dulcificum nationstar delhiites allmon wpeo spriegel vălean brejcha zhuangwei lineswoman epoisses locatell doubledecker revaccination terez teamcenter crisises ocegueda clems causewayside narrowmindedness sesama reddihough balkinization chiad quinzani othaim ldls wonderings laraba legro acomplishment casana gabling politicalization orrisdale gnpoc nostalgists verkhovtsov horseplayers altynai thoughtfull fobert foodshare slugfests nonattendance victoryland cytoxan teramachi calculous picciotti lepofsky sistersong ukpabio pizango etextbooks borinsky ruffel stauts sxswi ellise refranchising salamabad somper ecovative gemany stabilizations hummler commercialbank flacons azeffoun charrin yots broadreach raheleh goldmacher kelash kiswa dumaux iglinskiy earier temedt disbenefit unrelievedly vittozzi buitelaar mieles oatibix balsera glamourised candymakers collicutt arellanos lifemark tagines laizer nabakov drakensburg vindi bucheri sightedly harmeling hernadez malthace hulcup panfilova lilikoi tradmark tufaro undertreatment backcombed homepride straiges wegh fleuranges gcmhp faulkners paravan laughingstocks lentol zerofootprint fanm ffyrdd slagheap marinez thougts blogospheres susceptable porny caried sitelines rosg parwaz merilees cereproc resx emergin cmls annerson dickons cogitating honeymead tarnovski coquillard solrun harfenist kondas nutrioso connaitre binalshibh freetel debarati clocky tomsett fidis intruiging fabiszewski tinnies condoles intelligable ecologics maddo mowie fautino crystalens trigonatus linctus sonys nathenson relock lossada werema vandellos kovick propanolol wimpish jtx funuke telltales siefkes mahlum templehall gnaoui palmatier sisolak transpetrol surján arolia boyloaf splaining wernli judaisation soukhovetski kruer marquetta ghtunes quizzle clippie grandclaude chryssie eknaligoda taie protoge gullets cavusoglu birdbaths esaa holoband voluntears perb shofars graymark radicalizes westaff edemariam metastasise schweddy syphoned waleses workstreams iceburg luzolo tugra langenhoe rediculousness lepping swishes portakabins quaters xyratex sjoblom pekahou greengairs masaro titanothere sarji cwlf britneys nangrahar varshons lottes citadines fayek artefill nakhabino reprocessors tolitoli amechi ballero rohanna galactically premising dataplan recardo contast foodmaker swannee oganyan stoneyhurst bathstore sirichai stalisfield indelicately avrdc nonscheduled aitan gureshidze villagrasa shebdon skosh walliscote anegasaki brookenby fisherpeople waivable gnjidic bookhammer sbgi budney bunnin conculsion polek buyvip gabbling younggu transfomers izraeli milleniums barayeva zhura medishare righs ramsburg czechvar decof propell sayliyah kitja labinger plotty oncophage brockbridge aound plessi risg mcpadden tanswell landaburu pureeing farries gamersfirst rouiller adhab bargylus simplegeo cachers salikhin unlatching abdelgadir lfrs maqaleh khosti forgia kocik shequida debone musicnet shilleto lonoff rutf jornaleros dinnes michelozzi wiedmaier iresearch internati rouček zalkin kilpatric rrez abdollahzadeh pickhardt biondich chipstone obare sembiosys karabey muaskar butlering redistributionist cfib bodging gaudiest motivala taposiris kearin overwatered varischetti tenners turkina trebay phema cockier scrappily tywain zougam pantuso urbanely tamarra mynachdy jayz cupet blurbing vivaldian tochterman mikonos laganosuchus maidenhill adultcon kiwayu bryshon smlc mitsuka phlo overfinch mainero mitteleuropean diprivan olevia dimmel pappardelle malignantly langour häusling babec dogwatch guguletu ezrati zafiropoulos pociask kacyiru iacovone roudier bolívars svnt koziej syntocinon hillblazers rvus greenwhich wallagrass pereria chnages uparmored fadai priggishness povoledo phindile saimone demattia otelli faddists zackham thurnherr mackynzie thieblot whiffling achugar braincells tombliboos stoccareddo ozin sandcat bulkan gogarburn amphistium karinen bailgate unwearable yaodu henhouses gualeguaychu kotlarz zohor rogol wahington electrochromics ordower debossed butrym kandhas xmega nickeled abouit devloped wristing gambala propostion cyberchondria methedrine serviettes sumarsono niquitin canalys transfats gokita hollopeter belevedere shulte freeload danetre devaan dotloop devrouax juchau stautberg virusbarrier streleski niketown headlice mazuronis conceeding manshiet honoury duvanov alirezaei underrotated classens gilbraith skypower transplantology ramniklal herrarte sfgh qoq readyreturn peetu ilts debix byrraju anonymizes newstrack perifosine alomari popster triumfalnaya stratt bespattered telogis branshaw schwamm babaker quadrantid laksman oystercard strompolos broadsiding pharmacutical mitsutomo viliv flautre keenes muttemwar slingback priorty ayva dengir repected monitering jees schory yehezkeli tomilson fathur tatooed benquerenca ocrelizumab tryscorers interbolsa klnlf overcommitted holidaybreak homegirls sheskey flunkeys bjerk overpraised muhren cobholm dimitrakis machinimas koshalek shujaaz annamay birdguides pgad shenghuo megatooth esepcially stanion sheehans vpak makaio jelbert convertirse plamegate uncuffed towfiq alagem retrigger drivesavers avrett matche mctamney apmi philosopical dongier sangfroid tremolando secb iafp oliveiro gandelsman uxorious illinformed laeven farmacy unreeling franprix westine streymur ducate stephfon ovca soshnick zuqar zormat cordylines multipin afcis cousseau hittable itemising kaparo parsol rohrback valdebenito ozem mexicanalink qmy biscombe hualan thunking bankserv particularised rohrman pellettieri famvir pilferer impastoed drycleaners weblike khachidze bereano narcotizing cordelli counterprotest cooil karzi khazaei sedums stasse cafergot linagliptin luethi godbee chercover shaweesh schimpff ehrenheim plecas loidl inconspicuousness siquiera sunopta cozying gilleo tokers boutwood bagci unhitch wihin sinanian vegesna lezmi winegarner amgylcheddol interlex qarmat yetagun infoterra deothang partenon spywareblaster hilia helferty pinesdale ecogen diffidently loehnis phonecaption pdns aetr radwaniyah garpozis schnipper chloraseptic ballypatrick rabbae renationalising depreciations resuscitations maels guissou holmgaard gaffield deerstalking diffenbaugh thatiana spett tabajdi cupcakery rysavy intubations homegroups trepel mariacka swarmcast unsuprising splatstick chaudri gbks sathnam buyology godsake popovec kocol usulutan demarquette differece assayag cochairman kidiaba kaixian sweatbands pemuteran inve uniformally balathal piccante empassioned zehme werhane duckstein bestayev hematol frascella rothfeld zlotnick tanyang chernovsky restring marnhac koeck nubani attacted bavette elran ungovernability quorate soulflayer lapook meteab gorvy superfit firstpage bcwipe mashouf ermann kozmann synergized annyas beachfronts saibao duchampian clincal gruters cliett upswell wenhold langenegger garano hollebon capmed sfantu xeroxes euphemised restuarants oysho costena greenwire lawrimore reponding oceanium fidaxomicin wahidullah jillions ylon buyside ballyreagh watercross nexity ginnelly dorwart braefoot bourgnon brookhollow plevin camtek filardo dumpings kotchneva abhirup kiesl neatnik almihdhar potocari borlange canogar nwj cavic expectency fander medieaval zipcars ieta wooddell rensink rottenest elinkine ingelsson louai haggi veyrac tarryl misrouted glenney rapers writetothem yazilim bruuns chsw pojaman arrambide tripplett feretti reinsure repoted bernadetta jdem mediterranian gyem shouild cavileer tonsorial poitical grantleigh capehorn moeckel aldrige gidu mainassara burguieres bettadapura nosediving nonsecure facebooking garsztka speacial snacker aluu nextlabs smilebox nasiha futurechurch horberg menapace rabaska tuleh bvii couloumbis eleider demchak labourhome redcom nhsmail gremolata finf naame khnata sidoides beetaloo deking denouncers odabashian daughterly eperjesi tittering expectance furberg stanleybet nardil knakal gadafi strenghts apolosi imprecisions counterprotesters epaf klimaforum loathesome blackcircles overruff routon ketek mehtas aremissoft believs johanesburg artemije bradco podber soulliere mangiaracina lipshultz dgif weirong kazeminy lmra leecia adsm duhhh wyebridge tobyn drilldown ipekci aparatus landier trefechan westraadt microtrend scarpaci prisioners tosoh inwhich detorie bakman girardo hosang bertonneau moue minxy nutrional unsightliness hereditaries transactors housebreak cabaluna multination roadbuilders wyett tamte mecher massounde megaliner mzikayise kurcz marraccini inexhaustable tashnick mccrackens packway semisoft leishan cyclacel metsi lissen dpfs titillates wynen lovan sekonaia zedginidze bedmates paymah crystallizations tinkertoys lorings volc nattans tmti dénériaz puurunen cattoi reinspected oldmill dorofeev bestattung whetsel noubissie uors mendiratta qriocity marsack injuried yanza verino bascara champers gridlike brossette rebeiro lampam grandparental acgh tietong elmlea tirin odowd jeapordy hansjorg swandel bakhurst penedes hamoui ndanusa corporan glenearn steinkuehler zainaba kaveny ermmm zeidner piccaninnies elsenheimer villaroger giudia stramash haakanson gnarliest seminoff ediacarans zuchowski aijun aosda encumbent elektromotive slackly jbar gurule duzan smoulders siddy nanogenerators dokku quicklook ecwr preusser bunchrew pyestock ashtari kurton beninois razaullah skulason balchunis siyathemba overshirt iguaçú tigereye telhami nacey astroglide gillbanks strudels gorinsky codatronca welikanda gharawi sensationalists warnaweera exercized zareth sedillot balindlela nosimo unadmitted saõ ntawukuriryayo ikeguchi lurma microbrewers overwheming kolokotroni lubrani shenyu dockrat onepulse gemzar addaction velders svengalis machluf ivell birdine fixie kapuya foxytunes uberalles blamelessness voyatzis calguns kapnick mohlala hftp distribucion zubo heinrick ghawas champps respectfull sunniness vilday welterwight szen ovec efects prosecutive intervac ocobamba pspgo careforce shopwatch posho binationalism rubinald mcgree khiyami seckinger borgna teitlebaum inumerable confrimed prattens bamboozles customises fleeters porfido stoody aahhh sorsby nsmb prepme garavand islamofascists unhackable riboli higgo koelbl ursala kilamanjaro pischinger hamalian jipson ervell bracale cravendale lubel wojtal cpls haydan camerawomen kokoris audies burkhas gounden nshamihigo cemf daaras foroyaa rockson repat rhiwderin kenscoff hoselton flewitt oacs sherrys eyestorm slavisa langbord switt meseberg madnesses kelami phiroz zitty pslc urbancic belabors infuence bullsharks nephropathic perdikis rougle srur chowrasia istanbullu dustball roofspace quets staginess cellu chadirji madover bolderson vicitms filtrated meiff kielt jarich smallprint coehlo womenkind geovax wolking marketscope pestival chuet avouris sahiron stanifer malovic symeou enraptures instructively trillionaires phandroid govindini musademba nerco curc tunesia manlier domeier chupeta gurewich tutka pwnd tohidi sneesby busmann cimavax probook creperie tallish strecher winalot bialobrzeski guaderrama stamell gretar sadeer salmide chelvan ovab inkd tcell schnozz manful middlemoor armures paiche edemar sajmiste yuanjie uninsightful inaudibility streetwork mcclintick dmhc cowpat paterakis zeku helane moestafa illogicalities lispy andreen bucala swraj zaidman vilimoni scandalmonger rolihlahla sidetur guamuchil electons navaras kauahikaua mmis nhmc minitruck nageeb wyma yuansheng incompentent yakasai bommarito shaoxuan beliver seedbanks pijbes smartbike toecap beliefe halfwits qutaiba caffein collecion ultrabithorax antzas dizdarevic lapdancer cahnged attaya tuhakaraina lailatul tlil haverstick poggione ardous demined hunko sesamoids winzeler gensch antlike clickwheel hanify patricidal elkon expostulate bespeaking aspesi charcutier nemesysco judicia ibbc iobridge okunade dogfaces frostily dusik maggotts minutaglio westsider acccept egate nazenin witlings bancells mclehose mexoryl megret gerstenzang umshini extemporizing jamanak erfle sukova mesmerises ballaquayle diabetologists chisanga detoxed ressner astmh pugnaciously mercinary anbyon compromisers powerreviews lapandry gŵr toderasc mochamad forgetable razeen idiotbox mehmedovic scotching dwimoh tirtoff casaburi horselike therasense desexualized cmpp qiam allof rendich mondeos gornstein temporall kneeldown pristavkin boireau namotu vanceinfo compactrio kerrea otherton jawing huebener fleetbroadband sheriffhall lhoknga myska billionnaire iscol blinkevičiūtė seattleite senpaku vasts gearlever xrep hashbrowns tightfisted eroticize counterpunches immie boshers denuclearisation dojack rubberstamping jimmyjane bunging frez moseleys achamore tirozzi geltsdale verkooijen sakhile oberton mussburger ssma eeob overwatching margenthaler raffie bastarde drukier paradera senal altagamma rosile careeer onmedia peszka fullbridge auctiva romeike cyberethics adcps masoff ahumado ardeonaig stevenote setlock picklock seatons absamat spiri landco vocho urre aircaft unlaid brekkie nonappearance goldenballs bendler meddon falor baitings atttempt shushufindi milbrodt fitb sthat squaremouth dutia peacedrums krohmer otri presh overanalyze zaldana mudwort photofinish liquorish rimjingang ngruki uncremated duschl alalam dahlie whimp melentyev shakour identfied kabbara eldene recuperada tremelimumab unwarrantably kajran guatemaltecos kreitzburg asplundh chaats manour xusheng sefik adiba lightpost unifirst agriturismo shedder wpnsa calamandrana emmaneul exploitability pubmatic swaco tankel numerex guidiville accually hajiri ellite buchko aldai proteolix listserves specktor turiscai badei belfonte diabulimia memfis mccuin sidhoum clotheshorse nadasi dannemark pikser sheqi grare cavinder democratice dretzin keumgang silha balkind completeley tsukigawa grynspan ziaullah lazurus paksitan tsuris dongbang rashesh ensnarled gawp pittelkow giffels avondo vanderwagen kefah jannarone arraigning onebox erfani chepkok detriot kangeroo coppess metalers schwermer tavasoli twito lidor implats khalig antoniotti abrass releif padoh swedishness lanaway reliv unsavvy straighterline unbuttered willborn soapdom ebot aome salloukh wescorp valencias khetagurovo sbab athanassiou alexovich dolinger yerkebulan gracelands auvert afrikaaners bjoerling birthland wfmi schepel seideman brutta gezellig zajic freidgeimas edelca wolfhill edenside mutawakil energycap dryburn mialo aizue skavysh diliegro fanfou slome narins cirv keluak crossick grapeville pensieroso beiqi nurhadi sezmi maillots pssr beriault atambaev jusman encrusts shantal samadani graceway aahp citypoint gastronaut leathered happies aggeler quimeras cadue panaf adml buzova eletronic anseo discriminatorily whilden prelapsarian duxelles jmpr degracia chiberta tadai frontperson guoco sojern nematullah sigtarp languard wetherly teesmouth ruthrieston medassets ecvet offerring medicalised imtoo wallcharts gihad serendipities rumin hmmmmmmmm topcashback numerologically ambegaokar wongpuapan vantassel auchwitz inkley taurons cepollina fledgewing rozenblit facevsion tereshkina purssell chatterboxes pizzotti mahgreb ingoldisthorpe vpas mandarinate slts fgic bambinos efamol internext annenbergs chandley podles yarnbombing joyti quadband wgts bypasser gillinov herheim sociably plebians avtoframos stueber depetro khadam northwesterners naturiol pdcf battilocchio basavich durach tomasita schaede corodemus chockablock foyleside libber fixins traumatise cynnal leffman northeasterner domolailai fravel accaoui everygirl muddiest tieger elsbernd etrade cracklins backcheck jerika mozie schuiling spagat solarize railbelt plantadit dobui haugo parouse labreche tandoors oestergaard kentrail lifecar flaggs oldner hacan restoril millmore deltalina paladar thorhallsson outpowered karamojo mazroui durabrand lavely centropa farahar trendmicro explusion electrovaya zegveld goebels bebside agilely esapi alikbek meghalayan adventitiously regardin khial senelec hhonors boninite goddijn baikeinuku bananaz couvering uuac rynhold ushguli restaffed enthrals underscan bartner gattlin okum akator sagario mangetout chabalier dematerialising matchima spencelayh behme oravetz tankering trunkless sentimentalize unsated curnyn slabby altegrity tavinor macquisten owoo shiat shapovalova entrechats epigenomes inculded moellering oversharing nanotechnologists clukey grumbar pötsch clarvoe inquries geldermans plaun schulzke meadfoot shaller superpark sinovac syamsuardi dizayee penknives unreflecting moelyci masaiti bluestonehenge polensek butterhead polledo burbine dentention banahene steare kahre rammo prody yllescas chames deskilled nikitta monetising cardas harborfront exhilarate philogelos torchi contran shreiber goualougo studyblue schwammberger icestone hagworthingham uhrlau lacore nirere sozar goginan sabouni saunooke dassarma vhda overtaxation nammco abloh vaxjo movetis thinkway lousiest peita ullett saydnaya adline anthoula schlindwein hoopsters landcruisers rezeigat fabrizzi covestor farabaugh valuating flipflopping jilma marconiphone idenitified lldcs timesys equipt adlene niemela puddester shynaliyev fatayer shamaqdari hudok gaymon identy ziems shiprepairers manawi bajil sprd centronia riddall gazarov mademoiselles arrse azrouël eymer svox sampallo electrifications lapore horseboxes hypres sinskey rushka overinterpreting argenbright toothmarks wrinn butkov novodevichye bastardising boubakar towncar avicena jaekle guggul efinancialcareers oakmead nopr guaino gunnarsdottir freakery vinoly suppli aulsebrook boştinaru overnighter densborn obamba nexpress whoremonger poras kisik gillotts interdealer thordal ingenix popkins tulaichean probl rnln sunbaked muhlestein orexigen psychopharmacologic kulstad chanice torgovnick borishade rahmanov firsties penate heygood ifart acred massgeneral castlecourt mailrooms cacg grosfeld pcpcc erulemaking barrica brancalion takaso roadblocked mohommad cornichons alphameric ormando mirasierra whiskys tranchemontagne tongayi roebke erevia ancester yanlin gorefield klaß hardfought reenergizing botequim seiphemo yayha ostm tiguas hausding duoyuan undertows basima dambazau disqualifiers nambarrie sacramentalism califonia imagenation butterell anastazja pressreleases verheggen shaap bronowicki uwezu mortor winthers wyvil seyferts cellmark bigaud ethar soveriegn tulipani dealtime ghulab appartments calphalon idiotarod onepoll feinsod senterfitt surrick caloia appelby studabaker businesse insteps salsalate grandholm tokofsky pidgley saiger handmer dimishing capoco backhanding myrone bluemke adminsitration severence yorky globrix buhara hynor cadging alicudi quezadas cylchgrawn eurospeak fromlowitz jaked tattled restaino springstein japery tregolls flakier suniva arabias clumpiness insectosaurus ceasars vankor giltburg schillerstrom cajolery poliglumex rosettastone carmondean nisene heineke meidel violance dabbahu broadnet zbot scherrenburg ocsw wyszomirski enshroud contura dced kanninen adezai ohca wantanabe multiwave sequa tupak kratovac monai finetto akusekijima esms maritally glascoe gecad frankcomb peruvemba zitka pakul garagistes packa ncdt yevgenii squarcini dissaving pellens budgeteer unpressured transcendant workovers yanggang kitaka termist greeve neuroarm guangwei geerling folklike roebroeks ncbm hoshiko jigmi keyun hellholes aquent fayot monagh swellhead outkicked footbath roenigk laschenova feklistov rockabillies lichtveld renovacion luques haematomas hohlbaum qdii respec venlaw brandable snowmachines superking bikeathon sucrerie sukhirin knrm polulation nwcu takeway terlet ucunf kameisha bolillos avghi ccrif underweighting daraghmeh sulek choosiness lemunyon loadman nonpunitive misplaying arulanandam fiestaware zvents schwam mothetjoa tcherassi edesa rotina dijla halilbegovich sansern yotsukura kowk fuelers kblb hormozi sohonet luchey steinwedel agcenter qualifiying remaine cput qustions stoudermire sharpy millirems etumba reverand kauranen hermene baneham parentline ibrisagic wallboards darfuris zedar fanaika lobotomist bijal allshouse accommodator remoteview kalymon summercase wachenfeld vozdovac smashingly haigwood componentsource pizzahut calamander summerlong spartoo grazin prevaricator iannicelli currenttv fordhook bbdc haterade areng krainy freedome branchini millsport ribery adug lathallan exigente lasok lorinc dcpi albader hochedlinger crowdpleaser dobrik studwork sashaying aopo proglio beizer spirax prochnik ruvin guernesiais kwing pbsg arkstorm ilano vawts viasystems acknowleges shoeprint unflustered wilkman sealyhams linjun neigbours upex seeclickfix spelga eskelsen intenational dabaghi wisecracked selnes forbiddingly tomihisa bliadhna tasovac toeloop bodhar sartison mcmoore hideway stattersfield reibman stogel coninue esdm microbusinesses facination foresworn aftertastes hackenburg finnbar thorsgaard prgs skalicky mipomersen rvot globlex beatboxed dignes anouma stealthwatch zimasco linnel hyperaggressive magnana civette tyrees fernandis pmvs egly markelle mounis jonnier widescreens pycnogenol dazl oldways accustoms woozley streetworks glencaple schweichler urtiaga goldsim undersupply lipidology uitslag hauter sapientnitro ilinois hoxsie chadband beninoise goudswaard tamweel zabin oluwaseyi flexbook velafrons sauvion beynat gmai mutilators unifab dokht mehrjoui surestart tendinopathies btween divinia douna faintheart insor garringer nolvadex numberof atacand lavarra kribs tchotchkes icepacks alagno lateraling haydnesque monobrow maizar lightish killifer ghoula jarus uppercrust bereave musikapong richardsen chicest nuwer macilwaine chards underthrown conzo zephyros tecce istrobanka dirienzo citrines shipster untrendy synopsize ieah lissom chozick aksentije liveatc digenova capoor rafli stalinistic unvalued clinked sarghoda mebroot cyllid esfehan zavoral martt mirrorbit kufeld heixiazi shnider mullahy dorko maresco pentremawr skodas cifg campsmount flavorite hosl banwart railteam bernadito tomashoff janjigian rmst yagawa talae whitopia feau cristofanilli imangali colmers kilbert gackenbach crassest highnesse zhizhou retrolental ogbulafor teether hairo barket decrescendos adorama reinagel ngog perkey microcenter gossom outtara echolot masalin downshifted rosenheck superjets nemwang apparenlty elisetta lindners hafsah extenstion tulipmania ahart cembalest ayatolla irangate gubaz zavagno fnpt snowies schwaig tzarev croel palumbi aelon kajen twitterfeed fekter priorites zekeria varmland fiercewireless bubkes kamonyi expiating peekyou annapurnas ishitani eilenfeldt ahmadinajad pelargonic chawalit zarich restavec lnat doretha linyekula ncafp terie nexbtl bancsystem cybs rangeworthy elopements ronks slurps chesnel devraient repigmentation torgovnik karnit shekelle olice represenation zhonggui establishmentarian conquerable bujnoch psittacosaurs hiccupped vanderboegh walleen surtitle sippers rechichi revalation makuei fumba cammaert lampariello thredup bathon makhenkesi bavituximab plusio carandente benfatto trimega nextradiotv clev gilboy pannabecker vahia coquillette bratic brudney californiavolunteers paatero stultify toqué kavaf ileret incierto kansho micromanages ostle xunlight bardella isleib omarius skrastins mokwena akhgar rutino fulmore docusoaps nonathletic lippis dipex karayilan sacheri afesip buonaiuto taavo hesseldahl luisel putsches paven hartvigsen bezark innundated indianans sivanesathurai chelopechene cyberweapons bouard daury colitti nonplayer dailynk neurotech capels respray lichtenwalner investorrelations virtuallogix jakucho schmölzer dlcc kleberson rotech sceti obtrude roombas kholwadia cicale playng yucumo ingleson osenovo humanitaria trilion kashper kuthe anzhen alvine unangst aethlon clintonite jazziest progams jianjiang junkshop nungaray yabunaka oskoui golubchikova magomedtagirov dangana elshaug ncredible tisin ouisa lisenby nuraini ruecker ebie onodi somnambulant fallu unjustness dvorovenko consalvos tlapehuala hootan lisis relinquishments tenaculum palek fricassée comilang reconceive firtree rosegarten collucci ejehei unblindfolded prestedge voordewind viewerships urli malgieri ababeel yumyum yandicoogina shouwang atabani imbeds tottingham widsets vasterling bleckman tomljanovic khameini innumberable argouges rhissa xlx vexingly dervi gullino malasian cuemaster boxter wherewithall alekno schorle nigori vangard unloosed lekach harnek prepurchase sacn endter sielicki amanjena pelevine microcalcifications amatrudo exoneree smwf abereiddy bettane webbies riverbay otpc palmenberg muolo ackert refinetti busness councelling nasiganiyavi schlubs morizo beikou shokhin dysfunctionally deerstalkers frankensteinian bencivengo dusks anghie bisoi ringlike engressia frontloaded prattles dimwittedness onetel ratnesh flanz aggrieve delcour nadali moosajee szrom progessive minczuk reinject indevus antiriot bortolini microculture oufit kevern crosswater metrinko wlaschin wijeyadasa antiphishing roséan korge siguiriya ajwright yellan shellsuit lazie acdi spadeful austrlia skived reqall girnius swigged vivra alaitz duhnke guaging ricelands zerrouk surk hailpern batsuits broughs dissemblance papaye respo setoff repubs fudger whiteways totobiegosode kingwill dawki sintonia lejnieks meei strenk freddiemac counteroffers hemwall pyenson aaqil mcclear hunstable reasonalbe jeromey acofp gabarone hyperventilated autoexpo peppersmith bibbings judiths humorlessness raydel nonsuicidal chamie jaggedly bettaney fortissimos lutker navic lasis riddile birzer shayeb shanzhen auctionbytes medvecky ahic härter superheats taneal houseowner solino deynes wensveen cmtl blackhearted transoft scarify sauerwald moreish antigenics fanciness pisaturo nigussie oppourtunity bochinche flunkey raynesway reinikka saadeddin mvumi feltmate kleptomaniacal activequote differenet regasified bestrides basardah fulsomely coiley onama meganne sentras batom chilcombe disembedded trunkful playcast tanaiste inspiriting blairon sneeky huangci knaupp wiid euroclassic reimport ghullam finl nared socìetas gachoka jackmanii trikilis declarable rismondo basbous godé fatahian oilrig sdcp furriners downtowners pompholyx farbio vangala suhardjono inmage nordson jianfang sqo dcmd scossa cozar suctions pigmenting monstering meglioranza sweatheart inkpots benthien oneof vernola alore berettas hogwart ffcb salac cayler doonies autopacific winiecki nondefense southstreet allscott demichiel overprice scorchingly berru wajsman sebonack givebacks trickily mullaithivu kotting perambulate cavelike nicandra craigour guenot formidible powerlabs perkinses aabpara apls obayomi jolita shkurtaj vanezis fssd kapetanovic scheft alyanak retrains leskin posole palansky puigpunyent devay bieniawski nutmegging albertrani hadjuk goryunova jumbish khales jegher femara defensman avapro savient spoofery dharoor wolensky uogb yhency cuminestown tresspass pensilva chikwelu percuil safyan overinvolved reactivations woodfalls varelas muhtathir zabrocki pentons ligocki jensvold vaneman mullee tolemaida macrosty unrushed chariman jauntiness nnpt nogy horible asssociation kawale bollore prommers redounds freeborough maglis mensun stratstone vikuiti neurocare bustami marenzi dalmane illmitz francises undershorts angwenyi amadine dabic veddy hctz gruessner bifeng recaap coccaro diminuitive bachur enpa charleses cyberpatrol jannuzzo walkstation roppo polysorbates sianel challengingly iverse karkkainen greyman posedel barfe neilia jinal aftermarkets hajda turismos citkowitz zoidis trotskys ungraciously midlist magneux kamysz sinosure wihtin lesha repellers hybritech fulston blitzkriegs balconette fitchet outscores virtis dipenta inteq rigl oftsed neindorf hindcasting corehead dorleac vukic kolola pfis shovelin baszak melanee englemon wbmd bowlhead insm nolhga keynoting laussucq unstowed medbøe belayet lishen cheesegrater subcription tracewell tambunting adolygu harshav faruqee allahdadi sissified theera tmsuk shugak omniscan chevettes pikestaff phambili rhinefield kuchenbecker lukonin lifestreams polarn ebps boxercise texmex blondish shishkhanov wallender derrion rusticate nsia olfson sameem whizzkids mazria bestir lakner kenshoo meghir krauthamer reannounced blousy zirok photiadis smia reinvasion simbex musicforthemorningafter gothberg feasters genteelly flometrics knickerbox feuillatte tulchan kitcho quietman copans wythnos bravinlee sanctimoniousness mathez szyf mamuyac nonroutine kamikatsu ingla mizani gpic puedan goadsby istore krowne northacre deganya kayishema spielers internalises muhummad anthuriums bankengruppe iped heiligman portuguse cgdc rosseel infantilize schenone elbagir beccan yasouj hudgell callfire deerbolt leiblum ineloquent cussedness haslauer gotic molindone thecall feminising akipress esquerre suriyasai agnitio neelofar madrilena folkloristas overeats whizzgo majit logomark dunlopillo mcleodusa schear nonvegetarian turkstat riversley renoirs gadekar tigertext ristelhueber hubbie shaprio hoffinger sylvor liepzig ersek comorans souryal politition zere brideau celizic pmti glasfryn weinhart thumbplay duperreault madencilik plestis shaiq shabih soapies antimodern redevelops negligees torarica tagai brothman naggy igcs hahahahahahaha rohtenburg electorial longheld rooseveltian overbook thermie abfs dothard sintu echouafni noshki mediacurves boosbeck readytalk larrowe wanabe appetisers baumgaertner deminish xiuyu leftfoot nakhle cloudveil kohail strentz germophobe trotignon ledgemont naoma ghosheh headcover drotar preatoni dynamicops chantra elchlepp fanteni ususual celebutantes alguire leivinha haemost sagerman shraeger colliford gardler brutishly sendlerowa autoban cherkos smull golfballs wafertech mucinex reitell logicvision zatloukal gleichmann lazca artigarvan wooky idyllically odabash bbumba rosheuvel efit elridge taea driade mcanelly hitsp gelowicz talebzadeh steppingley kontakthof vrae satava vinacomin shpiel techsnabexport kaptel hyperinflated duessel vileda dombrovski masscap moonfest kheirandish supersecret pyszczynski fenkel proposterous eastell koufman zierath nabintu chorcha løkkegaard cusos heicklen vlock artwerk schwabian thanachart wbenc depinto usaec fenproporex mirakle kozodoy arreaga kforce wepler tsalikov lapica pezula tavaria joudi corviglia softgels kapahulu karber cretons scrapbookers twinbill mansouria communit ciza kriwet bdhf breakish pacesetting weisbuch shogan vivisecting schramma hampnett gameboys mcarabia lairige strokeplayer svase bibelots rivermark belltel applera acomplished njvc septicaemic felini rmhcsc rawood exubera curnutte vechicle bajorek vashishth volumetrics gongsheng grumpiest ramlow peratallada eservglobal troublé getequal unartistic lytal strausses pedri machiavellians pazel ridded ecotect corveloni minimorum shamburger chambar perusals landgasthof trammeled plexxikon derinda dalpiaz toffy zafon amirkhanova marzola mutayri dtas repot bassuk villification suncare khomeinist luari lollygagging fireplug halaco triessl tendonectomy falby gallahue ehnac inflationism dardentor flueckiger lickteig fettled domenika barme mummiform ivideosongs unopinionated rubeor intransparent vercoutre profiteered domenichini monolines gentled overapplied hejna maching ballbearings finers skycourts millitants schlüer homecomers grommek gamt franzo keiaho rugasira drčar casulaties groov noonmark elderflowers grauls battlemind ecosmart softkinetic besmirches clubgoer prediabetic ibbeson depouilly multileaf nebulousness gorenflo corace mahanna ramadin carbinoxamine pacquaio gobbetti foldover furbearer windsocks dabab wollack hladowski opinium spykee convivia devauchelle suul dallasites shareprice kulinski inglemire dinola kevane schwehm magovern contritely youthlink begginning batterberry biodigesters gedow kocken afonwen luging neratinib kenzler borqs corrance qureia malcontented rilin pcati prizant tanora anshakov hied bridgefield locavores falafels markhouse uspt sombreness babycare witthoft jijun cheesing therkildsen haegele madaleine tastemaking abvs awooga gemlike tahdig effertz autralia calingasan kechik friggen netshitenzhe koskas taleon sarnez gunewardena polymethylene microbloggers hostelworld lyublinsky kwanchai mythbusting askale utsteinen barimo taybarns fibia dissatisfactory comvita consisently mccarra drugan kabbal irurita janangelo soliant copemish grimmelmann unhinges dubitsky changey ifanc sectra romich bronis mazard tognozzi friebert afue descripton dovale zuska yasman kollins jalai afirm ballyowen pochat turrent komisaruk homoeopaths advertisng econsult shelver frix fuyao electioneer onmessage minibond chumminess endulge pastings beanywhere winpenny kohrman weals funkified champine hugheses umguza gähwiler viskase geospatially matw testiment jzj borther xaiver splutters gicheru diddams lincolnian gtec cwik lexcycle vljs writeing wheatleigh styrenic tabiou shough erhlich skimboarders amantaka calestous pesacov stolzius medallia depletable resing biomatrix constantiner unexpressive trilliant sooreh xybernaut kulakowski chunlei boffeli travelodges scootie hubberts spongelike mirandized mirandize alss bulabula mojaddidi choudhri beva tardies fuoss barankitse shidlovsky shanavia pitie estrellatv borukhova thoroughman gywnn goerges goldenrain wtfc uspca firb plunky chanc emptyhanded mumc nwanze cgnpc steckman opthalmology boomeritis minova humdingers sketchiest stoutz brodcast caschetta sarnies engrossingly anounce concience noud compx debottlenecking kanharith fynd lansbergen ucberkeley whiteburn sikkens alphacat switchovers dbfx locricchio chind vanned cruddace supporte lilos nickiesha rosebys wellawatta cryobank homosexualist redbaiting kimberlina piks abrham stoskopf zenergy transparência ayelen aboudi lejuez stieren dunnhumby farmout ngobese poltician hoedemaker househusbands acik karakus dillenbeck mowlavi belluci uchc doup subzwari availablility borchelt visudyne wimbeldon idlet macchione beytenu qaddumi tajique fromong gonul jihaad eucerin elkarra solarcentury ortegon arciniaga alavian headd fireworx menedez softlines seroka oncampus chmela deserie rybeck colee supramax baleira rvers chitr dziuk molaschi malinoski dellape barcalounger narcisstic godah miltant naudero marily kostow olmer eqecat foulstone kasav adahi athat cutchins rigaudo nancledra chtr nosherwan faaf tjmaxx unbuild boffetta coopi seipei tchuruk ngosi histrionically lalomanu kachinsky falconnet quiter mcalear pinked malalane navajoa vakapuna raduka peacher honeybears sudby defalcations gabot giessibl solovic ebulliently garone yanshen rerras junny origamist kopenawa gunel tariffing humongously owassa squeo foryd northwester forgetfully lochardil macromutation unsanitized thanklessly sadec delamontagne fremer readle fruz goventure hollandois bulgurlu tseycum usdx descenza ncpdp dogheads frenze betaseron preferreds lating revoy nusoj dispirit staadt palled ebitdar milyo tappolet aurn wellesian quaintest sherbedgia behin citroens marcellos donio axiotron zinkann peform xojet woodpigeons ratnesar karisa outspark cynhyrchu jurados tuputupu repasts youcan syman shimel amtote zerola devoré sukhois cbocs rúm geona srlc nuvox plumchoice gingersnap sheinkopf lellenberg seic prexige machinelike fighers lastinger plethodontids caotang tarpin abreham hipposonic magezi moghaddas mishak retkofsky bogacz holtzbergs gomart kurniasari walnes izdihar enpei grappolo evanzz ieak picosulfate fredenburg welfling blinkoff reserveamerica hempsell cigler cvik koelling onica tranquilised sinnette earthecho merran nuumbembe juking cadus rushcroft neftegaz hazmiyeh schmadtke refought mossmorran sinovel calcifies malary gracko trangressions wickenberg kazinform akoi kcic keerthisena sandboard machalek galeai spherification clixtr mascini cervalis tracesecurity revoz excoriations distroyed amrican arduaine etxeberri hasagawa shagrir tannaghmore redways anihilation olow cachay domnitz memorystick labyrinthian enegy saadiya binyuan blazic riofrio gokay tachyarrhythmia borgny sedmoi shokusan griebenow ejk sunsetting roofbox tomt gqt incisoscutum gotkin ottica stovroff kavenagh zaimoglu lewisdale grimshaws imazapyr knautz yonemori anthropomorphizes congel cysylltiad sierwald junade trupish isrl madivaru mewl pretti elomar courters benningfield borsodchem buvette kalist intrieri propser vellay rashkin milliohms technopromexport wuffli tamug kerevan perscribed squirrelled kernerman darbee fowzie teeshirts samels soothers weiger plasari gallick diginity girhotra komansky inveigling belorus soild morandy caralyn stoley hawlata razeghi didik alasin suaveness maxner joset edenwood adminstered abstractors invincea pennyless anado jurgielewicz aona magnun lunchers kemkers latifiyah hupond dominiak sambhi shepik khupe spykers cliffhanging absently eabis midpines domori peterbrough revivifying mickum nonstatutory invisage melaney hutcher paliivets netshops quigo turny lamak ghurkha earcup skab springfree mpumi sigurdardóttir depoliticizing muayyed repeate leising narzan suseno burkland elkady scratchpads toeholds timecard abhra nerakhoon lowbrows oponyo ghayas comandantes gmwda gobie esayas lerners zuora klasko kokayi cinematch exobiologists luobei bouga delusionally cvrs chetra biaxin reichgott requiris woolite counrty namad ameril tembagapura mockey snickometer fabby eirich adineta bogogno kubrik interract qchat eliaz nonaccredited brudevold glenbranter laglio wnats havebeen atayi lowgar polyradiculoneuropathy klutzes hearsts moshen xiaoquan sebastain rousingly discoved jouzel tradepoint alahuhta birtwisle girotra questionned vasby hazina borgerding pcga seakeeper wangjialing borntrager magdaline renumerated stubbled harneys atfa mayakovskiy rumery jacunski reibstein intraracial peolple nicaraguense responsibily prasidh klaber blairo ferrufino gallactica tareck sahaku arciuli uplc nofas yurov blancher worklessness milhaupt tanayev moudgil langert vinopolis nejma dnsr xega wilgoren amplex levkoff mittelplate gissa mattusch akeroyd cataclysmically ignjatovic siegelbaum myspacer defog bowab aseza rousell africian beraza cretul stabaek schéhérazade heatproof ciatti raushenbush scathed quadrozzi convis labeur precentage romers insightec mplms brasilero chanton sheffy aliwa dhamala acelera thilmany tainsh jędrzejewska novitas qdd expatriots bridalwear niquero cernota novoseven normobaric guoming sumroo sorah poed cthomas kibumba glendronach nucletron sneiderman krusoe zephirin dmard deskbound tradgedy bistis chaweng nthi pathologized cumulous youmail brazzale karunaratna admc nordvig gcobani proselytisers tukuafu tritely looing loyle killinochchi smbl proceeed nijmeijer supplementally audika todger spiridellis tenderfeet clynelish memorizable smattered dycks litterers wirelesses keville nasibov casalotti zuccato losec xalatan messiri rafterman upskill shousheng ierardi epicerie waliur trouillebert aquaterra amstelhof replants kazlas daedone vandebroek guiltier moninder marfork interlog gumblar talafar rattin boooo peopole hausknecht skarpnord thembisa chmc smru inutiles pongpaiboon tarrell ddisgyblion vetterling nizich rltv hinni retirment longlists metronorth sellgren posterboard crescencia cloues cleberg claimd salomao casterline ayrshires stuffin repossessor dabelko overcalled shinnojo mocktail noerdin chekroun colusso furtiveness oberwetter mafura enterpise drymonakos zacharda krasdale jusy sluder bekke fareeha casesa assisters sulkily archirodon borho reputationdefender nickelberry rejigger rhinocerous wbmc possis schipol lindelani hatzes fhlbank mrtyu liversage caucchioli intellecutal kirtzman rdq ffostrasol nownownow élitist vianet shakiba recirc khakoo emergance turndorf ebco malane jarg backpedals crommie saamiya treehuggers buffetts joele dezman osaghae kilburns cemea kalogera madaen ukad henaac straggles kattegatt cirtek mayhle efvs mogran newschools bierling garwe minezaki pogram tegge ioco searchingly wasiqi salky kalloo followes jiskairumoko fejerman krajisnik dupler ranchita ftrans mccarthyites ambulancemen balmorals potatoland jowharah transportion belkas caly colbon jimador nidorf jreissati unplucked saarwellingen tomcito sunich whackjobs adesa schops dalesio piyasvasti tsetan demarius chimdi kronkite finz musahars ballycolman wybie stottlemyer amranand pargman anandalingam danhi shabwani linkon depthless camers puigdevall xacti relandscaped gerel potpie racsa olotu cepia plumrose ridgelys driveaway mozafar vidlak muhire officiators wuerker zunshine underlinings blisset guoying palmerino pepparkakor machipongo gashaw philpots senkel standale ferensway zebrano colemen ceysson ovulations satterberg eoms kavee flightview lebanons solemly fangman cffi finel maundering negotations gieringer farouqi challahs juliénas lehotsky rubinowitz stridex twitterrific venezualan edingburgh tallyrand econoboxes taglieri sheiman mualem gpy pommies credulousness invaluably toukie arrowpoint lmrabet ervasti arsu citröen liszewski osunsanmi overcommitment schisgall giraldez shamalan aquisitions trigen vanrooyen powerfulness wilbrod limbourgs savou masakela kalvitis macwillie jennerjahn terrico demolli ciic catrinas alamolhoda yalies minibook shibis wintemute haijiao padillo juenger sciquest mccarton wierzel hazelhead ggtase brefs bendich hamane sheratons kulat chisi plakun cdus rafea liguasan mcdermotts uncontrolable hulkenberg legistlation shilou salkantay darkazanli unmistakenly fatuzzo touristed snorters braison andolina noven hegsted surgicare cadapan griddled bousada plcm usibc ebersman moontoast turballe hmoud samares blusters shubaki cowells rumberger derestricted announed germaphobic taggett botticellis margol mammuth erace nalge firemark refugios montepeque gratuitousness wetters greendykes sharil kinno sitrin fneish mabhena badush rendleman cadged oveur kroizer trelliswork chagares ratners hekhsher unloyal aspec morledge kamuntu biesty habibiya robeks refinancings paprikas grrm toking computerising corodeanu milkings kildary skyfari kidswear bushcricket bollixed wallcovering schoenwald ffgs analytix birkenstocks sefland arvanitakis bernholtz joltid dellia elcoteq decends buckmire teamwear belby establisment sufyian innoculate ameriques procuraduria yunhui innotech ruwaida liolios cheapy rutowicz vigoa instituions amouage aniasi casias lennmarker yonto sauteing rouas spruiking inmobiliario foriegners extemporize cancercare winehouses numberi diddums nonggang bucklo brauncewell undercounter glybera zharmakhan sibeam stoutmire osscube unreplicated mohau disneyesque croute inescapability groebli dollarisation appendino bernabè cmit tiii dipity contabilidad humanick burtone sheddon verex kalishnikov climbie texturizing sikku recurrance gringas hipnotic exquisit gogava qeybdid hashash defillippo musonye darunta rahmouni geocacher autocentres changingworlds embelished leitenberg ureb nealey baringdorf notarizing klimts dementium kafoteka barnshaw outracing jounalist querce websafe joek anzueto parayil bicyle dijlah rwdi glaspy donnent monigan naadac olitsky guirane humungously wallstrom legistation excuted slotboom gavilon wnes mpowerment plastech klawonn sticka aspentech bronques ballestra grevill platoni asss khuria skittishness jonagold intolerence ciarrocca smoaks versacold yalennis linza zabini rubiner zhixiong establishement deliv lavatorial layva episcopals tagong poluleuligaga habid garwick tshepang superhead spainhour urband neutraceutical rhynchophis aspirus koelmel immunising distaghil blacklaws tingled drwy mevlud shelledy breteche zhenmin mooches lieberfarb damione spirko imbrasas lazaroo sugery chusak canvin bashmilah cabieses opande spaisman coulsfield goidel smooter burklo virgoe lylle ikhwanweb redprairie khodos unrequitedly destoyed interheart mingqing schoolcenter therminol hightails sokhom zhanar coshed limbering oxytricha apwg luochuan defenselessness zeyda bootiful bevine enviornmental freecom pitoitua oberwaltersdorf wraggs arnautovic silima sicked tapili yamgnane asymetrical sewnarine fassihi milberger haeggman deresse freidoune pelosse chirashi venissieux adamji vogueing comor conchord kobau garavan giannandrea panameno bamdad dufresnoy tymor elbmarsch xchanger helgren trabbi khaymah levya moujik benezette akhlaghi pietropoli infosurv multicourse smartwool throwout altimas drefelin hxm delvings foria barcaple debtholders boissonneault petalas sorial munyakazi nationalmannschaft cambadélis circustances escrows keilen kissenger lifecam sourer chumleigh meciar bigeyes chailleach moralised hymenoplasty agbami keyaron zeebroek glimmung bombardini imune balinda ffriddoedd harach djinnit etirc sparts forbeswoman neumos trickler borwankar braais kerper peforming inoculates fleita aguet abdelfettah khajavi commy fioriti loller firebag rtog chakaipa louks shabout tekori coway ursuleasa loudham foulards vonderhaar goetzl xcalibre blochs habetz tawafuq houjian jerie hypergrowth heathcare wishnie nextar stst mitalipov camdeborde wellenberg owyang llrw schouppe otkritie counterproductively truley saridakis aubey lillestrom trevers stepgrandfather virnetx ailman globeleq ibercaja reveiew salmansohn strm asaib yilishen eliadis jaquemet liscano yehiya csbr ingelise slochd brandix orthofix junhe mariaca saood yoffee dubens mynbayev sprawler restabilize sfcs khadraoui adrenalized simbolon bescos kinghan yachtbau tsigdinos trokel skywalking yefang corbato midelton datoo killt dajabon tcdl masisak kiminori raleighs minging timesand klebitz nauls coffland fielea curtley mailshot macroy rebloom deicers sundkvist buyanov hatboxes brookses unhealthier rathina luparello moraal oukaimeden jodean devyne callgirls papple zeromax kurzyna scratchiness flinter sebc touble crockpot foofy kusasi azilah slicking bntm dinerral zimmerle sardonicism hippiedom wtkn veddahs refashions wiebenga stavins darmer nbsc freefloat dufficy subashini sysytem kilquhanity representivity simman positons powerblock outstaying roeckel flowmaster eckholm gyorgi hasibuan naipospos brittlestars ayeli huijia transapical housam dettra pekins gwrs malayev bocevski llorenti swingby paddywagon wavery maniraptors janaagraha krueckeberg schmall nizri martore epbs burred levulan biumo windbaggery tharrington carped lomonte wizzit mcgory kasserman denouements rcrc sertig percoco saidur trendspotter lupetey priniciple nuthetal gasprom unphased unhorse karwoski beringe arboc bonterra intiatives stiglic mokambo skilcraft monsees mateljan levantado splodges artsquest aforge mwaniki huachen perogies videoscape aveion breezier serykh congree bandwaggon zerefos canonisations tumminello yurkiw weisburg samoens lichy abukhater sioban cruciverbalist extortionately crocked dubik baaaack charnvit egerson predevelopment potbury accg trainride clottemans wizbit trundell lifevest lammermuirs ghosthorse hassas basateen gavora hasman shaoyong sumaidaie ocloo opressive skjodt waltic nuvi coneway anthropomorphise hungai perfformiad mavimbela vetrazzo australopith tamuly luchezar adivce advanstar apruzzese bowoto leyzaola ireporter eriam vonkleist taveau houselife intradivisional defensives monacolins keevill somfy raychel nonprime maestracci pgds yergeau blueeyes kostelac touradji avtec carvello turnd bumpf seref multihued pulmoddai adriyatik neutze trinajstic areheart tdbfg scdi monocrop gemütlich rummaneh peterken dozoretz ultralounge schuhbeck sharyati agunnaryd laquinimod ricigliano bewilderwood tripitikas huthi chakushin kremlinologist unclamped glaciergate techine dardennes zahwa sapergia musngi canoscan tizza jived sztykiel pishchalnikov salathe scholnick microcurie copaque mobilizers safecrackers menotropins ziso naimah zenithoptimedia intelliquest sobelman adjunctively cloudwatch guocun bourzac clearsky thornsbury cibils panaderia changewater myocardin denene madziva mezuzahs abdalati inexorability bioculture eitf vitaioli vickroy nonstrategic morotopithecus turken rawland yibi dmitro sigurjonsson merighetti ghaire bouteloup beugre apsco oddsmaker rustically sunnyville hpmc linkery pottermania turiano cimpl netsanet muirtown ffpe underbidding empirix farrey lahoris kubr letford garapa capful purevia clavichordist diamanté mcmannis oldstead jahon scunny schnegg ddess gransport irace ardiansyah cosn bellweather binit imjingak sinabang grandkid casasnovas mediahub redjeb rfec zacharakis raufi futureless tomatoe phillabaum foodwatch atomises merlau passacantando myxo winzenried ayung maryjo mumtalakat neigbouring panden shaminder choragus katesbridge sockless sebasco woehrle vernae famly dllr chemnutra malangré hajis jarwan manorohanta plaschkes felidia confidents whitnell repremanded hahnfeldt pfannberger xianting ohiri negotation suleimaniya cashcade schofer feczko shriller roulez nearn carnaroli gronberg keilholtz wattenbarger prometea vasileff videoplay apodeictic dispossesses detered blogpulse lambreaux kinsell forthside sicortex cartoonlike nontenured kyama mancrunch mavrinac unsaddling zloch bendien dustups ibrayev mamool turcas recentered idoit popularisers shopsavvy gawked fritzel biblarz thowing sharrers mailander cerron tolins marqueis rejiggering devestation gramanet powerfuel expe palliatives broussards iditarods ulsh currid hérita verduno natsal eroticization easynews ziha wiranti scoopful niyitegeka hallowich rehospitalization devincenzi nadarasa sabriya desexed pollman savaya nsofor eisin ubisort belatacept shabbiest changsan axid norber delievered gebregeorgis tukssport khedafi goldbar outift koloroutis sheikholeslami borkenhagen savlon widemarsh siarad hdaci zushan porshe giddon carrizalillo intersec kervorkian ferk easybib shamsia barrasa mhuintir filopoulos swordmaker ureilites profesionals leigha austraila eldrick eggbeaters dickers kotzian highquality emary calderaro adako mowaffaq devhub farthermost lostness abdulmalek barrista giboney softbook colesbourne paln waginger nydegger xpressions palancas fobb schoolmarmish showcourts greenbergian americinn romick semagacestat lloydspharmacy espinos recommits acquaah brouwersgracht rcnc yanobe jossen koenigssee onts sawadee chelminski clubpenguin undented garske zacaria uchizono sniggered houseal khanjari kozerski kalfan kagaba antimalaria troutner raspin neuropsych klowden longpen bermillo hatzius calvay epigonion dissatisfy icue willebois collydean hungerburgbahn risoe tornoe tanae romanengo broekema denbrock txeroki lanum immitation inevitabilities wahaca ntag exida reuinted frbny lopresto platteau bythell asociation neighouring lapera naumannite kollmer competitiors taghrid summerell incrementalists eerier kolpack sheinton baytree marivent phonthong suglia archaelogist hoeksma overcut strelsin silverley lichtenheld darelle askernish segatti elandsrand priszm luciene infragistics chorost mindflex nwando rydze tunebite paediatrica cubasch wakfs hasselback akill atmopshere reborning sophoclis plassman galliagh sloshy jaibi ostick plasticy schmiedt sanca reportin wolfango ethirveerasingam grami econic msif mckerracher semapimod klocko carabineri incandescently mlilo evilest norcros bitsakis throttleman anthonisen mccoskey triblocal notal besilate radya descarte bionovo firelighter clopping placemen ruchat demisse paratek menzes ehring schmahmann chermont merholz stockpot aronhalt valentenko tulles pidsea thsn mujirushi sereboff thurairaja sealcoat elosegui crohns pecentage whirrs xitao graniero gawlo earthwards simerini orthosilicic justanswer riggwelter aorund brouhahas debrosse andruss cityroom akba toneelhuis thanee kambwili selesky dropback jesenik curnutt herdswoman annisul enoughness occhiuto biblioburro parodically kresch challege cowtan cannarozzi yucho telnic sholley louisian rieuse tradmed trevemper bedritsky qualies holtzinger airworks hernquist bahts pappelallee erehwon simspon jetboil tabouleh xhelili jocic pursifull schuchman vilstrup amross decifer markwardt abazov phaswana differin vitrano toprol bartoshuk guardistallo adulating septermber bezirgan anisi daulaire showhome tafani gunked ptec boobed karoke antitrafficking kyriakopoulos stulman nickl nightastic guibovich belskus kamynin kalisha centurys billionares froebelian cardie utilisima jackovich barnosky weister bertinetto finaldi crowlas mingxuan clearable flarer sandpapered lobbygate aqqaluk wayfield waterflood baseej govloop churlishly vismitananda lodenius racerback foddering mangara barotraumas levale deferrable blackfriday andex sanminiatelli narjes haggiag lvns acorp muyi gcash overdecorated hibhib sponseller stedim suncadia rebanding usbf gospelly uncurled zommer pedicurist luem falahat nanasi previtera globaldata wintertons bevon dailylit chunlong homevestors spinkai addustour parrog simulus petrosaurus popke smri passback communitie wassana sabans oligoastrocytoma talbieh goucha tchatchouang ndebeles mellace magnetti penalities jaroslow felpausch cdrp mechira lagrassa millesime ntegrity vistec ramjeet mukhisa andraz ickburgh liebovitz dotts offroading snowcover supertex tuev imposimato eutsler cibernet kazman laermer fordcombe dtna scheuering senghennydd wassom melness mantarraya cravotta kattie croddy girlier alphine stacul brangaene dramady guinazu drugscope jasiel windrem dlugach sevenzo risius sezno nyingi reflexologists dorritt mejstrik unblind devaluating summerfare fetoscopic dancemaker hosseinieh frightener shenita shakr adalaide bianchino spiritclips nqetho ctmm kickstands stodge brosens aseff metroplitan zues peagram wigdortz yoshitani supergrasses valstar xiuyun pannekoeken presentationally kullgren sooud pupillages humenik autoeurope lindia homebuy teladoc latzer fraxa kamimoto kortnie kamarck earthsat klaeden harethi kogito oschner staney speciosissimus gubmint fireable worrad jackknifes pouchy anticlimatic gyrobike chelada haeji tjibbe rolontz burnishes bosdet winspit tarciso sliderule stultification applebys cutietta lusciousness mariton lupson autom renegotiates prbi pauperisation copperwork stateliest restringing jurdan liechenstein engineeering pndd cotonniers qateh kagwene sayansk scrumping behuria hvps mizera escobars chewey kempski ceraweek arres serritella tinkerings cerrejonensis cken wyeths prupas hoosh troadec tuntable trendalyzer ruebel maykin duckfield kopczynski blus etymotic wahner teamaker syafii biohackers prauss bernoskie amwal andary oclaro slurped lanzman musicane awrt estrangements setterholm churchbury tendrich eruditely telk crovie cominetti swaybar arogant zeelander lamana buthaina wosb kermabon sieracki deward gphone mozafari gjorgje edig deviceatlas unibomber wolsley kasulke robotlike galab tazu hawfinches gaouette dierberg soneda dranginis sharisse sufferd goldentree mepolizumab fradette shirtsleeve disillusionments ddaear fatemah sunlamps peachcare slickwater vlcd spluttered viacord berryden petionville extintion gulladuff pheap alvery leonidis aquacity flacking grinspun ilkham wenglish rabena ridao yordas buonaguro levans cevipof aosc mannello duzce nantcol zdunek helico slong entrup whiteleas stringari indeck jochumsen wellstream mataponi niere mwesigye meroi tremblois kaleme biadillah rapporteurship mosala cuiv posegate chhon valentins opthalmologist panmunjon maryl multiconfessional domjan harlans marzluff pointsplus peoplefinders simulative entraction zguladze fuyushiba gerami energías pernin fortfield nurowski lwara ravishankara apologias alberstadt travesser yennifer uccio cannongate fitisemanu braveboy disastor pregant infusers pantymwyn vaco highleigh ilcho rochez vapourising underplanted cenckiewicz shadjareh plasticene ramierez drca bline doje symingtons hermelink pedatzur lionette piacentile sandpapery cubbyholes knofel kalogeras feiger organistation tweetups ovalles chainwide shoptaw lifesouth vandrevala barbis satyadeo broccolino rabdhure keylime pohjamo sadean joudeh unpartnered furfari superstocks bargu aquilar colbin prenups buddakan helsington powerpacks lipworth moutsopoulos intereste tavelli gibgot vegitation allaho mashatu buulo dynacast olagbaju schwerk solventless wigga haymans arthenia banze deddeh garzarelli spahic elwak bjog wideboy naderites esraa greenmead deflations durado microcentro yawei cordani normacot atishoo oportunities fleischacker trahn amireh falceto tonnere rensberger aerni timberhill nooooooo nisgs munarriz consignors adcommunal shulong maralee incessent raddad shabes elfred rushgrove aherf mamduh hizzy lucratively ayalde deskphone progammes quailed rhymesmith militzok skytown rattanakiri sudoko supportsoft kwasniewska tungesvik totengco airmedia catholes hirshler baldeosingh degrease hittisau elshof danovich narenda shaali trepidatious overperformed jamiyah klun axtmann variani rephotographs longroyd outdistances nowais munud moggies shirtwaists thuronyi farw otls jorgan fallafel goatherder kosterhavet narvaiz kacelnik xinzhu exonerees khurts ollusion yaffle lilts zabaglione denmans kumkapi kemfert moonshadows alessando audouy korsts dioli gendell slebs photiades bramman gueliz slideout biosaline segements xenicibis rolos nithish polybag skittling wicoff bamieh sightholders hauntworld dirouilles pratices yakker abcn eurocrat sayedan fulmor mahmidzada luja eritoran underperformer tremopoulos weddy djibrine suttirat midwifes medplus noncapital cheesbrough madhoun juknevičienė rhinodoras kazai respresent lanxon mesomorph shahpoor gasgoigne falkands chemtob alestra regualtions kurshid macropetala sensoy firends pantalons studdal czisch salinated balefully intrax yarrowford emruz monochromatically peschanski oudah trencherman hafedh koitz chitiyo espley scambos ahmady morells talek fqr melome soonercare ogechi arrowbear mpes fukahori lasondra hethersgill santions cucs flatteners futuresonic lehnertz ngoun matovic mckelheer ciprianis eough wiita kuncel duddleston prudenti zenti maarty rttemberg firinne drummerless adelir dizzily bullishness siadar pauperized conwoman nblsc dragonwave cinespia vathia outmuscled bailenson nyongesa burfi soremekun pccy hedonics connahs balyoz drhp harsimran caihong muyin lagrue pfaeffikon disinformative twitterings maalla amerock tassagh jilo babybjörn defencelessness morrisonn cevik chukri apegga hottoni wreathing tokuichi sanahuja atepa hardbitten pluspetrol sesji virginmega azadiya iluvien hazzazi mirina merico steneck enomatic thixendale dominca untransparent behroz bigbelly hyacynth niqash shapelessness tarascio darias returfed insurmountably cappellazzo firther imaginis champigneulle signvideo mijac malkenhorst sherjan pgcil conforama bavarois yonghao zubeda glenturret hadjicostis burnikell altaqi haigs aanma bollwage pongthep kitsyn nonpersons keigher barbaris mashtal chabraja iwanski knockings mjallby aldeanos qpod langrock schweinshaxe raschhofer cocoavia gershowitz cheapies targeter boppre grmn bollix obopay lendell zeppenfeld reproachfully doffcocker doubleline kellestine onevu virginiamycin ppuc dormeuil ciggy guilliani vanderbeken usaction ssed trimeris hitachino jbwere nvta sentencings corbieres herkert whatshername gawanas satyana bovt gozman incontinently kikambala pravachol payloader reamin tuddy cothron sengamalam fahrman apidra dolatabadi sagapolutele waldholtz bobińska kapron chepkemei ariail goodone prelec vivix habersetzer raptly envenomings lizarbe valhallians yogurtland rudahl neigbourhood raner stablize nacds ijegun mindorashvili barigye ajdarevic beddar biotch subeliani yablon colostomies shpigelman embarressment joisey smaghi trilokpuri garoña wynick headshaking taherkhani asiantaeth cbtl fuzhen ubci omache denkaosan tarmiya filewich baiman konczal ukaid beesands didas starcite molyvos vandierendonck salpigidis jackmans dawdles scentific temozón frother lipc milvina lonedell jaffin ufot hissong kleinfeltersville ronneberg naseef janian pharmaca kooza clozaril wojdakowski korostelev gastrique kondratowicz nulogy tkos hajizada kwaje noncoercive nimalan kruglik generose telbivudine penello unliterary laurson unrelaxed sagrillo biswamohan grachvogel myard circulans khusbu voguish beiliu lubinda iget inartfully ajuba rancy distributers rauchway unsually wizardy workingwomen chmelar ganakas unappeased lital salera outdating serice nanogel ebrahimian combita jermin audioboo slettedahl wories sreemathy imbongi aslyum bartended jermel mussbach minibridge micromanagers cozaar shemmari breznitz reiteralm harsant gafor dewji chilin ndesandjo bugliari labovich peepshows apotheek talibe pettinella hufner gylve responible hipoteca unefon microbusiness capato mdri haike udmf novabay ammonds provokers submeters gutgsell rewrapping gagara liegey bouillard apey measley kotrikadze zeidi okarma flaxby brochstein ricciardella azares trongs scrooges horsy wallenhorst tinyes adorableness lucimara dracup czapnik propulse sillick vanmatre schaumber kylies compartmentalizes organsation emeklilik tamadon lackies tidi scrunchy bartran chargrilled gouray messchaert massud boteco griethuysen sauvignons bicurious chuanhui ivds guernesiaise baldanza crvo ollivanders talisma walstrom mitham bourbonette beideman bcpd powerphase winnning newstrom terible durgham lobala khromova rubane veliyev jehmu sasparilla shorja chimbalanga digitises dhahiri weixiong ozmo stegbauer onsm recyclings kilpatricks accuvote smartbox akhunzada cabalettas belobaba vyborny bourhane egros woodforth thambwe nazeeh daimlerbenz owlia bushweller noritsugu baicker ormers raggie lynly agamez lostant cogsville wellpet zawislan wqma chandleresque underspend kuhnhenn kubbeh morghan sicced qubaysi revies cabei perorations realtions mcclorey gribiche aerophobia shahwan webanywhere hablen pluthero shindaiwa teleopti dohoney bochert cuddlesome allera bendross dongpeng kilninver aestheticizing ebondo expropriates dsit geisst sinot adelaine pwyllgor ambulette nadama stanfordville dipersia fagerlind jazera honeyborne thermoteknix keatsian qinming gandus umtri shahrudi twitterverse inclue milbanks fillman zisblatt kajwang weikle wolfgarten redearth teseq fernworthy lefor akse curlicued wetsus guidette bakhchanyan bosci blobbing planès saghiri laugable chesshire pravex internationa beckam wordily northhampton miljo bornhak gravley ageron koperski blamelessly linkia talaris mcintoshi bawdier theise fdas sistare naghmi ipico novostey twighlight lackadaisically hirbet lewiner bedwar questionaires apeing fielakepa theyear akuei colonoscopic kinam xmpie kisby famlies lukianov brandent alimta kutlay sirchia eitm patrico hasbun barbequed crackback zalesne urbanizaciones suceptible dunaire failled sathre toniu overwhemingly mountainscape insensitivities alimentum backbends clearpoint tepperberg barkoff sultanzoy klerks rudesheim cabassol ifire greenfleet bobulova ticketweb coinless delwart tppf intital bosland roehrkasse lashof afterwork miroff drevon acatzingo killavullan covich unaffectedly baechtold maelle unsingable madland petrominerales yonus barafu audiocast genteq siglar farham slapsticky kaletra eurusd lueshing androgel changgyeong dairygold euraque avtur memela fischerandom kinkier seitaad kopecki ourisman crescendoing ahmadenijad kerbstone gayego alpharma rptn radosevic downshifters trevo naghshineh emmad bluekai prestipino petreikis koswara asloan ballakermeen shaqa kamagata studioworks kolish disparagements yatkin esbjorn socitm komid absord movewithus decadron aygul piraha inorganically kinmond spaffords asmbs romcoms nshmba akpd furmanski khardo panard guideone courroye lahouti laborte hibba mdms stevil hardegree mamaliga genexpert cifta zonderkidz popularism talusan paycuts feanny sorini eeeh liccy gopers beertje ohios reshipped budathoki mtvnhd honghai koifman carruba trilevel cautionable honglin unnos allerslev wickramanayake elementeo symplicity rtuk hornedjitef geolo vccp reismann framlington vpsos pygar ratably healthly paasschen boxclever peschka awpr misadministration serenic peaceniks homina holiman xingxiang entrepreneurially terpeluk colbertaldo consumated simponi letroy bioc shbak medlink sitefinder sordelet yaling kapandriti customizegoogle sennitt rotavator gilburne esmeray gingersnaps tealights optomistic roelfs lovieanne buzzmetrics membathisi monasch trifield kraisintu musni facelessness selenological kispert cohabitated iekeliene virigin mulemo sakvarelidze riscassi smallhold damians cityfile textainer sisic sowood spriggins abstentia haberg lving banatwala nishma platkin shnewer shyanne zubaid bocsa memeory deitra undebated sharkawy jolkowski neuvax casnocha nansan pansion vacumn confoundingly palnackie draemel mochammad algosaibi unbreached macbeths ameican devaud gerlan bluephoenix suliaman mahfoudh medr bateses korinek volberg cusak hrouda jaziya thuring nagayuki xiuqi knockemstiff givner ludivina midpack taramasalata gerrad slashfood doxsee nechak ogonis criticizm mogilino myvu decarbonising eact uatp alacchi dropshots prometic evain doubledown bannermans glamourising anecdotage demolishers compari coloseum khadambi flâneurs kuppinger glitzier heartware nexbus abashilov camowen lsgt imarex sisavangvong gimpl mcconell thuras reefat mevushal cacaphony sedore amantino tillysburn touchtunes semsey cdoc thieren ofice obaigbena mantraps mashishing parvizi kuittinen hayli radition haouari chrisi giek testolini nonmetropolitan gluteals kronur xinda kpatcha lrri klepach linkbee glante frengo karubi myobloc otuam redivide ciggie consituency moukheiber resculpting qcue destoryed blinged sailab bouhnik apiarists woild holimont alameh veremko fiwi smci unscary otabenga caleca frerking pjhq kiww bobsguide tvnotas lietch gvidas meteorically matynia wiessmann manferdini jamundi geolearning hicox dobrish oppresion lanice dacquoise trosclair brocher windpipes fulbrights guayaberas gepetrol varnagy kashief kosanov raineach sandhoke ungloved whippe squarest raspiness revelaed okech zhongzhuang somaia dudettes andrikienė natik ramanarayanan darnowski gentlemanliness adhp khaidarov cerebal plutzik santigie fallibilities schaeuble clariano dodgily funkwerk ashiestiel goosestepping neugent orinoquia mauel structual ritazza ameo solamere kopište lizzimore basjoo dmva lynnea lukoshkov waterwings synchronoss itgi ralfini pretsch autho nourian sageview hassanien hedigan tavai corsendonk siriwan jaybo verdiem fuljenz yarmash sweetners molak marcalo ispor glru encams youba helyn wateridge dragonballs muschett accountholder clemencies gemmologist frometa europejskiego porstmouth waterspace consumate fossilizing mussawi harkess uncanniness gyory zwahlen affilliate dramesi hiroichi frogmarched presentability whatsits nefe sweere szavay carvo wanjin sangota rinconcito gobey fomentation fondues beore trakin awwwwww liftboat benata magande pignoli wormuth mbow dyrell mariches kasdin tolou parliamenary ecoterrorists brushoff wuethrich dundreggan buckberg bhagidari mcgarrigles yamith targed understeering immortalists swecker astropolitics newfel laudonio lowcost leadframe glorney hestitant eythorsdottir jianling mhrp keag guerze cidery cococay mccolly mellau wolchok lbsf beefburgers denaturalize numayri ssrt twork bergstroem portalatin brunellos oddsac traffiq satisfication clawbacks publc narcotized hartpence colombopage detoriating brassiness stivanello candelora fidanque matricidal extened liveras ibrohim einfochips misetic teamworking inzlicht rockwater hoedowns koteswar tackier solih fleapit apetit handleys pebsham trustafarian qumranet nardy lunak impenitence nonhlanhla molgaard oversample boguslawa nxstage roadsweeper therapod inderal dewitts tutition dapuzzo lawleys ombré merzouki tootling demostenes lotoro repya susica anbarasan flightlink troj winghouse raiments btmu mienis jasira flagyl tittilating mizban rodean mingli ludila sosostris brachiosaurs grammaticas hartshay masiko liptsin amanzi flyouts almery vatnajokull edilov maliah olubayo bookin solaces ceiops shiach budzik rageful suppon trichologist courtoom antisatellite menetrier opne pludermacher plettner hennegau ablynx consquence woollven nyalenda etvs hasnan edwan mmce voxant duhul risgaard ickiness heoa nesirky alishayev reoccurrences roisín duckinfield jeanswest arbd harpoonist alcos chabangu komolafe dadush bearwalker qisda tabakovic boumeester ghettoise quavas aattou sobowale metallireducens jhoni preemies vagni pcsu zigged viklang sterzel finucan swithins shaull kuglin scylletium gokova plevan ruun prefight fairmindedness maplecrest blairtummock chatbi buyline bwea catrall projectwise europhiles zeale britanick canniness guenet zachari adsu targetpoint huveaux breazell balaran lupardo misconfigurations urstadt eurofor supersaver nedl pantastico kurgapkina baaack nudell audioid sypien amrdec udey mcgruther pâquis ibmt misbehavers callix homogeny toystore futala krissa chansamone berrydale parachuters mogilner etess malfeasant lemore ngconde houseing deewa brewhouses vatandoust semore blackheaded persent kamenetzky groovaloos norphel misaligning landefeld mcnaughten shaibal revenews roundham cssn belf cocoas novogrod buywithme nutrigenetic kakavas fountainwell dafarch aristarkhov klvana mohassess fambrini ketterman juqua petrohawk tommasinianus israelies nishar edelstenne breathalysed buzea osteoconductive wiill stts unwieldily nnimmo attorneygeneral falcos nextview attackable anoth slutkin deodorizers racisim sarracini nhsbt emoney sagaria ganmukhuri petrotech delury harbormasters skierka zipingpu anfrel groundsheets zeum ahbabi hobeau luvsandorj ıs dinman scurrilously ainamoi lorho offlee fitur parashumti deathers unshrouded invigoratingly eirwyn magisterially noninstitutionalized disend spanishness jianqun sholam bonrepaux ballyearl loopallu chengli yssouf callaways airnow frien elseware khairudin blowzy nemos himebaugh pulks soberer popula brawndo brigandry dicciani dourness kusstatscher hefton visitng shalgam photographable afpd outguess unpretentiousness seglins uncompetitiveness methilhill fridd selvaraju vauzelle bigton haemoglobinuria unrunnable luhmuhlen mayagna castresana reenforced jalepeno anaesthetising botner wallbirds druzin pumpgirl awilco overlimit ossum jetfuel winegarden metreleptin ahmen sententiousness bacm druckerman lanitis maryscot pohler grantshouse smogs bluu exitoso sexsomnia fleysher rightfielder salving dietsmann sweidawi hornists qarantina glenkirk motionbuilder greensome ibet qafco dragonas raviolo ortigue tuilière cicconetti histed manuever intimas shcherbachenko pericard thundercrack pyscho cammisa harres poshness yakini clergerie mashele fimognari purnells mexicles schatzel beschen alloudi leocorno mckhann obsene klosinski toder groundwell merryday vistakon freakfest entrace radiotherapists troubleshoots setember transdniestrian quizes shostack exarchia mwamwaya rehanging nauiyu carlesii shehrbano flatshare honts haick oaug kyliekonnect cliental crenson marere musila meii starbrook choroidopathy azadpur farras winmark frailness anonymiser restauranteurs guardans suporn derden lemler peycheva tobji mawlynnong kipng sedlock gesticulated ghariban daloz unsureness padhraic radivojevic maegle poscente unmooring cadelo bourjade breitburn rebaine assps mistal aggress ecowaste mogt argaric sueng usselman reprogramed grassless vahrenholt goeldner mcraith jcpenny halamka sisif maeena voepel bitatawa muchembled sizeism mause vannatter boutboul fench elecricity tsumami swaibu nigmatulin ascentium seelische urwand schütrumpf gulhati hacktone summerleaze treuille kappelhoff amlani linegar thenia europeanize funning muia bilalian neomagic pygott moudeina martson garrards mtds gautum bergendorff plgf hords basilevsky norkom cryne elgas elektrim biscuity yef tuwairqi baktash schweer ecobee kraftfoods unembalmed autopark natcen blogworld rappahanock tuxedoed improverished sittar northala enlighting alexzander aneurisms nuriya explicatory toai maasailand bonter hornecker leonne ramzee zalaznick grasscourt moviles studentessa alixandra micardis fontelles cacophany rephase gurmu ballgowns babbitts badoit souan frigatti misunderestimated srsl tital weitemeyer brembeck pedery mcnearney comoglio minkwon jaseem tastykakes akca akissi heussner hurtfully sourouzian tremulousness lilke esmart peiying appauled bittersweetness indicent curosity schevchenko bivvy unikko hjw kokonas brysam younesi qualman saulny ledua nafjan falsest yarning tedindia adamonis teitipac dublanica chipless godri veronda threeasfour charlieticket zider qlikview sauquillo darios skedee reyeses fléchard podsmead soliday allden prowell casketed hanhua supon lagree mutalik caymen slotmusic kerti skeevy hypercompetition pakay degraffenreidt danishmand xwbs matombo redmans arshid natirar anticounterfeiting sigurgeir gorbushka debentureholders onton macauslan oosthuysen babblings grummitt rahmanipour fredia boudaries aldabran pressclub margusity badylak kaleidoscopically safelayer stojnic wargotz patsis grivich haladas arciaga dongdajie sutaria vouchercodes seddiq judelson magnevist eminate ilincic tauiliili burght ohland kasco tuzantla centraliser buyseasons maratier yankiel yunwei conditons druart cosandey whirpool stadum mvume jowly tsakalakis abingdoni conedison apirat searage stoesz scvs oldbrook clublike suradji calasan rehang kotsos sccb guilliaume proo obstructiveness sarnie opportunties shepperdine montervino achany acouple miezis perlet rybinski sorl rayappu haims angelson readynas pouvons corkscrewed kawasakis napatech reffert bloomgren talaiasi confederating andreau kabimba unpunctual guranteed tbilsi dratted pizzicatos paustenbach kenol aerium sofroniou mecury obstáculo maritial tubigan burig mcrl beatmaking weirdoes portch maskery goateed petroliam memeorandum contortionism gwylim anuvab mofidi gurmai nawr pouquelaye jeke tahboub garagey nutsie swedelson sherringham mojgan gemaldegalerie rnld wambua sarnow láidir menduh newriver creyts babakir runggye hateboer génoise vinography gennette korkishko comninos brkovic constituional nassfeld sayeg sizeably steepler evalu nonkululeko scaysbrook humanware gerberas sonyma umalat ltci indieplex zanjero bertsche nikolskoe durnian plisco ridonkulous serbanescu hairapetian sixthman prooth kracik renesys shortchanges hydrofracturing humanties zhongyin tracky lyreco pleadingly attoub overweigh blogads paedos bisaro panagaris thueringer oostlander khudari tradewell winchman supersexy torfeh bachvarova oxonica sanye guger benedikz eskinazi akeju cermelli philogene dadier ngarlejy leavengood stented starflex anawratha frailest shriti schmancer varfolomeev clattenberg pozzilli dohmh champo bussers filesharers thronton ffffound wcpb shabang polick repackagings shewmaker overburdens fascinators geordieland assaluyeh wanawake ovali boepd eagleeye armsden sipila hipsterism denktash nakra briede vantone empanelling sugv ubiquitious hyperworks vneshekonombank welltec patreus ecoc bordt zantaz sodra elebash muslimat manbeck furtney woodshedding politicing hippen ditsi olbas disgruntle geoana suozzo droppeth jalaladdin kongresshaus skraastad francophonic egenhofer kathreya nimco loyality preseasons rublyovka kalapathar strege cdnx lymari enfeebling tendil grinny htpp shisheng kierantimberlake acors rfmo overfills pickren senkakus pratury zepnick privitization hasanali nutbar jagdeesh bewitchingly greenbuild ecumen caracollo revatio etchebest hairclip koedinger daffron cardean toolo wagonr delaitre tomsoni photodisc charcol trumpauer aurigo culduthel tosheva hijran lcbp marajuana solmar cervelas havng transmogrifying rescreened cracke batdyyev milivojevic asteroseismic palstaves gabla georgeous weegh madzongwe peleton undramatically cisri albinder foamex overcounted hoik amgott drozda hawra rochinha farat autolyzed avmt bochao zibiah trevelgue cherdchai cmht hartlage roren nottm nightbook weekending menlow reuseable matere saaka castlebeck canonicals rejecter bernoff jibilian dincer metametrics madano lamerat hanusz diserens leovy blatcher reborns alafco aproned peffley gardberg ectel ecomony hyperstudio kaptsova aboody ivara jpmorganchase nearman turkalo lochmoor smartstax degreen belaiz molinaroli luedke durborow ragingly usglc krkonose tiszalök huynen auchenkilns cafecito twistings parsenn irreproachably sensoji gooood annwyl thouht islamise mieras kanaley wogaman clingers housedress minuk kiefner civitarese jumbolair ayerra ultraframe ingenierie mailpieces commmentary heftily jermiah oudated solomont conspirata cringingly gastright mcconigley reheats vitton plewka hrapmann vicunas aftermatch thadd pento orgainization graessle gearchanges tabibian parvulescu goolen whinges erindi irias solesbury fanniemae lolls coustenis waverer rumgay springmeyer itelligence caoyuan scotting strandlof hongni arbittier groundsmanship muchoki hellooo chuckchi shatilla shampooed makharinsky mesters mcgavigan dudenhoeffer persecutionis downcourt maletic manouri tramal kierre kontz riddock umare foodstamps openstage stopgaps zenab undescribable bariza pridefully calaboz lopina brooksher benarroch eirwen demornay papaer brosolat esafety seamlessweb supsect mogge pilferers azulgrana ghoulishly llins wettermark coccaglio arclid kinglass djellabas reincarcerated gunthardt orpaz irreconciliable chhatradhar douvall mandazi mannakee borkowsky countr embalms labout wasicky nonvolcanic riffling sabeeka saddamist antistate chermoula skrzypiec lucard cittareale vaniel effuses barcamps baaji kolecki bongha bajramovic ncdp oenoke mimedx iahv runscoring stammerers bruene hoovered danuza oiks muhame jusr rigidified schinwald togai almorexant sathyamurthi cdebaca elfert liabilites denoyer ultralong klutziness exergen mcclenney chaussade cullina zibakalam ziegfield peop schirmeister kenneled bodycon moneydie ensconcing bujie dragga flašíková synbiotics bolno dnot phien glamourise relected miankova potot wintermantel suhn sartiano mosle squawker glencadam gummett wwmd jeyapaul ggers antimacassars radov transaven sentor syangboche multidisc greycoat heinert chairty maalem kctr bergstad albicelestes rasjid cherrey memorialcare smoothstone henpecking kerkhofs gagloev barview tutoyer pistacchio anthoussa ratney romanucci motorboy cubavera ionatron equired flagellator wintersweet caringly rauden nvoad farmstand asiimwe ganlea roodman waingankar strangulate rolito litowitz kozachik knowledgably kopites freudiger rechannel agresso carterets tuksal cortivo decato jirgl fooler madigans expd flanaghan gloms houtryve arduousness saintlike podeschi transperency plyed responisble gravedigging knafeh diagramed squawked endebted vondrak malenkikh iwinski ofmdfm neesom greenorder kurchaloi bahmanpour ghantoot saklad efran depoliticise halbfinger saravanakumar mengiste ferf cabined fontham wingstreet bachmans cology theoneste bareroot netgain knake mgtf venturewire tightlipped jujuan vertigineux seguranca guendelsberger weisgarber refrescos hatalsky clerge tapsfield unseriousness entinostat payperpost jasdev habshan spritzing jaslo shmelka abduweli nonpartisans gaznavi groundfire snowfest tarpinian poddala keppa synagis gassani jumaily barcap cimarusti megahn coulsden lpld nhli sourcer cirkovic buffardi cseu asgeirsson tweeker macguffie dominent levasa maloni neeves jehuu pollitz fullfillment holmstead jotters heretically countermoves miday swarowski headblade riemschneider horizion fueltank dymovsky grundon latsko arlaten sesquipedalianism rubefacient assac nacianceno sandyhills unrecycled enginuity baudains schouwenaar bisno searchwiki druidsynge dontarrious moukarbel smocked amawi emptywheel pradaxa abutaleb licensures intacs unaccaptable sandrakasi villongco halek unshockable protsch ecig mendick worringly campsen jaunarena ukaegbu smetacek preprandial kamoshita poujadist resistere westfeld xigaze prakken victrex geater torgay smarminess stuckler oldster gerhartz szulik benseddik sedarat plasticware brainpan eader plodders calisi phuensum kikinzoku wangling arsey gibala tantalizes doroshow bellmen merrylee eelmaa rivère klitschkos unemployement oinofyta uncategorisable gernat tsagaropoulou owever meritz ansawdd eodt commu mabala quintasket simansky davilmar vipre utahans mutsekwa giazzon veveo syatem omnikrom vidino deyanat watte subpeona kupelian mazier lefar segelström gulmira saddlebow montcoal lythcott celox kranidiotis viscogliosi carianne dudack bilchik castlelike capretta liipfert reliablesource chidzonga nilico chikurubi guiller drillholes yadegari advantange tencate commi kobliner tsvangarai halkia bondaruk outride bachmanns ruangkit hydromel ronksley laronidase aurlandsfjord kutten hamadah doghmush jinwei tarangul moinina fibrillating olsher guatanamo hamiltion dulberger donly abreva nanceen comatosed reacquaints magnequench ostaz cahr garlon knoa futuresource pennrose alvac tinkebell glandwr recuperations zhuravel jingtao stiking francios biotherapeutic shov rodericks gyopo vergiat incredimail lodeve ahhhhhh combivir alouni ziercke sunlamp hywyn nuval futatsuki kneiss madwed mccoin markshausen linhope norit unnegotiable biodomes eredvi lagudi pricha dusia htng talkasia dnaprint qrx sickout fasciani palcic sagardia anzack praisner unprecidented genine suppresion rehage bennitt gugelot blaim allida rephasing alertme stasinowsky widey eastcroft tragara idose agilysys wassuk mitsubushi kingy unrefreshing unabating danien mcdonaugh guttierrez luijten kadom wilderotter riuven widspread nienow dagvadorj kipman grzesiek noxiously climateworks cabourne edemir elkoff johm aquivaldo sabaah mmcf pavluk airwair pristiq traoui iolotan norhayati autodialers hendersin hyperbolizing substain nureki irineos niknam menthols hakonarson suffredini leyrit cardoz ilchman pratali paczuski unbury crookhall naturalizer bardhaj lahim pleiotrophin copehill vengoechea nasief botanika calvan redstones tapitsfly maners ingoldby violanti deveopment degregory chabba arthrotec immoderation lousing laggers xiaosu cèpes tranchina mohidin bedmate contucci knestout reboosts cotgrove saruni oaxen tofranil abike detian kurkin cutesiness addage mmbbls antisocially cyclamic tayde shahda daniszewski robberson opondo fougerite mannos gigapix miczek furhman swia pébereau warco snoasis stamatiou tropicalism reprivatized hiemenz versaci eastone tombliboo minskip revoting nicr fishier scoveston fallenius khammas storimans griffithsin outcoached cassellis garorim weightroom ngowi sfogliatella martinko tinharé landesgericht livepc tynell sledded opporunity abboccato blackminster fumbler ramsha hankla evaulated applabs blasing jadco quadruplicate offi perelshteyn annuled trupo chistmas autismspeaks santiphap lekon geosentric sandioriva rothbort conney moralistically hipe poulterers capacchione dunlevie prevaricates ronez iboa haemonetics prodesse chockful fremeaux ipbx najai leggier teason narubin tricoteuses klyve kohinur rushaway talbiyah huffnagle aites haidl gaouaoui gaoxing mcclenathan crapulous raczko dersa rasist imbeni swithenbank collingwoods brynle galactico celum zvaigzde downsizings jubur serphin seree acqualina idylic bsja kisielius rasenberger aliecer perinpanayagam testar dagsa zamarra rambunctiously sarlis durepos chatwell aadnevik surfwise reorientate riisgaard saperia lentran bittlestone twrs pentabromodiphenyl fmsa unbureaucratic paisarn lochren neltner dogtime kashmirs lorelie anemona thunborg hulya sayidat redevelopers anasthetic cumins afrh supersensory regicidal spls nebbishes modero pekli kinevane ezat indebting icenhower eprize pedestrianising rosiland bronicki terroists bresland clangorous knicknamed haoge atrocites ekazhevo satinsky contemptibly adversly kleinubing peerindex invideo destigmatizing grenny geele tommye chekwa comtempt nyhart qureishi simchon tunjang deked kenndal onglyza chunyuan ciccolo wirayuda barnados motorports tsuyuzaki infs maiolini intskirveli corkers aloudat chiazi unagreed awir boughten gemkow merentes politco berlioux macaree confernce yiasoumis lhamon persiankiwi metlin turbinado jancic hetze shunichiro stonghold gavaldon lorrenzo elier wikispeed himilayan diddled soakaway sinofert horsewhipping mursyid eggrolls ruskins bargehouse zweigle jarecke shikanda hostpur stodginess genuair mikoliunas cwcc krinitz jando sarasponda vivadixiesubmarinetransmissionplot mobileone mathiason newbo yette naema spanis neddylation sufferes goatlike helimed jameah cathell mcparlin mountaire hoggets destoop mediterranen shinrock zarattini silió cornutt woodbird hochar convieniently undercharged hymnlike bhonsala eidur montanera muellbauer wrongfooted lehmiller mediabank collahuasi schiera démarches shetrit mewhinney feminis kuljic timeslips ketino rohaya parachini blakebrough faler literalized bkmu scruse astelit repositionable golba qassas nammi supranationality neuroaid speliotis braconi pimex restak koscik maternities relearns benit gootee drozak rockpoint madit lachhiman jontel pymont needlessness lupoi corrosively hoell mihelic amercia oybike tnsalp abandoment naple mcneilage ringfenced lindenauer econimic cerovic muehlen trubus thurday refermented rymans sandhaven belkovich berserking sadiyah stanmeyer cacy snocountry besylate schizophrenically pennsylvannia traiman zatuliveter caputy mpam abramsohn prma ameritox nonk pyromaniacal videogamers wenaweser tigerwoods masklike mougey spindling isais osode bleik totia obstinancy epromos latibeaudiere knaul samsungs westernising tagetik womacks madkins bonarrigo zhaxi economicas evetually maiali chrysanthe imberger frischkorn vieshow srdp egde zemek marianjoy strakhanovich sexualising cashline marafi beader dismang sibum malagasies stuffbak senturk inglesbatch interlocken luoland matarasso incanting sleeveface ballinalacken zhurov kangtai boaretto friendo keroche uvse sporepedia mccelland mulsims ketrin sudayrah reapproach bondian thermopower dowsed boagiu abawi burdzhanadze marcinkowska myoe digitalising patk mohaqeq obang yetiv tytle britel yrbs unwarrantedly dmepos costan penegoes relocking smartnav alaixys jeremiahs bactec noncontributory winterize iabf yuthasak scartel engelson waybright haladjian aduro bellantoni cisos sahf vodone islamicists homeform fragueiro moben minap tupolevs subardjo tongkor gibben handcream pinakin hurlston claf michelito simrit treays gewgaws giavanni bridlepath pioquinto desgagne rhogam powerseraya paydown mesadieu mawrey chedia mspot ballasteros eluxury paracetemol unintimidating whored lazerine oenophiles narcoterrorist fwab lawlessly lizan gasmasks peretu fridjonsson criminalists lenelle outgo ndga rojewski curleys sevey morolica mollan nugo colazzo matinenga lolapps speeks jamecia ilanaaq shemayah unol minisub puchala sarkozi sainato cahoy gbar glendoe cheaping telasi hastingwood nuctech palella vermot wxco bladderwrack zetters vibey suaad hwkn ibahri whitetailed jeles offsetters polyfilla truffer techflash multiproxy previewers kindrochit airpot minikus balante fizzer handprinted demulling rudrakumaran bugaled hlep uncomprehendingly actogenix ogbuke unipac nghien neckpiece bhojak shoptalk sermonized againg giacomotto stasenko tousa revitalises vitalising dattakhel evaw getsemaní nbcam loami palestian cvis textphone fakhrizadeh tobman pasovic gintung trevell cartoonishness vicitims karrington emnid essoin electrathon folasade abdulhussein bamsey dhurki baasch aggravatingly vijit mambe qabel luwei friguia baguilat knittle unfastening lettinga caiani sydrome txtr wasila fundraisings cannava ingc sterilants stovetops linty hryvnas mstf belmain unwaivering carpers petlyuk broschart cencus slatten neuroeconomist bibical muganga catfighting suttar afpp superheroism marketriders nimalka uncosted harazim abdulbaset milllion ciosek slipperier grocock antimilitary robocalling popaj doners lynnae senaida binjie fayson kidstart mashangva suleimania calter curagen breakbone yulex giammattei ximin unsticking pullia overspilling petties haziz ichinokawa arsad kuchins tamarkan jilleanne unpursued reinvestments chivvy piersons parowski offsites gutteres goewey stren toolbag computerlinks spierdijk czwg paitoon powmill farasi goeltz alexsei ppera kaczmarska stadtmueller shanking robbeson echerer pova lightweighting cookalong stepehen repsonsible megabank strim shivender beatifying sauciest szigetvar pixs rolodexes pillin gothenberg sketchwriter barnabei flossed clitoraid zulum dramaturgically bendicks scios ceredase streppel dessange wasna ljdam ilsac yermolai suppositious purakayastha rollig aloui wuite hibell cytotec uyilankulam peacoat premeasured locair lemuroid eastlack petrizzo malachowsky turiansky volskaya tribeswomen ratuva childie hooning murban mowforth yatedo sandblaster intergovernment jajab marilson daffiness clapgate gairns shampine schneyer beehner retimed stalement unprecedently cicerones iclaprim islamicised sejdic chakladar precancer sovietize xiaoqiu countersue chitown infonetics lipsticked kissels zubari treaders wollensky centerparcs lamazou nodjoumi upconvert magelli takaesu dalessio contrafund fnih schwapp zesiger kathay bewhiskered fukomoto salmesbury zimbawean fiscalia erbst klyuka candying kleisterlee giorgobiani ojougboh schriock susic mezedes throbbin presicce redlawsk lijian petroperu becchia kettlebrook mahamood yanuar stateswomen nonelectric andrettis thanamalwila castleland webphone jungly ostros pandorans vaubel dodel unspayed kalafut herdson ymarfer chigirinsky myfyrwyr bifeprunox jackhammering seasearch kiyah blackpole hendarman mascho massaguet devasish atssa moralisation cvti ganguzza snorkeller cppp gootman qingtai muzhda alambo loeillot suvanjieff calicchio brnjak punjana dukascopy sparcely mylifebits ismel shompole calbeck wulfeck rebif bodytech airson maltiness treaster boudrow draguhn flanken pukhova npoiu infobright assination norani meijaard aronen drattsev accton starchase mewed meowed suleymanoglu fanlo shammies bannar nyehaus sebaoun ladenson depoliticisation lladro churchard montelago waalkens premeditating treadell caubul nightstands sverrisdottir rosofsky kruimel arvedlund merkes drasdo khinchagishvili intersector attram magnanini gleckman diame dsnet sebasti medpoint halterneck mccaffree unformat vatanka gallazzi pspca baghtu blagging kentallen giannasi contolled roaringly switcharoo horami equivlent xtremes nceo portentious conron circannual sarkany datascape subtance burky arikian peruke fndd selimaj clearancejobs surveils dryfhout glbc mainous pdex soelden chigas blotchiness sbordone nagpaul malinowsky kravi wejustgotback misclassify contrino bletchly rieves esmir gyns aiting massouma sharhan portentousness kauvar demier milblog oluwakemi threesixty leshawn irtiza owenses blahniks premcor rheumatol idealogues norback gellein stromstad gelitin feeing ciborowski rimage bjerkan tenterhook vanalkemade anticorrosion spillius javaux hoarde datacards hezam papangelopoulos definably bienaime shrewder smallhouse jellen staybrite contal mkhuseli ryzuk dinertown chrisochoidis giltwood buyukada reinsel adebari moudry procomp sweatiest yeatts erradicate tweetmyjobs orlewicz trashiness nesterovic sagolla chilliest matevz faoin sheephouse grundhofer focuss econobox alner salba talascend clatto supurb votin yosypenko enmeshing equivocator simcere blasim oobr alante consumtion blackarmor panyard komie elebert digitalbridge ambrozy snowmaiden paramés doncasters hurman cuttingly uaefa hardhitting jeselsohn albiglutide dalka dragutinovic ballhawk shikaki spectactors asokoro swansfield infosoft confits lacedarius fraunfelder gastinger freyman osterhoudt adampan interflex paytons lekuton danhostel habitue vatalanib bernadac marchfield duso vondrell tooni samander scario langoustines prologic forechecker gembe bettyann workboots gissen hysta midlantic zoombak cirali washbag hararians mowaa zalan rambunctiousness adrean interaxon sampil shtayyeh yesanguan leftwinger roettig chascona wurg kouddous okenyodo dolfor smorodov shahedul smta polyglutamate harjap watcharapol avantgard crankier lungstrum exhi mohrhoff loogies tyrrany ubig mushiness odmhsas kasunic kickable kochneva schwadel pflaumer triplicates rajalaxmi beardley wilverley sexualise gullibly baryalai refusnik flowerlike jetsetting sacsayhuaman baruffaldi turbanned unstapled ateke buyelwa ogushi gooshays twitvid paible strating internetworldstats restacked kovalov netburn filedby barbic fruitwood tasho padiri quartaroli starent mufleh conferance gorat eidelson rubendall favory frauding suhad sconyers cornips veugelers gottsch oosterveer pulsepoint cierge sitcen denigris guestimates montrouis jhpiego recriminatory perónist mwanda pogie boudjenane tutima bavelier coolabi gorol tuilevuka gartung ahole muehle nahass tcheky tdwi serviceberries vivaciously friedly novazyme heimbeck beachgoer hoferlin hopscotching ossoff junquillal igiv forsythias tiney brindas intentness odep habuba howeve bandstocks marcee ritzler reyum salkini zafonte henseler vemic mawene rahlir jacquiline apture unig mailat laffel demilitarising appositely litchis dipert zeinat spevak sayef denesha shorefields lapdancing vieria gumshield gisby baerbel dustings commonweath ahilan kamitatu guotuan ebrie callipygian coama zorbeez especiallly disclosable ruttgers fuligni beldini kutbi natche bohaty drappier myvatn toureg antiskid wierman arrearage sükrü dellasala whinged blocklike parrothead elminate qamzi farls taneisha youssra kleynkunst altr gewgaw wiesman mexus holmlea tyrany kuentzel waledac cgnu superlotto comtan nocker vygaudas bobat tightknit sajjid coai filerman kambakht nanoradio cybersafety tripati goehner rainawari ogelsby paksane rabeder shigetaro heffers coregulation kausea tillabery chongshi suchada sitzes partan breba manzke tagliolini revalidating sneakerheads pęk jcbs studenski disotell whitendale baraniak linsker sunbrella ehouzou jedburghs ignus outpassed puertorriquena moxom daneshouse gehlert tzaban khatua loughrin wornat punkass veghte harcar spoofable bambarger ispca barnatan garavoglia plauge accusatorial continung hasmat nyangweso wideford hairclips castrodad feminise naftogas relis nieburg kurina lthough esthesioneuroblastoma nakli anzalduas majhu vasty christoforous morganthau allaga eggless recks parayno petropars rosneftegaz ipys elins bulletholes contines outstreched capouya opthalmic kojedal invervar rozlyn zibibbo panhypopituitarism athermal raskoff shaea rosendall thorbjarnarson kenyas ghri thaiss mcgoey gechem officemates grodsky pittsbugh vitorio manahattan moisturise niglio nyaumbe prialt tardun exfoliants helveta sickipedia delawari relet awny trcc untravelled protaganists engergy healthyliving gaesong jaskiewicz wsna rafanan politians tapui bloodies solidcore sabbatucci yezerskiy ritzes arunma tulsiani sriharan loyan braining bagdasaryan consignia pecr uviedo silentnight tevot iecex instedd melvinia evanne kinesthetically hyperseal unreeled keckly stolidity brackmills jaacks indespensable lissavetzky zillia overinflate terorism antonione cerimon ministeries definites sobue bleated myrthen addidas maftoul jenay hriz harperson drubbings zangrilli estling kongwe bacskai ragheads silcon companied tesema napwa arcinazzo lagmore hydromassage jaljalat phax betokening uncurling scheyder larrigan zetlan kully urumiyeh novitec turnround minorty adew gradillas unseemliness letterier wcccd oout tolee prediliction tsapelas krent carwashes tsatsi condidate ropper gutíerrez onyejekwe ryness gróbarczyk klappa wvvi bedoin biodyl odec hria pozdniakova declasse salmones bornhorst realtech oakfields pellant elleithee dubrowski fashing braney wiewel thorleifsson gazgireyeva izlar asig dearmon statland barjon manouvre fezs openhpi clergyperson centrastate bodegon scrabbled aldeasa rathgama eiag reopro economywatch westerterp bluenog fbj bacaro acucar unavailingly rethorst cuvs ghiglieri paleopathologists whitmanesque serlet actioners repeatedy rogacki grimp sedrakyan casodex heucherella kilfinan unanalysed dicroce snowworld mykonian ahlaam kosseff oeur jinguo surpressing goetterdaemmerung heithold masnaa iarfhlaith colifata mulrine daewoos memoli homex tidland barncroft veljkovic traikov jolena wisneski abraxxas haymet overscheduling zemedkun victum gosin excruciation plumpers talve bliese mielgo neurobics gewen mangul atempted banchetti mckellips arwady scarbro socialmedia ceannaichean hoens tpss frankovic kotyli hyperventilates cribbar schmoozer lezark jibouri gaddopur mahalak networkable scalelike reivews dyfs arrotino pontignano debategraph sawc aryani dazzo bullyboy masstige adbe lustau farruco expatiated serivce gemperle freakishness alathara tintero unrealisable minidresses xtrax umhoefer porfolios inveroran brisbee kiplingesque chorman stationwagon roastee aqueel servicepersons devoloped micrus kazmierz shoulod soulen mandolines maharashtran jobsohio tshing tehm eldery avnon lrdc desmar rostekhnologii zitacuaro elshani edap gymastics mergent wishtan mereway parkgoers imars nasulgc jcaa arito ouatah waterbugs whif matrex spiffs employeers orotund impossibe felitti nehad fosil cigui hammerskin gaxa szczech daleep aloulou przygodzki oyw amatista trubion postmarket callisthenics jielian iovs becora khaindrava woodier blindsnakes bpex smokier karahalios vitone glaberson aaadt nonvoters anwarullah clarky olimpa cherkizovsky villainized lancot barzanti errachidi mcgetrick dundonians hamengku yolton swartzbaugh levav bouzar yousefzadeh bystolic cardizem proventia yerawada dieteman fructifying samarasan triump photoreal kmpg slatyford supershort djbouti saproxylic schertler cusí zadorozhniuk equitorial farelly britiain ʼal oceanlinx sayerlack hespanha jolil hoyoux massaud lomitapide shucker ginis citifx loshak rosekind stba manakhah svest varet bademosi korst samreen cobbolds fonarow neighours kinnesswood planemos memora schoppa slatalla giragosian rncc fleshpots isocs urtis moevao onlocation appathurai yiin techconnect retailiation militzer schlabowske zargani battenbergs birdell spazzing mizejewski billerud klitzman prograf angloamerican ashamedly wishner wirec rustagi biliousness topbas dabchicks torrentes postmus roundscale infinitis hornabrook badden skochinsky plainclothed exoticisms saralyn tervalon securecode configuresoft multidrive rightsflow fzco mulraine mundipharma nonvascular nägel breathily bodywear appals bowlder cityteam glistering fereira martons inrap ardelan pentech noaman ilight baggish angoitia wmrs mohammeds covia celzijus wiltel stepleton hermidas kipkalya barlyn zambas raeesi citzen iturgaiz protesteth pcdc overinvolvement pelekas lagnieu daisher funjet piasentin boulangeries podoski bobbys ultradns impracticably cathdral kedrosky rasheem perunicic prostition erka bryarly pricelock ornais moneycorp ameristeel hughen xianfu mngadi ventavis fewsmith frash rothweiler gabart falkenrath pakpour sietes ouches youtek jonases bugaighis primet hoegel dybkjær dodkin construciton khouloud crezdon mindblowingly hadnott glangwili paddlings hajicek widepread rulespace ectodysplasin dandeker zeitgeisty perries shpetim amadiyah keydrick wehlener showjumpers spron danehy terveen chodak compart virually semons mariell methaemoglobinaemia bohrs oleszek triathalon prodution dalecki toeman gentiva truedelta upheavel riekstins bouramdane yelpy canaloplasty kodjovi zirakashvili locurto jarandilla rafis cachagua dahlstrand bitterballen mcclellen wassersug overindulges stentiford clema sautéeing unklesbay tregroes dalguise aspatore schlippenbachii immobilizers raktim shamaa chanterlands saddos resharper boytoy foresty jelous diesotto multicandidate kokaral taregna menachim fayeds ventur gintare lgiu microproducts tafesse noonen delinski noebels jaunted naheem selcan indiewood washir nurallah manufacturered canonising solpadeine grazhdankin scences zanevsky snoozes marrietta hamrol kantes preperations infotag brittini securency fuul aeberhard asteriod roomstore lukeba russsian eloul pfaffmann rislund bronces sploshing swep ralton ligth empoverished boomburbs nonsequential lhcf polania mokango thanawala sinopharm kryshtanovskaya agrobiological firmwide willikers kerbow jarris argumosa attampt hochbrueckner eversons ephelia securtiy netlearning freons asantha fifith piwna pigneto siddiqah carifest healthcheck intoduced beshbarmak zauba retiling deganit stevely fruitlets envolve linardos repell hosteling insouciantly onliest noiron cardes greaseless doleuze madewa butuo siral ciragan slurried airbeds tabloidization stewmaker obamanomics vansteenkiste goates meieran zoodsma nasrawi recue mesches oevp minerally venkys kahlid kondaurova ordianry likeably aquarids horelick hypermarché singularitarians goken tarasiuk fluffles jakusz smetanka tirus minmin sterzenbach tournment ameron insititutions otex bohland petrolifera malecha salerooms tressed lifewave telenorba deceuninck shujie crêperie chadiza sexbots jeanice ukieri robatayaki malenotti inititiative olch danilkin labrandon prosten artparis mohne cochetti endp dasient sarinana schamel investindustrial wwxx jawanza superpowerful targacept albanes toffa feinsilber afbs scaffolders shidane krejca utimately bookbags fremlins apsell firestarters healthsource rachofsky bistrots tzavela bellavitano agreeed freemantlemedia maxcom eloui hotelicopter allenna mcgowne torgard oulo blaven zebley tenerian jacquelynn theinternational marzella vandenberge leuckert straigt cetron cyfartha kolba koukidis luftig prestart dorkiness mohabir nauss sanitoa khameni eurosceptical atwah adung yubaraj zierdt hereos slaoui biowaste sandimmune speto tomase collander ssing buttenweiser horsholm winkling consommés questex watchstander lehmen tversity sigaty karydis heikin humorus kourula arechabala devines dehghanpisheh bratich dinowitz karason qadari jacobsons tonjes filipiová restitching shoddier scrivani tibbatts defoliates unholstered unmoveable hulthén moaser gyrowheel inflationists neckpieces deleston gegenheimer errrrr amunategui kingon naciria maehr pulchritudinous waitperson appelberg amselem swanlike spicerhaart findaproperty bartop eucharides bigongiari pomaded gieg cleanshaven halotherapy orlane anakara khanafeyeva thitinan karugarama hakimian vnus retched maysfield cvos bearce bioinnovation jailen bookeeping cooed addante mafileo miscegenated fervidly caroon samouk patelco benzodiazapines haufiku tandikat feldzer quing muheisen kiwaukee tilem sarejevo schieb castledown stuttery ineris annozero raidrs phaal garzones yongling foldershare anzelmo handicam zimnoch atabayev aicr covention jingled petroliana hotornot jerritt seprafilm symbiocity ganchi brynaman nazami kavkasia oceanhouse legutiano bestowers returing cantick multicharacter sycolin aphorists fritti jacomelli toradol mujihadeen onzo turitzin denica whaaaa narusinsight champley shangjie ignobly overemphasising nessers latsky acknowlegement sbics ctitf ctsp cebalo luxim mival flotations fobney johnsondiversey brokovich iguapop zelevansky boccio knightwood brahams rogerses glassel hartzer murado huneycutt kersen himelf targeters pecong tunwell prinster constiuency trihatmodjo collás vaidi grimbsy nasbla nakhumicha vöslauer olaide gwella liqiang filmically valhi morters orlovic hilgeman grizzling ipev farnen governmnet feyling hutsby viennetta zippori hushkits biofit rusen ndonye callwave monomaniacs busineses andreano degorski whipsawed mhaith proyecciones uppish viread defnitely intracacies knable mohin forseeably eunick presdient hintlian mtgo energysmart kalemeh talez razanamahasoa deceipt vainstein watercube bastes necesitan breysse powerscreen bashari krysti ncore nalbach nathe orencia divorty fussbudget anusat groenig phard teachman chandrasekera chamni moquet unaroused aspics halaas rrsat erlys collacott ysbryda downcounty ishitsuka velling rital hagers djabal hometrack mionet dymanic jandreau attiq jannetti oibda maesbrook guilbe sudack khaidarkan overhelming aquatec arenivar licuados ladsous xiuqin boomgaarden archuletta schooltube adforton mashek bisimwa elshadai sldf mahimahi openscape mcewens pouplin zingman dhody seaney tantalise girao nachito afairs nawlins vasilyan mecki gwledig marzuk mazetier visitied inoccuous techau sonelgaz brothy porkie huiyong pannack bahng felow infighter fussily ilao thecompany committeed sakkinen troester unprecented metalcrafters rollaway saddams raqia ezpass priestgate yenakiyevo ghostbar bellord bagginess dalonte birckhead gridwork afrims glasfiber gelcaps lashonda lmxb gloder korenmarkt turbocharge soyle sctf pixlr serfass westerhaus watb clarencefield scudded circadence lowgate ferlaino vaananen moneghan clayhidon eeze visen viciano kiyofumi frazell frontgate cherigat kabban crcd securesphere medicore schulle krabak bustleholme bebawi delonas streetwalking presedo missier dupper fwix xosha zestril conside lmhr spottier speyers lgpa moview loped classfied complicities frosses scheungraber chieppa thouands sanitiser staggerford jurrell silking galewski carveries naffa farnquist frisoli perdut stovitz drowsily abstinance korpal disclike betterinvesting predicitions yoslan quotability photoswitches grpr moviedom maullin maddness stalwartly starchitects chhom finkley distroy chainstores cwmnïau zuckers kudrycka desterrados shanmugaraja lakesha shaone shermanesque euille medhekar kowsari ahhhs minsiter lrpl romanens whelbourne chelbat penelec kalsoum basille amgylchedd collectivise tohave boycs hustinx colagreco hrsd viñedo bluelinx palmason seillier kaernten predictify shirzai shahrestani bialowitz msrps kickabouts lattman loosener jabugo izosimov obonyo baumfree priot researc salubi raviglione darlingtons soukar tripit donorgate ghindin wifeless vppa destinationcrm miwg whalemeat unchoreographed auriculas canapes bardavid gimick insipidly apercus numeiri radioheads chamberlins peoople fosberry eaglerider irhabi chepkurgor squeegeed soulez reacquaintance volkl conery ciaglia laedc pietravallo spokesdog steelbox devashard esteruelas anderon mediabrands arbey olmützer kreab agfc garimpo mcnenney sabatine rabiyah kumzar musicid pilgims homefree gepford limeside lousiness prisi adenekan lazrak guclu wolever restages bleum virtualise tscp maternite kuken snackwell derrys aippa stibal overcommit atircm zolfa murderousness peckolt xdms midgate gorza ruentex sarachek rotondella ratkovich rudoren shoucair shmyrev hachamovitch swoopers remodulin zahorian mathmatically pisarro perfervid cofinanced accutronics zomig inferrence verboven cornishness cavendar toddled gothically manuli wopping snarkier hmmn milenov brightseat bardoni auldyn tinkerbells mccarricks preplan matijasevic deschaux sedonia dusic susceptability yokoshi tifatul batties dellwyn fritada akond defuelling durica varbusiness loilo striplights papastamkos budiriro meeeting mackert unrestrainedly franic stampolidis metascores adivar immensities sanostee chedraoui latricia silaigwana joselio villemont callans fridjof activinspire reimplanted midstocket fitties vayl khanbhai sensualists omelyanchuk radiesse devriendt filp kennedies gabitril overhype maquin cofinancing shemara kidzapalooza belpietro commmitted kwamain efyrnwy gressman brodigan semtek adirs qoh oetiker stellaservice cpix freightcar matria nasier arboriculturist bozize transhipments tanrikulu crossruff superstuds rquez kalkut ragoonath antipolitical fortunado argiano petrobas rumsas elisco cadora jelleff ncal sbsa travelalls furuvik magalski ratcatchers mennello doorbuster yunchuan zellij coldsore introducted phytopharm buerki twizzler scharfen lellouch arahuay parvizian mcconochie langstein tradegy pinkwart hirschkorn foulers amazongate schüll censoriousness rebook eruvim zenkel whiffy tsarukaeva kirabo trevard biogenetics pingeton enfeldt callowness zarein miseducated ahmedzay esfandyar boxloads klipa caovilla heimeroth enraku dearland soundin trunnell zuoming ilyukhina grupetto midwifing anerican nondeductible espd fuscia trubek menashri prigozhin goodmail lelands oilexco wotter undedicated kadannappally pillaiyan vilaro uralchem memish cashedge khorramshahi mailouts miccolis amendoeira lavarreda sardjito awarders trenchancy ddal wfes curruption dhoble quattrochi nosebag interestng tamasy endesha hanawon datillo arktikum lavapies protrayal protectins telemonitoring tackiest strokemakers goldinger unexcusable intelliquote sunnyhillboy hyperic yuanshao turnpoint gsoh spinelessness corralejas mildrid clomping paisal ricked putbacks sbrana choiceodds cnaan monics raihana sebnem enobled studentcam schuppert endboards kafashian sujak infinia duplessie altinay starchitecture kirubakaran dentzer calvanico galamsey mukisa duoji payrise batoning ustelecom aplace chloraprep yachters sekulich suddeth fengwei bicentini kmarts kliuyev nickolds medbery coronaries bielas gaull gonxha mindray sibio sohil fradgley fastlicht sperrazza berez eprivacy gingeras noticer anabtawi naiem talyor volksrant mussolino abeloff rubare thirtymile pulchritudo nahiyan mchutchison khovanschina uxbal dakwar kookaï lovenheim gudavadze midence eatr vanderzee sickliness nixonland resiled schwanewilms ossatron smidgin doodlings indebt saugy rasula zachy uniongyrchol orcoyen kingdee komarom lecarre viennas lemlem mirixa lovallo isaby hightlight fcram hoariest idell screeming thenca apptec dorsher zebtab zesn exergame netquote fistfighting besir apakan nikolajeva viatronix leftwingers numnah ajdar aborning gualano charbit kleptomaniacs houssels mishara abseilers hradilek grayken pesko wheek metalor dasko bazaarvoice downsizer kechele batcho nonmineral manteros gatier millésimes okasan mulben waple brasiliera shmuger profetica theravance bialo honsel fcso tssc louvrier schleisner hujjaj maschek gardebring brashest autopart bcbgmaxazriagroup frabjous peacor northton shoulderpads coporations hspice veriface dryvax julys multilateralist goosing usce goudiaby graffagnino muhedin manata sweetmaker channelweb grauso tremin wmk cenes kimbrose pozze dybbuks nozkowski marraud medjools demosphere sunvisors chotaro condemming zouerat sorton zetterman xinyong fulgosi klenz teeranun wisened chmielinski norek swooshes alsaud rossiskaya plues aahad beecken hollihan koezuka rzo hagiu woofing lacourière yottabytes zerog damuth popick cruely payge loopier laxmanananda ciwidey kebler aneiros sherrow unterkofler jameos akritidis gonone shanetta lacinato breema farechase idtgv biospherics mirise noboby mecaniques hillendale mishits lotharios sanglah psbr sleepwalked siblinghood barioz vasakronan bordwin scoleri failling inevitible jabaal nuzzled crouzat javedanfar overspills kathon pengxi rimowa beauge paraquad teyssen mesis chlg asheninka sarcona cegh martineck wilens pomponi pitcaple haphazardness mallay bodywash zoonation folketrygdfondet pinheaded tierny milien ponderland pongracz controladora musberger sterksel cocis narcotraffic taegan pollanen technolgies ganek chechan ogguere ozumo prisonners japansese glinert conditon stayaway blackann onesphore jiaomei agboluaje cratos hirko haukohl monotherapies bergian bucardo trowelled viermetz juniti possoni kasselman bonnani aumento démodé tatoyan kasonga jesmer cavness affymax liddells yurgens cahlin backbiters mcdouble saulino bertagnoli zwakman ziccardi supremecist heeeeere nighties kulbicki masline lalena affonço yumasheva ecuries yhis maici bernasko sahidulla klyberg nannys heilveil rosengaard uprighting reindustrialisation nordaas raisinets tweneboa mitrichev gawks qasmani bikesafe unspooling khukhashvili mélisse exective stvp securer duerrenmatt fitel tifanny noray afeared nyanzale specia shalson bhattacharjea myisha franzoia suberb catheline krahom seditionaries stockmarkets jokanovic ltcfp orygen birkmeyer jnbridge unfavorability sumbandilasat neelmani technocentre beaunier bossler brownism harbourage mhhe tbps staner depersonalise shopfitter albless birchers sheeplike mitsuma ramsaur kostyrko tailgated breadlines cepko jivanjee nenuphar uramin revolutionises kwikchex multihomer siheyuans wardieburn barakula parrasch kanpachi crewcuts sebio nikpay pappers lengkeek salares jaffry thorseth saltness lotery beversdorf tulashboy shiavo caravaggios elmay amenah licy vovan gleysteen osbrink ekay simmond bukata borovay madisyn aulisio thinset dakim realtive altantic incommunicative wadsted nimbys werbel scarpato karith kohlenberger antipollution sengstaken munqeth ogut babbio vunga zakiur iciest mussoni befriender efdi ptpi pelma transitting tarino tindy sculptra dalcin dugel sanzari lumleys polkinhorn campaiging baroncini pellerito kaloyeros skycaps kozari watchcon nukular cunat religously nhsa nomineee bibic alpuche ritualize riotto stauncher boulodrome hairst caiz trattorias silvercorp tidmington wigren etcheberry tennessees vaghar gaoke batemen charlow bellevarde bobbly goodear yazdovsky eeurope cryopreserve broubster jaynarayan fordian guilelessly indarra questcor cannabalism lehmacher encashed higgerson marketside actiontec wraxhall sycophantically renasant skyforest toothlessness toshishige cooneys lovborg waggles metelsky inayet netessine pooks thassa cummis lookbooks tpsac collpase monicans mikul tscl ausn dext sweatsuits daklak peerlessly playtimes westfaelische maneouvres woobie iezzo pethau metascientific unsentimentally cruxton nigera simplyhired afican semneby backorders malikyar swansborough ozd blackweir becareful varlan balbeggie cfats simpe bouilhou alwaki kalentieva dardai electrorock shipwide cosily withn retino gridlocking iggs denuclearize gubbeen mauerfall windchills ugss tananbaum luskentyre mhangura venkataram wondeful kalite kingmaking strozzapreti happeneing antipaxos fransje chargoggagoggmanchauggagoggchaubunagungamaugg edies giricek finnebrogue homeboykris tristone mobus kriese schrauben adilgerei shmira inarticulately thumbdrives cegedim philomen lawnbott fornalutx filev poulters poofed wardwick epistolatory zanoun abingworth untaru hilldrop altamarea megabreccia googletalk motionflow thuyen fugitt obray ferrai cricieth imangi tranferring willyum pufferbellies olajos isaichev intellectualise financiarul meejin corty rickeys czarism srifa lookng kopera geoengineer twistier hundreths ixer frightning pollins cssv renfors oiba loglines guoqi franceschelli econmic prodemocracy pacl leeuwenburgh hawketts nonsensicality tylerton elshinta cheathem gedco spilak klisch purchasepro gonvick moussaid gogii kurgo novostei moletta dargate fabianski pyatykh hibees niccolls vetrone zottoli inglenooks eicu rexrode warrenty rikardo unerotic immersiveness lichstein punnery stigger anonymise wanyeki metahaven stetch holosko mantau tironi bechtolf joedy ruedrich ettner monolaurate hearron mutemwa netex manzitti galwan pitchwoman hotelera seereal gwq tankan rakus akallo dyatchin coupledom subtrochanteric sumayya nightjack ryzik hardmeyer jacketless dsda moble leruo iraninan resorces filaq tsfs virilion flamboyán wrotto xiaflex bettio holzheimer unhired zahera idealogies predo quixotes chirst laurionite buisnesses hirshenson lescault adventuristic wraithlike dossing tabouli luege iwave rivellini semicon coshes viehbacher volpp nonfuel latunde gamalath mouquin cappex pingsha reupholstering gulty msosa ndarc olazo speedpark flovent quintupling rybovich pyttel bankwatch truimph helocs semalaysia clerkly ropelike bradke nrda elagolix monokroussos presentaciones lepauw cowpats dorb trenfield matiullah mpal brcd himax nosamo excrutiatingly microfun noymer blossomy qayara bjorge nighthawking tamisha kalenjins garosci helvacioglu biosurgical unemployability mcclaran floxin hyaric chiezo colascione massucci autonational baseco godat juchitan bertamini mainds kolontár noncommittally aptr cottrol videophiles irifune barakani megadrought bolotbek biolley mosbah ijewere naqura bunless preinstall dunnichay fortresslike lovenkrands rivello unrisked stoemp ccould heptullah siggs ykhc puhalo mirapex tomnahurich mcstays ferarro trouserless leanse goerg jacquards hromadka noncumulative unroaded impieties mikkelsons sweenie blumengarten saleemul percentagewise jizzini frantoio zinetti lunera kalaye tumbrels kenber struhl baheer langenhahn hairsbreadth decarbonizing mundanities redepositing videosphere muqdad donmoyer chaleff yhf osteosarcomas nasuni gettng crescenza slingboxes collardi saeedullah misremembers grandmougin bgrc modfather avenbury skanled survitec grossmarkthalle videoboards yondelis fugnido sarewitz bonxie knobble nerdfest stribley tarekegn hoogie rhenaniae lentine epir lautin rekso leaue jasikevicius jobsworths hsss kachikwu vanwalleghem dontell dizzied rothgery foxiness chalamish littwin whpc caniato coolish incuriosity proassurance kcfa vandergaw andarabi govermnent manhoods discofied tumbaco vyomesh toweringly testagrossa dolmades yehle dussindale thwacked delerious inscrutably tapuaenuku elegent terravista adventuresses northumberlandia saiedi bramleys osmanova hyperactivated cultiver muhle echoworx murderes bordean boodman egnazia paffendorf zakhu schels pitaro britspeak skinflats phonautograms penine storediq tionna yaourt eurosur abasing gerrymanders sidled koblas nexterra eefting sdsers moonfleece puela hedgelaying woja matieu nlcr viebrock doumgor aeronatics kochon siriwat gloppy perceptics dunseath veeps mxim capeless costcutters purhonen harmanis dorheim bollene lidove smartconnect keyna crunchtime poeteray palmprint huante unconventionals teresitas cascal lavee incompetencies campign antitheft sulub discombobulate larkcom rotable shurah taneshia sabraw brunstock pourdastan timetrial gewanter fidc unionbancal intenders unpeaceful koelzer maizes corfman tunza ngirabatware postler sudr rozak aple fecally oncoplastic kaching macrogenics makhar syscon sexify nagdi kubesh fruad mandoo unticketed rimmell buratha sleeting smolianoff dionyssos ambrook macaleese knells dimiss medigene bddk vigilanteism cioloş iofina kirkorian megaliters saladdin reappoints manich varsos sugarcoats premeir gudiberg matzger meville collateralizing nijah beirutis fishbowldc preparty cardlock snorer thirtysomethings lochview kát bhumipol aquatech inarticulacy raffele bedzyk sporormiella defoggers weithaas promphan hantho unconcernedly seattlites nasari elating screecher charpai eceiza beinert schapp nelco easyoffice downblending dabaga suthin wergs vouchercloud persichini fertonani runflats tagney penpole tarratt harmening balletboyz qeis onziema houseproud respess casglu spidcom brtain akinyi katwala wooziness dynapac nccg linescore chrysanthis ladau stuller nowhatta okosuns troesch portgual bluelines schnare kaneoka yarkas cescau askarieh otjivero kalashov ramoin creeth afterparties excelstor thecentre khulani rabjohn biovex zvinavashe rosty composedly adulate weydert squeri scorable rafin atbs nuedexta kegels sawrymowicz flabbergast mistrals molsoncoors singlehop blooped grambau cortèges ohley dangerious citma procaccianti relevation pubpat razzed oveneke osvs pancanadian lenane giulianotti outsiderness teaster negedu costerton propagandised kanektok unimpressively sordidly elslander arall ritvars saudino tupra politeo premeditate basbas konzelmann yakcop aurus funsters hezlett rakishly eviscerations meritcare underutilisation christofis filtronic huiskes pashmul romeyer coywolves dezell rigerous housesit panjiva boatful metamorphosised hulky roures bilingües snerdly trusso draig headquarted viafara altringham oppotunity laudicina berisa underrepresents broadbrush conningsby emfesz overcash outproduce olke gudni akbulatov fruitseller ashecliffe amortising policitian mommer snis shoeboxed inteko unconstitional tsoukalis mezzaroma loveshy ecycling djibrilla experinece keithly irenas saghal rcni gubner bangaldeshi grandt escentuals dorros sayafi galavotti actec glinted seminfinals dendoncker cryle guachos cicalese reeg illiniois themal isenstein zarkasih matternes inzerillos fetishising mowj dasmunshi stallergenes sevugan dolsi nawaja naieve dabovich stephannie polydrug unieuro bashiqa bajur cscmp hypomineralization nspca cockscombs reeher vigliatore bvrla schlomach mathrani jurisidiction jaralla stojiljkovic haskells inflammed gauranteed seebs sideroom kiddingly kanneh squirrell shekhovtseva crooking wfic econergy chemoperfusion soullessness cavolina sophistications guaceto zotova reimchen youngins abaas excitebots shadeless shwam armload cbry bjerga trizzino sawula belluscio withrington scrawlings tranquillizing disinviting techonology zhengang countertrend loooove lummy wintonbury revention dzongu mohtashim torax lycourgos snakeshead handpresso aureos stecoah zubiria shanas djalo bpoc dipippo santhera buckweed youtie heathhall ofcc interservices cheongsams rodenberry commodifies archaelogists repesa overfunded mamalahoa ghayyur naaco jammaz averbeck scanbuy tahita sithanen bilaal radkin citrano nonessentials kulasekera hsse grumann fishermead ploddingly schratter crimpers phillp danat trendspotters joman hawklike neighed guadamuz stoeckle tianyin maccray vincor weldmesh regieme steriliser srdf robindale selfserving geneive pukey knuffke travelwise dalmain tradionally bemuse charlack sherika ambasssador ddct pokiest peulla solowij sheenah chakrabortty waheedullah baaria postconsumer hiiran stolly auberg subprimes vanhoenacker snowshed lovasi herrling barbicide quemere xuren satays abdulrazzaq narusova warinner makombo dharkenley restrictionists investees foryourart secla raizals kaijima begrimed ejide jumpiness kirtankhola fifteenfold kleercut abosolutely chintzes klyph bransome groundforce andriod irandokht obono biegert centerist erikkson mmia vesicarius flonase lotrel armfuls hosiden tavernise woodenness wsbtv jabrin dualview electile royak eurand pafilis joffrion honeyben ursitti steepling seimens reservationists hankham mornati toped hoeffner beaconfield representan eadington nevres cerattepe vanderhoek indefatigability fenwal surbaugh wiedmeier perserved birgun plastinate futs ereading thillaiyampalam ognyanova unliveable zlaten kumutha rinzen scharbach undershoots goldensource musslewhite quaden nastaran osmak fartusi saponaro chernofsky disinform dilsaver ipelegeng selosse trifactor bivings areshian gyurcsany nethken confernece tuerlinckx hammaren trofile steinbergs wobblers reesha outgross northgatearinso gangjee mukantabana meshkati messianically kejuan celebriducks weteringschans kuzo misdating shlapak selecky rodocker chebaa abednico scalamandre compañeras buttsbury friedensen montepellier madle utek facette pyrolyzing fieuzal laproscopic yashkin jermyns helmsleys nozipho mcpate bobó alexes jacobelli mainiero tsumani neognathous mcommerce wirahadi boersch banser mudpools plungington agroterrorism burnison beachem fmcgs mankiev buchris ihrsa lahza duchak chartkoff mmvii arrrrrr forlornness gonchor apotheosized magnoliana jaurez cyrenians chiofalo abfd gasòliba nonlife eobs abdolsamad forbeslife chouchan sidefooted beautful runstrom uncynical gymuned thinkequity schendler palchak wiltons pitters haigood trisenox colbost jneid giacaman kellari rebanded isinbaeva simpcw unrecognisably wabbes alamzeb overhit sinese muner repacholi wodicka demou barrickman perving auburns educat contageous whithead mortages understatedly submarkets kleibacker selction donnish schaltenbrand heartscore shamless rowdily paleja genarro oarswomen freekin cdhc goldstuck opeing tuxedoes raucousness technocracies suppported tyranical pannacotta perimiter shrivers khinsagov schiavocampo meadham noshing extramadura minesh granades forhead gimmies sheib khury gøtzsche ceril atebits midfields teethers closey zammo medcath slosser ladoucette africanamerican carryforward hanenburg tradi ellenese crammers cjackson financiación wildmill sidekan surrend atomising fabrazyme frisée boullard rhythmless phcg tvert propanganda smigelski scrappiness ghadaffi weihuang greenwold boebinger kwsc ereck nabali glenzer drumshoreland zubairu leichtfried servce transation comittment choksey ampinga motorcyle quadbike starbrite bcame goksel germophobia daddow depresion maleza progenics goosestep donghui daddona weilheimer norred pitcairngreen antcliff bollihope vocabularly sachsgate kortlander premediated agoraphobe gorovikov pierogis dobrinski singlehander kamais ludzidzini rapkay balero eviter scarletts isria napierala newwest pericak aharonovitch snugs ercument schoewe alykhan interntional sgitheanach lambrew boneheadedness rhadigan pwer bianna taranabant luluwa fosmire agahozo wickliff airrion lianfang pangalangan sukhodolsky docilely clearence sendell minnite pannek blubster towables ferraiuolo bossone barosso winterizing slrk karcic heára producted redounded shovelfuls fullabrook hairbands nonstock dadswells alemparte ozor bangkwang schlössl slaloming niswander guzovsky millennarian ratsat foreighn inarticulateness audenaert heuze lambikiza koechler bespangled gtce vaction contrairement rajapaksas pieczynski unforseeable rumbatap mediterannean aviran docksides catavento modert italpetroli demari odioworks mcgougan schroedel abery montemerlo spasic eglis dunkenhalgh islanova pixelmags rhoys kempinska cleansheets fregola cortman gerbrandt viacell granulates cobeaga sunbathes snorty greywalls sbfa vakas cilurzo difx assiduousness wbztv mcln avona standhill aisenbergs morroccan colllege underspent visitphilly kyndall untinted mmscmd unremedied buddied poretz pogoplug phangnga gudhe groundedness gibrilla virkus educationguardian chemu garofalini sahd extortive sotherby wienermobiles hotdoggers umiker pallozas dueles terprom kaboni plotts vukelich kazakhastan sideeffects vanska accolate heesom adgenda hardbat skeens silgan melbreak flouquet zokwana narika rhoi ebridge yge mushailov miiro imbibers conjurings sengalese bombardiere andrin selamet griesmer battino zumas mananged sevices ingrouille karnell fotp rzss fakhriya wyndgate zaradic photoquai banej itrust gardenwalk adaniya ethiopa microtrends dincin iftars phrasavath labl laicite asarch beanos exended tombini ileka xuehui soury differntly lystedt shocktoberfest thinky asiantaethau reinaga willenburg essan geuder jianyin truckles nayereh almezaan meadowes isupport sengis dziak demerjian dvbt sajat underripe higazi rhostryfan moconews caler beneift klaussen emvall somjai opatowek sealeyi hfmweek glangrwyney vilhelmson accumen nbjc ladgate karlekar batschelet delegitimizes wumart lawlis qanta boredomresearch hendawi inoffensiveness lonley lipoff revaluate drillhole earlyish ecssr thurlo exurbia matrimonially mobbers berlingieri foreshorten journeywoman wintles qtrs girlfirend hipping frontloading allisa rebeaud cézannes pdii pizjuan zellnik unamendable temporising jafry overanalysis zaiwalla topsiders headshake petroci nodl burnstone altmans manuevers pemberly alliedbarton brucey secetary fullcircle khomeinism sashenka andalgalornis wonkiness beckitt borthwicks banrural unadvisedly mlynarczyk jouault moromizato umming botvin schultis thase muffing sliska mamere echikson codero brutishness soilihi kitterick bloodlettings rhinestoned icej niesiolowski anandappa relativise blockfront frova camozzato phyiscal skillfull mickleson qahwash draaisma piconewtons chernovetskiy bashier superliminal genthner consititution columist thurin buzaigh pnvs guestworker lvam devocht demokrazia techteam wildnerness gonch neoclassicals reductivism crasnianski panwa snjezana vasterbotten medding lerille amezkua playacar ferrells ucsmp navisworks kesselheim betsworth snowsheds tolossa reebie rajaan buddi murkily coronia seiniger janal prebeg defecit whitetips agassa escuder torlot omaya molehunt concretisation marchinko ballotine reailty tudworth delahoyde cornioley roenning sarcinelli celmo mulbah spritzers zees goffer jerilynn tenedor rebic gershenz revanchists pakista memy rojansky soulcraft lamno mottern dantherm restis embonpoint deblase elesin olic cpdo kavran mimnagh gibaut hudlow guestlogix decolletage allrich cloffocks lashell pietrucha freerunners echocardiograph vargos voloder deche compartmentalising kupferschmid inveralmond erdy soovin laskhar biocompatibles winbourne particpates handsom certegy tambasco ixg buffenbarger cindee sspo breathalysers forys odweyne rabuck saeka erturk bocot tssam callori elnar déshabillé americanhumane coralled journies spado lafranchi swapp cardelle deschamp kynikos parkgrove nrtee ldtx danstrup fleurent deazley griffall foerst øn velleron mensik nesmachniy dannette concepcíon counterinsurgents tambyah eurodata drumwright pineoblastoma ghanouj mayernik morentin kanaha gesac malagon westerhever doorknocker peacful thirdforce shoeshiners quesney dogwhistle alape hoornweg consolidants williamsen ftaap lippiett politkovsky moqattam jewsons dechra tranform tregele educap wearyingly florally mugunga lafauci brundell backloaded wittenauer bydesign screamy thealby cauthron leavenheath swartzberg sdrm omigosh nomnation kødbyen baytril sassing sidko swopping traschel rhsc gailen mcgrand itins vinuta patuano prezant puhn mygig seenigama chasteness shamila mcclammy lekae extenuated tartines lingeringly incomings shwarma kianoosh belken nagatacho altcourse bundley heisting gionatha ellaktor basrawi zainey unapplied ilean zaineb simclar italain devonside muehlhausen rochkind guarnera quanis renewdata horakova zoocheck disentitlement sooki boichuk guantanmo bwambale zirh comroe amersterdam kurty fcpo lawil ganca prodcution mihn wagster galperina landfast abdelmunim bpop streetballer foreswear austhink incriminations superchip hulaween radiotime feay airmagnet securum overtons brancott bourk burdeyna unfug duchez smooshing chaiwan pennbury npwt tmep outred monfredo ndirangu wzaa cryptarithm reimportation fasanenstrasse orback bholua promens englon organizatin unentertaining seatmates hcahps igaly quadricep drawbaugh mispricings teyba safeware altanta weedons quaterly redecard reconviction thomasis sedq tuomanen woojin copule drival wegher bionova enormousness willemsorde tabesh dreeben misdescriptions trubeck rosenwaks balmacaan remeha brodyaga peppier skapa congueros bayanat kaiserstrasse gatorback beouf crablike disx tapizar corcept divay labégorce airmailed degression rickrack ummmmmm qvarnstrom templand vegatable dimitrouleas alteris gitlen shopfitters bipro abufarha smari kenyetta polygraphers denuclearized ballengee steadicams giuggioli bleiweiss iluh kimat schockenhoff rotenstreich visarts unstrap teenangels kowalenko rafacz natual subfractions unblushingly cobern vittrup margreiter sannwald netprice surfy qkl ingénues giammario akekee mujadidi arounf zfps misfiling micronetics nazak ozsoy wattlebridge haralambidis juritz ethienne reynecke cléac altafaj splinternet izunaso flitters polticial serapong cawt snaefellsnes supermileage waltemeyer trammo vanersborg vadino norrback sjowall momemtum fadzai miltants shunra fnirs mezie trüpel snackers sunchoke recrafting peijun accomadate cusden mullhouse xuming baruffi niggled centrafrican moufarrige qunar blavod abudwaq bodysnatching kneafsey pbfa arthrogram gubareva ibarlucea swiming unfortuanately unintelligence israt marenariello alltami amshold selectorial cerus soligas worsely uwimana biopat buddenhagen pietrowski carmex whizzbang dinc foxbury melanine terlo supedi panjshiri broadcastable agensys ziobrowski shoddiest semass educaiton nataphon splitboard parring ungracefully rubacuori nuclearelectrica stigmatises underutilizing blacksite meaby hazelbank acidini hachamah fraxel kingsmarkham sunporch arsalai diaster murazzi bedazzling diming gonchen koroshetz loirston kongkiat discplinary muiruri watergun briber motar patick tufino beejays decampment froufrou nasho siasa wyddgrug sflg hagoel huuhtanen pettier gamawan recordbreaking wotcha miniskirted fidgeted zemaj fileman wdowiak counterplots hoick reaganesque gaggles celestae multitiered laurans lumenis aqmi nvla prosystem welcare dibiasio desalinisation pacificor irens hwyr banjarsari worksurface eltek gfcis rhinoconjunctivitis saracoglu phoung spraggins mcmonigle hewad hongfang holdbacks briefel richette kalchbrenner polygraphed sauteeing kaltschmidt afelee alfaraj sreet essenburg eggland coolhunter fetchet supersites rindner gouro gruntal haueter meyrelles kaplicky dizzywood vivaki wipfli monemvassia alsumaria khamisiyah jiuhe schlemko loewith rttnews grassian fileshare cabangbang glink seafowl hydrospheric bougatsos incentivises handwrote muhammaed recraft pibworth umarin waladi schene rabhan idtvs lauckner selbin tasktop steuerle otellos sebire poeni raisor reenrolled feart finkenbinder jennerex bergtheil warblings walcoff chungtak wftf degremont caeathro galphay zhila pratter morocca tieas dinker penzeys viastore icontrol fashingbauer berelowitz spiceball singlar silkbank iacopelli retrovir swfr mashiter tratman manandafy laylaz lidoderm noorin oback nonstriking anesthetised holstering supergrade eafrd indigenious cardmember pedre pugel androgynously footbaths miara jhamar iftf hornall indpendence exantus gainsays unsaddled farmhill chalendar ukrainskaya matonoha bagsy yalikavak tamnamore keyssar zvegintzov smilovitz cloddish nonnatives pötschke raineth ambisome catheryn helsper stambecco pupster zinkia powersite apaolaza constructionskills mazurian grasz avitat annell simpletuition ronaldos counterpulsation goldhay businesss dubson omnovia amazonencore thermofisher wonked availity cravers poreless gsam mactec quedamos stressfulness antonovs csbg molsa pvos nyag lakhvir rustlings unbloodied footbed oudéa aliadiere sodefor nextgov spohrs disgrees tortise deracination foelsch aircruise contempories tankie adapid fahimuddin unidentifed yobbish bakhtyari expresstoll ebanking tcab ghettoising enshrouds yavala missakian snellin ikenson trifectas khinshtein tyen comentator flyballs intergrating satorius taxand metcalfs slimiest volodko washash anayat corenblith grobelaar grandtop traavik hobijn yuanwei kobren mushore courtman overprescribing dhiyaa mainey tercica oosterlinck maplets penans culatra motloung mudiwa nauseate lohans piatco macheras horsefall esmerling chuntering happpen baswell leighnor kalkidan gulestan paleokastritsa vegliante marrage dejour yionoulis groveman masachapa cvcp farking undershooting bockheim rajasingam snidey korkman backslides goldsithney demureness frozan lobotomise schawbel escapable tetramisole prebiotin simandle polizzotto dqg foamers previdi beltranena ordnungspolitik rooked digicable greensun makhauri ebex amorousness aldermore crasser swarner dillihay italee datca kirkwoods dubendorf norampac copeley caviars rafte thykier normanbrook cubicin avoding kharabadze berjaoui cerist corsock toussas misleader januaries piquantly wasman ovidsp kilali demaster holtgrave eremian frappuccinos charcoaled collywobbles dorvil stuthman wingos nafd frysinger consternated propertyshark jindals sirulnick asimow khicks geomicrobiologist schuknecht trøim reyaz rosiest olivan golarz vivane stachelberg metatools aeromech retek remirez barbastelles serhal obamania sprucewood blfs graythorp universitites univited kyrgystan battlenet horizontales rumaitha maiers rexite greenbergs ltach lnas rittenberry rhji sulafa enox immenent weyerbacher debilitates trimel snuffling rocap backscratchers leinsterman pricewaterhousecooper misogynism wisconson dedmond suyitno dadge mthalane pinenuts kuritzky shanawaz jaylyn confe rottenstone stollar debtline gritta pambianchi groocock bredberg europeanising llandyssul capitamalls ajack habbah asdale presbyopic swyddogion allinges kyleigh carcache stracher pinoncelli theleme ngeno messagero moodey cefin gaddafis salaway gershfield vical doah lionstone alverado nichollsia satsair hannstar unconvential sarchal sommersdorf disallowable milovanovich coloproctology kumang undercofler israelsen clubcards soveriegnty catain prescotts sitilides misrepresenation snuggies titouan irrevelent brattiness vickory euroccp bedf coloradoans luqu pgcb karimou anipals asats giyas laddingford paleros rukhi smidgens moncoutie schnarf apropros leahurst chindo nobilia redetermine scanted westry defourny jerrid flatulate feifdom eightsome alipac buckly kazani dirkzwager cummine sexlessness rutfs vould bifma nazarali pesis hoock paperstone presilla gommes ornek kroners bredeson lahya sponginess rahwa aynte kotsovolos gamex rietjens mummifies reoffering incresing semiautomatics fiscuteanu abrecht ruffler raiff overcounts luvvies eeriest mingxiang pichaya smushing ozonoff pelisek cessar vadakan dinked poggia roros busienss grouchiness sansabelt upstretched skvarla werrick dispiritingly petrikin suebu koruk jekabs pensham garritt noncore woloshin creditcard supertax porousness adipec pstd azotam tiltons ghadhban frechon handys andriole kocab brainshark kraushofer unlimted intouniversity driulis foucrault junnier weatherworn ringfence gribkowsky tumer decorex underpant aitzol mobmov lochburn preferisco travisty mathad dowm privets woolhandler nooky kuklina boutall khamkhoyev bridgeable buatta secondees tawafiq saucily debbane bubbies rapska gemmells tablespoonfuls masbah raddled convinction adelos palenik kesington muniruzzaman unibanka sunchips jasmijn unmanageability mlda spivvy surpless equably synaesthetes thirtyish discombobulation nabq petchabun duthy chlorox leslea polute metaldyne eerp dailiness dioskuria abiam aretakis vreth whae guant liasing maydew campatelli qorey nuview carmindy dristan citrone petersilia meagerness oilworkers hemgesberg bllack podgers amplats ingly feihe wiemar goolkasian opfermann bunja autorisée flatrate kaddafi ampitheatre tranformed kudrina shukat ochakovo marcouch chathrand stosny sekoba inbs rudgate snobbiest demoff supermouse tunnocks jamuana preannounced hossien stodir guegan owsinski valicenti aquaintances muhanad lazovic mujaheed anano kosachev wanny kavcic moede breska ettelbrick coulds pozon szkotak walderstown delafon nickolson oepa kalinic flabbier khoshjamal sherland lavance fillup dunkelberg rokafella pbpc ccni gabig suwayrah shmatikov ruffly testwork valesco sunblocks douyon growbag gorkss piredda hooches rycart bathursts seiffer sharlip germicides chhiring kurskis capka penyak zdravkova poitevent mounga leposavic nachreiner sumiden mondre underfund haoyu americhip bucketed alimera litigous enamour tawian indictor heartstart onguard stetched kennette unhulled vaccinates disobliging dostoevskian bardrick shihadeh magodonga replated untersteiner glasslike bangadi illinios parging dundale libbertz amsafe meudwy delyagin easom aboubaker nasonex hashani radder pomposities rodny kirchnerismo barncastle semdinli hezza inablility solitoki powere narcotrafficker pitboss lazerson muit gottula oinker bestball guzzles teravision rushdan rubenfire infrasource klimkiewicz cnmg surprizingly chinodya remilitarizing xiangang reconsult dorback doft southerlies leinhart bernardinis obselidia tchepalova mashery overseeded labroue ripsnorter remcom gunningham crarae filskov bierschenk usnik piereson jeetay carryback finneyi hottopics cyberhomes smithline fylingthorpe rykner terziu argandab yenikapi nomaguchi derecognise borain ocober sparaco trifari baranwal nocas slobby hommels sliddery jouwe wenkui chiennes yachties ingnorant abermorddu lekander nourizad diffe bekamenga rocabado oguike martrydom sapic affilate mitayev meindertsma warduni biotex kritzler vanairsdale tegtmeyer devasted chopinesque moistest ulve kouvelas stringfellows threets zombifies jeht paslode alasdhair standardaero empatic maryss threedimensional playsuits siguier babbitty godsakes kayange pandermalis strautins saharawis technopak conleys oroweat leggie sesti defago hasids gazzale herculex transcipt ractliffe millefeuille habberjam gantvoort bilary lankier centech zaynar sanluis iongh tzofit mussab amreit weslye odelin minstral slopeside soundcast sorowitsch parkmobile ginormica snortland meritocracies indecorously retsinas notchy charmz souix hyberbole drowing reprice decitions saltie xerion mokara whitecollar chrysostomides corcodilos haniyah subindexes ezeagwula zaneis erbey bullmer beckets bradely pfcu broodingly skurygin hourmadji giade yerkin maddahi kukmin scbu zorigt niftiest maeyens leratong wettstone meanwhiles hickses efast manservisi mithoefer robinowitz czechoslavakia styczynski perreten milltir turbomentor emrg sharashidze verrucas straatjes caladiums idleaire arboform mpombo gougère tuszynski sunaoka swiryn nonresponders poisonville baloga biovest reidl jamalca gameela ngarua hrpc fsbi chignons varnadore dashikis cirstea strobbe reconcilation firesuit profert landsite stymying clubbish caceras recision acbp darif grooviness biscotto lkcm securitise skyped avivit potholders ntaf mesmerisingly fime sensibleness mwaka basata leabrooks inquira mingkang eskdaleside klovstad quanjie obessed raschker pluckily headwrap choruss hiratzka viruslike eathyn gesm nettlesworth oestrogenic honaman shister multinationally aykley irep roasty supplimental martocci homelier chipkin miamian parenty shipworkers disinformed ferosh curtseying compactdaq uncertainity vsla napbc sangqu norks lubenow niesha deterrance suhar tebbits proell mauras succeeed kuando onky brandade overfarming laurys ipga mcld chernack droser naftiran qaedat scaap cankles netidentity blcs ueberlingen sparkleberry afforable communitywide endosurgery schloter jhung recognisers sunswept pheby amael nutrimetics purvin shivaun canetto cylone ultrapar protoype janerio dragus threadcount giday hammerly lesleigh zhikai apalisok forrell howdens vardag propmaster bikestation optoma jinsoo garstein damaseb ballboys ahanger helstein californa sowash dbouk magati anastassopoulos suranyi youngistan stateliners sonitus thebault simeonidis maryton birkhahn prudentially thigpenn matkovsky inflammables dugatkin treneer tigresse baluja datatech aubrayo puréeing catrachos vallillo verplancke moreys brickcon unimagineable hindell fundementalist incisionless jichun vanthan gagfah lindele lazards lashway stepback saintpaul aleklett dollet acronymous deschaine cwmgwili chapero hazimeh belachew clearflow armorlite rakipi fiduccia picpa videosurf cyberteam hemicycles urspelerpes shoulld sukhotsky ecet mularski newjack solimene abowitz mowmacre barnich genego gursewak safieddine probono dadms allested marolla perspired middlesworth ouderkirk mohrbacher sanidas oseira kinkeade atgofion sayette suller ribollita mindiashvili gratins gdula karolczak czinger proselytiser nonconvertible proganda hospi vishnevski sluggerrr kumiki blendstock busaidy duchaufour rebonding mcgeehin renkel bodysurf halfvarson tesarz mulkearns alyami ziswiler lumpar overweights yewlands deconcentrated stichelton boggins warser diffenderffer stonelike pillayan vetheuil dealbase yinxing militarising zagaris profligately jgbs ruales timofejevas miscoding drinkaware mambili wmam rowlestone llez halona noninterventionist levalbuterol konashenkov spikol radionet maneouvre dataupia niakhar anarchically dileu nvds adzick masculinised crankiest rxamerica knafel renfew testifiers achacollo amorphousness natsheh boscamp depature crookfur diyer milway aichr ristra backchecking ptown prolith hollywod gamercize balce carnesky gaffel costell suryakusuma geeez nielssen myocet beauvier emtech diffculty dubbelman shifflet safestore yapacani marlenka vectura galymzhan jupiterresearch qinnan tiggywinkle regreen gadish saffan ooip mockable whitehand cesareans greisman glushak djeljosevic classie numbly dazheng coughtrey surmelis belgiums yway caramelizes lombarte bundtzen obongo formigenes navias baaaaack harrumphing iovate revazi sangdrol nonmusicians hausers daccord magarotto goeteborg passported unimmunized wiedemer smitka derrières gfoeller exhibtion squanderer pongsapat chemmedchem cheekiest djoghlaf smalll anthropedia affronti cepgl policyarchive gottliebsen rapnik trasti bastiman krautchan ketts continuted klironomos golembeski kriesch brogliatti pepperoncini livlihood zinkevich frohlick perkus heayweight shanghua culmone atakol mugals stablized cultureless mazzari honni porrit sawaneh matchball externalise cerruto lachelle comsys jalander kolanda whitehaugh momentously mismeasurement rosilyn mcbarnette greasier madziar murugiah gossy medchi systemising kavira travelcare montine hezbolla rachline karkos golodryga manzeck nicar qidi curtsying mcdougals toniolatti straght leplae shabbazz oraquick baghurst nulo goldbelt figawi shrauger selmayr wierdos diagraming yesilyurt lexiscan leglise yoos swindal agnellis vitabiotics babygro iteere wecu infolinks optos enflaming adapation madrilenians eigenharp rejuvenations zeglin wackily gaggioli corkage obih hillaker vallerie touchstar ifty reyhani ruhulla counterreaction dmfcc dussey jeds investco tanginess jannuzi lehnhardt richochet entiled ooking drewal feelgoods spirts debriefers sinecatechins unitisation meesh toireasa primped guilting ephremidis pooneh backcourts whiplashes gulfsands cutaş jurcic naiditch kablooey daftly wahabist humouredly zhilian kassaye chapattis makumbe amené tlhagale chaparhar snootiness uncorseted unwontedly fewcott efore proscout wsox macherio bozilovic xtet interogation ebwy toecaps siobahn datamentors whippey immitate herszenhorn uruzgani lollypops orumieh methodone belyaninov bareth highman cinepop tossiat berghold bmvss topchi pentrechwyth bricky trilogue untenably adwent mofi elixhauser paitson roués lephone vindec abubakir amidoamine vantrease mariellen hyperviolent pernando gasifying motney tipperty maxlife dhillion tursunbai ameel overreliant souters unsterilised sarjono sharrad mckemy ripkin oaters postconcussion pynoos casagranda proclo mitrice ibeam daila aukett crappers hechts blings blashaw kragnes garrulousness polderbaan pollyannaish ancestoral ablondi nkonyeni mjos hereabout hulver wainger scaremonger birdieing idearc chirkunov beninson ddce restituyo stambolic boomlet migranes departees ipanemas sortun baskas stanching stacho loestrin donnica delevoye sharespost glaiel harabin chezelles eurfyl klitchko nerin ogorek offman dorber downlisting instictively jacamo haematologists nmsdc deanthony deadheaded zagala foncia hachikian greenfort musicmaking spiessens desertxpress bilharzias homoerotica sunsail khushtov nazyr pandorapedia distinquished fnbo blaszko dunnellen kaissi ffilm urfan rosbrook caughman courneya franquelis rightwinger overseeding endomorph sgurrenergy abudullah howsare onsat constructech nonsinging alupo lavachet delcher hauslaib incovenient suniya savey apesteguia shiree yoandris helex soliloquizing polyphenolics levoxyl presstek narcotraficantes bestinvest bolinhos zoomsystems comvest serte listel kelaidis kenderick salel carnebone alchoholic belwind severenergia glimse heilicher gartzen brookover tuvey uninvented pittam iurato underwritings endometrin trezona woys casassa mindlab annotative lammel krawczynski submersing themn valcent thinkuknow singledom taharka possati redplum penayo sardelis farecast subramaniyan continuingly tuilleadh elhami throughfare nozhkin oversaturating smoriginas predicatable chooky wynott swiftboated ameerul valades pouters tujague gfirst visionless camileon rancatore kaeson gryaznoi moelgg avaition stiwt ferumoxytol cudgelled jeggle cebt cargurus klausenpass earings kittredges prorates caonima macvean overcritical patzold mareya fromageries battut dreena somet yisrayl hamouly chiego craiginches pléthore shojin bhith waitering sufferred etouffee felgtb drra ditalini weaber upwash boutih janikhel schwieterman tryanny heeschen kleivan luxuriated katzev mauad podding jumeriah rosetto energey amortizes primobolan ishiya mcilhone penalisation gotopless jounalists kalmijn azocar kebbeh artown yabuno hydrolized misimpressions catrow roseisle loarie guzzled chheang spped gecc apah photoscape selody machisu wolesley peblig lssi kagona marziah lajko astonishments counterdemonstrations quillagua wangmene hakiwai nipro gardenweb artmosphere ashoori makhubela zhushu islambad subfunds szonda maňka desolately latests litwinowicz shamol odiously bufwack trustworth afmxa whiddett woomer sauvey lewitsky ficara perridge passalaqua sayedi nyeshia overreporting stablised weathernews stabex obamamania kaeda gynormous jaemin axcient pmle mammotome qeada pisf altamed stepovers uscinski defronzo kéchichian skeoge acccused windiness gouldings favourities zhichun guysville woolfitt rosani oludeniz fedscoop radfar hemrick hooksiel crissakes guagliardo allodin leningrado centamin dhaifallah paspa abdulamir patulea microprudential fredrix colimon elaws avihai krents glassless absolutisms sharespace gweld buttriss summmit theramin couldve insideline uygar intriging bahkshi cldf cbhf calgreen unrenewed nonadjustable micrsoft fenstersheib letna middlecroft schauland begic duralex genara lardaro nanoengineered ecchr nonbiodegradable bressman korupensis holmeside dizziest filppu gaglio gnashed rudkovsky authorizers virgnia lochfield cemusa beclouded potshow azizova insectariums unshadowed panasci lesane micosoft permeti piecrust blunderers interestedly stigmatism rustier mahida deyon jonikal skovbo beevis ryotei smolke malast ikechuku decertifying cyberharassment sidefoot sipio underthrew kemery bigmore yubamrung pedwell knoxy speechmakers mtop ticketable oldoinyo chemjor wamiq schalfkogel longenbaugh rasmala perkstreet exosolar sainbury guradian baladiat kacou norson paaswell annoucing unguaranteed mellqvist lexon filesoup darnovsky euronaval quadbikes rawod fraenzi mahaman nwoga assoication citimortgage musir postelection bloomie ezzedin mielles oscarcast donʼt kickboards nchabeleng panickers kenoy chascomus cricitism ipodtouch lifechat riskily dopy tophoven léonid lebi mtan blcu ortique maxlinear bamut anlaysis newva bijagos cholnoky koksal bircken usbourne barnfields ddmi ambergis bittova iberri ambaye retendering schoenholtz rawboned burberrys ghayoor dongshen dansaert akhondzadeh shuttlebus pavlopetri mentouri bjornsdottir cloistering eykel eemea molnia cinnamond cesan egington firtina datakhel corgiville yeatsian kelberman abdellahi overbaked eljvir melonas xpak naqu hoffbrand ymwneud volberding debronkart fondazioni intollerant zareer rangier melgren recalibrates dimunitive kerimli xhale mhora potbellies seminerio reclaimation shyy rerating mortillaro chenghu parisel mccreet metje chulov bssf cantrel boehle grogginess eeds atherothrombosis guediawaye depositos krawchenko truckfest meringolo dcsp panarina jsea pijls bocom grwp havillands unleveraged barakett owener smagula tsod phraselator trevisanato arkeia garbino balilty bacalzo degooyer hosteria sensitivies gateses pcar hdma zonegran laabidi tomisue phemt panhandled groharing klimentova degauque dafri kazachkov rushie targetman miljevic generra upscales afix wetli domanic mclinn sherita exhanges gronvold lamouche anahtar reinfecting paedophiliac bijoor neivua rubensteins budgeter wineglasses moszczynski admp djavad visitorial hemingwayesque nanosilver oncothyreon presumptiveness carpeneto everflex blogospheric dhyanapeetam eschatologists olivant weatherunderground ciggies nandigna borrman lowriding dostam baltiska lisped hasira tweeness ghoulishness immigrationist brigyn policians framwork bogdal hammeri screechers drizin shakeys inabilty kirksanton goodfood persbo walletpop salaberria vermet cadidates hyperv tarassenko hawalas drnc vistes tangly wordscraper abderrahime sunlin theodosopoulos unsigning schwartzenegger steepbank siebenaler idjits handwritting provactive underexposing acabq prewash rimsza goning goldford lomondside davender rijkman jobseeking zulfu beerntsen tsnas underfire sorgdrager racecards nattiv sadriya acquital promperu freeradical guldimann cogentrix tzekos babalawos nosedives tendar mubaraks hahahahahahahaha spickernell ipcom ergneti echan daivd camarota casciato qare guentner foxhollow oswold guodu cmlp buttree timipre resurgance mulyasari piehole zinchuk knobkerries bupkes assemi dibens dryman chimneyed masterlink duddies busemeyer staceys enouraged agaporomorphus dettre toysrus multiculturally islamising goldvarg montalbini elhaj kyalo kamlani hysear lietenant galanz troublespot bardakjian ditziness verticalresponse drakakis sarcs hirakubo shainova vizner scintillo stockebrand mochomo manclark aborad massolo stimuvax nirkh cated ksnd slowpokes aksh plastinates intothe necarne antiapartheid moruti uglified podimata mpshe cfso felisbret mountainscapes hadhramout shtik seedcorn arizonia winklers somocurcio lichtensteins insuremytrip dutchified ropeik bothfeld tchp teamsheets nationalbanken brussles laugar technologized supperstone corridore stomaching wuxiu stidolph comodity flyspecking hostmark allensmore rolfsrud maridadi vlsci carrem koumiss responsbile buctzotz prasow legimately hersen wibbles podlike pigeonroost siwik americone dirndls lameduck qadoura reoperations caldrons tolek hubertz poppadoms tushishvili walkaways todoli practi prewashed timebank brainchildren mauia schankweiler tunelessly gavio winkett vilmorinii decarbonise pepperball drywalls aduku careerwise geldmacher bollant lirhus wmpt sattui speechome sunkuli majestyes popogrebsky dworken hafida skrivanek microdrones baltrunas khaldan audrin vanselow deneroff dadak kascak afbi montecastillo haraf qtip liveuniverse isnora maulidi straightjackets radilla micropilot remache yongyue mazeikiu annouces strangfeld strategizes vicorp unrationed massaglia tbvi ceramtec eiris potlines laskoski longwei perdent piriapolis studenka prolink sovereignists ganglands rampike esperion intitially schmuckler beanfeast schrenzel nonelectronic bamarni conversative peterkins rejecters staropromyslovsky headiest aquirre lipodissolve zahim steelfab gotze lhotellerie flns staffrooms faughey outrush purewire georgaris emoi enchancing artiss chiaiano beeden reseacher mceniry spongey cannoning irrate lickspittle skory innexus lintuan lazeric capaign braxis quiffs sarapiqui brüner pooterish klimley excelerate mycolors addex rhydycar staale saklikent posma trupia airington idong peerwani veryfine polyhydroxy bvps mosleys tereu famliy technlogy homestall tzafrir kosayodhin barhopping byoe timberon winzar melazzi flogos plainclothesmen shpl litty montalte penningroth carnivorism barkeeps themost refroze ultraclean zwcad bingai daokui capricans tanjin kharbash griffee botherers phort loscher brombergs schoolbuses attacts mooched hillheads monnat kuźmiuk anted avish vsevelod topdown pashminas midichlorians élitism sonnega bahukutumbi assosa gichuki maleyev yoco petronis splinterheads sacristía szejna grolar fladung ersland cksw fellus ceasfire weened satid tanpinar cianna debaggio potholder martissant cheesmond cyberonics floyde pedophelia leanord naiive stanched ardura minuites datatec schenkelberg libberton sidm alspac uncoachable loiron overprescription skinput cuzick mangolte jehani mananger menrad pettiti kharzeev dealflow ratagan gambera yellowbook urofollitropin preowned earthend saccacio trentmann comish zhenglan shtreimels baylake nyfix sutherst olima vaquillas ciorciari boulygina spallen rhubarbs forbearances vaujour descottes daintith capelet maesydre mackenzi kabelis homier regaldo planadas primecare cananda treaments brunnera fehlhaber defillo khabaronline aramnau nnec carrousels francisley steriods pevehouse dairakudakan rafiullah chrust ncsli microbanking documentor indabas carsickness bahran weilded schaben jonney dmes sunnywood superboat rotisseries stancl malvey mpay norbolethone daszak kazillion semitool thehotel primadonnas photochromatic varland wulfman penacilin sharek religionism donielle cingulated dezzi millmead torbati peiry spaly lyngdorf resoled tlbb mitzner interveiw braodcast fairpensions pavees kaschalk boarland sanfield slendertone khev usablenet occurrs schlagenhauf governemt peepolykus elctions icesheets haltime tagliata carrillos ishkhans dibadj stiti styres mcelmurray hacu péchenard libtard rhydlewis losinski sudekum cheatom spinnato phaup manoguayabo shaum afssaps bossaso parziale lobon girolle reau destounis sapristi propagada chodan rashakai curty tangoed batzer calvanese eremic exiler griel closeminded wikicrimes agoraphobics riddiough sandalow lobenstine factorys krutonog khodari wemi tamelen vanrell kohestani roddrick allanna thenm shewsbury deutschemark nisenthal bushara kamchybek ermira primondo layaways mccally dihle opiod biosurgery fratzke schnetter gerrell telphone fractionators nighclub vardes nirja grebby ilyse houselights maninger rémoulade wsii petruzziello chicherit csssi fabrico greenpalm gransmoor tronconi bouland booys sceince cmtx schellens raydiance sarpourenx mavni wilfie defered monosol myogen schaghen obsenity chakiwara whirs dlan skyscout onone misrouting pyworthy rhônes northeasters prevatt frigstad seetoh inwest exces deju futhey nimol eliraz derossett woolsery lacusovagus manless lbpd emrouz sherez roadwarrior exagerrating bioflex laysha godzillas waivered massler joseva overpays nitsana purdis ocklynge overuled fourpiece profeet federbush headscratching embarrassedly protiviti zhesi uniko fagiuoli viccei zagha helpdesks threadhead flexibilization fassold looniest invigilation vaciago kilkeary channick ismayl geofence stockhill oakmark osaze pakkoku estacao whodat kizilay barovsky igss netcu diptheria mareer haringay firmount renationalize overeaten anatomize froglife traipsed interpretors khetrapal canl pikelets resplendently gilyeat zabib endocrinal obenhaus bioprosthesis gufeng lowlier deisher multisector baddock ynni blaenannerch huitian rocquaine brodeck jayanarayan brigend jilal breakables shkedy phwoar kravat proverbio jelleyman chelokee tojirakarn goudin qomo earthcraft agrc shantsev mangasaryan suncream faceful besharov chailert amibitious desensitise oulmers frothiness lesers pissoirs trickel servicement swampier drotske vivon seocnd idjmg aftewards dravucz caixacorp spirtos elizabete pissaladière drivelling gharab counterinsurgencies necci siefer thahabi qinsheng haidan kasteler polititian tipsword penodol choike ancesters kyabakura wrep htcia lydmar dykhoff kabiller icban navjit beccause nixzaliz unprosecutable ceter sipkins nankabirwa omfug amoro therapods phildelphia meltons systec telexed unchlorinated raquenel wilderhill carbombs wieseman sedgh recolonising olsiewski braje sollo kiefaber nerudova simonsohn powl chipolatas dorith sleaves platystele twinsets spaeder mattle raechelle semsar paciolan mosqueta dieperink jingoists antiglare issifou vastic sweltered shrills colsey hestness doerries motorino iwbs beyersdorf ralliers inweh agendia habinek reinstitutes dunmire overstimulate ambue khannas tultitlan nightguard csag sleazeballs gmtc gansert burakowski beersbridge maribou valras damscus aairpass bigtent munichs ahdyar knudsens annointing hogeveen bekbosunov laquinta amurdag bannaby abramorama evanka demeanours scoraig shway tilkin armagost subconcussive breakfield giattino sodhani maccaull nauseates aspirationally niloo silverrock muyambo zoladex derelictions ithought vfinance uranishi pinneys zumobi porogen vincz aaprp playwin fermon rocketi slotte portovaya gwraig wuzheng grŵp quntar cowcliffe chldren disabuses downington angmo govx reconvicted buddeke gyalzen hicker covenas aromasin shebab stonewash personratings marcuson bayti miscalibrated businessowners leeburn byalalu marberger wiergate microbrewed krystofer quada ajorlou versveld supatra rachinel genepax cahillane akinbola cuende tribewanted boughn maslovskiy renshon maidencombe joguet simoco plyer comisiynydd drumrolls agland yermoshina hypervirulent willenken kuryla kellyton shovkovsky venzuela bassell xaltepec unpegged marzok chutian conservera michaelwood seeminly warthan protzmann azrouel trombitas dremiel intercomm beedenbender toeless elishia shipside heleta woollier gotschall taliafero sannitz glucometers countryish asimco swinbank bentovim redward barall nazereth litner agrobacteria whitcome vircom upender golsan corespondent benguerra geoge forver hounsome worldfirst botanicas rmts tyrannised sunned pelavin niembro lehmkuhle multifetal granddads personalizable carbonare ganoush jurijus stultifyingly crabcakes bulok acierno papahanaumokuakea bollingbrook cortesio katoey mukhrovani yanyun bisol lanxiang losty yeohlee cardiotocograph harnet heryawan rashidan dazzlement monkmoor dyspraxic gatkuoth centage feminem sawl ilyushins skidoos pizzetta dostoyevskian jamye muehlbauer lmvh inergize hmbana boeh virigina goochie tepetlán wlcsp ncsp bodytalk kretzulesco khristine electrity gremillet trefz katasila gulets rohra grigorjev bedrails kaliss transf amoralists cannier sigheh lled fujeirah devestated republician coreg samachablo anyansi biunno briefers sukkahs accessportal whitestrips angawi ruscetti sheillah bauerly bunkrooms awri mocassins lassegue volpenhein mungin reitemeier chukudu kalcheim greenestreet iosono elfstrom backhauling xpec capial bataoil lebeauf frothers melloan panzhinskiy turtelboom playfull slathers cugnon exectutive cctm charasmatic hajizade tairia labourist izzatullah successul overleveraged overley philtjens cerfontyne ovx mutahida younghee hoil threatre jizzax prevalant stauble pilled tpye ververs gerler kjustendil chavancy antigravitational zakinthos remeasure haugabook yaftali infotrieve ladish mediamark jalynn prutsman walzes valabik pmfm dawidiuk vnaa headguard tennies biart facebooker sanm wackermann shmotkin btter tweini jersualem partl dressier alkhalifa rendevous sophisicated lorenzos briskets spinasse fondants tamarah darknesse whoppingly brushings caviling jinxin mcilmoyle jadaa stickings iftaar layas pechacek voropayev pirb zaio jiangying hakurk oxtails mournfulness preferido bicsi stverak coniker makawa rodopoli nafpliotou ablauf zenima dishrags kriechbaum chipiro ascarrunz lmar liquin arapoglou subsonically hairshirts holevas wiyono scamsters blackmount ampim spiffed munaim cyrte orgell monastary zimbabawe agressors gpif scuttler trucost arnestad mosehle savci pricings divoll adigwe drumintee exercizing villainize khadidja dulaymi konsam harperstudio maccario orlowska linkout visalam kawaler egstad cagnon kalameh mouselike glaab margett pancini minins commisioners russoti chyzh kalvis ngcc thaut layettes muynak shoulderblades taribavirin imidiwan medvin sunshields wouuld productize cepl zanupf broked zarghona toptier jeffrion denckla quandrangle chiropracter plaisier bazzetta nxtcomm studenty talkboard quaak martavius petrosun volpara thavisouk corogeanu lighthart europeanness schenkar collerton monteiths heppleston elsina pinstripers borocz lafetra marmoreal hotpsur sarsembayev erraji neidstein mahne sendar powfoot chanthaly shiyah rayssac hyperendemic tschuggen recapitalising rendeiro cimzia drinsey tavoletta ilhota dubchak mcelman hessy ovais abujihaad deglaze anghelache sgobba nonstory intellectualising oduoza kissables hensal rewarm woofter lumpier cheesily pacier deall neuralstem haplessness caifornia ghatan zitan apalachi shustek ofran balakhani adoyo fallico nanocenter februay yoicks rosbifs pcapa rafaelov bulgargaz stollard diebert speedwalk fawningly jokily tygard vallings librandi kurrum motorbikers harded halfsies burgler kovalyk thombs worktables boogiemen taikonauts qalah weseman sadibou unfriending pennyslvania fwice fryett giampilieri behnsen housely jiazhi ebok janean disabusing mehdar faultfinding birdstrikes soonr tobola langenhan sumirago nahirny shorina dobransky tcdt mcalpines noninterest leonardelli macugen ludwina parveena marberg biofuelwatch dombek dustpans accupressure casee nyia hajem pief chubbiness vífill moistureloc maroone chiringuito jetico flippage galmo schuyff reinstatment envivio demythologized kadaria onecat howt sensipar sherpalo meadowgate nurdi casac rétromobile mistiness greensource pirko laforgia hottle crapy trenchi etxaburu nagydij toffel woundedness bdcp buccini outraising bruesewitz scacchorum lovatelli arteriotomy tregonetha stuewe onother depoliticised senties natthawut kenndy roozrokh bodyattack leeholme washton wedginald mohebian syden ehmcke protocluster explorelearning ghashghavi carida matussek homeira alsabah mazyck comissao bassoff hubdub malamed neophobic zarnke napack swogger mukonoweshuro maxit errosion raveloson lahaleeb jericevich gadgeteers umraniye rusenko sotware chevillot kettels santarchy chitting bindschedler kamajor shmooze apparu reissa mureithi ravas sharaud fajinmi rsantiago assload salicyclic kuchinoerabujima kontraband wysocky goodmark lawncare kryvobok leibig guindulungan expostulating vahradian unexhibited holbeins schoech detectible pamelor fireams zation stitchings spoilery elegist mohamadi sacktor boomslangs zemlianichenko premesis thita vinopal lascola protrader legrice demauro wazzu riebel rottweiller sauts sabeg servigistics tomlinsons kalkwerk hininger pluth longdowns subtone angiodynamics catastophic favorables caravanner anip wyshak sulmasy okuyan kekhvi carpiagne lutropin chruscinski garell thingummyjig sereena hammerings tielve amangalla nellans ungainliness betro consience strenghtened wernig ehrlichs physicial xueyong undecisive britrail kramim chevely sulistyowati bibp flurried prapawadee jaroch kittaka kamienski pamon bakol mailstream brif battams muttonchop campingaz sylwadau kyndiah rollenhagen psaros igglepiggle critcized memolli flumotion shoar sousi houngans fpns replaster soulstress osmek antipodium validis lakhvinder autospy yogabugs complainin yanito viperous recapturetheglory muney mdtv kemach mismarked humourlessness pitigal haileyesus bolchover turturice demonination albondigas zinged honn ghuzlan identigen brdl gatza sipla finacially coudreaut kubango smartshops tyrannobdella khaiber nondas shasun sankary kipre emamul firststep imaki ringles paukner sorosky separado statelier spunkiness biocontrols strahs scintillometer tches cacchioli mahnic hytest extradiction nimoo benfits lelouche masgouf gangsterish boltholes musana semaw ddaw floraholland exemptive primous paiboon reconrobotics atttacks frothier pwajok plusha posties penwithick hochheiser yongyoot cuigezhuang chisnell portego fretfulness amvrakikos integrilin carrotmob mutashar burgat lelieveld emmorey pseudophedrine abdual oilrigs wzab uptightness aaaaaaaa gorbey bootlicking spykes asacol monitorship bionumbers distaster wesselius argippo brunners mahdieh sceptism grotke hostaged thebritish wieandt perille snowshoer dedvukaj reemploy bribers orfinger orine latchin thisclose bunf armloads endundo ffransis fettucine backstabs rasunda rgus lorillards unmetabolised yolanta saffra bartabas hommos sabresonic qiyada worldblu harehill prating employeer punditocracy burqua dahlvig smeco tikos zippier keroack bonset benmehidi teeshirt madunina dsec kazamias midlem closeting qudra lanzillotti orthoaccel lidvall ehrsson starkjohann vezie capoulas ingolfur seave sifc karenzi ebison callpod kardava pettily periph powerpac thummalapally hcry weissglas meridio bowlines jaksto fulcra ribnovo topoff machipanda flytes thirstiest natchiappan reffet zarren gwell mozarella naatha towfighi sofabed tsuneoka annelisa supman athers globalizers pirandellian irené prizen zaafaraniyah multicity cornhusk harestone lhbs verbraak trialogues borrus ringford davd rawbers czin pilkadaris soakings tanzy hilarous mchann hengda egglike multipiece groinal fragrantly homeworking komanoff buzzmachine graubuenden nilab gutermann ruaro metrick yotei paparazzis bogusness monba ehenside algebris ultz thumpingly budhwa gluteoplasty alwara fiscardo stukalova parizek valenica shoraka beggaring rattue formidability gergaji wazungu leostream trasks junious cyberwarriors irritatedly delduca degutis shuftan megaliter hamiel sshe avalere mcguinnes laropiprant wonderlich berkowicz slotsplads innovene nstein maurauding malesko fundrasing wardrobing buzás lobis barerra underprice suson mausner sabaudin cramant gracer chechyna zarkin janati unmagical contactus kabari chykie wimped inghilleri quibdo teachin equiment schmatta faceguard ceyla rsrm khojir trendily cambridgshire sciencefest bimont steptoes chipidea fogy naïr holidaymaking diess trancendence safeminds sandaled roarie firstbrook sunshiney qorwk niederungen juanqinzhai zentek jamac bjalcf quenin runouts fetishise molikpaq kryolan garrana butterbaugh jahic muhajer talibs sexagenarians disbelievingly eminonu dicator kunzelman merriness abidingly dottino cantellano commoditisation coconspirator jerath hippyish princeridge imperturbably doermann ploghaus buget azher níos arboriculturalist hpti underweighted wesabe zakanitch packrats verismic subscore rehersal odze mckaughan paprec cristals brownsell surepayroll itip arym deinstitutionalized butah uttr pleged studiosystems sakanaka dantewara ciliv geosteering surapol amnio triyaningsih jonal zweben yahuza spigarelli esophagi apercu grenson depraving nonsmall makistos suspision breconridge bookswim abbassian merigot reconditions towcar gudal condemend kitabata yoss armao mirrorlike birthparent lifestage vlos merriot czepiel muskin boatpeople nonchemical velocci cyfan dranko twitterfall mckersie dodders liverail sztorc misruled heidgen youseph slivovice kosmopoulos endorectal humevale ribbonlike redtag grefenstette trucktown colsten snapnames nochten haithman galeao parites sellotaped rozett cgaq makaridze omnipeace realdolls willinge crewelwork zitserman streambox silmi zamzow silguy culbertsons ministate nutritiously jakobshorn governmant apdp cotis thjat digusting ivascu eperon batalona barishnikov ofpra interupts rochlen reissman mazuryk schnupp certifica menacker rohdes manifester noncompeting eujust tranformation schuchter economistes aripov cathlyn musaab megary jankovskis husbandless sppi thulagi escalopes rooda inadvisably tabizel unacclaimed baathism eclerx purpled improvemnt nontariff consciouness moec aboutorab abousamra liazid rutlege baupin sarotte televerde southernness epidem kuwai chharia richieri teeuwissen bepotastine bouhail geldolf wolynes diadkova pelerins ceud tcpi jence gonazalez keanae qualfied pirouetted maunderings khachatourian vaccariello upscaler vasilevskis crasset toensmeier snugli ollson hpra scld icescape myfi cubicularis biop mdtf basargin batelle brewley nidri viracept sexbot devictor sirard amenagement dyches kepak diddymen rabidity mancin tonnato pressgrove galliver vixxen vaalco fumblings khvichava djodjo adorability mabrook corebrand ipel kaelke fuerzabruta schoolwear discernably deutschemarks boyarchuk recellular jubak sogebank micoperi izbasa haetzni sahri munkenbeck megaband koukkula jonthan bigpark jagusch overconcentration petrecca strempler teint kabando confidance radzicki thicko backflipped triantaphyllides atgwu nonslip ismailzai hoves resilence nasto insideradvantage canori devit gearknob irrefragable ulpd rigths exercisetv trackies unassailed strenghtening shakhshir visionmaster padera meachin sivuqaq merdian hellevang koetting microlender pinters acquiror manizha naftidrofuryl ledia creakiness fastco zuckermans mogmog econohomes dulvy zaryn lishchynska paperwhites socialthing escorza khamene cajanek geiszler tonked espring leggitt liaigre kasetsiri bragagnolo crowdstar samknows treillage emulative ehlmann steptext besifloxacin llinares pafumi kofmehl macaulayite disarmanent sdvosb sporanox philanthrophy multistrand insourced tropiquaria adnexus bobroff bubl nehst passats phuthuma emmisions expungements rootball chatlines gcy guariniello helljesen boultinghouse remarket pcrd impaneling jocelynn bougourd brownite bacigal germanier indys buveuse uncoached dhliwayo undersexed pautsch daniluk africanised sindani increas buybuy rikhvanova christains tamfourhill waakye homoine aberhonddu hurón itʼs bergsund hallac betj iranophobia yayale smellier gancheng harbuck gochar floorpans beverely bamfo openvibe carnglas presciption progesterones fasahat tresspassing sulkiness groft muuto prescheduled gigle beatrisa nanopatterning naeng jailable eavey enchaine beyou rawitz disected sennowe diogelwch terino osna netflights caherlistrane edpr modric fairfull unroadworthy kariamu vermuelen agcom ballestros esot spiriva wirtschafter crisislink breakins zermeno eurolat merjos cringey machiavellis obesogen ngpl scrra weikang tarriffs soundoff tavalaro bedevilling fragomen lahudood cipralex sepich koleston olukemi overcorrecting slopey bbam stemagen chiadzwa marketised pongsudhirak unarranged cintec morgansen ruez eurobancshares saadun serviceably feting kellagher overmighty abergils haemorrhaged balzacian hosue glasslab crespadoro interivew corsentino mowasalat sourest oktapodi galvanises coaliton compsych uncollared boutrous asppa hureira banpro karabus cambr gulvin allrighty coffeeheaven sarene inhi lovenox weigner spethmann pvsa teraelectronvolt mcop chinse eggstravaganza warmish redrobe celebrites dustmann roeselii marraffino arking heithaus rosarie autex harvati recessionista brodnitz datablog rebooking takeh krema galatica ripston cooklin woollands multistrategy zubiate concertacion ambroeus marsey hireright urazov tackies dazhalan tweaky muscley cameronism antiwhite postform mokgadi schwarzenneger croff closests misiaszek thaumastos wheeeeee tanic sempell butkevicius manisco embyros mellifluously chiroubles moerk ramraja sercel malesan caleton hrer arnaoutakis cappucinos counterpanes sisha evenett surkhakhi cermony konskaya hoitink recarpeted lmod varqa goleizovsky dcsnet carsberg knackering mohhamed agitatedly raybin bupkus ocensa abyd cpcn avalanched saril kejun hief ravjaa balkholme adamiya freedomnomics shaunte holiff artworker dunetz naspe unscrutinized oleandra unbuckles takete darlys uniqlock otbs galluci unsheath schellhammer freson crossparty mineseeker unmolded kadhimiyah håkensmoen pettiest kiogora swaid arhuaca tobold lanxade ghirga rissole xifaxan silkiness thermoplasty getwellnetwork chuanfu vicoprofen artworkers sahlstrom corrag gengsheng crocketford alikozai woooooo tabankin dipalermo gildenhorn concilliation troughed yolaine emmins exchequers papalo synplicity timpsons busniess randlay remitters vivenne knifings haileselassie vechile ibrado finanical rigside elasticised brightwells endrik buttry eyptian foodmaxx sopped consumerland nyweide vlahides teuben mellizos retrofest amunition farreaching torbit mellila kuvaas krabacher eloxatin hekkema thanenthiran mobiler kymani akqi xingdou illamasqua newgard zonolite fendry pakhomenko tenschert brusiloff bombiviridis trubey muralee almondine vovkovinskiy wauchob maccalla dessources dinte vicous marole lifesource zhangazha criticalblue fatboys jwad dingzhi bakoyanni marjam heimoff roeca kucharska singsongy hospitalise kuersteiner testrake veyance unkeepable absoluetly munire draznin amercians bodypainted pittilo aiada doskocil fiftyish fujayrah inists sweetcakes adacher casva hexaflouride thissara tortorice etchebarne richels hser wolwedans ception seniat menahi joschi bousses karolides paazab gantly regusci mazarron stainthorp aguirres safle mondro impossibilty predecesors trotts pakpahan sanaria turmi neibert billpay rittie achabeti ieci breeziest doueh hypoxico pzena cannada dcypher jerrianne barrenwort pargneaux linselles moamer grockit weejuns dudenhöffer ghafor crapness rechecks scruffily krmc diino duravit airbed bidzos timebends biesk flaux skirsgill andollo miscikowski arnika miteb alkmonton museles widevine soetrisno pittsinger ermonela mediasurface yanguan kotlyakov transcantábrico respon disincentivise yulis duroville htfc dusika ifec larence azimbek dathorne nbra reedijk polygot wheelnut boedihardjo thiazolides rideon quatela nondrug soltow sergeac gropings chippac willockx cagily downswings ruias ticor burnsong alexin cannucciari godsday wasch okeowo indistiguishable beyonc offiical phadia pisasale healthspace bachatas taupes embler ridiculas gaszynski strbske siyoung odwaga chimore sedapal betrixaban kimenyi semuels gelbmann sarsam chigishev jursidictions savain zylinski suddely diefenderfer thecurrent dejevsky dlesk shinnecocks toowomba tortiously griffeath scruffiness morizio inestimably nellas jittered barkema jetzer kushler aecio picowatt reevoo slsp nobbling nciia flosses vascellaro publis salteñas occupanther vallandry actigraph fattiest vasper toorock sbaraglini sureno zennstrom pumalin brumit aschieri grampp laclair noncarbonated adminster sarwal noof dalfen agem hogberg caffera hassanain kubrickian revolutionarily stotlar dqed ehrenhauser orderd alonside coproxamol brotherish eletronics andalsnes trsl bjorgvin frostier bedazzle cancelmi simendinger jamesetta lewtas squishier ladny pinenut ricot delectables vasconez tumultous switaj maltam wisewindow habip aliviane crimereports promac usarec slinga bernfield weaponising ltat driveling meienhofer makus constrasting loprinzi barneses refigure ipzs quibblers blauwet narrowish sooliman fasps ragoon mbuzi chappelear swantee workwithinwork talkfest tecco lakiesha bdrs akoh ashja ofhis visocchi spudding yousouf mahvelous janece hewgley pengana cablemas momentousness theatricalized jagh disaffecting accustions toama mothershead abck decarbo clovenstone clms mutualised agnelet rattal amoes governates camonetti gargurevich dimento heith karolewski tfank penpoll franklen energos inderstand scintillated tarpy tsujiura railrider succcess greasestock takanishi shoemate delemere tavendale implacability hallbauer lfec carryon dimished riniker gilbar krikler cinton murkiest krstajic gerodimos eastler dubais forlines ygal enfys anagh cresitello honeydukes linkohr oliker koczi dreich perik folfiri peaceloving dellibovi turbow mabanga goozner meinshausen agoumi airwick asssessment maldanado pachtman broan kmsa basirat dwtc taioseach filc foreca wahh glencanisp pepke surenos bizuneh haemorrhoid linkenholt zongjin latecoere debenedetto katzburg ventilates irgcn avmed shakshuka weindruch cappas prescibed overanalyzed ceglarek quian apeh hebblewhite ploughmans hammudi petesch kissas yageo albiev simec chiranuch spielbergian cappio onismor condolances webdale kucik dangamvura akec somini kallweit fractionals aminda samoura odce shenergy datanálisis hagelauer goldbergian murdani loewinger riddens connetion yanadi scrumworks truckdrivers bedazzlement trilaterally omwami ergezen loonier goedhuis yokokume cinciripini zacinto bergeman tribalization breeanna qizheng clydes oussekine delrae protectio isayas limbones mightly metastorm obano vanjoki massify fréchon outling shakili wojtak keffiyah sunesis upcounty hugoson loutzenhiser mallins dncc oriard churchley targanta pement micrometastatic umholtz dekom idalgo geogia floderus aredia taona cicpc realclearmarkets sloaney hulnick genr mshini energyguide blueant poulicek spokesong kransco gatefolds unphotogenic cockfighters anastagi stowells thomashow hartcourt kazza kebony bogliolo pornified egpyt samarskoye lagani khosrokhavar inextricabilis foliofn merksamer ploddy pimkina coutlangus minikit xopenex herdes umarzai yetty threathen allaw spyrus hvlp nvcjd stigson covault larizadeh nessinger kiejman morparia baquer timbit renaissancere rhaid divorcés nakarin nonrealistic wepco humilation tokoza hottrix pantsman vithy thobes misdials hualon abdiwahid xinpei selari solastalgia rodeohouston nonken kudrik electronuclear comag courtesty petruschke methar kudina unshackling shriven mahaley soggier paikan leakiest achrafiyeh furtherwick rishe caringo softcat bourride khudzhand purità perod kurnev newsbites duraflame reupholster keeril mintzlaff isotoner staycations nekunam tomazin banaa bradmanesque daunts vogelheim beetge gladedale exceptionals ssst bermudans keyhani zwillenberg flashguns nbbs paintballers emmalee aftershaves gwybod micta krejcir duhy puscau carcelle canegrowers techserve tfets snitty deliberators fatuousness kumuka skovsgaard montgri fishmore icli ansol lendable invesment superproducer saleslogix whipkey barraques sprigged santine oedi desensitising antiviolence arséne drakeley pugnaciousness geleijnse ijcic arhaus mwampembwa basketfuls fingerwork adultry pekgul bukaty delus gpaa tpct papastavros mcclam underpriviledged effient levenston ltpa ihda daypack orbusneich antidiabetics malysz grelier shumer nonsports profligates baibakova bandelli limberis corelation pantperthog hockenos jorrick shieks ghahraman crashy meckfessel aweau eservice jomba curvacious overcapitalized condroyer mshda novespace ushar ubcp fampridine yogli teeples crystina kgoroge fassbind brookly erteszek echterhoff bonefishing sturmia fenay panathenian momjian mustansiriyah antonick aaslaug bfei elsbury batheja duperval rozner photgraphed scharman earnse vosgerau seventythree notini wedgy rackswitch ballyarnett ganthier cognetas zacny reciva suljic amfibus orwoll siggil lareo utahamerican wardheer yukky cruiselines treelined hbsc chitau leitmotivs defexpo shoupe zhengs fotoweek fastpoint yingpu dqd douzeniers amcol gervich edrych preisendorfer sharpcast unopen capparell funduk jobserf vilebrequin wvcm cyberagent policer foroohar multicountry cheeking simens jaunary sarbox hcam fnsea munwha karokhel punchiness derschau shilda peverly velaglucerase callau khadhar swyddfa squirrelpox horsepool chesting supramaniam motshabi sfta dumbiedykes peskowitz bathmat ballywillan yugraneft rattly winterscheid bybox craftmatic gambriel alauya vizplex dekosky supportors kleptocrat effusing denamrk restino smajlovic colarado milligauss reithmayer theyt spraints portugual xsite guarentees disario lavrakas pedrone shiane montazah malakpour bounciest kaeslin denudes kilnacrott pfaender suraev levigne flaen godfinger chérèque perfomers ubbeston marchinhas kilinochi chatanooga genebach cambricum ovodda disapperance vervoordt woodycrest parsh montefiores badza lagattuta superthin amaizing deperately vspc roadtest vtss vieregge dodard dudzinski vastani castellito tullygally kinoki karokhail palsey milipol leafier benbassat demil tolpeko gilleard mcosker digitalchalk razzetti hyzaar algesiras valueoptions tdameritrade partenariats nadhem deitchman merlone ellaone sottish wehrwein iflo puritanically seduccion weepiness polycotton osuri withdrawls dreki penasquito laraby bimetallics windmueller stilettoes ahmud kovachik joepa winkled toursim paressant kozima stupple nestwatch chirtoaca madridistas dlodlo rachmanism achos flunkie almaleki jiau adlakha moderow cosit mattscherodt lafraniere azadian gmpta lemenager resecure tucsonans schouwenberg jabareen altoumaimi parraguirre citysafe rikleen tersigni lcdx someonelse immunotec hueppe kittila skorykh realini prolia perer ehambe hearbeat yijinjing zeitgeists libow yanny stylewatch corter isaps ukio humayra bosselman groveled speid monjeza direness tuleev aurp cartizze malayappan blockheaded multisymptom hfas cloque sanderijn gkpi kaskeala ichill kysar shrinivasan bingemann martevious bivb representitives tajarin doind shuyong bialkowski kuryanov portec teitzel tefap mikaya rabineau laphil junsai everbridge vapidly huldahl hinkles reichbach clattery amadie korisha penkivel neuhart subseqent arabise theary shamsudheen menomune aomar nnal ramassage footytube gayboy lopex mercadito boostrom funnymen mcmurdie glozier waney raiber shambled askk opaques localnet thisthatandtother mmpl asnawi ribeirinhos heredad janies nuvuk cheishvili lohafex lokeris hanooti nhanh unitholder rotozaza tratner webtech wpy amerifit optitex röcken ojani gallantree mcbl abdukadir encouter rathakrishnan wangoi pubilc crafer gylfe tebutt strugling delrish colomendy negusie ydri hryvna beautysleep nxea enfeebles ogboru rukwanzi ghostworld poernomo cooncil dramis ovulates vermund kuyl matemwe nongreen portovesme psycopathic volkwein apsītis onhollywood warrantees injunct enrd tokyoite amerus sieć photoswitch shdema coalco kroloff crustastun cullan ctna justifing satawu oleoylethanolamide decriminalises mehne breat borysik labarda colboc llwyr hillsmere milbook cvision kdhe bleeper medmerry genasense oblimersen homedics baoanan millest unsoiled nyias rukshana coastbound twcn procrit crymble cantankerousness mattich adiana diloreto fritolay snifters yately whupped ridouane joycie frontlist timberwest unstimulating braslow recalibrations vitrola thiopurines underyling dilnawaz neurocase signability lighteners schreiberg tlachinollan macin phsyical ritualizing drawdy newschaffer opalach sweady nwec struldbrugs purply yatooma sexted ffirth kristic utoyo johnstonebridge maringo charnota republicon anesthetizes pirla chieming compretta westerback flatish coolmax shamarr bosfor zwingle judisch ihamuotila saitas werren nkadimeng xinqiang nativistic ringbinder idesign raliegh bosanko hillan misell crepps aracinovo dankerode helcio guanliang rockspring felito theede welching julong ncin maasbommel nontextual wharrie gwyrdd trumark windups kizs snapvine sillanpaa defensics striplings selka anastasijevic redtops imamovic trendwatching overtasked garaicoa edles mycoupons servicechannel arbaiza dishonesties geolocators urbanbaby podowski prausnitzii cosying apuc triboelectrification shinewater electroshocks happell brittannia icenogle rsquo machery eicholz chiropracty tavey stumblebum mydicar nonsecular acuerdate undule muthumudalige stinkiest breglio sebarenzi woronzoff scheana supercomm mexecutioner jpatrickbedell healthplans tarica larowe comsumption ultraportables hijji yrg comodi gmita despouy omot ntsiki smarmily bolinaga envirocab llun brimmeier edctp poufs klackenberg fdaaa mickah bryder truckling vatagin alpargata efficiacy anstine palistine taxcut lagrell masaood leevers ospca truvo martinrea ehhhh shuval classily dishevelment effeciently fastbreaks pairts bleick lokker alkhatib pulwarty fedral nyuki unpredicatable manguzi thorsteinsdottir piretti bloodymindedness horizontina stucked demandingly doback gimnastic yuce overfish ulyatt wowzers campest petray prouser upback epratuzumab greeners francelino wbrt screwiness gismervik economizes jaheel alkhair bcny euthanising edgwick sciton stuttle veggetti laydee vrus toters xiexia lauden cunagin portentously galantuomini vaclavik nortenos cueller aramini trown polytec gyar lacoe puffier isafjordur moistly aberhafesp chavarro technolog hurlow oshins nasbe noncommunicative apnoeic temares coachroof guayakí caramazza alderden chaderchi allstetter cosseting mersley balfanz muskateers folchi racivir hargon fruiter topseos deantonio shuzhong buzzarté lhergy zinwa miscalls bagmet salaams saveock locksets rochom preszler tpus schwankert bvmw teamlease thosand bekkay kadakin crushproof restorick ucap kitabat fohe briso temporise dipuccio souare alhuda tranched ameris jolean chengue kipred dahdal wmz cohera grotenhuis impishness herbalgram vlassi dubiotech dayboat midcycle admax uncarpeted mobiclip flushers ohhhhhh mascaraque fargione zongfu sovreignty calavia mertinak huzzahs hardscapes vohr wonil zestfully semisubmersibles planktos bramhaputra odimba skulkers gonxhe elegua blamers yaer strangerer harasym morcenx eyg illegitamate exploitatively tschannen stuffier poliwood affifi parlyament huttary simlab raaum ibandronate nanofilm nioxin onegeology nonsupervisory egemonye spotkick wilmhurst leslyn lylia bonier holzhammer edsinger cowberries ciolos naysay huijser guleff killenard replating cosabella distorters simerly interiew shoreh barankin marben bartech sunaryo haygate bluehybrid saccoccia bunnyland dezcallar plattin sidewind mogaka multipanel cushenberry superbrat sorillo heckbert yanying audaciousness tarabarov rumormongers cnbv hotelplanner nallamothu jarillo yuewei earcups raynaldo oktoberfests pentrebychan timecards sturminger lochsie stojanowski nsmt gastar jovetic bajilan loafed laffa namrood szpiner marvet seebrig glenborrodale bonvissuto souchard companionably uryadova insiderpages applica horaire bastuerk feoh sellek adakhan decarr kapatos ketra datasphere abbasgholizadeh poppema behalves salahat falteisek saksin verticalnews shockumentaries peskoff jabbarin semmelhack chewables wearden zimmerstrasse mazdzer schneiderhahn pearlmutter securitymetrics funiture ngodup surabi dangour theier natcho pethokoukis apichart kuljanin bendas poltically hardricourt amanresorts tabery dynamex piconewton trogolo dalsass bilsborough flantz sotheara kolaches morgunbladid decodeme wiseacres ouranoupolis androsova covali cèpe flightsuit edmondus sonosite gawping caergeiliog kandlbauer pettem torotrak myregistry potholers snoozed fawza superlawyer haroutounian needly clavenna benesova christodolou sweder cancerbackup torita gazundering chilren bachchans giddiest embrassing funez spaunton albenda platais icier maamari ntes gillane helitech solerno ddwy postpubescent versweyveld effusiveness raikabula bigged haddah ishkanian monicagate weatherstrip sicr teleatlas igrc misspeaks hartbreak crownvetch logroll plauged mashakada firex turneresque rapaciously yearout forewarnings kuratani vraalsen duartes shebly fi ecolodges barthau touze crotonville titstorm mainshill matrafi devanei hannick reconciliate geotec norgle sutel sevene direko samalut acclarent cohabitees nahawa mugira fingerhuth beleiver staffies ghufron infometrics honerable innovasjon vaporetti herdlicka uncertaintly athlinks piccino marbourg abdramane yitta journeyers easycar clairoix baldisserri emblaze suksan sculleries moayedi thriftily mataban kulju morlin dunavan seani kuratomi canniest writhings grandbaby carthorses ijet tadier rehabcare czamanske straussy stretchmarks underwhelms etkins chinitz nvrs shvitz dougill immunoadhesins janneys videocam schoepke triptik mezuza utvi smartening heuwer spellbind hafstrom wislocki neej yangaroo jessies zigging mhashu farstone kishkovsky szorenyi menochet zuitube kirkhams garganelli beztu mcelholm mmviii innoculated comissiong fidelite spherics wellbeck coremetrics zuanic gluttonously wascals booooo caucases péchiney jadedness malawai magnetites matchet fogelsonger regionalise schave tmst coffen hamisu neilyoungi texai pixeled petatlan ghulan proir tarantinos axsys whomes mwando pianissimos explotar membrez shiffler jeruselem opthamologist juvic quinnett lowyck bydd sundareshwarar cablinasian prescripted atttack alsation fallibly cacutt overprotect dishabille heavican raiken jeroboams piccillo hollowly shiryaeva uchannel raile tackeray cyberthreat bookrunners demeurent outfielding matuzalem forestweb tolerx migaud pharmanex polykoff democ calcuations microcantilever achfary midwifed drumbrae wwxt atsutoshi arrugadas butteries apoliona chivan dogwalk marivi crescimanno blacklegged germanika slingbacks mukit forcasts nexar ankunda geigers belohlávek arkland pedott topcu comercials nelton barrachnie ifpte culinarian nundroo brandied trasviña cuases dpic psychologising topolanek futureit trokavec urogynecologic photoshoppers shiia perparim cyclamineus yadvinder reawoken propps gottdiener bexxar weliweriya departee fldr delubac mewies lisotta furlined drumskin muntinglupa egre surive celltrion calato houttuin howdle witkos wjzw euromax tanezumab damco giornetti biolife matisses frieds stormily maribavir heptathlons bluring charitible swallowable bengdara msdc centuro adetula tolvaddon akaretler cakar khiel ryonbong mcdarby letiecq honigsberg puddifoot afriq zeif ballotting baldaro educaton sperian tundergarth responsibile calpirg souping nutbags chressanthis soparrkar penymynydd quietening bloviators tarre wardhouse disunite globalmedia denerley pomerai ecrg airaudo nobriga dispaly hcrc gradd ndongou adblocking reaffirmations contentnext vainuku stranocum producton blugirl darbelnet heidelbaugh milborrow farabow yaros papaflessia vietnames pownell gijima alcwyn ozat kuoy baltschug tegryn mcneela darpakhel planalytics jospe wallmart ofman completism breul witlessly goodmon fasinating collidge grissini booksmart zikria orjiakor accuvant hofshi czlowiek unroch masculinisation sunber kimbisa waggishly adfer southernism dilallo megatrade cadas priestlands rosciano honein changelessness arbora ramattan masseron hironao nø ccbn ausmin atfer kerven brannoch snogged dexheimer calamitously brasell cnla unlet cayennes fdls ezj iniciative bolnore insurrecta adickman boultings peripherique marcavage ndvf falnama manettino opinionators thinnish mitrova milarch oirish moraski jrti blaenafon snatchings souffrant emmission terenteva balouchi gentzkow pamams opprobium rtdd dwon sweidan mazzaschi crowstepped scramp scroogenomics natonal woodlarks jaabari betacarotene chepchugov teborg dehumanises borgstedt xianghong ortrie heatlh bankcards tynesha bablock andwele stelara unattainably vansville kloes furlotti tromps cheslock minxes flyhalves harnal lasjan preh valizadeh allars boogyman gwernymynydd gimara microfluidizer hardily saydia bakkavor ventenac luncheonettes coshquin recoat broadworks valteri satpol sparest reynié meeja disovered bodgan constitutents opiods rtpark inchmarlo opdebeeck nonvenereal orgiva gunshops revani oracy whoof overdramatize genwal sasbout gilberthorpe weissenkirchen shirtdress monterio llanfilo tabreed baczkiewicz helmly yushau dolmio konkatsu schtum freckly kubzansky inverarnan slomer volkening thabani itsma mignardi sentate hansala frenki blumenthals xevo tasat pawky pgmol kieny felcman visualforce allgaeu virgis drasek dampha fairoak ysios mistele forsight balcons roitstein inuendos nicoe godineaux bonnerichthys bezner galuba worldfamous emiew decarbonised cleron whuh ladman mroue tsirbas sustiva clarinex attemsi marinhos carrageen machulis meea mulongoti giradi zayim whirred kibbutzes aquascapes ayap chapparal schnuelle bananana puckrin alfridi bysouth waterwell lushes overinvested kowtows detangle usbg congolose risbury pdpt ffom pehub smuck nfsp simring keyamo omoyele dabinderjit hamparian pixillated blubbing spaetzle sönksen wilcomes disabilty pamart correx glamorisation zurnal ncaf rauterkus prognosed lenett lenkei karrasch verfuerth delcath rehad segerstrale distrubed puamau webslices boedker ahogada mershin elmaghraby copling paneque wowtv purnhagen coplink cagin hellotxt malaren dobrovic glasgows duann nylag philhower catcalled pinapple surburb kopetsky hypos teenhood backordered kokoi iapv khastoo kvinta stageworthy armanis kurbos benchlike tyhypko xiaochao filches ngakoue banyjima linkups gliadel ealim tschorn mudzingwa amberry rissing jozic gemmayze verenda politcians enviromentalist meleady habarugira desfosses biosource srob slanket garazh galafassi nollette dasanayake deitzler yewen nster alcr laluz talgarreg restfulness nmhh immunodiagnostics befogged samray zanchini charniele dbes taimour llok omgeo salava kuriansky oldington ledue flunisolide msim zarudneva balir aaric varvasaina uncapturable dakake beatlemaniac dhusa stingingly blankies khaddafi hadman vendormate elasmar sgpt byamba luvians leejohn marcoci playfighting kabballah carmaking mcnorton khankhel collateralisation frenck wallbox glimmered puligal mersiades billionairess friskier branstrom horndogs paulann nunnelly katsenelson kulchy abdirisaq grislier mylie hodac buschow schnozzle goldia tolchuck diease zometa echosign tindyebwa lahovnik chamil cefx vahidnia dlask bandarin rhani meningosepticum heilberg ecclectic presidnet scalici dentistas pouffy lavinge staysafe kossangue gorblimey furriness gourgeon weeren discimination labordi execrably ogunjobi disater sceney pichan superviser vekic fraility reslizumab jernvall hillo pluckiest ireports rowinsky mkvi meglomaniac giradeau laicize watina cohibas iecee pulmonx fintor echazu jugulars laygo eght bioservices cornisha zondas greycrook franzetta syndric espeed unclassy chadsmoor sabril raffinee impingements smartsearch shipmans mbithi chunquan mililtary toremar obamacons furanones vayrynen rizaj liquescent expessed totvs hellmans saifulislam srbska edfs attcked carbfix migereko hillstead wyhs spoofers hosepipes ikhwanis policical yongon stoneyhill charwood gittus valvulopathy dedring dustiest jankovec colcrys unsoaked longyis mikhalkin mndp digney splashpower homeaid santano berragan agnieska seanez azcueta watanagase planemakers frontwomen xingwei freebrough swieringa lonning rhinogs michni gillece toppenberg chainsmoking chattily currenciesdirect seasonless elwynn singlemost barazza wakodo demaci keyani muntadar plimsoles hembrook gruenenfelder ninewah portege interopnet binikos globic papco wernyol clydesider klish ironpants ultradense wanded ikena kéréon pesznecker waycott comras seveali jerrald mcconneloug nopporn speding arduthie adirus kopping martiza oncy rowlson hamhanded lenchwick ngsa panoff attebury capozziello sarvananthan malkovitch dalís michaeljohn imrg skrtl wurmbii ziaulhaq saposnik silecchia softic shopworker gengnian skycouch bovett treasurable industrializes chicozapote bromidic nervewracking rococco tasir adjagas gasparoni homewear beanballs chiense inkaterra veganic kuppermann suspensefully badrah trémolat decling doomster hursts cigarrettes altitudinous inconvienient levra spokeman yushenko aquapalooza unluckly repsonsibility magrittes edtp neckbands waitng grimmel kapris triscuits anaemically feinsmith montuschi molndal anastassiades woring anabl rebublican ehrenthal likel grunenthal couthy tnav ontarget upperline shucksmith bouverot arfeuille stefanello ffdm minimills cnsv hellowell hvms naug mapjack munizi agenices nonnarrative saviotti aleyna letterbreen swiftie transcosmos mcaleavey mamand baetge carolos downhiller streetlinks cediranib sooka oatt kartoum idacorp macchiatos xonacatlan boylepoker lavorante coutino bahrampour wearings gnomen chironis rauzi ostomates cieneguillas merchdirect diguido yuzheng bullocking ginevan postprimary soyun junsho misconnected anleu thundr tortilleria beache sunhats hardouvelis  souz avtr excutive stenquist ruchbah schlip sviblova qureish helil osorto yesodey galvins hypercore craffonara karbouli sebrina lolololol donnachadh romatet mdvip moderateness edidin facebookers thinprep royere koping diasaster berdymukhammedov nicoson hellfires essaioi unctuously meditor siamwalla pourous sordillo gcla aerogarden sexts regrind gidel knolton ketchmark kucova redcarpet marwoto zahary siroty pwsa zwally ozden kjellander elctricity dunifer rheos kowaljow policitians ttsi yolie cusanero stanislavskian gurthrö provent brownsberg dehavenon stephanopolous brookgate baribault hecklinski qaneh minivehicle spiegelstraat lenette zaninovich bacharan expan bodybags garrida divebombing trichelle jamshad jankins smailovic clevage wvcs visisted becomin demine pentrehafod rahimian nassri toosh kichler bratke smythers masoomi poborsky sítv honny pottered ishac cloggy uchytil umitaka kyohwaso mmboe kimjang quianna tufft keenyn federalise touchtronic prouzel terorists stoelwinder magdziarz proliance deloittes tecchannel poddars socialst melafind wathne mkiva yoggie killstreaks whipsaws puoy eopa cloudcomputing faultlessness managerialist nuvia magerman eivers overpromotion yanbaev cupas buczak datafinity sebban meretsky mutawassit centertel tekah voase lafico gourgel pashton xinhau schaepe jaquan updegrave parttime mpeta attactive inconcert abgr pomahac sedenquist hypocrasy baniata trussle axiant catchily brovik phama pleet siphokazi mughniya mesz ctsb disloyally zajaczkowski environemental traeg pinnies verc kryostega mjsbigblog itinerans jaggs abelino icpw surfas mccullar hyperdynamics bobier dialoguer dodgiest chritians bgca iavoloha langleywood krtk tyrka mccanne mammels saccharolyticus kameen chisley awyr versfelt nyone sapropterin pureplay mosaka divertingly basejumping khilanmarg dankgesang qalyan scheucher delahooke anmyeon lyutenitsa madyun hrubes parsely meimou kozhemyaka shuchat industrail oxfeld ncbw gulalai schierhuber wriglesworth bregazzi wisnefski zieser ynon sexpots hyperdunk manbert chitengo ptcb ahlering zagaja serotsky tanatside abassan tulshiram banchetta stuelpnagel inurnments envestnet ysern selloffs fretlight roitz stelma xinmiao noteholder jacomini dimicco recidivistic montengro sakihito spaceview almasmari mavy purnick kakule shaffrey parches superstein helsinn lovobalavu lotensin nchv jailyard offsping sarler causeyside fazullah skylounge coryells demoralises shubart raisiny kutelia feedmill tromping pressue alecha cakelove pancrelipase pirozzolo ehrnfelt runningen falteringly azoz metselaar hollee activits zannie boltman aryashahr proventil lonsway rvrc zhongce preauthorization boikarabelo marazul khalisadar schlievert leasebacks xueping mchendry butit canfa oplinger nevrkla hamdaniyah seghatchian jermantown menick maštálka buydown analysists oxbo szabos kerusch colleluori raincity konchalski slobodzianek imagesat crossbanding ivania yaghoob meain selecao toked kirstens savvidi kaywin odidi remanufacturer ibhs garacad vaijanti nellor dyax nzsx souflias fishworks ponten emcore luckraft braidholm lipsen bartecko batterings mohallim amoke weyel eees xomba erhman malenky lavieri bizarrerie lancz partnerless ances needlesticks ansfield reciepts emeter eviler busalacchi jeukendrup alcarez erdbrink vegeterian moamar gptw mauvernay slipchuk kaival prayerfulness carlozzi bintel kukic hasecic hindujas kleptocracies uptrends seaorbiter dettling brigalia repella kipros talmidge cluness billʼs haddows musicpass soetikno kornitzer pittburgh impactive mironyuk goteberg attenboroughs pescaíto aholes bremzen tpma scordo hepatologists econned efejuku structurer supporta urssaf ddyfi grudi onlin shoplocal osvath maheson vermeesch swotting lastrella muwonge couriard iproduct straddler neuromedicine bezard marchiony miceage dosmukhamedov transmyocardial barbequing biemer markbygden institutionalises avmf fizziness untether hotpicks baldiris corperations carollton pasteurising tsking civial ringstrom orangette gallicas cocucci ultraswim jvania mustert sixapart volkwagen viravan moiree amtower dpuc schwirtz randeree lacole iwish pellacani oddicombe merryweathers cighid overdrinking cdha serrailler abdelbasit ferrah devonna wayto bejach heinken bmxs miligrams vivagel mson cynnwys poltroons farmaceutica inimicable muawia bristleworms tatsoi gegenschatz quipster pnemonia pettry pareos philipshill truculently xius bloomsberries handbasin keehner gammick poutch hourig cheptai armelie rogering credem isailovic lealamanua quamut przedmiescie carmoisine harriger mediametrie kastrinos aulbach zumra criscio oversweet mcnenny sheshinski shasteen rbts collegeview carveout towerblocks genuflected godfilms robsons toolmarks tomlan schieman glencarron reagonomics collagists kennes uncomplicatedly thongkongtoon salesclerks decimos mircosoft overtrading sabaawi syrahs concupiscent adede acidless robet parquette septmeber akrum datebooks paradors arauquita bucossi comdemned ephemeres dachelet queiró telcagepant goddaughters rodgin mentaly stepneys laczko commoditizing oversaturate choriogonadotropin shtum taubate nonregulated kassid bekking knutti pickable vaubaillon kardasz redbulls tabacón leathering ironfire whipsawing inmans tweezerman bcbgmaxazria fishapod bruha ojdanic urell chibaya mursaleen tumpach opunohu prizel scrace poliform duncow retrogenes giovannucci catnaps schratz endplayed taigs iniparib gereida ecoupled giantkillers esmon harway charicature kovell fidaa roadworkers aaphp whitelees collapso yirrell llywd manging reminescent qatami bussinger tomskneft undercharging incumbants comanies etyen huayong knobkerrie puedpong canalettos shiastan levittowners pories dechane toffeemen poeu baleegh traficking mmwave buerck muré nonpoisonous meddyg palisi sinuplasty bluffness boumelha virent hatefest kavishe mostaghim markiet carelink machie overbright johnann coralling octabromodiphenyl untaxable gyegu responcible herita exhibiton ramchurn mazyek beuracracy humanises geither hyfryd meatiness vetrovec gruppioni edof uloth rampersaud knockbacks overmastered pearton boysenberries resole adamkhel officiale morestead pishtacos overproduces sacrilegiously reciprical logorrheic cohabitations iannacone ppan nutrisse waicu curfewed onanist aboudihaj scratchmann glubok mossbay rialtas garibotto dalcetrapib zwak taliking cantens zizola mntx millimole owamagbe trentside jackalberry tacambaro prudishly muraca bilefsky consulations xway philanthrocapitalism neverlost gellideg baverez debens lifto siddharam nasariyah maasz nezamuddin apotheoses mwangura papura saymeh kufrah lodrick itronix moldowan lefotu meglomaniacal mohtaj ptcda aghassi vanderbuilt dagunduro stason pacn misconnections tydtwd turnto geob reassortant ryhurst nelso amortizations nktr aousc rizai freris prochymal outsted drelich turocy vourloumis farchnad gubernick chaping sadker penfed collegeamerica macdorman simsch starbucking maryfran mohanram tooher cotliar mccrann lvhs triaminic vbvoice copehagen midgrade shawni birkelbach geoapi coifed disapearance pressie hulhumale unsown bilks powerlessly acuras hamchetou budburst ukic wetherhold keinon minues congressdaily goolding bosasso lafeuille propertyfinder ncib pelem testimoney naswa microsavings bondioli nephros anbaris ecycle telocation lawniczak fanball ummersen warmists faveri irrestible popkey solodyn entrail stagebound stylefeeder placke rozerem doomsaying kwangba bpom highchairs corteges punnishment reihill pedofile steeber necastro britnee muzhakhoyeva winkey babyboomers proscia bibw derrieres zandanshatar boecke stybel wintek websurfers inury schornagel hiree thwap unpunishable islamicism sniffly bontan tryscoring penisa ekaitz strihavka detatchment febreeze gonnot cchq wilcha paraag rhianne ensha fanok darious hachigian schoose anticollision poolton seekin multimillions taeye melmed whiplashed seegars thonhofer rakieten blueworks umsted istreamplanet disintermediating kezman abdirahim olhao braciola leathard satnavs hackhurst polyphenon responcibility boinking stooging standardbearer soussou lajdziak tippex athome persisters iwerddon creachadoir wildstone citlalli weissflog mexcio staford hasanovic sundem ythe perfidies pielenhofen chrz clucked hensrud ishum akgun shuneh indeginous gapkids twinstead callater microcephalics mipdoc hungriness tsco rerp vmag overperforming epsco acoem auditel marinza adeley gallichio farci kalanke shakiri koblik apointment galgael lacelike ivonete girjet uffner photofiltre taware petville jethrow reliberation kalvarisky xixin worki staffline pgmob balilo overdesign shailagh keissler alminova mussayab vtms shilled sulock theanyspacewhatever mourby libé sonterra corefirst xoie gilgore innovata torbinsky deinstitutionalize capolingua randels bowster lashinsky pikulski roushill douched ithin visanet ballarò groesfaen kameaim widmeyer protrays ellabell cetnik buidlings aigcp gidani cabinent mailshots shoptime deyton swankiest rocknoceros glamorises vaunts jacobses lamplit jeacocks mediakit rondstadt paquay digipass frisselle satchu vervotte snowblindness irfe arbitraged slingy superpremium heisbourg coziest frienemy bioforce ogryzko yasawas witterings knobkerry wouldd senie sebirumbi larynxes rescreening sadeghieh buscall qualstar uncollateralized dyudya misfield intot madrids rongkun zepf oppertunities bierma plocnik stinke cushiony boxler kayitesi gullable correctitude towungana michcon lösche postlaunch fxall abdominally bloemraad rebeuh rebwar medla akwesi califronia recogniseable morganchase ritsaert grinney trussells rafiqa revvy koonings teclas shelbourn juanicó philipina zdanowski darsley chunxia kabei andriol tryers dfls jesselson waziriyah hirael nagaa gravitz choedon echline carticel fatteners athron yendys soaringly crackel verderosa glossiest listpage mesalamine minujin hanaka recrowned schervish richfood affy breki ackard larazotide huahong gudgel coroneos weatherizing ahoua mislay ancramdale bordoli sacopee outburts varjabedian komanduri distortedly cavalluzzo stifado gildings hollfelder rushmoredrive spectular redelmeier godhwani weissfluhjoch tyreman mcginly jackness thinglab mamlouka hypertargeting fledgelings fullcourt ecomomic ryklin kaufelt rouches teamorigin southalls djugashvili meillier resonsible indomitability excape netherhampton crummack mazri corun rmmc declassé webernian crestliner baruchin ryohin schönhaus sucessors rhymetime crgt scalemp edmos barough administrable kallmyer groupons recapitalisations resitting karamagi chittka overmanning carclaze blaschak galouzine daychopan mauksch cmdbuild clockwatch fruteau jatas cmws robata wellmeadow growt freakshows zinurova calliflower cannisters semitruck decsions latifullah jingchao childres kalarikkal vcnetwork cusinato kayabukiya gottcha carseat architzel twittersphere ellinais itpro vicosa fridy logistex ecda blackhills aramendi pareco searer unstack cyflwyno otherized nairashvili nonallergenic unrented mangelsen eurpean mckeirnan frasinetti willimas bangstad kawesqar fiending msxi artumas visvader inishfree canadain hemopurifier aritonang donohoo hezballah samolis kaminskis mexder mugniyeh jobid minimall oaklea restasis eanet spichern glowered highters thobeka txtloan janets sehc sutisoft dutartre busybodying vitarte rongwo emergis scamwatch gnezdilov telefusion dyton nutballs fobts familian lauriski hoketsu dispised gheith sterilite smellers parsani pechenik semaconnect baghdis trachuk cutress vanderlugt potful drauniniu colorectum mongerer spallina keepaway expecations inceased loveluck grumping atill longevinex willerson hirgigo peoiple umberson clevudine pulper ciolfi atherothrombotic nhpau outsanding finzels earthkeepers ketziot slicethepie borowich mceowen comicconnect damnjanovic rebkong palmgreen badl bhgh scrocca squalour hammocking tihwa sciame mandey bially zorzoli wasington diplomacies henglong lethendy loyens mwda neigboring calerie sevylor szemberg destablizing kreuzinger chandebise starvelings babaoglu golanski healtcare spiritueux nakatsuji elephantitis kanstantin uninvite snugged pharmamar passur ruogu hpna davaajargal biscaglia bumster evaulation choppergate shierholz shanmugarajah unfortuntately manoschek souleimane eskilson sweeta pasoans sophomorically szakacs intelegent guffawed mahaweel inlayed deniably staba corrigans rhapsodize mistfrog quikcard nonchronological dawdler wybar nightster ignominies keychest mcilree kurodake ranit brondyffryn chotai maedgen mavuno andipa dicketts reincarnationist tuayev cadfund lynnerup tbed cyberdefense ameriyah womick ghazaliyah gissendanner scenerios dooking pantigo nonteaching congratualtions waufle bariay purevision elmalich mortgagebot hellobeautiful slyest devaughndre drumossie superhospital bellardo kocchar heafield nanzenji bizwiki taddonio relistor electropsychometer maritas claustrophobically fanah coonie yaoping mcneley birrificio befoul deloire gauffin pongcharoen nardolillo applanix munning drec crucet rossanna rikhotso kiswana thanksusa trailblazed denegre dazel trafnidiaeth xoma superferries lightful zaafarana provea ahemed pansea madhosingh khulumani goodhealth audeon estruc accordign dunkling meddlin urumuqi nuren birthmothers asiainfo shrops campagn everyboy hissene egnal gurgl ccmrf efama seldens sohair cevis fujihata seex msgi amerigon americanising collission corrons chowkay produ mcanderson quotational lipot blatstein aucott breaktimes chatterers delahoy rohsenow flipswap krahenbuhl jourquin georesources dressner slimier potupchik cpst clubley sleety ungenuine vdel rogalsky janullah blipped sourgens oters remainig destor tamelander chausseestrasse piroli nabco brzak afdhal sriprakash karakoc torregaveta moraca funwall shurbaji kupetz goltzer gelband departmen zaffarese tscharnke chivvied underskilled cenziper englightenment bresnitz mapule percussiveness knog blwyddyn cinquin yacon zhenliang ofrasio sfari auyang claragh shioi huelsken myfoxatlanta teixiera ananenkov shoukhrat rameck ducroux makondo grueneberg politov mlotshwa roeloffs incented moneris banlaoi laregely derserves clader hmmmmmmmmm unpicks lakshi kangho herapin krystexxa gurgly militarty podair noooooooooo tayberries bouffants kornum losan starvest ppda hypocritcal khaisman rubashkins demofall sweethearting trasylol freedmans mvis goldmanite motobike oneda sicolo figueruelas scriptapalooza dorrier advertently bougerol schoonveld larushka phaseouts fiterstein oremland geomatrix schlagman wisegal holyer newhampton engery fiddaman lightpole hasiotis gasm milnot pearlised kaloga lancor riqqa romaan sardone thorthormi disidente quaterback pozzale oularé rhythym beglitis highdeal fiftyfold afghansitan magnatude tabosa ulzheimer kjaerulff knifelike corraling pratfalling disolving hahvahd romeyka shites makhzoumi seggie devicevm licuado newboys hanchet norooznews völkers dndn bulicame helmsmanship different unreels greenscape ivimy negal leaud bhumibhol zenobians skett kvakhadze hiliary domboshawa grovelled jeré alphama pliszka childeren sustenna evidian bistate lendle arhabi steigrad zentaris acromas shoegazey surgenor jalkoti startrans badoian carmelitana dreihaus unilens illhaeusern clintonism gridlocks rassweiler vaernes rmcm nonjudgemental nullriver caipirinhas stuffily navyblue fjera sotio kerness robell tamamoto paefgen metronomically caison vanter undercapacity pashkin bankroller iniatives couglin nassem unhygenix rumaithi szejnfeld torrentially bognanni tellies gennara blynyddol bokelberg antoney arrivers sophana tushes abaurrea qiqihaer jatoba miswired sssd quandries margets unmountable paolicelli rueing vukaj sagid gwragedd technogroup goye gamecubes creandum wardag nanoantenna encinosa araton mulvin decentralizes tromsoe technologizer musayib reprotoxic £ nigbur monzingo counterassault koopersmith mcclesky menactra levenwick hanaan voelckers greastest bitchiest inditement muehlberger beiong theatening mellas lavinthal inititative intersquad represenatives roughhewn hallglen juliens intreccio magallan tibilisi langinger djelic stroia atlhough wiswe brickbeard alyawarra borneon pbsct dacra cheiffetz privilaged aigurande tranquillon frapin syphers polygrapher underproductive veleva shawcor stuas backpeddling koerfer mezcals closley namenda ferrarone etailers nabanga parmlid karastan gougères oversulfated lubéron ehlke narseal salihovic shikse sumptious safetly billyball colmant licalzi mullahcracy canetta singiser aswirl jarbawi scapicchio dicotomy profauna raclin bronger hahaya cygnids upstreamed muthalib rqi racak socalist hamantaschen aqeeq borovic breathalizer blackington abtan vandemeulebroucke sodomise zagornyi cibulas bohlig kalaje guilleuma islamofacism bastardise vantrix kochanov texaplex orcl heulyn diangi librans toybina mingtai foucheux nyoraku trixibelle wanggang ibison courtway listerners cariatide trishas resynchronise carpinteyro masciale seriocomedy witech pinksy counries ndemo emtriva losson holbeache whiped sugarbabes mondane pediped zubakin ifshin warech baramullah llowes magtira lukehart ergos swopes yaam cloddy omniums iruretagoyena biberaj naward nikolce shklyarov kasraoui yakexi causualties bewleys inbody gehle calandriello birp kloeden vidailhet ferrety burlier demagogical fergi phinnaeus dayami tibetians kleinbard ponturo visund lightproof mouneimne carbonnieux ichael teeniest pysche welie rittenmeyer olaim nytol goldworth centiliters veiroj eural pokertek forham elidel stiffelman amrami mcleister unscrambles guriceel dipso sanani wheaty pushbikes domaille generosities uncowed postretirement apollus granularly trugs puttered ghaws humilated clabburn camft mstation masne uscom indemnifications bizcom bimes stheeman pathela konoba mcitp marwolaeth barrowed sarachandran cherman mcswiggan tognarelli amnestying sawtimber greehey troyak breakfeast apiafi ofthis djoker gulenist abinbev yahye bullhooks phuntso motaeb flybus kanacevic hooha mandzukic safder bobrauschenbergamerica redlihs monthairons unarrested mcphaden hammertoe radnedge drico reinflated mudathir mindup haozhou latinode daejan orasure thrombogenics pithiest unfastens danglard dvortsovaya nympholepsy immiediately litheness bolshaw nonremovable akshayuk megace exalgo ccsn cmarket tuckup sosinsky filipini atatra lichnosti becharam fousing millberg supersector baveja scansnap shallying cosr uncensorable wory donig rungan fouhami olasewere remindful myriah crickard bogatin ipcm pornification naddeo hakkies topnote majoriy breiwick subverters spde felux krauel watherstone develo serping koukoulas amdl imigrant muttonchops tomnacross piersen standoffishness ballieston cplg trublood burklow kettenring outshout auctomatic iveth yusifiyah radioss camarthenshire matshidiso xlif schnobrich dyddiol ahmidan olympitis skalleberg sharwoods joerres netcomm belarussia jarawas cowardness tummies maridueña marwyn leevy oosterdok tocquevillian videoplaza kosachyov bitorrent sloooow ricinine husarska luned poweful staglin sculco penhaul hostopia requir zelyaeva honourables hasselstrom sairr razrs collobrières compny waynetta klontz kerick mopi optimark semiabstract fnia accelerographs pluots takishita hostry ayoreos microblogger fausey blahoski crowngate uncurls kourakos kerzel tooman babyfaced overated cbay sportcombi farrows calanchini mellowest trollhattan volach freijo damanged inhand frubes daetoo onvoy declairing mavroidis imporoved redniss fukino cering luthan gywn bellinis filgo ameresco imcl melya dasenbrock proscar vpso intertalent elitetorrents jallal cirtain tessieri piperade bluehill beruwela indolently ghenghis thromboprophylaxis prstore amabassador quintais khunkitti chichakli tyrannise progessing stepgrandchildren toeachizown intoduce fifton englan kabum slopers pasw gwag rostas menstrually accoun solartaxi paillette provacateur migliano untoned nefzi seonaid unstageable evolet septuagenarians schmoozed botney ineptitudes unpotable bamfords lebovitch yarzeh edgler hifx baldarelli weeing palmary interupting twinspires sadowy tazzyman plantscape nvicp yassaie dutchness preprepared shafan counterdemonstration boertmann glaría jancek meiquan goughie zulehner standardbearers vizit plumo rabouin shimkovitz udalagama tongkou zurzolo rathr bexa licketyship siumu highjackers kruszynski geobra pacenza faeza sornosa gallowtree farookh huitson calabacitas ibeer greenslate zaney helitanker souvenaid basravi simpers nobl arkeith renkart rhapsodically beautee gazgireeva acuo keshishyan eskiimo zerotruck skyfuel naroth wienert bessye netequalizer premerger georgiopoulos nutopian devlieger sinapse sharlon sexstone penichet angiotech barandica arciga aaden sherifat sowetans endotheliotropic nsduh pontificators chappaquidick abkazia goodstadt barnies flexidiscs korinne szapary sigersons prokopanko cggveritas wilhern trusdell jantzi caltroit inattentively buiness kvarnen vistica mcmuffins nuvis kodas karakoy naifs meglomania snappiest garmont jints apocolyptic panger ramic hojjatollah nonsuited tzeo charlena gereffi educability mushada hudetz hardheadedness pagnoncelli jovicevic exoticize nanah wapakman judys stratex congresss admix bekkouche basumatari frostings nspf mahlyanov bartnoff bruckshaw searchings intalio samcam corretora menday benifited muonelo rubot coachloads mashreqbank ceausescus insited strangleholds servheen heska mustchin welldynamics ajilon fomenter steppingstones poqui cortadito antitobacco listmaking incant osbaldo edbi schmutterer salorio feebates rochsoles gowning reprivatised cenicola uroda annnounced toifilou islandsbanki barix makuxi cachetes auchleven mplsound forseeing vandvik navane loftman pepperstock leaguewide gottis reinemer francouer joevan harbertonford timesmachine upwaltham locf abuhamza rwsl taggen huppi ptds mascioni mojtabai caukin excellus mxenergy miralax pixxi pourfar microbrewer coprorate affilliation trexel zapien starey gossipmongers bonning openpeak seatholders discerningly pannaway räth sensme goathorn yakutumba websky vidaza opiated bedenbaugh neurohr quaegebeur unsearched abdurajak dugat bamlet sbux jeager soard humaidi chatshows rashpal oratz cnnstudentnews nabokovs moneytoday hipocracy iconomou reoganisation liabilty congess giancoli xcaliber petroliferos agrisa miszuk secuirty wotorson abobe greenhealth sensless gronbeck stratusphere elciego misgoverned ekaya mccartt empaneling bretherick zinnov rodnick opolo azzarella irela starroc mouthrinse holylands cryolife veppers salatto charai teychenné unpractised rgis chemsitry atrovent sachino dudeperfect kutschbach koure unibrowed ejaf brazille xuebin tivit bydlo imiev mobie yunchao fidolia hamdiya razoronov dosman ostasz alboher hakas scullys ovacik straehle benayer obendorf joosse hlaa masselis photocoagulator ablard ronnette aankoop honaryar kohlíček hargeaves dunned comdisco narks nadt jaroenrattanatarakoon snaffled couoh rashbrook anascape kayonga damam henningfield figlar svrluga sadatullah olenn obrycka mandozai mastah kandoo nicom itacare admistration sugal saynez macroplastique sanitisers khreis beauvilain bamattre strongarmed tirgoviste pecoc marinich cruysse pultizer knoell textese floeter disports eatmon chwaer paycheques ballysally xanthohumol szkutak stablest yossie conking mazaika gadari scottà jelden smeargate islamberg professiona toudic terrrorism bootcut euopean kiondo apaydin donckier altiere kuettel manorcare yeardye oreign sluttiest giddyap loecknitz rspa ignomy demeuse loway turgeau depomed bindmans orderbook brenkley accomplised ustaoglu cabreras posang palermitans dankness thefrisky retroaction kamoa ravdin whirlie masonary mantris obam sargen webbington stefanides einich relati sivley releived suscribe muraka saegheh sedor miskicked covd inpart assemblys mapplebeck recapper exfoliator solazzo yogaworks paynet pensants kulhan erenga galick morgage linkwise cazzulani lacek intercontinentals workamper skypein nonelected bioarts standbridge mcil fendering lisneal kuretich alikhel friedgood olwine tinglingly carseats horist drisko terrorisation solagh skachevsky robotuna diivory rewrap schleuter ribcages foofwa drillfield goriness stressman pianosoft hurner licentiously olekas underlit monogramming ltvs grabbin widmyer johannesdottir markwells sinafasi kapitannikov casasanto pohorylle chirrups yamanya avcen yarusso nairz fishcross guarrasi valutation lamouret burggren kilinc aaim correra zarazua tunçay soumache calebasse innovaro ikhtilat naake steingold nuqui fetullah ghausi alwash muvunyi mittromney alapini viewspaper kapidex deborrah colorito lounibos cpaj stng adecn okayish posilac immitating disfigurations inzalaco deathcare khajepour recyclage louiville lenval daraio grasstops schertwitis dubbya hirter skittered tslf yenilmez preannouncement arangements bootay remediates gaidhealach sackfuls futureu unconservative ashleymadison vancsik biafore favas interferers moonridge pompas mrkonjic dormmates gruny caputured longhaugh dugue urbanowski pentling mansiz zuritsky wischmeyer adjustors loudhailers dbsi fdns mohammud gildroy mandleson bruegels pamfili audon powazek gabbers xuyu oshoosi kgia hitschmann leisurecorp unbolting polanksi wymeersch spectrial sculpter ringingly savicki tamileela frostiness octavias multigeneration earleywine cowett grubtown soporifics myfoxny constitutionalize abhore fenceless falleur nigol lotting motari ridgell burches ariizumi ngosso rukhadze kenlaw komejan repentances scmg dolmus infc kothgasser mutsinzi rothensteiner abthrax obesandjo dgmt shishito thisyear pogea changdu sabree screenwash skiver botul rasker velensek lppv melexis rotomolding turtons cathys simkoff youtubed eaee veloza gassick nebrich yaers schertle earleir shigeie girliness ardagna fonet recr dentinger synchrophasors nmcf reinjury kerrington setkiewicz fuzeon zostavax quarantinable knabusch skaalen vachara mwelu ascanelli tosteson balkanised saltcellar duplicities merendon mesterhazy mompoint nataro impoco pinfeathers arcsa reinterview lombardis hidrocapital palestians durao turgunov misn xinchao finncap ishmaelia gsces childraising fenerbache kenmark ruices dignataries darunee camwell bassarath druillenec schmitzberger condems forcément lowensohn palestines boyos desparation fifis funloving cyberdefence clubbier bissonet svento kleinfield tuohys digitallife luoshui peranteau svetkey bevi racily forodesine addlepated unicare slaggy noviye antirejection fernstrom hannides urbutis partyism rugambarara edgily bigstockphoto milns boersig bonxies tayet kronmiller pfaeffle quarterhorses energid kivutha fiercesome achillion shakibi adawe tedactive fleak broadpoint caprasse prescriptives pasqualati herzon sigvaris giampapa weakend neurovision chinaaid toxically wahadat readyness hipotecaria hilchey bohua clearport quelynah ukcisa mougeotte roft umberella skiercross fücks vdoe zelnorm larrson shishmanian bakeoff juluca schrode murithi calvyn unstuffy kremlinologists gussi itrip bregg goung odones hasselhof mudfest glugging nahhh candidtate baltierra aarika nonflowering nakasai fandila qibing hmmh langemeier multiculturality vidricaire technewsdaily sarasi aksyutin camellos loyds rameys oehm disintermediate suduko ntumi taparko overlearned outpunched dyann vicitm obamacan accessorise kabateck tstorm spregelburd moreillon beatlemaniacs mismanages sabertoothed alliyah arrowing nyeholt abdullahs pearley accessorizes lacarte onesy somolia sommermeyer glunimore noncancer zaidy farkers hshieh guejito dailybeast euromarket supplementarity tresemme chibale longhaven korolyev arfat superciliously beserra farinet feic nachchikuda gurstelle buttkicker technophilic perdziola mamhoud junqiu murfi thinc spyhole pieron cammillo unplumbed townhalls shufflewick peopple penuwch royzman schmaljohn galmes centralpark emop rhayel doobee semonin gdsn punsters spinally nyicff simatovic oskal enthusastic armh schumannesque telecommuter brunwin lucianna faassenii teethmarks huibin gadding trenks insectlike enanga lasikplus profazio etfc acambis autopay repubic tulgan anzorreguy instigative textgate lucullan lumizyme libresco spainard medexpress gymdeithas kalnas vivenzio beasting chrisler mckenith techmedia glenfeshie shmuely dankberg myhouse braugh sgiliau piacente violete fostamatinib dentley ucking systemm wygle pantomine conoci sasisekharan chuis callooh geokinetics shanghaiese tinakorn witalec donaca carrianne naturallycurly troudi leventhall hartrey commoditize lanais zigas moshed klaitz rydill luescher kramerbooks bimalendra groezinger haagarorum schmitts amillia paykulliana stroebele anatomizing aconites truska nasery kozhinov vampirish zetia metalsa wevl miedel seadown showtown breastbones bruderman koubriti bioinspiration olympianism partygoing sturiale acresford enoe nawur pixable truemors bronrott aytac khaili ringfencing castrission silverglade glufosfamide delegitimising grungey hoppier phullan tinactin vaifanua exhilarates motavizumab plisch czyzewska hightops kcdl delevigne sunblest adzharia kingennie hirthe overexploit mdus reaseach ostmarks grossnickel mcgie unprovocative iopa chandoo goverance depiano oooohhh amanjit recuitment scooterists hpcmp abstact kazmar albat sahed lantejuela chhean katios himmelrich huettel zentella balkiz underlayers basio sonagas poisioned jetfighters tpmc imnam debruine goalkeepr reimposes robbersons lubriderm slobbered hawrysh sinjari blockish serveware stabilty visilizumab underwired ibotirama bcbsma llandulas grabowicz teplizumab exciteable herringtons vannas confeniae sunspree tullet huiqin chivvying artsingapore nmec abbateggio ricicles ekhlas boonlert valden pointclear stepstool nibbies rihal expansys barwanah oranienstrasse misallocating mezaache tobalske scarton mokafive khristoforov perkier lecouls baobob expatiating nayomi obler makarere salivated shaiken amanpuri ageorges doucin bmhc bleepin wobs umaima sathiyamoorthy cheapish tutted subpostmasters mutagamba tyrihans russain keynsianism enagas ftserver literalizes telkämper baichuan catchlight orsbon umutoni kuzar maysaa wahono llanmorlais suitemate mmddyyyy televized kajla reinterviewed michelles apalta indovations lenamore pantelic humpreys mutallip qeqm chamness nakali ecopack azte taiyanggong shabaneh angulana uhlirova mcgc noroxin potiers fundtech sevillanos klingebiel gulfstreams unreflectively milosovic kingsbrae frojdfeldt rocklike hickmore cotrimoxazole bambur miscoded gasayev lamph engalnd mingzhong txting persahabatan peijian taraha bnpparibas solland physican tieup tentpoles rhizotron pulidevan svacina knappskog noneducational metabolon transnacional undoc weijiang vyroubova chiapanecan tykerb ogilvyone berthu tushie drad lacerates cdwr schatzie fishington digao habby southeby egco misaddressed orwel endoscopists olympiysky mulesed bloviator baukus charytin cowtow beloserkovsky shanthan yusufiya janadriyah occold voison extenet destablize nantkes murhammer brudenells lcor inititiatives venturesource stopoff uraguay eloshvili lobbyed subpeonaed lils ocfcu crustier advisen abgenix coldron koutsomitis yaming intoa rezmar birthin emmenthaler mccormally cicic dicate catilin alía abdelrazek schragis schnyer gcaqe uglification campains gyawu saqeb repug eyewonder handanovic apolitically enlargment handrolled nkwali galgudud extemporisation scamble bugsby hojatollah tremblor votron nauseaum warkwickshire anaman treescape hakeemullah narcissitic mubarrak pantelakis crouchie brushers adzhubei cynlais dureing mohsini pisted yoghi iqualuit unindustrialized psychogeographer kromek hélyette dumfounded governmet opolot uacn paniguian naxton vanslyke pesner souzas venokur mhrn papakyriazis fayram upsettingly slidey scienc calagna lewai biomechanists masoner dyfatty voglibose visualcv gyntaf dianca robertses capitaliq swette sencar leamond rocephin elswood decontaminates arriani ciirc combinenet fransiska tuchis tillekaratne kidults manheru khusruf messenging schoolmasterly grandbanks crapuchettes alevites shafika kalpen urqhart optionable barylick sewip gedaechtniskirche pointework recriminate houken chwm diamantine controlee tarkong jebby proseries lham festivites backsplashes denegrated regionalcare smelkov lenow birsak clinix dorrity kabwegyere mmadi spokesbird dizzier jakuchu hebbron mountainkeeper hamdia serdula activties kulei alior otologics bloking fassier livein radezolid munekata rosio ummel shunto babajob neigbour kudair barrenetxea katakolon underdoing kervick uniprise sovietov roshandel pilotin jambur barefooters schoerke recuiting majozi kapah barkau santuary lambregts worksource alaistair nytb smellovision bordiu guehenno rabinov easdown beatt veico zainy alterg kingsfort sevcenko gaztanaga geovision fidelistas schefft akasako sivilla pellizzaro nordpark euphemise bryanlgh marlynn rockharbor fondebrider sickbeds abdom adwok quidnunc turnbough madeoy schiffauer gdsm thaib dodou rosenblit glyncoed eichwede rascall nonepileptic pontieu mddi labolt missons softish keyunta mccutheon propsect xuefan hidded silerio cornelson thanawiya wgpc procyclicality individua duaghter ballcaps beaudreault kwikpoint ultrachrome meatpaper unlear sooooooooo goodleaf bhattasali rockstrom friskiness reconaissance feltgen scaloppine telmap contiero mawlavi helbers bahina amercican mazzorbo shanni econolodge elimated inessentials gloabl atws uncurbed delsym attendents smokewood napss nmba farler hermange kazahkstan ceawc storas biothreat insulet safraz solgar luddin worldgate ephrons prisonlike brigadeiros teamhealth chutchawal cramsey mdingi thown tagholm pureology treftadaeth theey schrecongost xolela anticruelty meagreness marinières taganana ispwich maryani chongquing rugy bliman teamates zilhão divalicious uplighters tinkled flessner renditioning lumbertubs donosti hdcs magnicaballi quily transperineal shawley klapötke ballymacormick trilegiant strategyone tvws caerhendy fetchingly zargana bracalente scrummy airelles blurrily barbered enforement shamblers cpsg keiana nampak vanech freeto kneuer cnsx granatino sihk quainter golddigging krumenacker straighttalk fostok chekara dübel hoplamazian nicoise loczi zulkipli aaeu walubi poye corluka rickell shantan khalizad bretonside accouterment jahmi niedzwiedzki cellsearch fieriest robotization wehliye toplines businss skirmett dreumel whomped aiesha smartsip pairat atendido stemsource kingate rushcard zestimate glantzman pyinmagon iksanov jillo democarcy iraqna shirel akinsiku rojelio lemminkainen hellooooo cadeddu kwandang nfsc penitentiaire moyeen stiffling tschepikow spags tictac giannarelli romanno rudins siblani mongel osri allll naeemah hrones kreczmer industralized schoenhoft shager tavolacci apondi economywide levensohn pjanic zaryen lowerhouses marcovitch postroom kaduskar patalinghug hyperactively mvoe indictors refundability osct surefootedness baete brastoff maldic vertafore resposibilities pwerau charelle basketcase monribot regall kirchik reliastar mohelim beppino ftsc benbrahim kodithuwakku socialvibe caravanette fowzia nrmla wellmans patwant dimetos idiazábal purechoice marantis richlite prepaired enviros capozza davaco ngage golum samancor sherlockians katuka jenabi stomaphyx awajun sikorskys bevies clothkits yerro tatics belneftekhim xaviars yidis egullet grassier dewaal demaliaj xekt vechey ahamdi rasff freih txtspk kiffians inventables ralstin yormark jostes furbelows falleros anahad rimler perceptable facua pranged macconachie loneman hiom appologizing runnell keystudio machira ripstik durng sluszka masebe bukima adoped accelera flywhoosh overdramatized tsabar invo expressionistically stupefies manary odlaug annings izbet mushwana wearever thuggishness acvc trups ponsky norsat mudala princessy lasku motesanib budreikaitė irawaddy steinkohle tarriff dilulio lausell moralizer orsus denino eidhr kohorst hosc hirsts behins grommit jaffey arghanj missoup nsmd dibeneditto gammerman olstein faulques conselyea chekir freneticism salduz agace foodiest reykjavic gewirz esterling charungvat notionals pineyro meesawat vadlamani cosmegen sonepar mestrovich scdea jealosy scudellari norng helloooo nutropin repellently surfware bonemeal thimote baichu weizsaecker danias pnps szymanczyk magnetation orand bassole saludes vilayphonh okao slakes beirao qosh odfs vidals coeptis wrongfoot partscore overfond publicker allsort rafsky kolinko kholiquzzaman wrongfooting taameer birmingam mccai labantu aemi telecharge washaun cafardi denhead mccarthyesque velz jayabalan anandvan bonyongwe testamatta mccuiston hariklia hayesbrook grrreat karskens midyette tukkers vitreal nordt yorkwood michol laubert grafenwohr geronemus worshipfully kronthal midfa rachwal judu jiricna neigbors mangouras nitech acholis barkdull kalasho teamate outnet kanayev maftuh pecot ntuf biktimirova freehanded malfeasances glympse cacfp weisbaum nykjaer cyzer cinebistro moonwalked hyperpartisanship twittery kerulos nonforcing pluimer condolezza bernshteyn waining pullaway abergorlech hermening elsbree prepays tokwiro diffenbach deleverage dewerstone consessions enjoli estenson streetline revealling suspened declaimers ostrovany featherstall chunfu stoere furores photomural dvani quintessencelabs mavignier gecho bruisingly chonji alexiades tininess pitiably tmts srere boesendorfer claburn corhan rachyal narrain mangova dalfampridine andrezza amenorrheic rinckel anousha ganaches possibiliy jazziness wazirstan pellecchia slud seinfeldian youngtrigg prefferred valodia flouris roiss cuisia supandji cleanaway gutsiness tullymore zietek rheinheimer faddishness kookier kuhlenthal censoriously washingon overcapitalised umbilically homesellers auzmendi poucha hamisha ratfishes ossenbrink chengeta lifechanging hynod shampan follensby hatian linforth catawampus bacarella beaston cuddliness seyfer sandwichs scrabbles buchak kinnel federlein sanquer despont dummond unrenewable whitemoon neveda olare sverak nogoodnik cpwr scherdel schirato macapaar ribhi solipsistically neupro gassiyev nanoproducts delorm saunby insinuative floppier overstuff messerich unerstand appeasements kerastase investers stilyan junbo nortech gerakis reenvisioned bamdev ferragu ellerin microban perchuk abbenante godbehere sanzin beleagured makanga floof refulgence cloughan joky colebourne nilsestuen overemphatic terjem grigaitis ballymacilroy sowerbys passporting morrills drywaller hipath cleareyed lygaid ogemdi nushin chiota notchup babywear mofunanya coucous crisphead cicione bkkiff zugu goobey cataio mhura nonlawyers riskind adedy rainsbury galouzeau ifob ouramdane sangmo wubbe lesport salarno sudsing jarheads bassily governmetn tipoffs sunhe treasurys blobfest shengchang inderpreet jeromie daneshjou eathquake munificently maleter schöllgen lindomar hubschman borsheims pruthviraj accouting overaker anderst salmawy anoh lateisha hediati taiano wiczyk asassination kaskida jütte sunée musueum somatotrophin cardmembers npaw pedral nonproducing cogane hofnung weekened bouwe gjeli kaessmann mirimichi gerking hartoch gratsos rousting opentech korodi bhumidhar cluzaud smartridges bealefeld tasnadi aquilion cronins kaalbye habani chhiri bijilo gabris miracco buckmans skidpan artise carbonators huajie bisbort collaspe veissid ventureone touti auchenbothie hegemonistic ellsmore protheses labeeb spritzy amazonico volounteer kevkhishvili intralymphatic digpal mccafés forgit mahardika steeliness repke latibex wefel quailing greyfields ringcube clucky netenyahu laidre mittals anggoro ttpa krale luzyanin mirle gennarelli visably magante beachner prillaman nyawera choic harpootlian hardarson yellowhair falcondo kassimeris herschaft airspan buckely faifi divella konat hokuryo decourrière picalm shortner helenas zyzak latoni porthaven prith knuckleheaded ifun dfrl novachuk jereon nawcwd nmrd vinoteca negc niroula attitiude moneer qualnet pellengahr cinemagoing cupecoy plangency faroqi marykay crns cwel prisioner braunecker downtick polonetsky mahria sampradya dobusch jabaar ondák sadulayeva stolman brassic appts sicardy cahe strengthend raikia peachie optiver obsai sculpher validas opnly sarpe isentress blimpish heidenfelder nikahnama battaglino gelaterias marcelled drooper condemmed resuced dataframe econony scotand houillon maubach diemu borgzinner bihm rubbishness ipilot galvus ideapaint poomacha erecord stabalize americraft zamili okari polivy brownsman naaqoos tellas conjugality cenb karinto multicounty telcommunications evercare yings intradivision shurgard plaquenil ccrkba derichs euromin athersys hmgcr kolkena bogany airlessness salsicce dobyne powertek tranier ecnomic ahuacatl valiyeva lamell yiddishisms chsi exchage chennan bicay cesnauskis sweatiness alumbaugh subastas azedine ireal digusted bahill attaturk tenene megraw zhouyuan whx farceurs bivl gridworks brazilan groundbreakingly wherabouts zhentou yahyavi grochan cammies vilotte vitravene parwiz flashflood juicily kodinji sarksyan imedeen horrevoets suvit friedlos jhihben diallers unmeltable kompromat ghankay cabrerra kizingo bociurkiw kontora rogombe basoglu pallinsburn ghowr michgan recriminating orwick avadesh vaubecourt galanthophiles sehlinger kirtas cantler ahmedy wieghart tereshuk jakeb rifaqat sublety smialowski vreeman mallak fanueil timlett relpax rizzatti shour spatule iranair guidestone mushoriwa gingernut twistedness olnick diversoes kulayigye wahbe gaueko kerrins safeauto dongda anjimile kalbag andey akiiki rootmusic isakhan impre forestfarm hadco succe prashan mccaffer deerow cousinage footbeds chuzhda dukies messom defrays unideological piteira nariyah cushingberry perservere waterpik joensson hirees wispry denita unwealthy hollwood radiocor ozbolt geust malkins nurtingen optaflu jayyus kriegsteini warmdaddy playday karegar teeu xrank jokiel schusters supermetals jurkowitz anamitra procampo faustyn slaugher farrs chicking thickburger linksland schlenger gurgled zahren gramms nilc connnected homestands applogic wijnstekers subsalt castaner oyinda terracino leafiness simove swilled ladek hudack shangba jenesis maymount pescheux poptag avrl besogne baselice patraeus lanoxin dccl bucktrout eprescribing burnsley wieruszewski wingshooting boulevardiers meyercord sidlauskas unharassed couragous portenoy hilariousness smedleys karvellas talel blanksby sheilding abcess fontenet konopiste tokeer destructionists shimbum teache dosara verrus hcat greengen marcellous staywell cmpid novacea pricewise corngold nagley mcare poncing typetalk medov frentic lickable carvahlo maycol prickled addditional bullhook loggans singsongs epcon disaters unpins craignair panathanaikos ranel roqaya nussberger sesne vucicevic picat restaffing goelzer jhones bassolet curviness geysering setley poice gueriguian wearier connectability novarka triefus mislingford saifain mlgpe urasia siderperu starkopf codies harrath bullot fluffball courion sarwe gizzie fbfc hobmeier emberly noilea domansky unbruised goettsche shanava amdani joylessness quangoes ganascia dgccrf atomosphere snickerdoodles mehrats plentz stolzman estroff orgn folotyn petee peplin portschach canterna kildean airballed provience vargason narongsak apari bcii dorozhko electrosensitivity komsic cisionpoint anfavea schuykill offhandedness unfalteringly metee modde derrow miniority smartsave strumph thiels nccmh calderons sukanaveita gtdi ozgul practicar mofetta autoport rocksugar newsaper athere popluation hustai superlow alltwalis citreon leperre businees gossipmonger trender norbom lawworks lefkosa muscillo yuster baaps cypc werst saariselka floppiness reinares hollywoodized pricedoc marmaro psilos etreppid asriran nextlevel louwagie ratnieks qadissiyah fujito sinola matesanz offbreaks proverty wepman karoub oligofructose hefron satanovsky liquidiser ahhhhhhh inadquate victums readvertised twitterpated korfiatis sangakarra vooz senyukov incindiary mccauslin xmark potera tudclud overdeveloping pathologise hawiyah orphange foggier slipperiest bjorgolfsson mutanabi movenda holewinski supermice ncipher kutka nevr cheekier multidose hookergate wittkamper buidheann pasteurise geners jeonghee chardenoux beirniadol hairsprays fedeski galicki houssain prepster manuma bentolila connectwise minaudière akikusa untraded bloolips darneille visschedyk paee turbolink zipse tustain bartuska gerbehaye potenial kokolopori popwrap servetas fotoflexer menye rightousness layshock greman kashuk unbritish pghm lemtrada fosrenol frangialli daylite olmsteads wisekey widowerhood debilitatingly diabled mtia odwan adelaider powerchords youthnoise tryl stricks emmaculate ignominous allaithy unatural ezzie baldovie lurvey bossnapping arbesman beinhauer bowran squareenix mallorcans trevizo ionta pubicly adriá tribendimidine xtream mothae terab teufelberger counterbid seevaratnam budowsky relious prebooked brinkhill beavans thandekile pulikovsky shaylin rotterdammers jichan writeprint tutman rpcc footholes galileos botticella edlow nordam photoacoustics areitio hopewood mobitelea protopic jondishapur whistance torralbas dogchannel pazarbasioglu divestures salow busone fvo evogene improptu ctdc tipul tranformers bevvy goukoye coylumbridge ojom hilfy zaggy scarpered ideologised sealane misdescribe junoesque lashman lpbp rubberlike norsigian wooliness kahumbu turbohercules toryalai michican abdiwali aleksandrow techspeak kožušník akunne gyllander screpis airasiax carbasalate hibson moleac lawnservice doulatabadi scaillet kandau enviormental assualted josem hoodle ayapata siport pivarnik zacari mackinson jetés beylerian glenfair shakran popzilla clangy rossiters kreke venneman careworkers windall couck proquad ntsanwisi danailova rollitt leitenberger reunifies fortressed vinelli angelsoft escapeway vondrasek anpaa chornet steamier hominess wiski streetwall guérineau experieced pietton giammalvo rivky lugubriously dechiaro paparizov bioprospectors platzerwasel stakkato glespin bhajis jumbojet convenership knightz ceneta avalide augustijnen jaskula perambulated rinsky brodre donston skumanick thirgood opam nagat sheahon senokot alket wenping pathologising rangina chuckies apunta giansily bucaresti strippable ghurkhas flavorx tamworths intraoffice waaaah spindleruv elfassi hamptonne traums unfortunantly vandrei devost phsycological voxtec zophei shanakill roeb kazlow ruszin eyla forgivably lomeiko kiboshed kontonis intest vahidov redeliver bogleheads gelig pallemaerts oleoducto hlavni friendsreunited qingnan maides therebucks rodhams falzano infopoints ecweru expeditures landcape hardister danilow akery tumpey mickleborough marciani expediture autojumble scaldings hedgie ghurkas akuseki vasogen dokin forecariah kanyanta terrachoice vanderstraaten accavitti kuijen tokasz michaelsson bamboozlement ahnold kehm missonis dirrrty barthomeuf rfet sussanna kloda nitibhon pccd solanezumab erye abdikarin poellnitz shepphird newsdrive mosny repolishing ipodjuice iniscarn kurfurstendamm wellchoice skrutskie rullis ontroerend mujibar ukayroc gulash shoker unshocked blushingly diamondworks giottos husten juxt eftc freydkin asbahi zorislav famalies knowledgeworks zazza snorkeled spidle hitrans rfos condenast ashikbayev zobi treiki paulitz civlian chiacchiera defibrillating jodka gummint lobbestael dembina grulke motorokr beginnin bookstands vydrin kolloen shoomp ravenstruther devasting londonwide spectratone powerpath dossers fmct genomas titfer athro skined othodoxy pasttimes mâitre tuim cheeseboard ducre thyangboche purc dhaliwals agbash websoft shalwars starchiness teson medjet broujerdi maravel gurpinar apovian ronsons varoga lador unkechaug qatalum kanterman ishoy shershnev sawants hankee drollness kafataris petrostate newcombs funisia bbka vascutek schweicker zhanybek schlepped herchenbach nashers harbut zimprich panarese supermum mcsteamy kobrinsky emachine scanimation qati mufas murrayburn zafrin gopaleen raibin visitpa isbrae chilate masterstrokes shipleys michelberger conoly gehmacher falsecypress cruses schraft zhouqu tecnobrega ultrasmall stategies wche meilaender lebonheur bransons meeing caplans meadowlane wyless michelfelder farberware mouthings wittberg perluss eidak kellaris smyle euthansia kinerney zeitouna viacorka alnur milebush affrontery moshekwa cuencame bvocs vucciria preneed neosphincter gorbi henigson encite kimpson appelius verizonwireless uncaringly contourglobal nuami annuality earlam gorkana aderito guitron propac nucleur ydanis suchana hualpen amscan emebet twiglet lolloping haltiner kushiage jenx cribby spinspotter ekas ouandja timmens knipschild persude gaukhar opencalais accuvein kalkunte alsema strander anjeanette peuhl enimies setena kochanska dimmings hammas britcliffe pongara fishfingers majlaton rpls honigsbaum backstoppers guzzardo griskevicius goffee aliceanna majeda danactive detrani dancebrazil moyai upskirting hogmany hernshaw caravillas penri wannell matyszczyk fillips tatch carrozzerie immeidately bhotmange shandies veros qiheng turgidly comedytime isohata nasdtec dumigan shahkrit deschryver absy clincs norrod arburúa demisch mcmanmon statememt polygonati disapoint heimbrock cerian quadfather damagh kramper henselwood jakubczak latip nirapathpongporn intellgence wangita cgpme reimprisonment ptpa welcomely instrumentalised nonactors kongzhong ychwanegu wallaert reusables mamytov foilage surveillence janys krisada houlin cowmeadow eclinicalworks sadnesses ozalp thwak fnba frehse lionise wylds tripso sunflag aidells westsound oculography freefone polytechnos rechnik vancocin qsearch ysgrifennu qotbi benoquin delashaun acpos tyrannising respresentative trilbys castlebank batishchev urbanfetch museli iupat pailor coolatta beceem mmrp lanthorne mellegard komondors galns fronded demaré brenhinol cobnut draftspersons torisel chandeliered kivlan köpfer shortcourse rapidarc mizroch kupalba christianist hashiya somech odeke antoci oreopoulos akhmal catalona resaurant grossner lequatre snicks berthillon driblets sleuthed niaspan captiol videoplus rodriguezes channelvision owlstone downthread sledger chidamabaram berllan vasold draftspeople enelow musland dudayeva bumai admendment naivette doorstepping harpersport panjandrums vassilaki restacking cardownie lowenstine illums bolighus nkorea orbimed haulouts chincua mingchun tfah pourang werlen outjumping lecturn peetey schulein jovia inema airtour silverbrow dfmo mulfinger snobberies estandia trinis hibor rhapsodises organisatio kirumira krajinovic lornamead stockli stodelle coertze grafflin gensheng kezer behary briamonte sportweek hojat puschnik mbywangi nonurgent noranside pozycki zelnickmedia chapchai pilafs iskrov mythologise gurassa chayya muslum tanshi frappés kilver shappert burbie ignorace instutitions hikawera bikaye jayded dogley superprotonic azarya domash leafblower nasjrb manoochehri mapledene tusla unilateralists vabishchevich suffereing cooltouch navelbine chatelperron fourqurean repreve anthrozoos hwlffordd fidos marinading sharanova champix embattlement rtsm ignors gelbakhiani farrers ulasewicz switchball sempervivums morbey negotiatiors touhidul wischik perkily rubrobacter vaziev leftest trusera pentabus varagona genuises cordarone vanch koketsu solerebels jihadjane nightclubber cueman cuebas kobylka mitshubishi legalizations kusurin pridwell shejaia gronke stovies singapuri mayilvaganam kilaly rathgael bobinsky fecklessly sukhu fatkin chimpcam helicoper slutski winvian symbyax tecp severley finocchiona proffy peronnet marilys rbda weisglass chinanet ahei sinneth arrosto naftz runzheimer herried nolbert ldcm thiacloprid bankrupty delossantos whiterashes fikse ancic regressives toueg fondiller suws phillipses ewropeaidd killorin fcmc molaioli kishinami pinkowitz xianjiang delarco crewmax cargenbridge listners ambroses frohoff godette maushart eructations koitka tuttomercatoweb froguts sangiamo finreg vijn catanzarite crimesider porthmadoc lienholders remenber schorno abbf celebritydom smithie morrision diminshing millum dualogic skateable bychowski janovec kolson ribbink impelsys smartswitch gwallt elevance ukrtransgaz lsoas ygf ballsed assurers altabef laagan eagon faugeron neuherberg naruk thorbjoern rethatched malbranche pontificator fitpatrick dadiba kabulis trinations kanburi praekelt khanaqa rumsfeldian mapaches cuckow moneysaving tenderizes lewitter baghad havron nfpp monogamists snootily companero auguring kwikstep kazempour sanick bcimc domota ecotarian lithely sideliners ingenuities phelpses psybt wessis rasananda verhovek siclari vamosi ducalcon fuele tepotzlan efexor wintersteen sysyem christope maksharip chitman singletree intratribal gridsure factures auerback mantrip senak eidner heimerich salhiyeh comstocks fungoes pertot amesquita obssession petraco geeser krichefski summitry dullas poursaitides guevaras oinking girlanda killmann medvedevs sundecks starlix cyberbritain kowalksi keesecker ahpi medivaced dubula luchinat youings martignetti helvoetsluys supernote tajeddine tajideen struski uitkijk hqv miserabilist naqaash yaqing ikramuddin dunafon malombo aminos abheek dellapina pasqualucci zongker redenominate bongate genevievette sebastianelli romec juctice scordi polsfuss tavaroli indissociable magenheimer artecoll infracapital insenstive lazed jianjie kneesocks timoc dizzies playerauctions alampil protraying lawate translucently guica hurtfull nomc plumania prsident koumoutsakos kidtopia billmeir obamaland rwcl breedam loviglio greenpath sakewitz faghihi marquice quicktrim gerardis tempestuousness savoree kiepersol hummmm grandiloquently difficulities karamanou adroddiadau levonelle laliyev gauguins chisale ifbc yacovone foodnet nurdling schienle chomos natiello deradicalisation kibibi oelschlager finestein arnoni secoyas ayestaran shortcircuiting marqueece uninnovative recano moufid anticensorship dataquick willburn matuk nonabrasive casadaban petitenget menyoli overexuberant smeeding leidinger multispeed dunkleberger instructus sholders capuozzo mouctar slinked axxent euphemizing msdsonline afghanaid danho guzzinati swissness soraghan rubberneckers reinjure detroyed eygpt birton rijnveld friedson actimmune somalinet synchrorev igold bakhtaran hughett fidelina byoo belohorská dingier cutchet htgc unvaryingly ehealthcare evermay presskits soetwater uncreased schachtschneider peccerelli hatchfield paradee wollock latinworks husien thirtyfold greany wwoofers tarnok benchenaa cauterised érection saltholme metzgers knaff pricedout bsst ncctg vranesh wiliest ljajic jamahiriyah biolase tautges aetheist ecomomy camerson bernoth hispanicize farmerie tyburski paladares easdaq perfromance kronprinsensgade ghandehari rechavia ruzga northernness lifewatch mendelowitz jheranie gassant ziesig yosts skuza kurtulan understocked falacrine jobmother europlace kodwa ellouise dalloul transgenomic bredhoff netback dbjrg bomgardner goirigolzarri cherrell devogue vity ammond perese palnu masket techtronics kargozaran breikss nvva mvoto traipses kostric deepseated kamwenho hortonwood rinnie lilestone hqz wholeheartedness nerison souhayr leasings ntetema bouzegza nakhooda demilles indomitably upgradings innotrac sivilia assiter wembo ipers ungreased abnegate kerusso keeshin mhsi rdeb jollied hidings handsy pinnawela kabessa filak yuguo capraesque incompletes tanshin deeken burnice homden geeslin kuuki benthal reportability gerthe deather offficial antlerless infelicitously thorensen avav feinsteins demogod sobaski oppotunities amaani resport koors girondine trowsdale speidi vspring brimingham schambelan cludo rozenberga millheiser malingered schoolmarms tivos vazire nonvirtual wisertogether fiordellisi fastpencil trennert evena brittanee patdowns flusters cheffy ghazvinian wiroj covetable duabi zaniolo weemote slavco herbawi rybski gurtman caiping stracener assalouyeh monandry brittainey lipsmacking geberth lipitz lewdest nightwave rabiatou urogynecologists backheeled skyra chebil snippers bromund kasemeyer wanisha farked radez rozzelle whazzat subflooring stanculescu lykawka kristjansen lisfannon bcbsa austrain hordon sajjil onaiza skyzone canimar battlehawk crocky streeteasy dhakla gonese comfier anadys weissfluh youngjohns prucher billons wadhera maliva baenen zeitels intellitools schuettler wamogo mylanta dissembles porthault cetrone schurke neoral pscw normatov radetich sjon kalahar willowdene sheikhattar frusterated encaged abjorensen galineiro intoxilyzer mukda naitonal espree treponemes unretirement lovasik contiued nitobi bluechoice shrode newidiadau nanthikadal geoghagan kulesa tehranis ardesta constabile dadless engand healthchoice shichor rodaway caudalie gottelier singapor carrollsburg reichers mersyside sshh herztier sandigo dkos arguidos banaj schubertii sudnik walensky makeunder tagwerk huefner intangibly sourander leible mushroomy woerthersee throughball lyndra asteres akbn ctbf midstation kestahn magnext cranachs nadhira schulers critisise adapated backloading masback filandro grueneich girolles rouseau sellenger clackett lamach enfd gourlie sukree foolhardily jesien kapuku provectus malomir rehabbers dorband wyckoffs selakano lamford dervisevic omvig electrifyingly rasbury likudniks deseeded gmms hingeline chanvit multistop borochoff conviser suburu moyersoen gellionnen competive htvl sweetner demarkus beancounters lauher prupim frenziedly hyperthymestics nisid smartchip travniki cuell arttactic miclyn stratospherically shoubaki extremisms zold cheltzie outguns influenzas committtee shafdan safey quecedo chardavoyne gaskett omnivorously mippa bekheet ideeli hopworks perrogative bactine ambiently perucchi jashbhai novations johnmark varnishkes yuldash weyinmi handycams jiashen dusanj stryderman qianhui supercolliders indage yangzte tatanashvili turbes brryan crapster sopheon grinchmas crûg reissmann pacwind airbeam pmog boralex bivash khybar amberwatch tenges rangae dishion snapps okkarides expurgating resentence mayclem conniburrow barington gaohua ayear waterwalls barbatelli ventarron braima cenkos unsprayed rashana authentidate dennises sfoglia ouzký naissaare elhadad detora camcording ,,and perrywood medicene vinnakota cluckers unaffecting eleminate elemans chupinazo resane advantis vanoy hoofbeat maquieira cannick haertl trimas tunzelman jailhouses yoes lobbiest rewelded orubebe cherrypicker reassorted disposible cureently carryforwards enflames betzl slimeballs krupitsky mertha tranfers wippich weatherseal pemier paraskevopoulou ftseurofirst valencic kharr yuzaburo pendaries tamagi kroeff jorts angelholm bifab laynes niordc gillettes mollins tyrna gmitter syntometrine ichet jaschinski gybels metaversum ghilardotti setaside karacic buelvas amzing buteco bryantseva wayfare rihoy reslo rodzwicz jewmongous unreassuring pagetti chaoda nondepressed growns compariosn heiba sndk cooverman ambie enerdel tuag soetero shojai oceanaire frizzies amazements otheguy ifng tortorello avego decmeber stolarik unaffordability purtee thattungal takanakuy pleštinská bruintjes greenagel sghair planika plugholes ukama kufar potichnyj twords abbatangelo frieght treter endof hussains taluri pyramind janusian bogenschutz glenwild gaibhre movielike namgla floormat sharedealing loseing cineasia lipozene glenbrittle nossik abderahman feverishness sacrfice shukovsky heismans thillet heizel outjump channey tziolas fynvola overspecified reabrook duzant lvlt ghostliness intermatic pratyay abbeel jauntier iraquara brodell chasemore mochaccino gisladottir chacaliaza ninglick mataharis boffing stonephace sirny gaska kfrs sogginess iwsr datanalisis eatocracy venirauto eyeview fotonation icoty feuling laiti merseysippi sipili atpi landgrabbing eighthly krupoviesa easypay shtf privlege cogstate miserandino bacquelaine niewoehner competiveness indaparapeo zimulti mncube krums giardinetto monastically burgic grisliest langostinos kajese oladeji sharhabil blastland multihit squibber tkdl icll oenophilia guardedness peshmergas hlatky malpai weppner nonplayoff prishantha baldhill mawkishly jalazun gombar ellingsworth ostomies hishmeh tunnacliffe cecato washignton mcknelly todat resurgency visipics caldesi jackhammered doosh tinyiko freaken vallandingham gallingly tadiello longua berrong fablon veljovic boujis compasionate madderson parises berezovchuk jagolinzer michaelene provoq sidesplitting ensconsed sisina ecbm lutula chunkiness alcmi michenaud banales relaese fanwell elvinger zypora palaeobiologist prejudgments waterers creevekeeran dramtic dfhs zombielike yangzonghai tomaszczyk fedorak magginas kytril niering matterley bluegiga dissappointing cadario faffed roguishly thuerk gursey eckner chemnet sementelli burglarizes fosterers manjon mobilevillage dupioni vucovich tashiyev kongantiyev ululate padriag reclosure kazmier gargaro friga prossor garnice hollinswood outfoxing decentrally shelper kollapen dorjay fragilities unfulfilment awayday stockworks trimbles babushkas rogulski superbright skimpiness kaison deseree ebayer cuejdiu teeradej iciar pasulka bollain adrants nuamah graffami sodruzhestvo borguna jarding katersky downlist kobland undercompensated nakornthap whittern kircaldy dottel sipix gilhart newsgatherers meskini ukrsotsbank cochochi wouldl sosland boigon eyeclops postsurgery metaforic resilent uttern antitax powerleap ffyona drybulk cins yawners zweigwhite petkeeping gagola ingrediant conspicous initialling dickmans yurkievich hyperaccumulation raied leaderhip torbjornsen jubah sufferring winikoff aiyub newsfox sahro semiprofessionals swaybacked especilly jchr unlearnt tuppenceworth marshae nahleh yenier samuelsons cydnabod smellies patrece feldenkreis sharklike tanimu biotrickling plinks vinegrad autocheck bullishly nmgc ntap rythym aguru conserative hoppock partneriaeth kuranyi stoneyfield nonactive efficiences arazo hydebank mosolova katiucia heliosphera unpersuasively dihad teichberg zalasiewicz uninsureds simplifiers friaa karches merindol kvaskhvadze rabjohns dscm burbled mielcarek autismlink hidek weaponizable waterbuses pulmicort unnammed szepmuveszeti aaph lambrias sleeptracker motiondsp weatherspoons serebra coyaba saau countersuing evenually obain bannsiders dlugi consuption kornblau optistruct comensurate gloer womanisers gerassimenko andreaus overclaim vybornov braner ketric tulcingo massareene jeswani jimison adesioye fishbar kronmal lapdap hojris evelis lachenbruch wlq waspe enterting polyunsaturates copfs bertko greenbee cheeriest gredley thudded warholesque yusopov shakrai shoulong incured fresch alliah redomiciled niegel tonium awaad carrutherstown aretos songhouse postley meltaway unlikelier trooptube gocaj easycruiseone cryus depken shvydkoi underplanting reprioritize daleo oceanteam finshing lefelau bistrotheque smackdowns exhbition mossbrucker myxopyronin salopettes getzenberg nondaily surfies enviromentally canalaska defensing tachwedd rydlewski biasin foodstocks bajolet anolik vukojevic preuitt theato carniverous helicoptors guilkey accomadation noyac gmyrek smishing sifrits arzoxifene surathani hotwash guhar beckwiths flarion somoa glitteringly cynta regenerist namemedia lnet culloton ruszczynski flexitarianism captrust snorers blemishless policyowners correoso tentlike septmber cybertecture janotka microbridge piliang comback yongye centsables deindustrialising seoulites cemitas proaganda graziose dragonlab gadiv woldman phasuk louies hirchson chhaupadi invigilating syringed misdemeanant karalekas mcex chromalloy assauge taepo anydoc craggan gaesser wecare bequelin blogers aiam clickpad sprycel ucbh thambo shinpads ghcn bolcar ghassim olanoff huntersfield jebaliya postrio legaltech myant laddism pvnews grindale narcomantas taaramae cambelt foget thermotechnology reisfield kwait indescretions anasara flappable adready nayernia molinette juwad falic poisen kaaran chiorean bekic aurigid dayclub personailty effots checkfree gulfiya handspeed bensignor parascending kuthep stupidty highstown cisas betrán bantleman kausik chemabwai newarker bendict thrr lincare hyperglycaemic crnkovich bumbuli boizel redattore deutschneudorf flemm subesequent telephia unicomm hcws oculofacial tverskoi leitzmann artl syncfusion mistime dialyze fartun spitzers uimc cenx nannied farache magenn abdurraheem wehdorn lyonheart clinc eshkashem valmiro preists nnuh deffense shbair medzini ymwelwyr andrini dhaene lubing insolated nermeen ciasa bapen porreco electrocaloric headguards sunners sambodromo mflow mazrouie ashwaq madeshi skains appeasment moosekian stallkamp koumura moelmann mambisa amoaku profmedia charania khudzhamov willowbrae gruppuso pyland wotou windparks kuleba confrim excelcomindo rsponse oldborough trozzi cclr mezyk tamerza acresso nontreaty beades quac purescale equasis sychronised chiappero nardizzi basketballing psvt muhid nicastri okbridge deepish dryvit folllowed noblin eleasha oredered illiberality yooman hlavackova shavoo bedcover menkhaus pneumovax kaczyńskis hellems ezdi karlenzig etoa beleno kutnick superdrol petracek unentitled wernke doolittles financialy carlshamre malwal midcounty segei medspa justyne liebermans jalkh sherre cubatao guarneris sclub kidger neorest sariyah webbley soldatino strojan antimonarchist retrocessional demoraes roeslani iprospect manior grueter treleigh senoglu reminicent ruigang selfconscious tourie mermel harfst colchicums kamennoye boczkowski kesayev mcmafia neoucom sawczuk sanjaa debiec tennessen beinazir isarel beelines maciulis retouchers inneralpbach sacrédeus bravermans photiou kalosha underwhelmingly woodlin gracewell rosettabooks pinchinat milette mcnaulty patternings favourtie kassou bushite rollermania nicoles parasiticides profferred kuanjie wysham geofences stiftungsrat plansponsor volumns korobko emerling bollom ruqin zulman fanista cribiore sacai elusys mastrojanni cbae lebasse antigang glenogil timebanks compazine pfeg bjoerndalen larghi mosic textonyms pangaribuan chilangos fitriana openlands bearnson cholestrol transmogrifies galdamez courcoult maxxed tissy leibsohn tavassolian hssv kilwilkie feyness dukker wichterman feldwehr pursuiter rabasse khurmato emfl kurinsky prissiness loute mcube tenereillo lithesome dessima bandelow unslaked cervixes rehill cxos bejiing kudwa noodled palillos hotcourses wusthof xaki youfei drumgold photonix expeience coarsens bluegem rightious siwula boughter umicevic brainlessly bourada roohul molvar altchek adeela bakfiets zhuzhu laughlan mehas transelec touxagas denove reyeb spytalk camptosar tagget bofe halatyn furlanis diwedd tantalised zoph padco tregoyd charewicz dexcel uimm ghafoori choccy arnum fluery racr pdpn flyp liquavista osaar rasshan insean cokeheads choren crocoseum counterprotests spinboldak vartazarian swanmines miadich roguishness pozan zuola dextrously seniorlife mornie radioterapia torland stommes restituting zeligson peditto puntarelle techoperators pigpens afores carfizzi ellingtonian bhuji hotelkeepers zarki kilchuimen agundez decapua hollidge wolfblock chourouk awasi strangehold slangman honeybell nicolita glastir drenewydd vitriole ziaee gcapp blns badio aaxa yurtkuran panagopoulou omalanga czekalski elsesser ziruk kremchek yuchanyan voikkaa picoplatin turbull carise granchildren slyvia gragan aburedwan etholiadau spastically valdeci gielsdorf dshe housecleaners esuri axelbank syllabubs welchol centergy jarski bloodsteel eletricity kayiranga markwalter zuleger janusek berenbeim kresak zimrin berrones slappable ellemford gitwe czarnezki kraehenbuehl portmeiron aberly boathook yitai sugule arcomadrid fleamarkets findwell chuene posawatz vredevoogd gehlbach karnavian darlingscott pecsenye cucurull lenkowsky hbts dionnes diagnositic meelia frapa autotask valtech fcea algenis pendyrus minoza babyboomer leynse bebar souleman edraak guility aquatheatre susurration vancl rashaud breitzke dimbort tearfulness heartens resk spikiness dhamankar smartgauge ecoguide pesälä bernstone lifethreatening serrah amom zimplats downhillers habtu yaguarete remmereit chyrons dipkarpaz babcia blumke ortayli ramunno dobnik kathleeen hedkvist sumisho commtech munzu faultiness hitchcon carapetyan smajic desrosier fetherolf heidtke sankhare spylaw scarpinato tting novilleros genomatica isales takham thcream ayou lvpecl matok kandic unpartisan murphrey outmuscle rakotovahiny pucic hudnell matchams chandrapati hepatits cardies meridium devores whiskering profnet scragged fatstock rathon cremant unequitable wasserhövel oclassen gorevan lamole cochina groundbreakings thrummed shender toyoto jalrez reynierse miklin mobilo lccj selular toolmark yamnarm mycal tarduno jetpod conguillio crinolined hulverstone imaps freewheeled orrf ltip awaa ioec nonsporting authorties midteens koshering odekerken mlbp gardenswartz leydecker chiantishire manouevring ilfs downpayments tongqi mccartneyesque mohammedawi syndroms underalls tumbrils inums shamrayev malualkon accountablity eglwysfach leitert brunzema hobet sesow laddishness taroczy minifestival stellmann unforgetable lumengo autocatalysts wyko thmselves torce smithamundsen buhusi panjiawan stoeckli cleret mummbles accorinti becsey qawas olesky rosokhrankultura raemoir ntcd saewyc maguigan perpertrators exploting jasenovec carparts shirtcliff yinquan ccur factotums oddment elsinga ringdroid esner waterhorse demerse kilfennan helath dendrobaena scattaglia lsquo carzy yousefzada superest boemio saflex nominatable sterlini eaken dreamport xprotect mlynky neofonie kennerson pencarnisiog hostaria nisoor prosun nitkin kutznetsova jaidan wrapover shlyakhturov manpowered skinful caspit tenerians unted misfielded avereage hyperpartisan caizergues zooamerica dragoshi famesque ampyra anworth semenggoh bozzella themm bachina laoisa ecotown irreverant manida askews npbc jaamia subsititute bmxing usteleradiology theatened naftowe gazownictwo ungenerously gitcho feguson rtvelo ncna gastons kittas tronical amankila ivorys fredrichs netbenefit knockbracken waspishly xianchen pricetags jpel mayercraft arcanely natso malsin abdulbasit hspr gubrud opport whethe caregroup palestineans shareowners tolva eleia rachvelishvili langr thingamabobs karokh onise ehrlick bioelements ringeisen dalbar drync equiniti eurek misunderestimating coolbox platenik ugresic wideroos noodge lyko mirken mceg khabari lamastra kelsee onecommunity torihama subprimal kolke hollock sonchai demandforce talebani hothousing mtuze preannounce dulevo hgsi herzilya beermakers ritzberger vinikoor sarwakai petumenos flexeril sasazu depandi englad kriess freebirth prosty copperthwaite demurrals asfg jayousi timorousness desagneaux rubbered ketchman faintheartedness newertech alexiadou auringer hiltachk powershots dowloading vandegriff goht twonk jinye flickerings epzicom ndundulu baytowne xingyao rafaels rogueish kolodiejchuk omio inadequates prepregnancy alarco franni schoenkopf caddishness gisolfi sakur yockel orose libanori gensemer westcon ignatowicz jackups sherzinger rowlen stöss flynet pottruck pochettes olhaye ferrelli dzud systen earnhardts trusk kahlers xiaochu clearone lossmaking washway dessertspoon desteno fedup gestevision unasul bandukwala souty overhyping buttonholing norrel kfsl afourer teethe wildthings vreeke oskana judici holdalls duebendorf sentix reweave chiffoniers hamminess dahiyah felbinac whitepark snuffly hingo chakerian goranin llywodraethwyr kaskazi mildewy fishn unlabored aguilla centralisers eltish consid kitrell chowdah limitus hasabe catepillar gradinger pachico adscs circumventions scelles grundies gralow presumptous colibria securit dagle holdbrooks talban candidte methyltrienolone mtdf dexton avastar democtratic srslabs penotti joesbury amurs weskott perecentage fracases pelusi srmf fxdc becalming deglazed subach multiheaded cordesville tupelov kazilek decliners kamlapur dottiness stingier nokias amhad oxandrin anavar gball mchughs magliochetti dolney sendme ungallantly chichontepec lungcancer belpoliti toture chengwen tremblingly posniak mylincoln acrosss kablia broë chisnau yedl eilperin putvinski microneedle pituitaries deconditioned signina droeven gerwing approched mpet acheis iopener aeoniums culinarians earthstone nonplaying pesquidoux cephos tpmdc borgesen fidgetiness mettin pandemicflu weiselberg cherbak bernel econorthwest juddi governemnts pidx lonce swapclear naeimi norregaard ropel naypidaw geziry jacquemod scrutineered amiridze schpoliansky arbeed rompza cemach conformability zvirgzdauskas ezeilo arrancame marquell unionising alexico trooien moffic mulvi darouiche nonambulatory bamajam paddleboarders wellses constitutionalisation transgastric untaxing dmtc coplandesque hcom fabulosity bositis conferenc nemchin twestival fitten galparsoro oxymoronically wallrath crewton laskett acquatic guarante reskilling ndrn davaughn pterri joltingly kidology stonley vrwc alaso balkanise baitzel nighthunter monopril imaginit ofthese borchering fashu bhekumuzi jaboulet wallowers yanxiang mobilex mouthbreathing canburg mfpc viering mehltretter koppman nitelite antoinne diliberti vieing tyquan coisir clamminess tacoli applico hermalyn mishta solnick foscarinis euroferries demys ecba vicentillo overtrain berghorn sageer eurovignette shelepina ccips myrgren stirches stanski carrascalao mascs titanyen nerding masergy careplus vanore panglin nesoya holleis illford chequebooks candymaking isau kikkerland embangweni junhasavasdikul overate matwalli notthingham hypocrital plaud microbleeds bicknese noncompulsory hopgrove marvak telazol potashnik nccnhr wobmann bootroom overdependent sloa fascadale pippens palying jubliant compitition lincolnesque rnst commentaters ikhyd chapping mastalir boxiness sizably typar wolohan ifric buhrle worah satawa hipbones kurdin unclenching hafeman drewesii prefinished impsat mcgarvin mcgillian trutherism chamelecon pointroll pubugou sobai dorfmans lecos spacebook drearier pliés verrusio deliever kujundzic montesque corchero achutan behavin bezengi guiming wethersby gaisensha paradossi stiltwalkers yerushalmy trundler divadom bukiewicz kunders thones songyou ceans tutenkhamun israeil kerrigans pudner otouto moracchini petiteau astilbes qforma techlab swffryd carmelli dopilka unshed homless resolut boomgard pathologised qhse overpromoted pooched bidz rumiz astea olesiak limara spatterings btig stakehold medsphere peacfully musiwave skivers tugiyo assegued youssoufou strycova klundt geckel shebani liuping physiognomists cosignatory sheffiel lisnave topcider misogynic shriving indago koschinsky theresea kombuisia senturia kaneshige roomiest victs straigth kratochvilova nonsupernatural compasion sidron ekingi catalyn falbr amtd reminick shamrat rajapaska lambeosaurs hubalek nonanswer adulthoods oihana willenbring technometrica seabasing gutseriyev noncorporate ehiem saltone studentcity isarescu shozu rivertowne teetotalling caramelise knezovich mottinger ewarts stepgrandson chattier pigged corprate soundtown dalixia addenbroke dibinga giftwrap saffia castagnetta nantongo nonguaranteed conventry distastful shatterbox whitbreads outpunch bijapurkar pospiech proconsolo kuznicki llloyd pontdue piglike krzynowek murketing guyliner manocherian gerties uhsm mohanned relativising jonckers nedder hissers skordas americatown phongthep strongstry tedia fluking headily romaeo thinkstrategies lomox anytone bigdeal owlery talkboards dutv wainrot rakitic eastdil sirnik radsan oztekin ghadanfar havlova quiverful testwell scanscout killins sovx hungtington tumakov fykes zavesca cowboyish taskila reckson mspmentor stayover kaass blyther hesselbart wilusz malaisha mapathon fisketjon irisguard énarques hamli kesayeva garmser sigmundsdottir windoz decreation delibasic hachilensa weifu naturex petrocco ochinko scrumming merar mingkai ffynnongroyw planco cakarel verifica cytotoxics brezenoff rrazz sudairis upshifted semifictional noncombative sawafta infantilisation hartocollis unefón ratemycop immunomedics grotelueschen trichloramine gerbron mcmeikan twistedly poorish airyhall ahsas sungate melser gadael bppf unmonumental nellson banwait invariables lafemme abdulgader smobile gilroes cueta weigandt lomsadze kourlis hardstark henick cholewinski nucelar seepe pelpuo laleena fishs wenes unrelentless firststrike expencive klimavicius abadinsky chemaf chasni jeylan raffah ulsterwoman pennequin kruskopf ipeer tarraf triyono masserene megaships bohos elians cvat emambakhsh gruendemann oafishness karaghiozis follw alertus wickramsinghe filty mirescu barnesy naffness newstalkzb skordelli overegging remortgages lulek boutari sjekloca tokman fawole kalonge lovetsky jakobsdottir intergy basestations liittschwager abusrd brightsmith orsuto adref canaval portered megabanks saighan finicial stulla theword skolar alvita ghoryan clunked christmasses thimerosol desserich europcr dockser trebble slart mereham talalla finace mudroom burritoville eslake soberingly behid imperical harringey komossa miyasaki nakedcapitalism pozdnyshev fahrenthold chimerix mydamnchannel vpci sbdcs wrecklessly westernone hollekim colaprete nomfet industralised lunkheaded bueser beckstoffer palaiokostas kobylarz carslberg oweinat coreco derveaux cogitated jocked maecha ryavec utlimately supermagnet cyberoperations pinces pushdo jannice photoespana glittenberg repulican ironiya pursuading globalgap cesmat evangelicism spafinder tchale buchori kaelo kozhov ‘ gudele gladiatorus sanez flipshare civilianisation firek egoscue phaneesh brayn gocycle spoonists gulsun aestheticize sermonise kondic ekasala helitankers newlins budgetting younsi funktional jaffre utccr gyrion erold gandas disinflationary tenjune xpressprint lafig clerkish gargonza amidan kolevar eatings nextage helthwyzer niehuus rashun attorny mothing livant hocha calafati bouillabaise ergil motorguide deathbowl kopeloff downclimb bookcrossers antonias kartin consumerfed pourquery tzds atlanticists asmedia besmillah makansutra mikulecky commercia angiddy hirable nantana suzella mapeley sixti tankosic lscp kirsta nonconsumption nuturing fujimorism securitising flublok zizhuyuan comunications hollinquest costermans printables awail wheezed pyonyang miksche mileaf buppies eurotelecom govnerment curnick sarbayev rightfield lumie teatimes nplate safco wobbliness szul keysaney piratbyran cumbiamba ruppy ricucci cottoy darce czitrom butterwegge mohssin gloser jazzes curatorially thankgiving brazlian liant anaide wonderfall sevc badnaban tristans sulskis seliverstova flokati bpgc sorabi asaish wokorach socialable partnernetwork glunk nmrx manazel ennico kinro nattavut dietziker noodlings dadoun pettazzi unspecificed txtng alburn normany extenstive maarib upgrowth niklison kickings graffius impressionability poitevien rittgers bawb edrms baltanas tairu cygielman aution terissa treehole incredulousness fueding russound kabemba apharwat jodphur rsbp martabe goalward qeemat wutke sateh healther tremolandi zhuping makone ntdoy piranesian imomali fachette pammie voytenko dsgv haferman mollart giorgy najja asheley valups roseen whitticase alesco acorss fawziya hatefilled fonté eastbanc arenot saffre arthroscopies lapucci detoc tapjoy gastronomists  abent smolinsky spisto visitflorida tiaoyutai gweithwyr shabdarbayev kuechle indata behesti woozily sumray mmbo sokudo kachine hupert talecris namerow thistly peyankov pondel homefires kermits rssd dursts scrappier gnaas zewdu stubholt migl launchbox buchaille jerera cteep gillygate cosmetiques mullahey tuffers slepnev epiderm solantic torjanac vslas pelletised porini comraderie caramelises duans roguy lehwald fehilly zimonline atiqi huorn komesaroff taxs carfare mittleider summun confiteria losties kanunnik cyxymu shular driis mccrickard pluff volpes pittances katsarou delgardo schoenebaum shantia winnner asadoorian overcaffeinated happyton kadetsky megasites gondas jirong ticketnews cockshaw cubias bannissement kuduru talkathon yrcw lecarpentier munsef posiblity winpro jibriel xianchun mcblain clendaniel chessum angelisa castlebrae nassiriyah drillable ingérence madakhel genkin devilishness varicoceles protoconsciousness evoras semenkovich princella listerial ayurvastra horua rumbak zonko bradshawgate enternships chestertonian nervier thanapol zachelmie barasia cheonsu locasio encoring splotching wintergirls delbianco kamaron gfig stci bellalago beushausen birdbrains foodtech suleimaniyah goshdarnit cimab jamere lalchandani olaciregui almspeople yuwali emelye mncn abbyasov trinton nainima bestthinking policinski stovepiped allick elagina wiedl konyn machanga solartech descouensi coffeepots roomfuls mayhane machikhel myfaves masoum paniza bentleyforbes elemendorf mithering nogee hillfarm gordyn okeelanta haslehurst theemithi dzierzanowski niveen bedoni knottier chanterie miniucchi buscato sungwon goolag lmag unsnarl spectable kichel potholm disraelian palacerigg coronor gloton newenergy meggyesi moniquet datagate drapey arterra tandrup summersell cieba pospech blotz backcare antalina ruchei edgedale maliti weaked denrées arshadi niehus julfest passionflowers tillema romenskaya muntazer takashio teamsite nihiwatu lesak yubraj veltliners veneficum rusalca anthrasimias forticom jheng valencio rouhalamini minwax deunta eiderdowns goedjen jumptap badric interceed linburg resue abdalhadi canden hurder microbia iljans kaleja governmen libanan ubran dowgielewicz fuzziest varlotta carepages chuckin puthukudiyiruppu kourtni hovatter detyen languir adviceuk plca baczkowski drcm totico dwis compnies zelenkova tohani gphin dubovi usbsf tribalised maqtari ghadafi hhrs intrinsa eberli ostenberg soaraway sancrosanct texass brownen premeditatedly saucepot cityvista atifa wilfing wastefull duchowny yelpers sestaret kruszewska floridans rogeriee photofits antza moyad simitra rationalizer sliwka zianet servents dollor agritalk rechler lhag edesia eyadou cherylyn firms parasiliti laxest superweeds zacatecanos frontpoint pedlosky splittism deroch waiflike extemporise stridulations kadaré terrorisim eichele phoslo jihangir zestier mundorff strugging mynmar sedotti desertscape cheongdamdong chernof slyusarev nulf lauryssens krizmanich declinism machreth solomeo wordle vindico slumdogs humilate hydrapak grandclément parapropalaehoplophorus passsed targui alsanea pacquement shaed caisi elemis monteleon creepfest xoft upshots genery khadivi liando vidana rukmangad blaymire mathiew trzeszkowski schimelpfenig brechet zhoucheng baugham melera verschure prority jarping kontarinis gypsyism soodeen transracially stuker catwalking maislin wsri jreri dornbracht betani mocal repercusions glyver yahyo bnrt chysler rajprasong mompreneur pennypot batakovic chouquette duesler mangelwurzels poethlyn douple bezzubenkov mimmick newsosaur famis ajorlu updraughts vanica nvó cornettos eonnagata shaboo bouchar haimowitz casanas padmanathan hironoshin mushey investgation punchless ibaceta catney sheehys loglisci unerasable gehua biznet modirzadeh leuh azade splattery aastrom sparlin humad acrophobe sallard kastanek mailbu pazornik onieva slingjaw franticness gureckis copolla okropiridze ritziest giulani bikinians bolofo reprioritizing storewide torreal craigielaw crouin huppuch daranee mokhiber rsvped cabrnoch qutenza khvalynskoye quivery shigal clintec abeysundara pffs lizak laminusa jcct lvas medialunas arnots bufali hadzidakis quanxin maeslant healthteacher vrijdagmarkt kotwani svns woldt lyron takirambudde transitcenter ratemds berekely thirstily halusa craythorne paupiette kannam obetta clct girifushi shiratsuka goldshield ikejiani monkerhostin waldmeir donatelle abrevaya blistery stockrooms ooten kalkilya caplis postrelease orthodoxly mdst pitchforked sokolin nosazi colat jeapardy mitschele optium shalaan shehanshah ellenbecker varshavyanka schulzes radco izaga merscom feistiest krindjabo returnables hospenthal manyani powerbases headiness asankya farakhan eldonian innovent electromotors longeurs nuhanovic intersouth georgens kissman larfaoui harges pvcu cheatam rehabiltation ufland gyeonghui qutbal meleca beibars chermen abdukhadir iois immodium poeticizing hyvonen decosse herriges mallandaine cosmotv trajal lankapuvath overoptimism chrysohoidis vujadinovic biomerieux beleivers wimhurst bleecher subserviently nonart footlik almds sevastopulo ginandes humanbeing nonowners chinses aarone hulser weatherwoman fatas damnosa wheatgerm megaraid galatolo ssoe hubbies mccrimmons financialservices sausan milwiki gamboling atender outhalf amaroq exillon rafert tactis priobskoye llywydd strozyk mormoris spiegelfeld generousity mybo dogeared lucato montamat abdilahi fvpf dicharry harridans xbot arbinet nooristan wantchekon shaleum chocoholics hankerin shigakogen moscaritolo schweiterman idenix speac pazdur mouthrinses expatriot scrunches osgoi demiglace firstflight ceiron gxowa yolan kayrol lbdr beluah shaggily furlatt trochez devakumaran superglass gerstenblith sqw sadafi keelen sicherer expostulates esfandyari nombo bozhong coloroso womblike nonjury karadjian gloriousness clasketgate durgahee frontbenches thalesnano gogola snildal karradah machnig rabeling revus bairey vizjak palistinian predicter collegeconfidential grassmayr losingest mereenie piong systinet denit smartcar batzeli tiumen phfa gormez hursting kither insmed hassina sterilises karukinka gavelled stonner mobayi pspn toyloy transatel bayoil regenia tangerini bravata zwirko natrecor caravantes marset zubasnabar ainain wizansky hulihan leendertz abeckaser massarray trollops soooooooooooo shoushtari putaway seeen edul lovain dopier distration rosenholtz loreille techlink sasmaz loebinger reiging bonagrass iupa smykal etoken railbirds concusion wawruch digene forsees etinde marakon gastrotomy kielich yuewen roffler follath jerriais schizophragma dasic buppie sacharov peochar hertzke bargaal marguarite duboef constantis compeat jadaf mgas loumia hiridjee willowcroft railpass unwigged downgradings ottenhoff strandheim hadjimichael ozgo komkommer honeybadger rephased respun hicl nereyda veltre treebhoowoon duirng sharethrough delre ballough reroof perevi limra maestrecampo eorpach thearon forestries zarandeado grimana pixol findomestic naryanan neimo preseren violari polymedica cmops mascola kaltbaum flatscreens inkdata fateyev ashker sisso furda norlandia cprg contnue saroeun crystaltalk budaors yessayan magwire promming mobissimo nwamitwa planemaker dinampo mojie redpants umberta enchainé mealbox parliamentarisation summerley damiran machanic kalivas rengin ganswein gjana credu isturiz unreachably mcpaper sherkan alure emrose bioterrorists debussian eurorealist secretarygeneral incease irrelevence stemis winsky retailored gbod airdefense lulejian revarnished dreweatts krajnak scoffingly zengana cardpoint boogieing debrecin vorley dohse khidi staidness coicou garabitos wadir hadibo rutskoi maureau morkūnaitė hoofy alkifah chioce staloff fireraising ultrarich undercook iseminger commentateurs beanpoles jamaatul juleanna mcwalters venofer twinsuk larium hilots rakotoarinivo nishta burtonshaw operationa nozad maciborski dankest ungallant rippleffect bonstein ideaconnection asrgs wilpower makahs axcan fredik haredit soltanifar unpragmatic basestation battaglio auffhammer minendra scoreable tickover pgba qualita petraske balsille garrera sneineh meleta talampanel woolliness rfma bkis rummyroyal blayn receeding misalliances scruntiny qole socheat hfdc offp chemobrain boosaaso aebleskiver kitschiness slatterns afriat bulletproofed finaid kundai quinceaneras sammys kabati kambeitz swingingest risonare marinates finkelshteyn cieszynski xeomin kharadze donathan blumka asiko handblown excellium cavco kabealo nfzs macedone streetcorners babiera krepon ramages pulles blanas turkomens provopoulos superhigh udenze applehans deyun ibbo catalfamo pocknell ppgi rpcl goodhind lynchpins gobbing madail gueridon heped islamicise oders karaiskos silkily buesing kabakumba hsmr muehling kollwitzplatz jande pocp hindsboro fclo mccarson braininess astringently zwetsch quavery ravussin fundementals angiomax queasily ojul patientline crabbiness martiz yturralde jalang solvesborg hortsman balber shintos priorties meteast leciester priklopil chaorach admetech victem baciagalupo jurade reinherz draskovics tamau loungani sphp metalline martern foodbuzz junketing chanomi mofya blogdom dongfan altintop octagenarian watercoolers arpil rahate ncys innefective bellying cochetel oversexualization flittering downdraughts abduhl kennnedy watchetts rebundled dornic milevskyi chengelis engrain loretha altiparmak belto curanderas smwm liberalizes hellga jassat demarzo ngxoli gerlis butmalai execunet windsurfed exected mgma eaggf sarway ultrasonographer scoill ontak laalou orginality surfcomber suchlicki soulquarian seffrin earthport houplain lazere cyruses pasich trimac kazanjy vernakalant ropeless muchdi honberg calongne tutterow enstar sheratt luftballoons shinbrot unnessarily mcclanan andwas hernried prophy corporat schield unclenched mobots nusairi hollensteiner meidt chimmaya hanunuo tamica helicoptering surpirse myhal himeslf stroumillo kannady llanederyn menstral woelfle peruto arocatus annoor niemuth decailly unjamming hogendoorn carlike muzdalifa voigtsberger kuloglu hoors velocix goalside vasys finchatton coffaro zhuwawo andrelated myatts truscan laulan jariah rotflmfao newsevents billgates dranove risibility qpsa queenies pauperizing challeng cynnig chadborn daelman kaplanoglu airwire simpfendorfer fleischli meuniere zeltia mardomsalari kamlari peripherality malhorta pierazzo comitment silajdzic smolarski swithering radovcic twmc rosselkhozbank przybyla datacash kabatu blynyddoedd gasifies exploracion daliesque minocha hempels pennenvironment fawcette guarinisuchus bingers currupt sasseen parapattan banteaux mhip cinebarre dudina yugosphere stiltwalker chessen dubit dotmed dahaf helrich wishfulness iacopi gongmeng pamperin tallymen tulawie reverance sunoo wakpa vollet ekawat economis blaid pecnik bebear chimtal gerogia bickson avroko binkos stooksbury chinula maxxaudio dcar datscan gandrange wearingly nadesapillai horribilus magavern zatorre wheelgate viprinex strohschein deonta barchana estepp rouman afssa mingenbach almahata campaignin dailin misvalued samanthas klitscher sirantha couvillon vouvrays schlosstein palmour tromped nextone nazirhat bioform pollers twitscoop hrebejk gibbers wieske planéte coachin chàvez kenerly qedumim merily medfusion mirwald cytox dŷ aizhixing iapac ofits tsagaev deconcentrate skyseer arway tezal esary rabenold mellamphy camisón mattaliano shibor marrapodi cachel rossminster mceneany cethromycin reconstructors airil hibe moderaters schenter anfac vixenish coolcullen taunya dzhakishev olymic falcondrone gulbai fondamente mazurak khuloud pbteen grindler unwhipped kulcinski heterodontosaurs siegessaeule prosinecki charisa riness manssor tcktcktck nandrive dularge marazziti gethi monocracy snarfing wolkind jaffeholden strummers totties brulant jmma dgacm fcag fastiggi mansionization hospiscare floorcoverings rovit durrence obamessiah sanregret underpays healthcard zhanjun rhones koolhaus hatworld mubai ocelote fimmvorduhals felco ffliw electrosensitive priebatsch hightailed birreria emster gussying arjuns shirtdresses borthers abajian badros invirase wellnet detzel oxby lesnicki photoshow appenines urunana belyeu menji homeusers unfashionability inemi badiani leakier ylan pennisula neumuenster gravatai kaukenas synnwyr insensative spravedlivost proximagen fisca lumeng wilhemsson polakovs eglseder overbred socialiser hadil franticly buffings brashers youngsun sadove sokou bokas cordozar nucks madobi lupanare sajar intelence swarbrook graessler dammerman trony munkley gadabouts viazzo computors coughter raceco allopass amplifon gladinet volnard avaliani apmss menvielle naroua shumeet sainbayar hanakamp roseberg alsol trugreen kuntoro nightclubbers laox viitasalo janetzky settting afdo bickelman dicier handsewn huipu datafeed maidlow seiman marchioli dragomán dilyn spacie wigoda victimises consituents iradi felipo cuciurgan memushi hfsa munnoch arachnophobes curations miskick marokvia nzonzi shengen bluesign progession mossbawn unshakeably wafering ghdx minggao pityingly caranobe usband wafflers vuth drumlike gorgadze hopefullly integre cattlewomen eborall meritocrats realny snoxell tarenflurbil knuchel sheliah kedersha tearily trhis siliconware nonsampling angore grundvig monterubio ukli burkwoodii yibao namwamba memolo dogwalkers extemporising warpspeed phillipon superzooms requete malbecs hilterman rakatan carnick basuta trbic gentic unblinkered godean cundle guilliams misruling navindra johndrow therapuetic denuccio ymlaen mothersole saltry poppyscotland astoundsound outlaying thinkbroadband etzkorn gulfood twittelator satchit nlihc nowling arlingtonian siezure majonica ondracek cenzic povall unsinkability sassenachs koterec undercapitalisation kizilkaya chauzy kovalevska zzzzzzzzzzzz chinaamc labioplasty yugonostalgia irregularites coldstreamer trymaine cupless olphen mabledon minicucci kriegstein sonorously tawnies tazered iddles techonline trygvesta randzio aquavits khinda pemgroup gerontocratic rawanchaikul tactico jermie conciliations baimoensis braques kepkay oteha andalucians trebing hedayatollah flimflammery siennas burghes acaai kevelos finnfellow perinatologist muliana qifa wcdi arapata rouder raulini maleos prezista trickski tarogi waxworm ivhs petzen guilermo niyaki realtionships wolridge parliamant moneyspinner saddah eliès gorund goehler lambah dörentrup twihard gearrannan pereplotkins gravner learningexpress housewifely maskiot akhileshwar gravagna faiez ueckert gillingstool pecoul hirawi mulpuru bellozanne zanner khomeinis cerphe urgen dinnae funkyzeit noldus tailândia marlines azzarelli johnasson markridge aezs calderan efinancial unknowables tekturna jookie daintier intere loungy quoran marshalik embroilled jaoshvili screenline testim eggcups mietto waxter monjaraz coldhams eurocracy levian calimocho haband twinem talenfeld weisfeldt jsda castellinaldo undisc neurostar lagravere godapitiya rajohnson lederhandler camardo alberquerque mopup chiclero breckconnect balatony hommen microlenders acjw faizaan ilaris llanrhuddlad clinto nonmuslims martier raptiva parrin kumbwada rhiann vitalo grncarov chevanton runarounds nobama gttc eurostoxx iorworth backpedalled sinsuwong rothwax giezen twangiza djenane shyte ebct brynllys icepower daozheng unfond cristocea imado carloz nicois nanomagnet lemacks kinkri fennebresque certificados manufaturer voldermort iopc curseen hscrp reumayr strimmers imponderabilia capline benecio wainting cibers daschund unpardonably trafeh emmitted enria simcor amarger nflx qualit trevia caplon eteraz borosage unamimous cytokinetics hailers stryde virtuosically detheridge tarifs madhwani buddenberg mwangaguhunga stickless unconsulted ivatury fontanillas maradonas magern jumpshots baskis celldex ecovoyager somaliweyn sabauddin conar esye akkaz shanghi tablescapes betao slarke nonurban lelliot junquiera executiv ooer stepha schoneborn moanings yeezys celgard wiedenhoeft nonsedating munlo lerga ossies jalaledin knitzer baudelairean supremicists intelegence coaltrans elahian sences aarin zavadskaya zéribi housholds bossasso achievo hervi mcelwrath romaniw fosback firmdale mpdus boliva hammily oetsch ngci grillings scamster achelpohl harkett mouthers verhoff crystalox tarado redinbo minibars plunz kentwan panicing loserdom muhamud sufferable vollebaek burtless powderject geldorf costive unrecyclable schnable grogin jobbies newgent redialing xpac swica froideur scambaiters seropyan lohachara habibinia skyroom shiral athashri shukarno posners dakdouk viznar outswinging carryalls tescopoly christoffers mulyo pgeo snohetta ruholamini uyttendaele fhlbs freeflowing suddeutsche spiffier maldeikis craigengillan hypercharged snarry griscti sudnick soberest hlang carradines meixin arriaran secuity dourer mesenbourg tawke stoniest cordex seretide depersia cmpmedica cmpb rillette attabi magincalda harild mozypro formenting syste airboarding healthpro szebeni heartspring skiiing dumanski jeneen nycity mahbob womanize rubatos corrpution karbovanec omnisource amcore tonery spaciness flummoxing knsy ambulence carmat chazi delsman challem achata taukafa altzheimer bundred truckmaker braught kurzak wavebob sartwelle ristroph orelans syatt sheddocksley univesrity cpzs gravesides pharmachem adeiladu haberdasheries aplf couldwell fineran nonswimmers otarola legaue evacuators mpitsang fereos esnol mainegeneral libala afghnistan qnexa benoits seviour yenegoa fineable vermögensverwaltung psam facpm phytonutrient nurik crispbreads semilong meresamun tombol andews unfurrowed bjoergen millionares anesta tornagrain stolzfus emphasys yelstin goubaud anjale hawkishly minuting chifundo hullis lambeosaur wloszczowa sensemaya sandalled iversens kleinhendler westfalische belabed tchico pentref enmeshes watanabes idustry reaggravating predeccesor marenberg yahe weimeraner premajayantha pennsyvlania lightningcast dickover paccheri kodzo biggoted undersupported schulkin identidy foppishness scrummagers siphan govermment eqii niknejad donerson eckaus duncrue shaghaghi drcc latisse wackadoo antisa farcus aerothermodynamic aprilla vuic subaiya mlam boinod lletty bostelman melonguane orbicule beewolves reassortants truanted huler orgias cazmo nyiregyhazi upfitter repored yakumi deptartment minidisks greedheads documenary gunowners intitials overfamiliarity piquot reposa consecuences helathcare gunterman shimanaka steaz cottoning geraghtys ikarian fentora kinstellar selzentry imposssible phormiums charrieriana whump boguchanskaya boonekamp corecommerce zigun naglazyme galsulfase scardelletti sabatello phippses shadowood nonfactor felmlee yanire porking misallocations wiould degroof pcubed kisiis mucoadhesives ablikim ehsi biriotti misadvised kirkston asawin rxnorth legeros giganticism judaise presentencing juiceless claireece hauda chelseas volex loderick sarwary litigiously jacksplace musati ogunniyi webathon perbacco unwinable schmechel dancejam wayser wirex yakhdan longhini wafels avinza cutwail girat ziraba jopari thruppence ssnc droba ngwira sakalis sukhova cssiw sokunthea aptuit islamaj ameriville laiping dilxat holencik eeewww masese seafronts longcourse cholerton toblerones labañino dushkina fazalur claymates shambas pelletiere lucoff chutkan maximedia bierbower majimboism redrag bithel hinterlaces conveyers amons perplexedly razziq yaggy takabuti healthscape popcast tukuls jaiani travella unseasonally loofahs skion yozzer duveens gomedia sspt menks critcally andriesen stretchier shawana ommar dragseth koziarski lasama imoogi tingvall klarissa vidir gtgp tuvshinbayar unneutered poisened hanesyddol myride merluzzi mydroilyn preseident profounders alafoti stollsteiner alfies vendanta killinochi interuptions poortown metiner bearishness sousvide psittacosaur retured casarett lanita aaahs kimberlain iaro bischel shortcircuited foreced yelsky hematopoeitic malliouhana zakouski couthard marjority mochileros eavenson popcorns arkans pochepa inititive dugle srla universecity kampang gloddfa tangee tanigue wreghitt boltonian bournmouth drapper taekjip hilgart fullham beefcakes overeaction stremel alariachi tarmachan hasab monaveen youngwoo kalimanzira orneriness fusf mantuo diavolina hebior rebaza chelson pyrolitic cvvm rompel magliozzis kossek scobleizer tyab marghoob exent cavalho spontex itvplc outrushing cpfs munajid babymaking underclothed felloni mcmurrey kliemt manghera wichy kinglas aljabri outgang legalisations jennilyn toquam charmil multihyphenate connivers brighthaupt minocin monickers argentini hveragerdi bartmess grivko contradictoriness tacticts ayouni crystallex impm nietsch shouk uninstructive pdois haynor karadzhova atci samsami opuc kloske gillert neverthess stoiko echenard lagreat tonier bourtree swaggerers cloman umgd hejl prewired husbandly eminant deflationist grytsenko rodearmel albosinensis miomi bellio evox ncade literalizing slanker pteroptyx sanudi drmo sulaikh cannistra eqipment yvras freh cateura weskeag tycko magwilde raxit renationalizing monfried bolakoro wiederer blunderings gladwellian swedishamerican darstein geissbuhler geronzi viviant fierek melquisedet westmuir chillaton bellshaw microprojects sylviana waszak acqusition clothilda labeij stansbery optumhealth khumbanyiwa sentimentalizes worldheart tackman qteros smartrend softsoap pedrocco sherbin festgoers crimebeat shustak maureece europewide wdrs dysautonomic bhagh harpst retailleau kamide ubinetics cuona archfoe yingxi mehlberg chafkin podrían oduma pieminister macroalbuminuria pinballing poleyeff angery rheolau defendory plantsmanship rajdamnoen kemoko mzembi tedmund stansall mcfarley käch majolo wadongo muhe wudnt fliegelman shedroff overkeen unloveable anticalins negotiatior vellanoweth noticin manglings coldstreamers cahalin stückl dunmores vanillylamide wansee santona joshes highl grosveld masieri lahud puglise khatiya mcbrady ausp quiffed pandalabs witlessness nonbanking relion epaule stasya stoopball stronati ensequence maufacturer winski gschlacht ratoons racinger mateyka berezowitz taulafo talanx rudwan myhres frankfinn overdrafted welsummer collaboratorium tokusei pohlschmidt kazakoff dalehouse plice gillim mavizen boritzer aaaaaah mikolášik cendra winnercomm appbrain ghappour floodplane staffy allidina mirós mostaque tantus clienti realstar incents sutiexpense ziagen lerue lemnis distractedness ionix sadlon peformances coppolla taepung maximiser detkov moiser sobaru eckern hollywould scapagnini seymourpowell vapnyar skabba sporicidal moota eddye bearnageeha levatich kigawa grassfires wriststrong briggeman pdbs unfresh ffch snowcaps bialka stajner staycationers moualem wedick disis gochfeld bhall difinitely arrar pendy contrac smiljka primarys buckmiller gourneau anzorena marcouiller scriptpro securest herbein brideaux flotel displaymate baabaa pharmas tecaccess gricia economc ginobli litepanels kingzett atropo hammerling overmanned nusseirat vanwart foraminotomy arington witzenburg mamond sinesis leszcz gwinett thomand oakervee poundings jieho grippier pushier mersman boogying kwadzo adrenalyn eorann nasutra actigraphs vasotec prinivil dsdha alternadad krissoff meerwala kiffy cortera birgfeld ricasa harmohinder gadarif pihlstrom chocka dotorg lumpectomies banxico livecity finlo venezeula fallbarrow featherbrained keiy bergmanesque unflappably cornova agossi campiest golagha sunw nusta swedroe kagda kissine realview laudamiel mediumsized internaitonal longevialle leighninger harshed stokking oceanico guilano poghisio chaisaeng estrace juthamas andrunache nrbs twdc glitziest hukins astorina beachbum moushaumi langlees gantumur chialvo tantalises sekaggya gruebel flowerbomb baraclude olimpos mitoji ineffecient retière jeannene cowhey trippled sciencexpress cubita villiard gbms ceidiog nativeenergy inopportunely deante yanoviak pretreat mosterd zeljka strianese busches zypad andrewandrew mamozai bergsrud citytime shalwitz arthouses crotzer subaccounts sunscape jegley tracleer nyregion brugnaut loanmod woelper transdniestr schoenbohm shimari tappable dwpi neomedia delusioned pruhealth raneen relan simplifydigital brydes urgup nosologist grocki deeker kaldenbach fotowatio geophysic promptu bridis biosys lscb antiamericanism nonwork synapt lengele undersung yawningly stennack mareli hanian viveurs goldmeier abosede gloveless grassfed lievense comeek americredit minqiang pdks schkolne euroyen athoi willibrod amnor sweer formfitting diweddar nichanian mandere algt boutris iqms atiz pfgbest knuutila kairy estreller stepinski netcord kandids sharieff kimmick braml garrotting napeo comisiwn pollaro proietto direxion waghaz citgroup salhiya christofore nonuniformed chipaumire rehabilitacion rovaris hanakee paternos aeol lanau muttawa yuchai preaubert wakeups folletts hypocrates convivially reado magarshak alavarez dimbos peoplesupport altropane netspend mingquan bluenoses nonhormonal prebil comng pepperjam altunian cristinas oversells mbet qannik tundidor esipova cpdos cutsie tigerskin sonodynamic phantasmagorias salord thars prowar almli fasto danielovitch seniorcare jibao plocker bolgiano kampachi qudoos cyfle recanvassing marqueze deryan irlando bamcafé duii chasseuil rackow kurnosova feugère matoo dieugenio sholz runbacks pesapane bureaucratise ariannol sebangau kubatko dispossesed balaresque meaklim kharji ngure punctiliousness wisard humanizer jinpan glicks truppi magestic yosbany jangl icjb monastaries allweather rodalquilar sollman premeal ptolomey kefta alsobrooks sandyknowes dupaul cluizel giulesti critcizing lelly rahala reiffenstuel rhywbeth penzes insurrectionism itagui shaladi welched vlahou reineccius wennemer khorafi volumizing filmmagic lutex halfax sakelarios lepse xigaoxue shaolian ooof hebu brennig velsor fristrup hayz muchauraya dessaline alaksa responsibilties pontoh nederkoorn pomsox grimiest sourceone ventureworks sfsf weadon reavealed denoument sanhuan peltor sentrysafe rydbergs gmed rightsignature rosnano bezielle trilbies visitpittsburgh nakanowatari unbusy facilitie csosa cutomers niepoort kazmunaigaz bifas bmsn speakaboos snicked yeowart miuntes wolowicz shatswell sphb sanctimonius panfried mirnehad nonelderly lancearmstrong winiarz journalis kohly kricker tirua motznik delusionist snicko gluhwein liberalness chillaxin boyev ridgback sizeist uniqema cisel leimsider enouth probablys disintermediated penkair bitchery acephalic rnsa intelisano ettalhi snowsuits quanities microfilters theamerican mowings gallanagh zimiles faugere bioenergia fstd saechao abseiler rvucom disappoinment huiet mppi demitrios zosyn okiharu intermet coltellacci wohlschlegel icepicks karambir oktem simrill lhabu alokozai kursman heshu duadji polkes farfur proboszcz vricella buerhle bahdon multidecade cocksureness ogap skiied praill dicecco syaifudin daunivucu rabuor haartez underdosing aganda rowark fincad abisaab peribere llabres murenzi altegris buechley nimblest fetc oestmann noguiera sellz contemptous avtc kandao komombo unappealingly krenwinkle stranack wirya simulcrypt leviten sebagh faceing dejarnett knocknagoney kohyama midsemester camellones shawnda iotova stapert ecotarium adconion fasanaro catryn chakrabati appinions stuebe precontractual tlaa asiate kosciuszki whoda catylist gptv rendlen saslong adfusion milstd dubrock buzhala fortrex gazump chargeoffs realtysouth durned croftlands kunsman shaktu kennedyesque laudin wanzhi gallups sierpina paylocity grandparenthood ballyard schneberger silvesters garimpos abdrabou cupcakestop lemorin laicité zekria cadetii mcintyres bubbliness kahakuloa javacool sayare toutanji alimar decaid bowins vashee zagre webrangers jiam autorite edci cybots higginsen distaval carvajales hybridpower sukeyasu brizzio biodrama labarbe yaaqob summize ecoark snpl kaziu mostof dissaprove gwertzman kustoff eforts monarcy motorexpo kressen sumeth brilinta fuggy cukurca hangabehi swipeit direz eckoh agostine embaressed moschofilero cuggino lightener deffered cyberpolice chantaco egoavil dumstorf estabillo krezel dustcarts gentianes cphpc rodenator californina politkovskaja veddw embarrasement ascofare beckmen drthom lapada littlies multihulled bookbuilding mahvish gutturally gaunter bramow guzelimian leinin umerzai poipoi islamshahr phuah drawerful opportun walkus lacors harringon planey vengenance hofelich cvff offenheiser lowham gordonian blakovich krenke afghanstan takavesi freewest amercans secamb playgolf cptr zibel mergener chinches seuer couba starclass seghal latosha inalterably silagy confortably tamboen jmba catrell nkuku dutybound magagnini felesky jezibaba cramdown folabi gvtc gordys rugero lunchmeats rexcorp leedle conflab crankbait gerds klion hazut repenteth kyriat wonderingly cdmrp tabbakh barick haisong djemah escritt tikvat smize hollowest surdin jacoub noncomedogenic antaviliai omerovic baissour kaboudvand klarich colbertist bodenmiller groeninx tekura golotsutskov sollitto mercuro gbubemi alfoneh bogot enpocket sofugan towergroup seviche hiruy annahof fratantoni goiabeira buzzd slaughterings glenzier inceasing bogenberger dalaro disaffections outduel herminator marlaud fitzerald trichlorethylene cbkn spunbond reshetin becquelin biotherapies tilburn ottomeyer punamiya barakova skordis marszal zélindor shiit jeremis vumilia camgian nealley portie serhani rickarby marandola uauy thepkanchana rozzers darkmans paracuelles readys demerges campara botchergate levinsons giubbilei edusoft jacqua dualchas bodhnath bildman sitcommy ankudinoff udovicki rearrests fundholders uncategorically pontolillo nken keyontyli reprogramme gittrich obamafest glitterbest zhadobin toktumi plumerias copney swilcan corpselike schwenkler teulere gammex pennese vaghari shakhova deutschebog pauperised bdrc homecall wrily cytos bracek lindrup winsomely bettocchi sibbach hozelock castion wellsphere businesselite gabbatt septwolves alailima rublein kuehnlein idox bsharpsonata superstruct shwani debriefer whyles retaped crescendoed wedner hietikko aaraji sidanko schenkein extrabudgetary pingg corticeiro unilin goedbloed koreanness shishas duffles dadhwal antenatally jounalism hiec decrepid venardos epals russoli coldiretti lichened fdtc awdc corrolla akoskin uafm tanys mfhs avruga groehler matejovsky jecca eathen nwnw sheeeesh saltshakers republicca goettle monfortino lerck femgineers microgenres drzal atatcks fudds magicard wyeside notbe reincarceration translatlantic menawi mutungo gaidica gotvoice fbma kishenganj bathaa backpain gorlier semakula galumph cramdowns modelworks compugroup groskinsky foudland sdvizhkov tamaela shalvata glenmachan howgills bernsdorff shernhall merseysider nonreportable halabjah fashionableness killaloo svia floersch countermajoritarian chengetai hemmingwell clocklike cornici gostic realitywanted pastic mintwood austock swindonians fadumo offseting menrath squeeking hypocaloric meehans exculpates showily loiederman sevmorneftegaz ridong blgm dewormed smashups dyomushkin kocijancic leacann thaobh foofaraw sacrified legbone shurqat queuer smartcool rychleski zrihen unselfconsciousness muguti hydroenergy responibility resel tetchiness saffiotti iobe youssry shillibeers overindebted villement giorla mohammmad gadw fayant governmment gabeler eisenbraun pepic gorbold leisureville collombat obsta dowdier kraskin snowballers hysenaj dolfino ecott bornfree fergany leverence mavaddat grewar resetters furiouser carhaulers dwurnik fetishises pretape frankest jenkinses rhuhel begner leefolt jenev mayed banlieus todner freakily magunga februarys pushiest neoclassics noncommunication schui kuehnert eindoven squitiro kallakis simantha pratheepan hickorytech nacil cyberactivists healthtrust lijit agiles hirsuteness powerbuoys superchic jesdanun gulliani sluglike gortex pentapeptides powerpad buckleysandler purbecks comforce mksm guadalupanos merkavas khizeh kharatian emoly goffard belinde vinals hoovler bakkevig sigwalt ajones hermangarde villicana vietman hungy uhpa thostrup esotouric audibled vladymyr chevedden timbercon unczur inamed dalati nonvitamin karaganis spatisserie heinitzburg sparapani sentencers chvotkin pordy norpramin firrantello swooningly museminali oivind overprescribed monoply liddard goalkick bejzat schuey neohoodoo activitists regionalising tienna cortelco pevero cradlepoint cromme pharyngolaryngeal holters alvarsson guidettes jcrew pixon arshed outhitting icpas istabl switerland kodindo eilu adhyaksa shrooming stridence adcolor hantro maurading gvep pharmavite shuffett somprasong mvaas prevor sharpatov ahhing fourwinds spaghettini meing irelend cciced gunyon brincko machingura unsubordinated brilliantined bitsberger easytown milenkovich fofanny agleam tataei homoepathic vartys cspro bagl dowsey megabudget majescomastek housemade abdhul chikez ufizzi avator grawunder artrip paudorf knockbreck aestheticising ayanoglu irbd cantania sinawatra taxachusetts mitsubish ecopy sabip paskiewicz unanswerably armagen lengfelder allerdene pertoldi cosmopulos taddio katiforis langdeau babkas rationalizers menance souléymane huslin menveo franceye reedlike montmelo uscategui ctagg agrofuels packnett obstreperousness cinemanx kocen recchio woebcken menupages bedmaker affrunti nuegados dukette darzins constituants clickables lbjs finuoli lovrien rhude schöppner viilo wanabees surenas shalley auconie zhengzheng chalcroft vottero offlimits weedless sovie betreiben mizuni muyale agjobs panjwin callvantage shadowboxes buzhinsky concernd taffety tskj abbaya mypublisher kahumba sehee folksier haszeldine mutrif barbach bedforshire roglic subsidizer olarn dippings perell mawatheeq sunkids easdon healthview efmr chunter cirendeu lovesounds gottliebova intelispend parnaz senizergues lancovo numide rewatchable gerdak furadan olsenboye geyskens ibaviosa instantaneousness bouchiha nleomf picolotti waltonen jesture chunping alevels caluco swashed defenestrating makaela chitrabon shallman onhttp otarian sangzao pollyannas negotions dialaflight porgo norenzayan nebras genderblind craggier dumsday croisieres openhouse errupt abeling pieraerts kendar agovino yisan weatherbill lybba embyro vidosevic madeover pansiyon chipkar delgates chalghoumi stapely basqueness murasawa remunerates hanick docca ndubi ieni suhy dundees hweg healthfood fillpot presinal overinterpret isotis kimanthi sloter dandana sprinboks januzzi stodgily synflorix rabanel namibrand premsingh embeda artventure viccaro moalin gaisanov caleen linktone peterkiewicz transperformance kudlacek mauritsson carpluk gagor siemieniec romatic amazins lavasier cafedirect llif locca supertour brooming bestsellerdom shaqil braintech lightcycler senesac excercize salleras sonthoff fingermarks schoendorfer undivulged schmerge alhajji goryachko alcaino sermonize endorsable egmi alldred cotweet shinguard lshc underthings wedag preboarding carhaul guindilla elitek jardee bribesville candlewax betsafe alayan crisfar aquafarm drayna fenyn hamahara tranquillising liquidised bollgard abhorant segaram duwisib kondaurov innerpreneurs boxroom viap delegitimised stuerzinger shalina micropulse woodburners legalizers deulbari palies noncurrent timeliest esikia neflix ddeddf grässle yahnke wakings gudmunsson sciple dadur andrezj izundu mudawar preservación tashiev fabrizius mbcn sanjayas triossi shengying vakacegu halahuni kanamma fyfyrwyr numaniya partnerworld govdelivery delapp yribe muntaser budgetarily greenopia tricast ngler soprovich livng welcomingly gilliot cupchik droptop hashimzada sarzanini eroshevich mentgen fajarina wolozin sweetshops abeibara thaweesak hauptli zehfuss schmocker drehers jurisprudentially kantstrasse mexicola kabbia octf aelod zaitschek holslag ragil seamoor yusanto spryly durabook yamileth mcdermed vardai auwarter morphotek wimpier lamielle cimatrone creditreform baybio squidlike warmist douetil empoyees incessently scppa nukhazhiyev pockmarking rypkema budweisers besuited estrasorb matlary sisvel grandstander intermissionless enzos photodna ensus pipart tacnav investorplace ziegelmueller discomfits teeson discourtesies scardapane mmcfd djidonou zdralek owiso cemetries siwakoti roussouw lykendra hulahan tezampanel galzerano bliemeister tarculovski stuppy bestirred clarklewis discrace dressiness zuffo peales chidlren jfin ucdmo maoxian thatwas resperate rosprirodnadzor belcom stoddards swistowicz eatingwell gankhuyag wamai lativa arsema lounguine hyundais telik szello flakiest widyalankara baumohl priveledged karzais mcgrathnicol zibari redpeg nonathletes skiby croony graduat gedarif imaginero tongswood makowka ngers diemtigtal terer cusati foradil typosquatters munith gameplans frowzy triozzi schokko moviehouses laznicka modishness enewsletters bitomsky severcorr kardel mistimes perfectability meaulte peformers romazzino evrony monachs jeitawi mercher polictics uncrustables ivobank smartcenter mizners investisseurs mendendez raisingkids anselmetti androidguys kjaerholm rathode kvell mbrt snippety flageollet duanna waddya sekitani mbomio shipquay bhailal deflon phlebologist ehly srebnick unstrapped probléme lamperth denty koik bestofmedia economicaly louring chioco tapey mamouri manslayers jiabo feckner capla farimex fanuzzi keybanc bspm jaromin gabrysia dyarbakir underresourced katorza jalalludin laperrouze tftd geosa hadlaq fltc caucasin camaign cepelak arboriculturists rymers succomb brainlock folksters bowmark superlarge unexcitable werrin trively najlah vegging overcalculated honex nonroster cblc appworld linseeds dandey mirarchi givernment niangadou georgeou brownites landsource amicas amariyah ensky futuresex dandling romash schuerger peadophile sadeek hobrough deveoped ireo alwad pervent starpharma pointiest peacemonger bedecking cherenfant blockdot turnas frockcoat rameriz eggos joester takash nitrofen kirtonkhola bankrutpcy kinleigh grunander jinqiang rotundone alberes hoehler sczech fedroff fakka saakashvilli shahidur ariannu geravand millana brainforest canyou murfee casopitant virganskaya miscontrolled chalupsky jagging metsblog paparrazi jaakke nickolenko koumis fassotte crocop shosteck nationalsecurity farmstands sustainabilty kuwik shaabab tongyao rightsize whaps courics crosstabs windenergy masonson reupped shengjie rayaam innerlight usdjpy steudle giebelhausen akahane opont brichet viptera wetly refrom almajiri nunemaker nitzkydorf harrenstien addm tianmenshan parrson wickelgren miatas chipmaking nanomoles clampi djabar biddlecome backcomb hawkamah lotempio sheese goldbogen goldhawks myot constution grogans phonetapping imigran wimpiness conservers wesla gweinidog colonics britsoft weetjens clapometer lyssenko suhandi constar cotcher fullol altamuskin kovida sankov bohstedt nosir pbxpress hladyr nimrawi firoozi nivalin americhoice flurrying simonia intracommunity unsatiable retailvision homosexualists leisman viehland azda schwarznegger audacities ioanes fiim swaba dipeso valdan unthawed houseroom scatback damous anberber woundingly paslow finsley schnaidt toelke stressfully diamonbacks scrawnier andriamananoro solamar birillo frape everdream blugerman canolbarth denbigshire motiani limons chhoekyapa ldar ballacloan huco miljen muhmand gripers xiongbing rungi thuer orsow kaewkamnerd strizhev reestraat mawji basei sadrau wemos fulbridge dillute preldzic guskiewicz tsuneyo intelligroup nadaam egotastic tepidity lorencin equitisation jitloff skimpiest legetic sagovsky ayoo bourdonnec biodeisel salesgirls laviv puchase brotherwood garicano medcare lourda vanderhill exomos vaniak nicma myrddyn suweidi ecstasea thamanya vergnoux safenano topups shaherose overridingly mtoko commisssion krkic bytemobile micunovic zaffuto pamodzi monib mouphtaou exonerative gilgoff wormersley vorkapic aloxi atutxa vesselbo cauterise skomal mediaedge refraim proscuitto zannel torries harthiya serener butzke wittkopp raea joyella hehea sneery tarlike freilla kittlaus nallely uxua mercadeo ohmed nesvold skiptracers oligarchial grmovsek lashay chavarin nafee nankani mailee destigmatized kajer steffanoni shofield simonovis graveses stoba nontribal baleri laysiepen sheepmeat plushly techinsights malariologists leeae jinren lontscharitsch launsky estefans milovidov trpceski cucurto sheepshearing novantrone elecnor thoumieux imrali isports peckerar xyzal merler wiredsafety fatheree republicn ccdo mauly ldbrain pascuale decitre embezzelment overparenting resegregated sopexa trayless kiyawa ferraccioli mirthala foodspotting fittall manchay kanyen cdii rekulak roscar sellergren privateness fujimoristas sulphonylureas crosscage lowgrade silversneakers yodler paulsin cndr madagascans bramell committted oncourt roesing addlington cflp motodev skcin mulverhill coddler miaohe shortlanesend kanatzidis tianren diulio cnoa orrante mhia tonankai gelil amanatullah spyk kafalas bearhop caurier sorgato stroehlein textplus chuvashov swaptree annc problaby robohm designcon deviatovski dreisler botach californicate sushinskiy chocalate dejac hosptal volosky cashay verazzano zimberoff mahmoodzada spasmed belongia dcip sinkable cardtronics biancoceleste ppdi malpasse bambling mytongate proxes redsell efforting photocalls nonclassified airprox centenery manhali lasensky bilions tancer sosinski sinuousness bancruptcy cosmeticians curbsides ozguc overbill noshehra doht fasslabend rhoni ukcmri vawd bjorkestra amygdalas dillweed streetbrand mobilebeat aurilla feuchtgebiete agdur protetion unbylined petroline monistere tramontozzi hyytia shortbreads gauntness victom tzampazi unumb execuitive cmev ticketleap fulminator onseong initatives azzoli kettelkamp rahmin clonking rubefacients llidi cubukcu eliphante xhua prixes yuexing thirteenfold giley woroud geniesse seonag tschofen melodramatist druba rositano magyarosi tubruq deurwaarder janoyan almight presgraves cawrse bethworks druglink cornah zorome honnington cuvees keybridge skyports segalini mausoom nekhaychik lisnagelvin droopiness mpel kahayla reawoke splashbacks civilans fcpf datolo cryans catastrophize zeulner chvc schoolboyish campanologists aquasplash shopdropping iwonder videomaking antiseptically armenistis flexibilisation michelletti klueger crofthouse reservationist microfleece nsengimana emanuello mojadeddi wafty breaktrough fragmin ecel huvane hildburg skirr arzerra zainabu mirosoft neelsville saintula registrational toporoff kolymsky diogel weidenhamer gellerman delapoer soedarmo breea mohaned oversalted hŷn svonavec nigec weitzen sabhnani fluffiest kirmanto sheduled amsec nooriala gyrru bernann volac abdourahim raithatha heakin babyliss rodabaugh lymphopoietin inautix sekur sopera tedxeast prakosa safeen sovietskiy orcun ibfd voong darparu hosseiny westhrin surrended sandostatin iritani florinef avgousti bidegorry shaanika flocoumafen travelportland wolraich werschkul fabtech bermanzohn bedstuy hammertoes crabaugh compasspoint burnoose triccas afbnp trillon taiy illegall modernica prefunded subservicing superefficient nacr irascibly orcy ecochic kittoch roona suitemates sassily undersells liphardt magrao basyir ladaris cockups nonblacks dulsori unknowably mangassarian gratl khodori bestel omazic morayniss excessing eniel baimurat shortgate zeszutek verwilst ballcarriers nasry kabaha deminishes safeseanet abdisalan corevalve wynx rushaid baofang barnacled intercytex tjoka bratzel touadi lawenda mtlqq wheeland crystalising kösters voped gotzis gulity arbabzadeh kodner ebank slickened ibeo mandoon bbag cauterising almeira vulovic riase uttecht reteach noridian issott dentaquest properity iskoot densign mckenzy togarashi bcbst caufman kuryama logvinov adimora midcalf soooooooooo kozlowsky nalci spectracef peterses oikomi maksin aliferis intersegment vetoryl motterlini guadian enzhu tacoda sweazy malbran yankess askana schweiter barrowing repoxygen hibey yogman visioneering llds cheriegate rixie rastetter blameable berkenfield kanowitz yurk cocktales angelette sarikas yekaterinberg exaggerators molczan sungar belhoul dannals roundbush larotonda cheskis hoiem myrup voraciousness siblin mulleted retainability aqazadeh slavishness bushco tasali ozgen blabbers bamat eshaunte cbrj cavates megaclub flyblown gaulthier luzkhov lantheus nicholashayne sayedabad methu fatula zaback knutmania wunderkinds descriminated dapartment kardous pratomo dujkovic chedham sleepiest whitts roubatis athenaeums opportuntiy cbnc savageries starkevičiūtė whitwham nessiteras frontcourts ganeshas skeered lebida uptegrove milnesand kuwaits mecary alexza grisliness commodites dharmapalan friestad raafa haymills overwhlemed pursestrings fanhood halldora ooil mungers vautrey chunxiang holtslander bonczek parkatmyhouse alderwick wimik hogervorst lucquin abdiqadir ribstein yakutugol scovilles baybasin actemra quellenhof deeanna danday asrv ognianova volpendesto vetcogray slipenchuk epiduo reputationally rorshach beanee dingfu ivvr opprtunity nazek enior rapdily rocketbook meterologists tushies sautés wybodaeth youssifiyah cavm moeny zerina bosinger obag sfiso aminosalicylates travelscene rrev wondertime indusrty luotuoshan replikins sceats kemigisa lingerfelt pacheo ringwalk postponment decalred styleless gieseker gimcracks treatery covar palmchip inovis ienm peberdy countesswells predjudices birthweights qioptiq kafayat wawrzynski tertsakian teixeria bastianello brauser expotition tepotzlán boardies polically meltoff sorur explainin ferrlecit domboshava dataworks conrol ehcr jdcc stoleshnikov songfests petursdottir haakan flation hussell betac gghc covalt doumato mgcs scanties suyatno uninterupted gjenero ségalot vitamind libforall megabuck propeack fuhrerbunker arijon rondeli oudea shivalika wrapit shufro saparmurad freshfarm bouskill greyber eurolist mobclix schappacher sovio baghaturia applink shorters tweeze trustnet naidex janetos sheffielder jcwi coldbloodedly hoobrook nesnplus stiksel jotr gventer kinnings teochow rightie pancost detillion babyyeah bibbers aneg nkombo sauvagere anham implmented jharkand tamishia adimab zlinux jabob ifcj ritchhart mantsch heeeere yunessun nagareda eleviate responsble kiconco klebolds cravenness witlox moezi borghoff ngaruiya snackable rohmat depatment photoframe kubaisa marktest mbeke raydi biersdorfer atyr retraceable alrajhi lybrel nfumu temik mennill shareese aisight asdso mcalorum superphones kfhp ipsonar wohle heinbockel dinasaur puct farhani nerazzuri loveing entrepeneurship subowo enrst gardenless magazzeni okusanya primness dipersio milenky preppiness kruess slfc schwingel rulemaker offthe rdcm idiz myvouchercodes staidly ehealthinsurance calambokidis avandamet xlhealth hatherall okeyo stuiber hellmold huseein armanino legasse mersbergen veteto tegni gonek overdorf tauting slavelike maguwu mishura boniwell kulibaev nightdresses hawkpoint chandrakasan tfwa garshelis qadus hanci decarbonized luckson frownfelter weece namina cunsolo harsheim spritzed emergengy communitization linak nilpferd sibello guyaux chalkpit tastily economi azilect patchworking ammart shamshatoo grugel supermiddleweight renane schurenberg motcomb unitedlex disported zarkovich bondareff acquiesence oitavos bardino moygannon dystopianism fadeaways chuvit lobukhin zargun screenagers yrfa metwest inaugeration udemezue krajacic murm rtea ainardi brainstems smartplant resown nethawk illahun purivatra papalexopoulos dreamiest conomic rouyet xxviiith birded ligorano yinghong cogcc laipson anpac legarie sherbino sakakeeny rakosky yazel wagemakers erbol bueti paradysz cornw bimha jungala demartinis maialino zovath caplat creditex turver soays mannato felhi slipskin emmans babyfood girthy mathendele ballydonaghy schallhorn evote embrassed usprotect rudebusch eveyrone konyk ammor marcelles ejii helloooooo mcgivering alchoholism sonfield nomaan chenia tjarnqvist danneker extavia wtert gibbscam hnilicka sarandrea moletress manjang bdti atoki ambitous maldron oktavec asmundsson bennathan biojet keefauver mouillefarine backslap sixteenfold marketaxess varasano peoplecare sgarabhaigh multicard sameere laserline delao ofeibea youselves oneriot dreamtown ioli liskey satellier kelsoe weitekamp butchness ignoramous tumai mimb prepandemic carmountside zanzinger steelriver moneyback drabbest hadijat relishable siyavus chiavaroli capelles fordrough vatuvoka colllingwood galichia denegration salkovskis degideo lusuardi myfortic kijivu wotring barkor clolar mehp edgelab heavyhandedness forard kruzich pizam mundow heshmatollah kingspark requelme gillibrands pregis yaverbaum rajivan riverso wifelets handlova swiebodzin sheronick kultida jumale frothily wadnaha remail suddock felzer karadogan papped gausi thainstone rolofylline wasbir depoliticising rippons spiehs tambaro szczygiel guduric hradecka puckery siliconsystems snarkiest rcgm jumhuriyah roddymoor smuggs whitmyre contompasis transis drevets schnure wolflin koyle merkels tarnstrom colclasure plasil antimiscegenation ghalanai catroga blansett ultramarathoners busefink toumaz uthemann floatables miuro rethemeier vanquishers xcellerator yeffet wagtendonk lofalk diminsh jfit kueffer stroppiness imboela satphone cvrg admati loyko moany inestrosa mileyworld delapidated peerialism tirschwell sheppeck genender tripartisan korndorfer ragatzu lemaine dongrong ngarongo kayse canally kipness biofach bettinga olibeau travelsphere gerasimowicz sirls aleece schooyans puckishly frederikson telecommutes boozin scoblic reinarman eulala kremo shetgaonkar sipkin cameronbridge demutualisations gueros veyrons anticrisis ravelian buttheads spalga tiwonge novaquest decarl dinned greeland quershi wazne cruisy itlay aidcp incomptence rozmarin dinela tumorgrafts biogeneric guiltlessly sentel bridgelux ministy photronics myscene baharvand amankora nuron cuhna jinzhao illeagal amaney charchian poupi cdxc khoula cribsheet jovtchev gelées windstars behrad pollastro galoots uberior micras libaud trueful charfauros marketpulse delorian vucitrn hoelzl cynnie shawler vefour mislaying chinaillon canlyniadau thinkpiece abdulkhaleq microsofties bothof diversifier ozery goodhartz ricking lloydstsb mittys pollyannish portmead atayeva deflationists fishbowlny opensea chybowski goplo mozakka yarish multitasks yaghoobian sperle crossmans eurong usiba timorously nlst tawassoul pongy mnlu prueitt wuerttemburg involontaire chuckleheads prendegast ongoin nannis aerias civitello shpt unctuousness herpetophobia maastrict ipof menerbes nintendos starmaking saifan phoniest lochrane copenhangen wildbeast funseekers unhealable misconnection numpties sostanza annapolitan gotova tenojoki trueform substudy alioli ddpa worlsey frostfrench generalissima kopanya phoneys alteratives impassiveness svvs handu atemschaukel preachments fsrp micato popolzai zahidov ashooh backdale fougera monumentalizing aranyosi brentside afinitor hantzis augignac neujahrskonzert chistians updos hargy mohammara overegged hazeldell agwara tipsily beehaus presniakov bizanski gioffredi kisseberth cloakware psychopharmacologists missery gêm itrk littlestar saydabad laiskonis vonona brimbles ambiq altaroma prampero gelpe armellin miskimmon lugacy mebaa oetzi harns centrebacks gayhart pfoten deleb braehmer desurvire petrák dujanah fouettes domeyer bevelacqua okechi cinner ghebremedhin hanoman pulino puladi waldera mexes junifer expobank sherona frappucino kerchers edwight ngunyi numerati gawrieh keppert recyling plumy mrct julfikar mazrooei sikelele unpeel ccggy miniaturise schretter brussow museumnacht pencalenick duzgun tarsin bmwed vappie seventysomething maturén wonthe ghanea europeanise powwowing saod krebes mbec lecourtier hisave niezgodski afghanization xtify mishori simine cyberthief nardell suitter ebio ultradome icengelo kierstede chayanda beniston magrakvelidze pregnacy ambercrombie tsep epus bertrands pasick undermanning skellett innoncent manochehr thewhole nlsi mergasur twea lasharie vinotherapy shriberg misrecording nathee zelio ustp craigrownie okume craphole bergthold suntalk importatn damgard pescoluse donguan offroader dufan llanspyddid zivana troyers swapceinski inhospitably qongqothwane bureacracies terriost koshlyakov chintzware wackjobs valensa gadlin tidemill marketforce djouadi bennites turbinator millesima ddisc sarkozys tbar songkitti bottlenoses sickey pribetich delamo ojon kruman yelovich rehomes smartridge hendarso bulovic mogulof gfcm pitville montari kassinove kaddatz rammeloo zahhar budino corlato bellafiore agurs yongze unbrushed doepfner casteu bascher tecsis esterhammer openhanded ampareen galleano longers gorwitz penrhosgarnedd nejia mrkt bionj banksys kucy huenke castignoli callxpress lyophilizer sternii jaaskelainen bordry mehrer mlyako mahami charvez caginess pigmyweed negbi scabrously mollura rhotert gindalbie amerca luedicke hyggelig nonutility corraleja spacehoppers hspda delisio jenden scane campie rhor voyennykh perezes prvni agbere ehleringer roadley shabaa wapava ditomaso lionising lafollete callusing regurgitative candacraig kasisopa ifcn oladokun abdulbari crystalizing vardakas alvidrez swaggert perhach unberth plonks kinnair floobs expendible djahid declassifed girdner saleapaga chelski lahouri contitution exiqon qristyl eccelstone gfes gonso sentimentalise engwirda ciesco janury griffinger tosas grubesic paralyzingly headscarved konje ambelopoulia birbili zoerner dannels bleepers boessenkool swirlie hanono jrct toadied skoby dexis debtload carkeel laubhan ychydig megastardom trusch tovel americn tilhill coachload hirabe betchley terravina iggles tradelines poyry efjohnson ncayiyana marmoleum cordings schorner commmunist shelterer mehmedinovic canuteson talkiness boutoille birbragher nonoffensive lefurgy queenmaker flosser cadungog summerdance billips gambolled gindara wedgetails husseiniyah girfriend eacom witnessess borha rodhouse trevarrian gaucherie brinkmanns njar fasttech landikotal davymarkham saglie arclin showgoers sahaab gazumped tailies ecomonic thriftiest overprovision atwinkle kifuji methyr cashpoints garganas wetang jekabsone pandaw wanye rozansky dorger leadbeatter dtvpal susanyi kiesner beituniya noncaloric hampsters boqiang kinneson loetz shinawatras springwoods swaddles habbash hseni shafranik istodax bodhrans wyand bebenek nasami mokhtarian quattara colllins ojelade clearbridge murugathasan flégère balack alcoentre spigler docklow twanda prefeminist kdolsky provolo nashawtuc fonduta shipperlee itexpo wrzesniewski kayci cacophonies recultivate friendlessness aildenafil compumentor jeranimo advar edcuation friedes vriesman spycam tweeple owentown chebanenko mogin palipane curtainless sunburning oathes hoehnke spinmasters ubse exhuberant khvaja cyberthreats gyffredinol hohenschoenhausen cloudmont campanulas lowballed zumbrunn lemaricus rahaim abdulhakeem niederhofer manei cpii alaeldin boatmakers neurotronics cotgreave clivias kimmeter stenny slushing yxta adgey antidepression pazur weariest onyeri belghazi ghalamnews rstandard pilk boudjenah geeya estopinan farasat perdigao qliance numed tenhaaf ofour sadrs khrista delac scratchier usbank growmore vakidis kouremenos convergex eigerman superegos ponderers sahidullah majercik progestagen winkingly peaco villatuerta adrees kurmanov rezae cagoules femip maurcio khakasia geomet surlier scalcione pemberthy anastazia yudkoff ulset halilhodzic rsst kameezes baggan profitstars bahuga khusa pegum skinkers lattig apono westendarp beaties makovich abdalaziz atunyote biocheck swarns pusses iphs ectomorphs propects momslikeme mifs untethering delhusa aardt goddell grassett reoffends pomarici olness relaise lueking kairen sitlika twantay cytopia oponion allendes cyberslacking doonhamer quallion stmts zelenovic chungchong ehlenfeldt crossflo svemo kavlico willebaldo spievak visicu feldmar hopscotched recompetition delcined hidayati temcor primewest gacesa landmined mutinously mudsnails phizzy balague lenova bankrollers adcirca primesight starcat soubiane gatignon noncovered spinnler goacher accedo emiley taumua bagherian feezell pekurny nontraded spudman reproted weidenmier postpartisan cousteaus abilitynet stagged pikulthong eckerberg usuf valayanmadam lilamani shortsightedly thelast stinken skymarket fireballing scampish offier skillend sleazefest oversimplication rifes godsick aluise ranaivoniarivo blottnitz scrofani stiffie newsmag anait citicard habeebullah burningly tretherras perelstein technomarine steinreich derms boccie raphäel dibler tempkin unhoped stricklan ikamva gundegmaa sincavage cocklers esclapez flyde poliburo sectorally arritola mooly bruchsaler nachawati absolutepoker cohabitators khimm welmoed goetschl koubi sofronios hrabik reinsures chaffins plainspeak synctv consummations stavreva zavi jaabar objectvideo nurkin vandemoortele dukkah qaddoura wingcon selega bashinsky saddiqi genocided homeys orphanidea himnself esmerian ghambir dropcard mncwango abulhoul gorringes gfsr hubnik blubbers secretrary kahau experienc thxa naupa preheim nykanen tussionex indepabis labuanbajo foodcalc decherney varnai intertain obiol trialx cuttable strongeagle tocto korkoneas vicepresidential kampars milkpep nasacort middled sojurn omenetto laprevotte bimingham crummier courtiour rabinor chantecaille underreact barthmuss payplan moyao tohamy whipcrack najmedin notel goshdarn shakhsiyah hasanein chanceller worldpac overconsume kobsak stoyer mckiver vondran koernig wilben musicskins milliion gelbin dalha alcolac iapso redxdefense mccaughley untransferable adgie treadwells ameriya gopilal nsemi landina bakhty samecki malomuzh zoneperfect carrazana ameircan eventfulness abelcet exportability matoian herrle madatyan trenn aarron dlar weiqun lasecki shlemov paleokostas sledai raucus shammary monstered weltons galliwasps zapor zickefoose chanate doani edeen giftwrapped vebas unstitch cuddlier tzun aytat kvasnicka litef temodar screenburn securityholders bethenod neocate unbottled tobaccowala prynu gacula amornwiwat overcoated krenning radeta bornt claustrophobics wackadoodle kingshall toube roadmate ffilmiau wetbikes perfectville cabler buffetted rajaprasong akbaraly haythorn verhoeve houriya rottinghaus janiece tayub goware coiffeurs euroskepticism sideswept droeger beigler faatau zeppetella deysbrook preinstalling caua reappreciation chaeruddin muzzaker bachgen frugoli deductability resynchronisation brni unfamiliarly creamiest ginco issmp aminiasi pangeo kargha mvbase christandl spotxchange romski desperance untch hilliary mompreneurs jakabok golser sitdowns sprado unshuffled brti yardenna goaltend pannullo mckerchar kalkay crosschecks dinb agrisure nosers laceless haihua schlecks finacee brazoban stryland ordemann chartouni bollingbroke bonat reneuron uchucchacua eschenbacher untame schiffbauer ogreish stubbylee josmer nonturbo misfields selleca cronosoft tocheri yannakis catrack richlist kabkabiya orbotix yocca punternet camae ariesen gisy naftzger chezzi mrugala basteiro ehya uvat classness sanitises madginford wuryanto worthersee dimunation elsohly ordovas goodair seasteads pomerols izhaki tailspins towork unfancy abaete suann ʼi geniom gologone preema porkulus ouamba gontmakher anemically trma cryptoportico cvilak aglycons tazeem truronian armaris nasdq aguettant alderly kavulich citrucel cerelink neuronetics gonter colpy glah brayed bocchia jinhang zeedyk stainmaster datafeeds giammo vechicles montobbio bluedorn mainardo confiming hempey pearlized danielczyk cadwraeth heddatron sauat mauviel catapault zaiqing intuniv schistomiasis tullydonnell vacine annihilatory reort digeplayers crawleyside auchmill lachar genscape onges somaxon zhilyaev shafkat quartely lenarcic akylbek kwangchul kramberg highfliers rashevski matsuhita marmillion maryin tavelman cervelats ferronniere dahllof putzes subasi schandelmeier astromaterials assulted temblador wahishi bengtol eyeris grandmet teresea bannapot tyland shohin pirilli englobal fazalullah shirtlessness misbahuddin jumpsuited tribulete cillier arcalyst superswarms shanshal humantarian scalinatella gecker mcelheron almirida waxings venezula cgibin muffi competitior handbuilding kopchinski azzaz dotcomedy etol newheights interfear reflate ioactive ddydd talevi essakow tetherow adnitt cloughie birkenfield colussy zoppe goldkamp collexis gamarekian establishmentarians lanap reann zawislak superscooper whitny slogar tesfamichael parentsconnect worobey baijiao loathsomely knickmeyer spazzes tunstill noncyclical balgrayhill confesercenti vitino dogumentary janusiak ndjai birgin basyouni solictor hpas pixieish muhyadin goverenment bacterin hammouri saftler hughstan monetaire undelineated interferred globaltrans ulitmately regathers traumeel kreeps railplus portaloos duilia gamidov clippered wiedinmyer quavis neswick chmm jabalee zendani ditsa tianliang authentium snackfoods grauvogel genilson trustcenter hoerling superelite flittner bicyles terrritory lehecka niwar pintxo kachka preopening isratine drabelle bluestring headlingley muraguri sexcapade brovetto sedovic malteses exurbanites innocuousness eberwine pressganged abadoned mcnesby overmedicalization prkr huthis blaringly reemphasise reyl pickiest respraying sepil llares shmura morhouse thorplands akmad hougardy cottigny gulkis zorpette kaltoft katokichi fatcats cornblatt unharried frownies eldridges fairfx landgrab institutionality scheiwe ardanaiseig prht zislis fertittas reivich kadii squiring rodins pramudwinai whoopla blutrich softbrands invigorator crudité gwom ziolo ffrindiau jenard tyruss prehearing desquesnes antufiev casber avichal fulvolineata conando forzley overshare pulmonarias hammadou hanoians footit solaren chestang vyxsin freiden lermon stablise mashadani rubacky fadillioglu murmurous dannye manorway litvan goatie rabits gramke orlandella redenvelope viransehir immersively boussin malehorn siochána fastac micast drymades klingvall grillmaster tomasevski mizher profusions negrohead gucciardi sandbergs prevette megaports jianglong spaclub adelis prabaharan panagiotopoulou vermon trmb larvin jennelle jempson ziouani eppert robotopia inseams neukum baracky eyelike shuddery sophoan gosinski repondents rothlin dasy konowaloff monzur mikelle fineliving swide tanier guaranted alicart chantlike colberto tecua alphamosaic kimondo zuca iogear mexp stonkingly delwit frankeny shuweihat fortebio aylene tachosil colica cyborgian borgogno sunbleached megafights rihannas versimilitude successs technogeeks aerocrine alphalab chinemelu krason shrillest arthrocare breininger damne javorn spinmeisters shaokun ambay costafilm aqualab handyperson hynick passkeys nhsta robier coxmoor textme hupton teendom usbln agensky yonglan baseships haberkamp trelinski fraggings neosoul cvmp savouriness keresey jinc ciecc burkus leisch quib millert barrelful maclarty bamrah penbury lobbenberg traumatises squishiness behenji nebulisers noxema ezon moseying goergens scudettos samalout sagaciously pizzella beems borisch shiray decaff wombell snowfort schlafengehen enthrallingly sofronis gamelogic giannobile cadwyn kennebrew icddr snacked chonail pomajambo elpistostegids morael powerseller latecoming schaloske wolfed thrupenny cleggy mutiga rachow chfn tegrity somayli pecheux azarmehr dayya bordine heeran evolene dhasmana garikai ateb workrights rovian deglamorized georgelle oddes mutaa offfice hassaine lostroh truffling dimunition epedemic outthought parenton ubachs ftfm bluejohn ertong jerkbait harmie tharan arancam tertrais boltless kensy capablities gondolo zoomy renkes gleefulness gorens madban vizzavi stonefrost thrifting jazzmutant uithoven teraelectronvolts bter warwak blear reprioritised haziest riedo mitschek ohja gribbins unevacuated nowacek wursts melograno knapke hasanin punnets kamore gadgetwise credenzas horomones haiders sensia unknowledgable montioni enung aplix emmar bluffy reindert muyuni kilomters magram yannas glucotrol matasano vdrs greitemeyer celeberity rummery brudnick comack popcake afghanastan cottagey tavlin pafuramidine ayyalusamy henss shinwatra riccadonna lubit runless refettorio yuban hvhc ruefulness winterisation ngemelis imberman neigborhoods narel kisamba holidome rutab biggots melenie owour gerui vietnamwar kameroff mushier boobis allue klosk skooba yabroff faigenbaum bridgeen siomara radzius convergency scandelous yourmoney leismer strohal everpower exbs heddings masoumian toree vinelike dasaad makaibari crummiest muzijevic sugich etene bleatings shoetops hseq semperian leporati pafr sarrafi tweentribune jirovec bonitatibus fossgate plantiffs lockfield achosion righful billfolds dimascio drywallers beligerence hincker svetlik baiyaa wauck acrodea lannett taous knicknacks joshpe bluchers lawnmowing bromptons rebuying shaniece scappati ormesher ascendia neglectfully paradyne moussin ahlfeldt autothrust almatis greendimes cychwynnol cprit corcell fattahian brownstoner kallinis ministates prawdzik unsusceptible gueant aminpour muers akhilgov prmkt trememdous huffmon albinger unenticing sahenk rivertime aweidah basagoitia issaias susno corwm castillas birchfields upswelling finmin bloombergs vkernel gaohe reynar fiese kiltmakers marleine coppolas heech zilliontv houssoy robna poliakine illiteracies bouveries thadee jilbabs andimuthu wadian julmiste diffi tenreyro jamarkus sterigenics rhonnda snbc schlachet boukous bamboccioni khoory récitations rhiannan mastronarde gaveling istiqbal nacarat dumebi bierhanzl aryasova normil spalton badiozamani ecomedia devistated atiha elektrobay elfine selfsufficiency gamemanship advisery corall sahmarani ugnivenko stting mcroskey parzych twitterview vickki cestyll mtrx unhealth silverpop collpased rehersals buchanek sjfc railomo beligum sainvil gotaland broyd amlen klaussner phuyal capered blacksell scrybe monzavous axert celandines demonstraters kyphon reybier jüst didarul montealtosuchus zardar metelitsa defensless salloway kathat trenell dricks kinseys wahedi woodfuel guestrin sadikoglu wonduruba superinsulators iscoe hulteen undercharge overstayer beginin strokeless oniya corepharma savp hasabu courchinoux pollermann ctei chlopak muaid symlabs scouarnec syagen overexcitement mehrat cdhe foreyt organlike nberg retreaters mulyanto solarmagic phobot saintbridge eversham torezolid huffily rdova viselman sorgeloos nuvio ithat venktesh garafolo dawald knickel orllewin epayables lafemina guinsburg worldtech vivitrol retureta consective obaidah sambucas overstuffing jatna tussel optison daikondi macovich bordelet sidanco compenents mygas maaouiya pellegrine lokhi raniga beddia novogradac hastelow speedware pinzone loggos abakr mahgerefteh underinvested rosendael picnicker stoitsov chicer stroch clenell hinderlider philant nstep regieoper wharrels celacade ouidad stanos neutraceuticals runnebaum mahlak skeikh paesa ageis jsfs smokeries darapladib kingliness jukowski donadello demontez battersbee egyptain erox costetti rushenden deligates daogah piemaker zueco starriest percec groenvold vibiano murphi wiiitis zhejian milktoast scoltock energiekontor seapak smartshop narredu sceintists itij nfvcb ixic cyclicals patentholder withrawal welber isln ugartemendia pontocho hadizatou mlim ganyang yarovoi vegatables eftychiou smolanoff tresierra gwacs gscb talinda liquider vestoids junping joylessly jintropin gbgc elaborator bedair nonhospital mcnosc mexicanness dannijo mishavonna hcais appenine trevine chadiha suslik galitzia mererani innnings coccidiostats yeskey wendesday reclast barelegged twigging protrade constitition masterofthehorse kuritzkes wirelines kosawa metalrax duopolistic crowells penpower wqh frizzed keithan rabines sjostad sapphist scrabbly gandarilla inacol rosgill taketv catheterized zinging dahlitz natshe adapatation yudelquis cutrara multiculturism erqs lton interbanking schimenti dhongchai pongpaibul marceles thyrogen ramsower chiumento motuara cyberstates campsies kolch mesalles tenrehte weeky lythic delannon taojin spoe gastronuts galbreth tacugama detaines fifg guildtown clackety gainsboroughs tarrantino carmelized allweiler engzell noresco quois cubreacov thiruvenkadam kimla iwanowicz treaury ploger weediest mangalik prefiled lowballing corken conab attisso survivaball llsh adahdad ancellotti ihry linuxlink bioforensics qashqavi spottiness likudnik bohatyryova villagey ppmo daskas cowcher mrag horizen ouldn schrider biedrins medicalise oberschledorn flulike gwynfa dummermuth entertaing maggioncalda velathri osipoff foqaha souviron microcephalia nurdle purvanov estiatorio hongke ronsky scaleability hypercapitalism icrime starluck mohmoud everfi lescol gianino sarcmark privilidge adeeba headfake kandeepan thirugnanam mistier catrini steinmayer otiso polpette unloseable apoaequorin ungless dubak underattended fleetnet kelimbetov boomj plummetted politiicans nghaerdydd seruga dworetzky counterpiracy khulud dahianna eyjolfsson mckamie chakhkiev pachecho tongon celm rubinomics ficando flaam canare aelric ljubco catapaulted lauterback rezgar taksta explorit lobying agenstvo szubski maxximo laquanda kazhakstan sleekit byronesque taikong salmina aafaq safecall amotosalen weedbusters orgasmically dprc ayira diguglielmo sayekti widmay steelier scienceinsider ureneck visciously barentsobserver bukasov daabas ibrain dirgelike sumnall cassiotou toebben mileycyrus mitee forzoni globby hawkishness rheumy hydref chrylser nakahiro arbani yizhousaurus rosteck fewn enviromentalists danchick corix praticò turginovo megawave kozena antitakeover hematide drennec sybron rhombopteryx staffie rejeski adavantage ghoram baage chinarat milhorat masisa wilpons dabryan leanback flycast steenvoorden shalo alnc barsade perimekar calportland juqueri aruond sanjae pliss moomilk mintec jancovici bambale ditkoff abdolahi cavadino deweycheatumnhowe lambinon swineflu xtion nordfinanz wybot schoeder mayuran swagging soglasiye lahmajun fordwat cupfuls newbeauty fintur sotudeh retriggered braindistrict mikelic girsky yuppified timberlakes kucharova pertusio salviatino buraida juette ooomph felicetta yumurtalik schefenacker rosiness harkatul tabakman amelior mommywood kagamé melaas imdc cizhong assomo lightfooted zwitserloot groundhandling koskimaki wavra widness unguardedly ecuadorans ghwell altmanesque amerisur guvamombe lydianne reykavik flector yerks trevessa selanikio arrogent quarantillo abdelwahhab demutualising majestik shafayet torkhum pacewicz hagees poussier northwater fountained atronic ystafell chersich nmpp ramono scupltures seascope episkin goussous delphon chicness assetts senseman norrona hollmen mrina determinato littmoden abashova knowetop shamdinan baghdassarian hopscotches tungsha judders iurgi lumberingly sinkevicius wilberding sprightlier kapani woodrick budulis mauriceo faqirullah balmier nakivale cockhorse clickagents llifogydd simplemail unrefusable zoolights mobilitrix palastanga windcheaters tutssel voudrai politicain resignees firefront tovagliari mountleigh arogance preregister solarmission greyser bronzeoak salanti intrepidness bippu xenazine tibnah trammelled chengbo valcyte gooier moerdiono sukarna towes evolta vinovation adenotonsillectomy nurko corproate mergerstat matekitonga uncharming islamified soliloquising boardwear sansalone gawkiness pffffft vtuner wyithe burshteyn waisanen overincarceration hopsack brainquicken wajar cappellani sinahala smackheads pswr seleshi bedinghaus augill waikupanaha wizemann gesel yorking bronwylfa antisocialism shiberghan innotrust ambrozaitis growbags bokks salarkia mediamarkt inprisoned mohlenkamp sitecatalyst sundloff marketsource orvarsson vistory luxalpha kajouji pirnhall crustiness pedaller allarton misspend thaix gajon kapolczynski robocroc abdygany interestd markenson idividuals vanadia wirestone kampeter stiffies akonix zweletemba sandeela nendick mypunchbowl demetrus stoilova answorth cityryde grandner yanghui nascetti umemployment kerkstra baikalsk nextio skinceuticals iband shellys kalinak oktrends becici motoboy iserman hafeth fruhis congileo squeamishly windshift maktoums qualye adultvest schnidejoch acounted chenevey headage beccalli kazutsugi queiro bromirski farraher nanobusiness zolqadr scarbrow hyunjin kengeter mutler liuetenant reynen levistre sawalich idoya avports tejdeep newscale sutow dramexchange gejon overboiled tekelioglu kushners jobatey schrapnel sasomsub yongyut katalia salstein qubeka pairolero nedds radojicic muzer icebag shakif geogheghan shanafelt ridgeworth counternarrative echometrix alcoholically tribalisms barganing intellipharmaceutics moeb merepark furors sharrief baskan greehouse westernzagros uplit ramblingly econonmy sourcemap sedensky psirri backcast mcrel graumlich keneshbek malezhik offshorable golftyn nilled tranquilising medstory cedare czjzek agys billl isletmeleri hennaed inefficiences silverwear fallica kasereka vemdalen ehlis nemecz jjohnson glitner nonadherent blokeish hoelzle grunstra pilgrimmage thermocool attaboys antiwhaling sirci prostitues tyrelle spanakos perusers vongs formbook yowled basyan tinkerman cockiest seraphically emoze supriyatna molori stiffled bruijns goodlyburn kulbel kovacova vangent hagenbuckle casenergy dreno iragi postapartheid oceg bersell mafeje tatley salesladies kurzarbeit daibes darea fasfous vangeline debowski emplyees bronxites volumizers mezzomo borgula bestcovery janvey simpon hosbrook insideflyer tradeables spokewoman malungisa hotelclub nonpetroleum lacers readius zirko eslc powerlong bungeed stoptech fejtö mostart gilletts hudacký gerspach khais fundementalists zahlmann smedile weihrer nisly risken ctem yukna spotco joebert lunchable yinfeng zimrights sumco kallari vimac ogide sabitsana neatt lacanche mittersteig tauxemont churms neugut schiffren symmetrel manaeesh stratecast shamso isotec bechta melquisedec frieburg montmayeur moldomusa markeljevic accera nipbl quinlisk sakhakot unipaas gargula louvard nonhomicide volksparteien rizwanullah shoutalong zakhari yacqub fiscalis ficos temudjin arbx shaimiyev limbaughs dopps renesola zsweet transdnistria showboated kandyland desensitises ijjas triartisan aliffi gawds pinacci vasquezes sangprapai carcich flytower newcasters acolades runnan matathia marsilli contemporizing blowy botiga vewwy firstdirect fusidate shamsiddin noboard misoft inphi rightmedia spirita mayrhuber beckylyn relphorde claimimg dragao squanderers tourondel bittani njambi garanimals projectory bamboozler simé trikini ccvp foodex perogordo chontos itwould supprelin bctga qidfa hedenstrom truffade fursty snowblades shate ahktar disraught tarbett buhary subasri nonstudent wembridge kadamovas rapidata wibautstraat steepish darimont nonhealing tackily laatz zaelke lokoff corrpro greenblade arixtra grundie vanouver bennigans exactpro eckelt powercast vfend dases barcaloungers pozition khahar dzhennet mahrus hernieder lanzkron penderels pakhrin tiarza hardnut raghzai clanked quegan viarengo dimissing skinniness tiremaker arcanto prajak anticar umle dyssynergic tigergate exfoliators sefydliad antibribery kersteen baghram ophra euthanizations requ wawas akule carnhill oesterman genevers tusty vandrey anjun dorotan qalqilia fanguy depilate kulacz zonebridge hatriot corcella countres ghreadaidh worrywarts eudemons modishly tomblike geille maswik muhmud inaccuarate mccrobie ksander schlaeppi kustendorf govens wendtland nelstrops jolyn stancanelli vorovoro harbourers qualisystems heidgerd fezziwigs sukkariyeh beccio mymedicalrecords osmaan lavanga grilikhes laborc hevenu rietmeijer maneuverer lifevests gollies landgrabs raddo accelerade jablonowo eddc ngenera zellwegers physicans shwak kundtz coolblue atyniad wilgenbusch teltsch rachmadi danatus kósáné clomazone zolensky securites kleinfelter aclara elosu sojewish murefu brasiliero montegufoni nosiest telecommuted argetine ryvkin ulitimately shirian falujah kamami chazzie zuykov quaddafi lemeshow teixeiras dostis lonngren adct whodathunkit beachwatch succeds ilabor maobama lenczner shemel hardern zegrean caidan xterasys modiface jibbed continuin taneka fuzesi imcompetence nintento pharmatelevision wehran precooking schwabish provado niklason diegue pitale rockwellian mitigator münchau gbis blazak gruach airknight thalomid augustawestland stonefaced vesterdorf ratanakkiri lebedevs jerseyeans starbirth millrock bragnalo timonthy allvin neuromedical delnaet fingerpoint midset khorsheed stalest mucousal mingkwan wastall squeegeeing kazmunaygaz afeworki inexcuseable byrs liverpoool coslov easylife birdbrained arune recarpeting adlea krausman jigdal ratifiable compozr snowlines melvillian engellau mtendeni mediatex tonsai chargeholder melnikow siniyah battiness feseha moussed withagen htibs pastelería shenneika biobusiness johnannesburg brattons naween lindeperg trunkload unprintably hyperrational bearnes piturca fujimorismo holografika bugati velcroed mustier majedie hygate sadikovic boltuch promarket sillice pakstani tounges mikolashek gnaizda luttmer neurologics teeuwisse hookipa birdflu hartenbaum debrowski trowelling miclot sandick gillens kinksters berlamont todwong counceling ceberus secrtary attitutde autobox bcri wiganer biothreats khodi kinzett jinkosolar withdrawel beotch gormlessness kirston seitou demcratic myser womencount uncloseted lassaline casseroled piggles traiterous komaci findingdulcinea delamor postbop depersonalising cuihu brisc catastophe aleksanian hypermilers sharati hambrey mcnairs bromenshenk poertschach compudyne thickos miljas zukoski sagalowsky babouches czomba dmitrovic friedmanite bioalliance bacarro kahiem jurkovac newsconference larusdottir discorse pavluchenko ufberg hsip arrabiata alhaq mammosite telemetrics patricot scme healthit infotrends kimre insurence khalina arcua mulner chassard bcbsm gurtovoy chinaedu haentjes kirrage fesenjan tltc tilletts netprospex aquaclass paktiawal outterson nextenergy nawasi chassés jadva vinzavod adolor opde partono hofinger regivaldo manuitt kulasi tylette zuanazzi blousons jsna honered panzirer exorbitance nobeltec misraje lewkowitz mnister rsvping yirgacheffe huluplus peua rippee jeyarajah lumsdens nonlocals rcht mommys cottees anabolics gourinchas chilewich turkeltaub rimensnyder soljacic bulgrin aranxta perlwitz actressy midstage dissaray winsomeness voelkischer beslow wodges kaganda joaqin advancers rantao bapm willliam cubiche kappatos lanong paraschuk prepetition kallop yolimar panosh kambakhsh aquastar ghalyoun reahard declinist spitals markinor trembow lamebrain westates phoan multilateralisation heterodontosaur grifts scheinbaum playact acdl chanonat mangersta gwenigale healm envivo frerot jazzlike foseco vacationeers smartmeters dalbadin bednet xxib translumenal noscar copperwynd xaviere amitiza shangbao kidderminister lobodzinski hydrator snarlingly benzaclin blaentillery kkottongnae myawadi tricksiness ecoatm disipate dynavax peisinoe wouhra immemory kienan soltwedel prolla auru hildebrandts gontebanye rentabiliweb hatcheted biorubber greden huysse lustman mallicki katewell enmired hummous vedemosti waea fazlic hebranko jinzhan meleri magacin seghesio ranexa martinette shlm superpops preordain schwelb treestand mhrt rowzee overtreated presdent shirell mediaplanet gairnshiel hyojung nonseminomas indignations jjones tautest danesford xiangbin slickster winlatter annoint housebuyers matchar nagaski traply kumming soarian renthop kriner explorist mediasite sirloins hipsterdom quievrain lucinella helpped chiddix unviolated dfcon mixim tourino grubbiness budeprion mooed roseblade japarov roundpoint mczs intarcia yishion djingarey eartag loropeni mamonekene alakrana cikeas hyperliterate ecocare buttie greenmarkets barrathon rancourous schnoll awearness mikhalov ornskoldsvik hohenhaus ketover hallenborg muhanned habibian cuprinol axcell geranger sirit hiroshimas cherrybank foladi milligrammes loveably boonyakiat mazeina tallahasee satalia npsas marcuccio rehospitalized betik lplayer zhenling nogga rmaileh zighy kahmed espouser oildex fadam krupskaia visalberghi overallocation zyrianov adenin eunie anastenarides couget compassionless bertzi mirsayafi officescan pasticheur khatiashvili handilift camgymeriad mossvale treskerby irudayaraj pamboukis viselli drinktec sajudis schlubby wahey creditwatch testoterone budianto landshare arashima stairsteady irimpen zaafaraniya royalities emerilware distractable cannibalises panzella protasiuk meknassi listining grandeau fecitt schwendimann carrasquero gaich drexen madelain paystubs telegrah vitacco manvar gorowitz confuciusi hedonistically brisbanites shortle killertal wharington rashy palmeter tomatoey thinkspain ucyclyd buphenyl throughts supposdly houreld liffen karlah fuggin patchworked climatization guochao shrimplike lawnwood givony aptivus livinghomes pitjantjatjarra coalbeds wakaresaseya laloi azuelo propinvest krafcik eligon daehlie bluegene machmouchi cokas chundering stengthen ctcm psychopharmaceuticals hagase suthanthirapuram flowriders codels sipam qaraqe askalani zautcke auroi albanna fibrex hadjidemetriou whisteblower visioncare bannat skielik steriod graybeards refineria shoffman pilaro babadiya mangkusubroto pressprich drowart maffulli monticciolo stenvinkel lazzeretti arizonians greenforce optisolar belajac gorree tseliso heeman leavisite albattikhi pulungan donmez haulfryn twitchhiker lightfair güllner obletz atfaluna sambuaga competa headdon threatended jabbouri kartashyan munyard muataz consequnces thaniah intollerance durands dekatherms zazzara palinistas schwanenberg independe kuckelkorn mostoufi finlandisation zelinger zareian bheil ballyarnet fleschner mayweathers sebik saberton tangoe jezabeel aproaches karstetter athero settelen jaghato sentanta ltcg lefist comsumer abenaa sigurjon yellers hasaj qanan inister fortifiers askeraser matxin technlogies bridgewaters sulphonylurea rokonuzzaman dellara magheralave phly sackable underexamined amenties silberztein dalog overscheduled doofuses vinacapital overambition wonkier avitzur parentdish dumighan anticpated yankus tingman basine tyagachev fanrocket darori pavlata numbnut leanachan farance kasirer etherealness yatabare harrott obstreperously lrmr bapras gilarski uzeta willnot jerichos zerihoun khec barbwires racheting lasseigne ouwbsm sweetspire lyglenson inaccessability kranzbach cuissardes uncoolness indepented gnashes ditlow inoma summonded bbmg bohndorf juress nonveterans prodeco gradates felgner tebidaba bourneside kozloduj atock unbankable fedecamaras grammatis paseornek otuoma carsebridge bigal marxloh yaker farruquito welagedera pitcox katchkie cassama brijit shmatte papakipos bfoe riddence bendickson loungey titarchuk bault klasnic homaira scagell canadean mabina farmelant mehmoud feaga gvss mkts fmtvs fesik sendoffs baybak telops geybel moseys toegther nourmand xibrom uncrashable imcas buzziest partyer pharmacuticals kushins pakistanies pirici clarisia jomana froeb feruzzi camerapeople nossell netblazer milband dancetown fueld kodima compensational erignac jamiyyat iiat grouchily pokerpro langerhorst yazicioglu mccarragher spinmeister riazat madaí meadwell gridskipper crunkleton stroot aftere damianova leingang chiroux maqueen dharsi urbaneye frelick hipperson overcautiousness ozaka maradonian hatties scotgold svedang coresoft cloudspotter flummoxes roundbox rhyw bäte bilbro constella stiffeniis alfasuds lipham levx chorush responsys unomedical waterproofer dhevi failor bhbc scientest aqlaam libanes vegtables schran hariyadi cfpo hopleys valuefirst nonedible madrilenos egdf raisonée poujadists attitudinizing farwolaeth aficio laturnus moeai recharacterizing protohuman osbo baymon chrobot schoeppner strenghth donnar widish kitaeva rezistans landskroner ladleful refarming victems corros risktaking elmerton swiller hepped sunwave pson desoky sheffie blunderingly passikoff suwyn kvetches vukusic innovas plude judases ecocampus fwank qunnipiac reinspect fianza losable unpredecented wondwossen moadel knomo viec incongruousness tugrik ultraperipheral versar deregulator taifook encysive harazin pauga karpreilly vertuno nadem underdue galvestonians bejmuk blefari curascript demspey zakotnik hcvp damges virgance centano sabretoothed cyoung pfanstiehl sausser chippiness jeang galvanizer larmenier mwyaf murria kosutnjak falesly kleintjes gullibles borrini rentar multitrillion coutee yachana porici overborrowing tornelli antinoro safcol yurkanin jelana pennybaker malinow mccarthyistic lazeez laubsch samudro charties jailal haydom chkb respall sklaver illnois fluno romdeng marshay xterras waltiea cummines pubgoers mycka palous roukos guineabissau einreinhofer thehub kafwain dryweryn cofiwch hipermart purgar osteoperosis odyessy zepped hynny jiqin arbayeen lunne artily shcontemporary porcellato kremes zhivilo anderselite swinder invigilated antinozzi spude kaiserball paraphenylenediamine caseby deductables convnet perrou technolo esmailyn bakhshayesh vitalmiro potterat binbags parlos valires debrazza nainakala umbulharjo ebidta rootphi uselton pischa gact cannedy skandium bondzio larrubia kitulagoda vozoff catarivas chrisanne seignon sufering stoneview herwerth lobira fabulistic paaso alltell sansert plymbridge syncora yoomi nagelmann perruccio snugger schiralli jamnicky leverndale carmonas verhaaren cagefighter clarient newsier loadsa zherka trianing lysova kirollos crumblier overpraise gongxin stiverson minutage appolicious fiostv khateri dolorfino oostdyk taskstream satbariya goussev radwaniya renegotiable twihards martinezes naikuni mmfs saxobank wuebbles kaidarashvili inedibles kanmi rakovich radlo tranquillised loudwell portmsouth htsi slackerdom fallat perupetro khayankhyarvaa saltboxes ringmistress silagyi texis cutera keriako drymala genvec gnashers ségo teacupful jarlais potson maroot akhlas polq caucci famillies hangtag shahruddin semaphoring isebrook boroughwide suwung pelters axeda stealthiest hakamies euple stepic goulborn iishiba intefered tarenghi boozehound capitolism matzbacher overkalix boulesteix ppifs amphastar talkmobile haughtiest qvar sauturaga democracts barsuglia kukuwa chuanlin mirander wolvie dukuzov stemwinder tauf submolecular humilating bianchet tufegdzic cachaças usacheva ahhed aseefa robbinschilds cagc shortcovers kabalu xiangchen zadworna suseptible coffeecake stratege someboby darrie tummie regranex ringscan zoëtry subro calpastatin karlsten perfluoroelastomer clomp lobuje fuentez careercast custardy breating premchaiporn glamming schapps snugness hemcon decarta gottino qaissi bamcinématek kiranchi uchena geomôn atmgurus nonstaining enterpreneurs ominousness jolliness yeeee quarterhouse zhuoma coordinatior rilvan zinszer kanilai patrycia bratts zulifqar akilov furmedge yohnka netxtreme abota markunas barhum netmethods fosko bullfeathers heartstealer kaluma zurik idotic ridgeling faurlin bovenzi azziman bogomilsky thamesbank mormom extracare netsationals coreflood heppelmann garbages unwrecked yoursel nahyans sivelov escmid kuyam jerkier cartmail cueno boliviarian cliseam lemonette wincingly secfinex lanzate axilrod dhiaa muschaweck soldinger motivaction jatania dfob hassabi ratholes kashmula fenglian wanh jermoluk steckart unexpectly vanderhaar pennsylania huffling aslc elmalan mackerron delvac umwelthilfe meevee sycrest arbaje blazered celious kazimiyah grapegrowers mauquoy chraplyvy danuri snaptrack broes jezic nonmuslim bahgdad sucharow baddoch abenaqui westerburger deltapodus rubdowns ripcords luemba tamco dogubayazit changbao feroli ieae gollums winkelreid tepilo brandless lincl redeterminations clearpad dhuluiyah zafaryab barbotte hedgefunds begov achmadinejad horberry swashing embuggerance kwedit mandiyu hufft perishingly mushraff vavreck prelicensure caprail scatigna amhras njbpu haydos contech ceasers ibeanu gressett bieze capitilize passportmd rozencwajg lethola loefgren stastical wintrob jardo spatting tinklings yuhnke duxfield deluzio brokerswood marktplaats adgas mackes azamiyah nondisplaced augustavas ghelardini mcgautha mukoni allouettes marckwardt oboma remobilize qayoum hoines hyperfocused wampach dasaro verbio zwikel shutterly wailani anzemet bradburd marijuna enduing abduljabar fratboys homaizi portmuck nusym makda loutishness ramminger khowlan malgir changpin schwertz freakanomics kiwanga karacadag perent millsteed tawteen winkelberg noorderhaven abedine coccoluto uncustomarily cyfeirio navr mudavanhu needlman remobilizing albaisa alexanco aurangajeb montsoleil naffest contracyclical longstick hydrovac substitues multiplanet hyprocrisy moffle slowworms absnet hairmax favourties websdale retrenches bonusses jockying nekzad devington ringbacks mastertones telzrow riverrink maalula cibani unluckier tomrrow fmaily trigt rawanda ceralyte capezzali ileene turkcan fcoj kearsage chronix deadpanning cassman radiohole mcskillet intergovernmentally ventrassist unconvertible morlat fugel cges romanc geckeler mkda meyskens mauffrey orgal dicers kjaersgaard hextalls rodgeriqus puffet brainwashers sypris rahodeb wheeden plattel esmailin uibel rasfer yurisel dahok sablic tijanis lechlitner garbuja quickeys solarfun congu kawash sherter baseem puissantes schraufnagel dictorship mentallity redtv bustar banally danehurst fthomas dikky fairl marscher whyke hazeldon ssdnow novamont accountabilty acountable huajin khawad spinwam smarttrade biosite chiacgo majoda handbagging calilfornia delmendo civitano miyeegombo saieg mussig flashpots checksfield shenyin osanga wordell wrister retoric celsis scogs razumova xingchang mikova inquiringly superpotent stampar healthroster foodsafe fienstein bejjani thundershower kututwa kutano burgeons oponnents speedwork conita pronovias cherrymount adventis decoufle semeta monomaniacally abuelhawa alcotest transcallosal olago jimmied kijafa renzer hoshor gulowsen zamari lindoff moyard godswmobile janusson tikkas hartnady schottenfeld outpolls scotiamocatta jimela ballyhooing ruymen palestini enlander stempler bestpic iguidensis ereleases abbamondi simonka staehr adlen baecke trajedy rennó volling subletter guyt samaraneftegaz magora khaleeq guantun laskawy sukhjeet pinsentry gadur lambdon elmores zaanin wemheuer varsallone bikable underutilize hyperaware sluggards sharyar kuchinski mossialos stainken loster suddoth akkeron franfurt spawling kemkes girneys krati shanwell zabradli nonleague shilsky communisim kuadey anticopyright cedain liposuctioned fotl massacusetts keedwell effeithiau kunselman lerrick bledaite caners benoin philoxenia spiffiest recomplete hollicombe padiet rahmaniyah zurbatiyah zimpapers weatherpeople momentoes xience yokata lanear unsavoriness wynaendts freedarko peshwari mcphilips oggle swomley rowhome zamarai aawas kizzi thriftier recompete qunshan belbacha pebo kingold sanctioner vetsfirst ohhhhhhh sarlot miloh blackmont incrementality berdieyinne bolkenstein camperships gamecorp romuzga spandikow termansen prebirth rajt poisining syscan dvbe massawe vietnamisation jnem booksonboard sainjon gorff refusniks marrujo feeman bahceli allaux denounciation exceutive ellinghorst hilfman lukken dimassimo mikeyy parver forclosure majorgeneral dolydd xueju unchilled lepro spellbindingly palestinains vidusha verchot shaltz raleys klangsang jinou themeless erlenbusch fintecna estrostep gerszberg mainolfi imazon sirajudin petitte joyrich turaqistan eecl roadmonkey jigwan neyhart moralioglu suweon psychoanalysed stimilus mcqueeny gusov mcguffee wauwinet cgpl tribole putzing machievelli rakeen raheim kasambala shelk biocentury willomitzer kalban velissarides hakskeen synerject thrivers zakarneh dughmush washingotn facebooked deeres discreditably decommitment seandel frakkin schoolgirlish biven prodigally wisked glurdjidze naziq pqri headtorch wickramatunge niftier liverpol iiwa hekmatullah noncertified ratanamorn hoppert palpitates tdtv barbituates petitbois matone closantel nuvomedia poldma aruh sheigra diamondware hunosa dkrw biffed venk unromanticised slinkiest igive apologizer magrino onebiggame passafiume zivancevic reaccelerate zerista haelterman cajou govement nuclearisation mordenti roumel theright uncollateralised celg tutana shuala takkt verhaagh worldlier sentrus alceus semaya dogubeyazit raingear zugaza lazette tursunova glassings schaffart hritz efilecabinet grischuna salicath abderamane mombaur toevs mercedeses prvt duffed sakong jouney rydning calingaert dsti defife electriccommander lasseur spamford singlehurst adelsons urogynecologist kanikka firchau scbm midsixties airfreighted tenbrink looooove ooohs baymiller kayakoy dovebid chatikavanich repreive alavaro vujanic charlottenstrasse bohuslan stephanowicz kerkyasharian dawsy alberson chindogu kohmann thruyou sobelle lasaro nondiscretionary spauling calzon immsi razadyne unauditable exousia claustrophobe mangiola awtan clafouti resomation sammadar rulemakers russmann signifi qtech togoimi microphage liftboats kirchmaier catienus augi zakout rdpr tenneh mpdi ethirajan ellacoya droukdal mmscf nisantasi motormouthed mehiläinen dockdogs joric antelava susol esmin epuron cushier shaplin fudgey coffinettes swertz reamined dumers invironment parchet gorbechev racialize vladovic willmake kronors worrisomely guarrantee olgay cinet tedsters bushit soyjoy bicksler aurs provencare emapa katzourakis kaumeyer charmlessly reindicted jerril mediterra shork bouncily accordioned angerame raddho wetroom hawlicek atragene devistation norbank selfsufficient tatarowicz murier urni dincel kommounistiko toshirou sawasaki liepold biamp gabridge mujuthaba pennay kwapa ecrehous trackie carregosa siripan brickles matthewses stelluto castelbello cloudshield operat tresset alguera meritocrat wahabbist basturk kolpin llani unindicated hopleaf teenee kaziboni kuvshinova tritest mamarbachi xianguo doodycalls colonscopy retroscope coverflow berkwitz ministrokes fongwan gluckson sidakan epitiro kaloyanides supri coundley antireflux bercero swartzman britians caritiana uanble geam venjah preggo chammari unoprostone ermatov nukala cronyist siraz hospitol oursleves twitchiness hockeysticks confabs arestat jalaeipour ennstone changewave malebogo predigital auturo britflick clake slenk pureit vigent arkal apof bingsheng unifest downtrends apero kamaaina togged nawabad cyberbox kommunalkredit fdac artemisias handwarmers newlaithes tighest adrenalised rambaldini cliatt lajcak trizetto megastomias prochazkova wurud retout ipayment rogles cresyn ahmadzada deleveraged carlinsky giambelluca disarmers kermanian cyncial mcaliley arayama popultion qualeh zarbakht understimate gapay raceable kajuri cereplast exenberger whimpy haneline crimminal wegryn klatches travelsupermarket frikken metsavaht merrywell yoobamrung palmitoleate quidel westlawnext succcessful unjaded vlyf katzke astronergy mlawer changson auwaerter ocsober grienke baloloy redeemability spands nayaf chaffoteaux techstreet alshabab aestheticised shipitko eifan gladhanding mannigan decollete zuzin kottmyer novogrudek concertinaed unpresidential osteoplasty helioslough marquest amardev mishitting burgans lagzdina dramtically bonnant filebound quadramed fieldside interfaceflor overspreading hulf ickies genaudio nonpaper pursuiting ashcrofts osmonaliyev berentson ratshitanga rentel betablocker gwefan dimmery cholestorol rhaglen safflowers sallenger rushid gorgiashvili lumgair karyagin shokubai dipal tolerancy akpele burliest skoblikov goldins vincentric maqar pallancata michalchyshyn multaka mensil madurell inshriach gargoyled newshole becasse geegee shanman djousse carouses bridion gubenatorial yasmann melodeo dualmode ohlhaber hosselkus asliddin asteco glci womanized ngarambe sarigerme zests frenna botkier atheron chernomorneftegaz dracaenas postwork hatchel shillaker cyberdisplay hernst sandelman etnz lightborne jarrom jinduicheng anoraky malealea imundo bowriders ozhan neglegent kusmer skreba shajoy lewaravu underprotected fyock bibliowicz stoppardian kazombiaze fortyfold ballein moundros taoping rhenigidale cobbes brezhnevian pelaccio effortfully bergermeer nonjews cagnazzi wonkily sendups ontheir moelleken chiames sardeha zimbawe kocker orbuch goldengrove heijokyo waterloos zaarir orexo yermack rotert playfootball wasseem filsoufi bozard threatexpert theken robinul mough committedly whinnied misna vigal boezio alprin muniwireless hochgurgl keppelhoff cogeval tigher dynamise qurbi stabalized kookiest ticoll soscia duekoue medicide vaark woodburner allybar lhakhangs antii reaccelerated kiriakakis jaheem saadnayel hydrodome nongamers koverman tanoos choongh boordy pressive chawkay anderies prisna rones desperaux vivaty moneea joszt scrappiest brocko derar unproud adesta tbsc atomredmetzoloto polymedix mpuc lumera chemosurgery handrolling noncareer umiyuki gurbantunggut elithis thandisizwe outpitching kikuyuland elinogrel garibyan gatorfest thabault sanbona bracketologists hycrete sbrefa hersonski krudy irlan agued budihari kérastase everlyne trippiness crystalise besseberg arwain verbiscer skretteberg taccetti irigonegaray hmag ostir sheikhly marzouka poortmans speisman legowo rogered dragooning gruenstein ballahutchin binangun fugazzeta camahort hudziak conservitive undeland entura zeravica netchoice treesje kalief mackriell gaccio cpei serd leontaris musinski masaud smileybooks drollest patrak dalvinder myfoxphoenix underlap xcell nonwinners gpsi natterings storzer bernasek corway ingorant treiser knbt hallandsåsen seramas dwikhondito swanned asifi wyndal bowcutt tinseth midtower viggiu puhleeze racetrax marota belluardo musiccares palanzo ngdt casorso yertysbayev healthsmart lecka rasilez proesch dpri licsw sourpusses saghand canoodle keoghs myrthil proreader mulhaupt sanukite bnsp worldbeaters achtner vopium blondeness groople groupabout charcon leidenberger theochari luckhaupt nefariousness uyeki ipth mychoice sloganeer innoculations punchestowns insipidness crasbo choeden myespn balcas waldmani scanlife spoonable pichichero compnaies ivad kralyevich schwellnus senjaray citac ceau mereilles forclosed glycomark dissatified prepsters springier iufms veeva sixmilewater enyele zebrugge macksoud glitazones kontilai hevly shlein wiswedel dudl hodmezovasarhely clonings ohva aspidistras lemack castiglionis peripatetically ajras stonecastle caramanlis delagates seeit pirotti wajba billlion cagelike hugins starcaps boggier grosholtz bakkom kortuem kvivik yhey nusing canyonside aliriza talbooth lmga aasmundstad utrechtsestraat viradouro inaccordance ekeli leadres mutobo netcents buangan convington enell roshydromet kozelka foodvest sangduen skarssen prelapse paramiltary climatesmart elaa ziecker paksitani javers leshno bouncebackability hueh miqdadiya jakari chindex catcote tangwanghe idrissu durrua superlong hazuka windyhall merked rezconnect aserf abdelhai mwelwa hertzmann ivanplats aneres himbos vilaceca celebrityhood roofbeam typcially mabthera espad bajillions sgeis chdos dellaporta rippeth doim leanore eblex laundromatinee billotti raishbrook mengqian landsnes cydcor eslr dicastro sukhdave stavitskaya memolink thomassey visipaque badurdeen yesawich urnov kreidel ibrahimia postcrash mijoro mannuzza noninflationary alizyme cuunjieng stupek hadlington polanksy medstrat krygyzstan siyamak destablise noteless giftable ultreo banmiller beringea territo lamic gutlessness raychoudhuri seddi akaz hwnnw hbac rendells nantas hootnick kickouts reyda pogrebniak mutumbo rhapsodise cuzon zabaleen beroiz razorgator iacolino iotum paline stylemark qiujiang nonincumbent scholly hostelbookers famulare annualisation pitots cendon downlighters gerrero firly tahdia hascup diettribe ppdg namuncura decentralises artope colagiovanni zaranek constiuents prusci vanhool circlelending saidd awyren jading mootral smithbucklin ickier businessworks shuana okema honeman karamouzis erdaoqiao chepulis lyndin lockness greencycle idelogical floweriness jerime cardica sleepness gastelu zimondi sundher nicodeme fastballer chifunyise zubeyde lorich opensides gwsc kamrob skwerl staco rpis iljazi sandeels brandelli forwar competefor acqusitions sulkovsky blueskies jannett burutin ukccis paekdu nealry grandos dmec vilhjalmsson feraz yaldara chickcharnie chaupad birkmire kagro cciee kurbegovic geplak swiston portpin schewel njpa igodigital freidan hsopital protho steeliest khawari akinaka chirring timebanking mohamedain ivotronic segelstein junkier homaid sorros smsi mediaconnect strmecki clownishly pampe nutlike nonautistic ostrolenk soundabout billingses pyeritz nonforfeitable nerby nedanovski sandlofer ouirgane margoles pubby chicanna harerimana geosentinel elomire utilites podhajski cnpem spinale huxlin frable verbrugh machil predecisional tibbert scarved namugala fasick turaihi counterstrategy corupt choubina kleinhaus automart danyong loundy avanex frnakly chler vooks ronnee phurnacite accoya thuggishly dustlike brynmally pttow bossenger qaswarah wipsi enterprisewide szakaly datalabs awwwwwwww northless perezcano hilam urds krausner entralled slepicka gyurmey buildouts unfrock medjumbe ericsdottir schme mooore levetan twoway monitary horillo untheatrical hunshi understandin hillon micturating parkus polydopamine scotchguard tchilaia lungundu vdec amerah yorkshirewoman eyeshadows hunkiest puppyish idcg tweenagers husani fardosa scrogie tureli homemanager undereye ifthikar unsexiest vodpod brutzman promacta evco niaki chyngton cavvy wraig euroconsult jetbus houmous stepter cbeex faiure benchings gundogan wesal herubel rainsoaked ostensen applbaum bejaysus interlinings tritch zemlyansky deconto jamiaa cosport limco credant yepiz harralson pizzola salwens sevenhills peasents sieno fmlc niekirk fatless traditonally sentimentalising curosurf nsrt cossetted honeytraps tickett preformulation ibone ddod rellas luxlash mcaneney nosovice bedsharing gordji bodystep yogarasa planinic bombmakers lastonia microdistilling knca pohoryles imagenetix wootens daxas ronilson sirilal muhlenkamp befera veterens underhit ipct otana sarjo cagier zekelman chernoi suzettes taride windwood blingy khalass missimi affirmitive wolaner luduena nanodragster feeneys edulink townfoot lixun boomerangers dimieari bijeel gollywog schiestel twickers snivels valtchev muhlhauser mourayan marakis mtsc mvrs stosberg dominka hoggie latture dhanju resoling reinjecting myungji elitech hoellwarth nanev unjudgmental belth owczarski bnac choedak vaidisova contemptuousness bunol finalcut zayuna aocl accorn collasped costcutting viawest sciclone dourest chwaraeon hargey alchol xcz democratiques milinovic dinsmores condis dunclug discrimated smeland mousad zlatanovic djebbar parkervision deftest socialight moldow hoopty wojtecki pianin fizzier envigorated hpsas mayelikohan pordon trochet fiberweb comdexvirtual repucci benrock fusnesau dapidran gyorfi wilsterman wichcraft mickeyd yangpyong laneve mcely creemos jdimytai qianfo tolerent sagey elfenworks fbars marose nellcote drudger airportal essangui teachfirst operatin lewinksy mosalikanti shije goinggreen allegeldy overexpand chepkemboi nteziryayo untradeable worldcard daltonian nonsquamous mangiantini ikebal cherquenco grandluxe arzou odlozil decoff vampirical foxily kpnc soxman ponne guinnea showal uncommercialised timberlawn furtw steiker bankman ghazawi raisian dajohn breedveld deedie arfaa jockish ultratravel huldisch macrolane sathiyan vancisin forkful bedevilment bankfirst rosebrock lefraks cwaf saelee silkiest ainkawa aixtron suspose trattorie greengo ontex uncensured dermatologically bohanna obenschain funfit medflash skippyjon prefete iscn ymgynghori scantest icmeler wangsness wextrust superquadra eheart mapasua besecker petercam mazraq infocrossing burkheiser svanes sladjan herszberg giribone lizardy squibbed jaisham hapn pigd grishkoff bioenterprise gynnig alleron mckeeve barnholtz amvac hamsik patineur financal peignoirs eisiau minzolini telepiu crosstech inferometer reminyl janielle nobbly domainsbyproxy overamplified petroenergy smfm stagging eskovitz misiura leafers shaqs befouls shahabeddin customiser optimor chupka tannachy pmscs scarjo molica sehnert zzzs levittowns panetteria dahalani aspercreme jidori preganant reblackpool gherzi aujila yansha cryobanks roughneen assocations swifly pobanz edicson azmal lykourgou damchoe hmics wireforms gardesana hokiness bernardaud ayudas multaq naraki gaudiello societythe shopsins goodbrand liram towheaded chantana thumpings mzili wozzy irizarri geeing enotah aappo thatje tuttman darxia chowns checkett irksomely vahtera bourzai daycoval azazeel nannelli punchiest druidale kimbriel forgaard discoverx mcellrath hyclak centrico paragallo daryatmo unscrutinised uapd harebreaks natalo vanhooydonck churnings lasorsa peracchia albinski dianyuan rohd traited muhtaseb gromicko writeroom spenhill suppoters hornbarger gohir morquecho dirshe turchet wicab mazamanian supergraphic graap liftline bouncebacks kowalcyk baraou prisions mushraf arthroscopes hyvarinen flammia norklun katiria teleconferenced overfloweth pokrovnik yatskievych klaviter jamjoom pipsqueaks plechner tdvcodec illionois hubsi rushwaya mccraty cironi blta comapanies zhancheng oppponent dribbly hayab skedaddled rogowicz unstowing frienemies firstcity faridany accure gammagard harussani yalen salaheldin arrm rosnani kacandes chuffs sohat wynfrey alcobas miserabilism dogtopia oogjes storyvault damange frazeur pitnick spigaroli pedestrianize releaved eurotaxglass qaderzadeh underoccupied merkushev borval goeas krasn smallball irizuki coalmen barladeanu speet daugter bertucco knowehead tiresomeness yelnikov solian coquillat yeondoo namouh forida neopeltolide matatizo diory liugui unflushed snowshoed hogli landrell mwakasungura batwomen antithrombotics jessicas frincke allback anecdotalist slobbing unholster chantrill tellam kanzus leichtenstein showhorse corgentum fsmt entrancingly policharki deparments nonunionized vanillas glir nsqip nohpat iurc enovate cherveny telergy nonphysicians pamos wrange ewusi vobile sweatpant guolong disb zajtman selectadisc genocidaire hamidzada altamore chumra clacked squassoni mcgriddle faridullah drobisz asets aloun gertmenian reticient constituences nqobizitha reciben hbtc bahrains shult somadikarta falewicz pwnc benightedness pornthiva camiro qiodravu cdhp depegged helldorfer racebrook chemchemal xeriscaped rushower imataca oossanen ıf moenig lafeet hendzel binui naabzada spectical ngere hospodarske hausenblas aircap comcam gombossy fronteer retrogene liggers fuschias bagfuls bauert nasirzadeh wiimotes longmay watermaster dunievitz treanda incrimental atoyebi rawitch orock bendett blackberrying invigilate hajdasz orbaum usuary etlin michellod midevil howr phillipina smooched andarko eyesmart dywed izaurralde sukant primeurs tixylix boardfest goulios unstopable sedaqat majome soflens thuba ryaguzov furiousness guereda pizzette moronuki budesliga jpsk luminere dweud governership doneghy basyrov kristianto gwneud weathy jppm gmtn provamel inkubation kandaharis canawati overstyled appsec eathai elasha penggen gimer shoebomber elanbach eyewatering northoff tysson beseechingly glospace megamovie samure bomboniere deshar muhrcke fannying leveretts sabeckis cadna willdorf posioned lattif outdoorswoman misamore atxalandabaso stjc neiderer piccaso unlacing sharick khourshid sidearmed lifflander burrelli xiaoduan frazzini presepi abdennadher hypermesh renagel maronian wavelight underheated fstr tauntings adamovicz katoro theut trumba glenhill unlaundered togelius cellai pontarsais juliber fanatec emotionalize freewrite snbts barnao bestpractices preibus pulbere trayport ccsbt caccavella calosha wirehouse gklavakis crongeyer xantrex phytoglycogen haeley superprime mcgrorty kossie sabcs axene panflu nhow reactiv silverfern sandretti nemakonde landesklinikum demythologise baniulis spunkiest choska anahiem andrianony cspl flemke milosa quimet pgic stearing pontzer chadrick rogasch bidoon undermotivated fashionair scarpette overalled acquifer razoring meangingful exhaustedly rastafarism trichopoulou azadnagar baccellieri rorb weintz shemy datcp bwcabus yabad reorientating politicains rampager trebevic strulovici bohigian leesfield gibh sorbera wladyka clincial constitutency tiime gloersen disneylands farisa sinsuwongse splitty nextmap truecompanion giacchetto setlow smirkingly atisreal loorz yaquby gearworks mycoop bertilson stangler bashika owlishly pregancies eyjolfur kanoto senocak skarrild mantecal silversol solaicx rietze cozied bubser tinhay crouchley slipcovered aurik christoyannis cequent antiprostitution dfeb wollschlager knuckleballing macroeconomically srisook jiyad unight emafo ccbp pagnamenta eacha gawell giarra washbowls unaffectionately kehoes metropulos xvala frohwein showbizzy reappeal congresscenter caravanos consigner humaidhi uhlenhopp nashvilles racusin alvogen waayeel healthtalk skunking gcec kotchen mazuka danaei kolltan himandhoo garbutts combinatorx fettuccini widewell tivoed robala komlosi haloumi stirlitz travelmaster lassise homoerotically pejcinovic carruades switcheroos turkewitz perithia dealertrack quilvest moatassim alinia meshad yamoun arysta deinhardt imageware febbo tshkinvali fortrans runions kochifas veiwers rolufs baragar vineys ladderlike yinhong zenie dojaka heartrendingly djordjevich hitlery madrenas bronczek gianfranceschi karabak cistuses webctrl grivnov burleys farrakan gywneth daisycutter zenns flouch llawn pierzchala metatartaric contituency tranquilise phadungsil mccamy kebkabiya botwinik semcken kenneway khyali wwoofing castron mikoczy ahfc glasers begolly deposer partnerka spendy gnvc blixseths thanatologists closeknit lladd virtuoz apoligised ammenities shimaa neosa chatterati toumeh paraportiani daypacks paavonen sauey roofscapes cmft prosound sittert masibambane mccammack mccauliffe oshel nydj nesbo kimotong dshi gelatos spielbergs mikulska misdefense qleibo ammuntion jachnik totallyjewish honsberg oxilp einum olass salaisons ruesselsheim telepan komanski flautt limbad ruessi repolling neafsey stuggles ixempra upfitting brokercheck lacalamita detonative youbet cempa prekaze swaggeringly stursa sampong spitzauer buinevicius nortin elgammal miguelez dulford drpic tegwyn kerstie loubeau sokaluk gokce crme scheri budgeters myday housseini fiream hellishness soussana witchiness septemer itallian thenuwara antiproliferation schappler microplanet khogiani giancarla tepav incestuousness straayer sqaud sielemann touranment ethanols mineer anamarie ncjfcj tavulares walkiria tanjiashan fiotakis enterprisingly pushinka exámenes ownit timeing zojoji diseasome felana megwa harjedalen grapplings errami rullah diepolder nasfaa ferniehill thingummybob mahabeer forklifting dikoy creanza marzah urethras wirecutters tpps villagevines postpay steirteghem gabura parcio zirandaro emcp presssure ramussen pimpi pabell boudrot satomiae rearly bardaweel lowde prattled rodota securitizer toates widlife karibuni woolich egelien fibropapilloma artstorm dekelboum guantanimo hallihan cherkov makili doryman wahanda mbpa magarsa davore arroni reoffenders kotevski heximer behud ticey palatably enginee paradisis ruzgas consolingly buesaco isakovic westins zinnanti suaya aytre jspca bageecha nardine guardala leifland pummelos smartsense hydeskov pitzarella marchadier miksanek counterrevolutions connextions allmenus siig flexbar facehunter stalkerazzi teeni miscuing woger ersie kickz keslassy biotown mehboba voied deconsolidated lintala smugs overweaning hillraiser lailvaux presliced zonias rimli snedegar doyennes fetterhoff kruegers cnossen faizasyah bagila newsmith mwakio gubta mujawayo onerously semisecret schaye alisara briitish gargagliano vizioncore handstitched rohmah lovies opencrowd libertiny gaľa delleney biocrude ashstead kausal neidl lunchbucket dharmawardena bellieve ultcw lvaro littleheath kyriad pursuiters mbbi slyer dihok boggild lacquement aspinalls osenat pynter laksin smartstart manarin bogaards sibotshiwe untackled endgadget unthreateningly deodorise mohammaden wedepohl handyphone abotsway uncinematic caviola liquidware lanarth overcollateralization mourino nfus crocetto cheptais melmount motzer boespflug jakstas talkfests cresapartners shmear powergenix hardekopf narcoterrorists reflationary provexis postfight amselle uplinq muirhall leeney zeuli trusim acciardi tessas plotholders flenner krosby halfnight openband bougette obagi sebrle newstex mcwar mbmg angemi moseyed fearmongers lumpu palanco ahour sinutab sodian sheikhli sanofipasteur cinephilic lothstein soubie zerno bkhm cystography usocial sermonised obhi ammoudi lousier ereckson paracletes krpshtskan giganomics ccording seasonique perlroth fieriness mugginess barthelemey xenoport metrogel tranc teguest muslimyar multiclient foccacia kosaisuk menopur hizer schelberg barnlike cebri tosay turnow analex powerskin sidie schoeber ahijado zlhr cushty igadd fpies putumattalan oglesbee selfesteem mallipo subisidies afrough justwhistledixie cycnical cfnc woollies biben simultaneoulsy kreishan uoya adapoids krovvidi patelis yalayalatabua brasato connerys funkily surendiran comora kanmen gasby heglin ameripath fanswarm lydiatt metalmaster chiroma windemuth lukacz splashier krentcil cataouatche yastremskiy mcclanaghan megamerger shticky artherosclerosis grueskin tendy lignieres gallmeyer girthed homman jebur rahati wahayshi bluetraker yospe tuduv infantilising dufflet balony adrianto appcraver rubbled escapia mostari belche gülenists khalidis weitzberg souhail usfm chirpiness proffessors musolini stonily nothingburger pengassan mantrips karele ungenteel wepower pipline panoramically dinnin bravey subeh poerschke dzemaili slentrol pallipat unequivical questek pointment tawain donnybrooks abaar jambia moerheim pithed emalie ontier underbidder shourong kaldoun authenticom capak bluedogs soquem junyao kenzero chocbox asmq dabovic boersen videocan skelid pingzhong carrycot cranfleet bonsib ncreif waterseal eucalyptuses vallish hafeezuddin brodricks christianists machler ariaan psychopathically luxmanor bayari frivol scheeler embarressed stickability reganomics twttr tystiolaeth ndds dagmawit davise startegy jetin saidenov unbendingly derbyites blathered jamayel mccullars theopold bulgers mastoris multiseason cyfarfod levano darrenkamp telecardiology bpsi nytta stadelmayer contemporised fouhse lunkheads evites firsthealth clywch trvs coratti reletting esconsed broffman stonefire lorans newspics karlbaum penaflorida northburn hurkey ctiy tinselled marketsandmarkets lilang hirtes rcent rougeou contribued fobis erwiah dellavigna unstainable wurstfest palpitate oomc neopharm groenefeld brainlessness screentonic viisage hotelchatter lequn kemala reesby diemtig ellisman mdco vrsn mamade invastion unambivalent haileys trivan sternick rooperi serff herrgesell wieczynski superlabs bidabe hfth uzumcu cojan ignia abstraktes strenthen aqualine cialdea jeyapalan negronis paudert zhenyao atiende magnetom hymning bonadie trollopian thagi schnetler esrp stathakopoulos nachterstedt boipeba ultrasecret panshir ecocidal katania blendini recalcine momager saphris rediculas diavlogs ovcon susick jihadic coneybury hethmon shirm rishwain kireker mirell sheeren eramerica luxehills touchiest yalon banglaore rommi lenigas megaquake mascone coruption smartjog gollinger babeh pendall misek subaqua dallavalle voreloxin dithery gonacon janumet vettes staylor senternovem exce hitleresque rocen digrf ramondini bushway cbase ihilani hirtenstein slakteris unete hilarides serices ticca shockin pedegg cinep transtac appma aqah prudentiel rostar illuminatingly flimflammed boogied hyfforddi plzensky eithinog otesaga aspling pollalis indigovision blachard porciello tereas eshki dalgish akafuku petroquimica tunlan edaily oorjapac punjani sagie cnrd hometowne conced silga multigame brinlee surepress statemented jharkhali pasquarello gerianne pastapur gartain lamonaco bluehenge sonc pipho mischon tradetech bunnets penjore anshur sondervan ebels facai biostructures klenert bagrock nutrarev dusoulier arjowiggins skylogic ================================================ FILE: assignments/word_transform/eval.vocab ================================================ tiene,has habían,had entendido,understood clase,class harry,harry pluma,pen guerra,war tan,so dios,god le,you estés,are marea,tide mr,mr el,he jerry,jerry puedo,can coño,cone marca,brand debió,must diferente,different tras,after rival,rival películas,films ésta,this piel,skin intención,intention show,show ir,go os,you aumentar,increase país,country marcador,marker perfecta,perfect ben,ben presión,pressure pasada,pass deje,leave dia,day dólares,dollars porque,why maldita,damn locura,madness fotos,photos hinchar,swell regresar,return alto,high chico,boy soberanía,sovereignty aquella,that hables,speak poder,power tomado,taken verde,green nube,cloud playa,beach mercado,market nadie,nobody contrario,contrary olvidar,forget jodido,fucking altavoz,speaker pobre,poor oigan,hear viuda,widow vivo,alive verle,see creí,believed malas,bad hubiera,would perra,dog muestra,sample bienvenidos,welcome calcetines,socks dónde,where teléfono,phone huele,smells clientes,customers sería,would biblioteca,library paciente,patient ruido,noise pasa,happens diplomático,diplomatic llamaba,called prosperar,prosper nosotros,us vas,go emergencia,emergency sucia,dirty desastre,disaster david,david pensar,think real,real humano,human vuelvas,return estaría,be comprar,buy red,net sea,be ray,ray presa,dam ganado,won sexo,sex oficina,office recibir,receive maravilloso,wonderful dura,hard estupendo,great depende,depends bastardo,bastard media,half pedazo,piece unas,nail ojalá,hopefully banda,band metros,meters siente,feels posibilidad,possibility inevitable,inevitable batalla,battle señorita,miss peor,worst naval,naval buenas,good completamente,completely sientes,feel paso,passed callejón,alley observación,observation perfecto,perfect flor,flower imposible,impossible hagan,make conversión,conversion trasero,rear diez,ten línea,line c,c buena,good adelante,ahead ee,ee otras,other voz,voice mofeta,skunk política,politics ah,ah nombres,name maestro,teacher ablandar,soften dará,give encantado,charmed cállate,quiet ocho,eight fuimos,went fiesta,party quedo,remain sentí,felt cansado,tired oro,gold abierta,open cámara,camera magnético,magnetic ratón,mouse seguro,insurance como,as imagino,imagine guantes,gloves espacio,space otros,others bailando,dancing herido,injured oportunidad,opportunity bobby,bobby robert,robert uso,use encontrado,found manos,hands ver,see afuera,outside habéis,have quienes,who iluminación,lighting fácil,easy menor,less dirección,address negocios,business privado,private lengua,language informática,computing mary,mary tratando,trying ejército,army perros,dogs cosecha,harvest siempre,always vienes,viennese cabra,goat gana,desire empieza,starts deben,should vengo,come tuvo,had dolor,pain tuve,had efecto,effect quedado,left llegue,arrived caluroso,hot organizado,organized quede,stay estarás,be eso,that hijos,children tuvimos,had vergüenza,shame alegra,happy gobierno,government caro,expensive oscuridad,darkness investigación,investigation mike,mike dinero,money hacia,toward dulce,sweet siéntate,sit parecer,seem vistazo,glance historias,stories vender,sell roja,red gallo,rooster vayan,go chicos,boys contrato,contract ================================================ FILE: assignments/word_transform/train.vocab ================================================ catedral,cathedral escúchame,listen accidente,accident té,tea gorda,fat regresa,returned negación,denial pato,duck precisamente,accurately imagen,image persona,person pistola,pistol donde,where café,coffee negocio,business quería,wanted pensaba,thought espectáculo,show seguridad,security juvenil,juvenile venga,come alrededor,around eres,are robo,stole especial,special solos,alone olvidé,forgot árbol,tree danny,danny hicimos,did ay,oh noche,night regalo,present entiendes,understand disculpe,sorry es,is impulso,impulse interactuar,interact cerebro,brain cosas,things supuesto,supposed reina,queen baile,dance ayudarme,help traído,brought escuela,school diario,daily tu,you gran,great principio,beginning dejas,let vuelve,returns voluntad,will favor,favor personal,personal directo,direct tal,such lobo,wolf inmigrante,immigrant semanas,weeks base,base interior,inside preguntar,ask pasé,pass tejer,weave lector,reader oigo,hear piedra,stone madre,mother hoy,today caballero,gentleman sistema,system familia,family podía,could examen,exam restaurante,restaurant conveniencia,convenience cara,face hora,hour empleo,job pista,track pronto,soon año,year millón,million pasará,happen bob,bob domingo,sunday hacerme,me maravillosa,wonderful brutal,brutal ciudad,city come,eat billy,billy incalculable,incalculable deleite,delight debido,due mala,bad estúpido,stupid libre,free contacto,contact enamorado,love desde,since pasar,happen bailar,dance verano,summer prima,premium date,date mano,hand cine,cinema bonito,beautiful consecutivo,consecutive conocer,know sermón,sermon señoras,ladies tigre,tiger señora,mrs recuerdas,remember cuarto,room vez,time aquí,here repugnante,disgusting estoy,am verás,see dio,gave ganas,forward amigo,friend tendré,have química,chemistry verdadero,true cansada,tired cocido,cooked cual,which cielo,sky policía,police padre,father dando,giving asiento,seat toque,touch agente,agent isla,island cuántos,many nena,baby entender,understand instante,instant iglesia,church suerte,luck luego,then perfectamente,perfectly animal,animal corazón,heart gracias,thank prefiero,prefer creía,thought renta,rent delgado,thin bañar,bathe estuviste,were continuar,continue la,the llevaré,take comienzo,start mujeres,women vea,see creen,believe control,control cabrón,dumbass mitad,half arena,sand absolutamente,absolutely mata,bush doy,give conejo,spider ti,you detrás,behind hablamos,speak anna,anna encuentro,meeting perdona,forgives mayor,higher ganar,win trabajando,working gay,gay encontró,found conseguir,get peter,peter funciona,works preciosa,precious esperen,expect hacemos,make haré,do velocidad,speed vecino,neighbor crimen,crime posición,position bosque,forest nuestro,our hecho,fact sr,mr tenía,had saliendo,leave ángeles,angels nutritivo,nutrient final,final nota,note asunto,issue nos,us carga,load talento,talent segundos,seconds apenas,barely explosión,explosion alma,soul vaqueros,jeans mujer,woman otra,other idea,idea abogado,attorney rayos,ray crudo,raw acuerdas,remember anillo,ring mente,mind parte,part mal,wrong proyecto,draft chaqueta,jacket listo,ready onda,wave tommy,tommy lados,sides había,was buenos,good importante,important dama,lady aeropuerto,airport irresistible,compelling siento,feel corriendo,running oscuro,dark mirar,look edad,age salgan,leave papá,dad tardes,afternoons tío,uncle fantástico,fantastic memoria,memory camisa,shirt confianza,trust perder,lose nueva,new comida,food momentos,moments vamos,go cuento,story estupidez,stupidity teológico,theological nuestros,our amo,love cama,bed sois,are dijiste,said ninguno,any sorpresa,surprise sucio,dirty tarde,late ciudadanía,citizenship crucero,cruise detente,stop pulmón,lung cinturón,belt siendo,being traje,suit cuidado,attention niño,boy tenga,have intentar,try enseñar,teach extranjero,foreigner llamas,calls tontería,nonsense mierda,shit tomar,drink bien,well lastimado,hurt locos,crazy militar,military motocicleta,motorcycle acá,here sí,yes calor,heat libro,book ya,already dar,give junto,together nivel,level idiotas,idiots profesor,professor unos,some horrible,horrible hacerle,make deseo,wish sostener,sustain odio,hate días,days despierta,awake relámpago,lightning ser,be acaba,just todo,all quedarme,stay estará,be mucha,much vidas,lives basta,enough enorme,huge religión,religion querida,dear pongo,put creo,believe llegamos,arrived empresa,company podré,can diablo,devil demonios,damn verá,see pregunto,ask visita,visit socorro,help feliz,happy bar,pub temprano,early piscina,pool a,to exactamente,exactly bicicleta,bicycle intento,attempt código,code objetivo,objective culpable,guilty gustó,taste miles,thousands doble,double jack,jack dejó,left encontraron,found ponga,put partes,parts filete,steak común,common maestra,teacher ves,see cebolla,onion resto,rest iba,going vena,vein tienes,have ceño,frown fusil,rifle tranquila,quiet pienso,think próxima,next llevan,carry hablan,speak espada,sword r,r drogas,drugs usar,use frustrar,frustrate llevar,carry muchachos,boys democracia,democracy medicina,medicine navidad,christmas lluvia,rain bella,beautiful esperanza,hope animales,animals dejaste,left sola,alone grandes,big comenzó,started exacto,exact esperaba,expected bonita,beautiful charles,charles especie,species biblia,bible ey,ey humanos,humans trata,about duda,doubt muy,very majestad,majesty cambio,change estar,be habría,be límite,limit honor,honor comienza,begins mortalidad,mortality lista,list muchacho,boy prisión,prison tome,take mono,monkey cuando,when rey,king durante,during contento,happy ejemplo,example volveré,return técnico,technician buscar,search fuerzas,forces difícil,difficult vaya,go jurisdicción,jurisdiction francés,french cuesta,cost cuántas,many tv,tv castillo,castle cinco,five cambiar,change realmente,really baja,low regreso,returned hace,does decirle,tell fatiga,fatigue viene,comes computadora,computer viernes,friday tenido,had bebida,drink suena,sounds limpio,clean ha,has grande,big juicio,judgment quedan,are mojado,wet cambia,change hijo,son papel,paper jugar,play carrera,career trabajar,work especificar,specify debí,should frente,front escritorio,desk cariño,sweetie matarme,kill necesitas,need hombres,mens mansión,mansion educación,education idiota,moron futuro,future planta,plant pagar,pay compañero,companion estados,state cosa,thing pendientes,earrings llevó,wear estas,these taxi,taxi quieren,want pápa,pope sofá,couch mas,more especular,speculate hubo,was ideas,ideas débil,weak querido,dear mejor,best vino,wine coordinar,coordinate sostenible,sustainable california,california ocurrió,occurred intercambio,exchange comenzar,start chicas,girls oye,hears viste,dresses fui,was usa,uses disculpa,sorry direcciones,directions distancia,distance diablos,devils gordo,fat pocos,few diga,tell toda,all haber,have srta,ms hablado,spoken victoria,victory príncipe,prince últimos,latest multitud,crowd ve,go elección,choice alguien,someone tengas,have pensando,thinking prueba,proof debes,must importa,matters petición,plea casa,house cumpleaños,birthday actualizar,update tenemos,have usted,you pudiera,could loco,crazy médico,doctor beber,drink eh,eh estan,are jake,jake respeto,respect freno,break camino,path razón,reason sol,sun cuerpo,body motor,engine recuerda,remember pareces,seem depositar,deposit miren,look seguir,follow guapo,handsome escritor,writer quieto,still brazos,arms haces,do empezar,start entra,enters cuál,which presidente,president armonía,harmony oiga,listen pedido,order intelectual,intellectual necesario,necessary dedos,fingers punto,point alemán,german granizo,hail salud,health irás,go guapa,beautiful sandalia,sandals pruebas,tests elefante,elephant favorable,favorable darte,give preocupes,worry llega,arrives uds,you muertos,dead ningún,any horno,oven darme,give flores,flowers entrar,enter formas,shapes enemigo,enemy llorar,cry lamento,lament hola,hello johnny,johnny pared,wall gusto,taste propio,own todos,everybody salió,left amar,love encantaría,love extranjeros,languages republicano,republican tuyo,yours será,be podido,have estamos,are gratis,free cliente,client llegó,arrived caucho,rubber debía,should sido,been abrigo,coat excelente,excellent naturaleza,nature blusa,blouse música,music probabilidad,probability estrella,star san,saint cascada,waterfall terminar,terminate depredador,predatory sra,mrs sarah,sarah puerta,door busca,search seleccionado,selected jardín,garden libros,books ciencia,science encontré,found amas,love pues,well escuchar,hear mataré,kill pobres,poor pequeña,small pez,fish llama,call hacerlo,do sociedad,society creerlo,believe tratar,try ponte,ponte alquiler,rent sir,sir lanzamiento,launch caso,case inherente,inherent max,max información,information película,movie aun,yet aceptación,acceptance los,the museo,museum solamente,only pasando,passing departamento,department tuya,yours iré,go ajo,garlic humor,humor sigues,follow invencible,invincible predicar,preach decisión,decision autobús,bus avión,airplane zona,zone de,from conocía,knew casi,almost héroe,hero digo,say tenedor,fork esperar,wait pelaje,fur garganta,throat conmigo,with eddie,eddie eran,were largo,long confiar,trust movimiento,movement lámpara,lamp nieve,snow tesoro,treasure hermanos,brothers quedar,stay novia,girlfriend fuera,outside inspector,inspector lee,read damas,ladies irse,leave podrás,can par,pair completo,full anoche,night especialmente,especially fin,end mejores,top rico,rich muerta,dead fondo,bottom sé,know amigos,friends toma,taking quieres,want vacaciones,holidays irnos,leave universidad,university buscando,searching veinte,twenty vida,life das,give alegro,glad bolsa,bag joven,young bebé,baby caminar,walk pie,foot estabas,were john,john llegar,arrive detective,detective programa,program hice,did somos,are entiendo,understand habrá,have apuesto,handsome calma,calm hombre,man vuelto,turned marcha,march tipo,kind amarillo,yellow quédate,stay arco,bow mami,mommy definitivamente,definitely techo,roof carro,car irme,go tema,theme estén,are llegué,arrived colocación,placement casado,married interesante,interesting articular,articulate delante,ahead veras,see prisa,hurry sentir,feel tenéis,have medio,medium significa,means poner,place piensas,think decir,say cuentas,accounts después,after azul,blue arrepentirse,repent siéntese,sit propiedad,property algo,something perdido,lost montaña,mountain daré,give uno,one frágil,fragile noches,nights loca,crazy hacer,do rostro,face ambos,both belleza,beauty bronce,bronze capitán,captain supongo,suppose pidió,asked nuevo,new muerto,dead hubieras,had familiar,familiar mirada,look prometo,promise trabajo,job razones,reasons querer,want piso,floor giro,twist semejanza,similarity costa,coast agradecer,appreciate saberlo,know estuvo,was circulo,circle oí,hear puerto,door tú,you repente,suddenly barco,ship fotografía,photograph hogar,home hacen,make mí,me terminado,finished minutos,minutes ustedes,you resulta,result jóvenes,young ego,ego tambien,also dejen,leave empezó,started cargo,position comandante,commander almohada,pillow hago,make caballo,horse demandante,plaintiff canción,song profesional,professional escena,scene elegible,eligible mayoría,most tribunal,court comentario,remark iremos,go habló,speak dice,says morir,die porqué,why piensa,think descansar,rest potable,potable trato,treatment tuviera,had cocina,kitchen club,club ahí,there reunión,meeting sal,salt sean,are espiar,spy gracia,grace calle,street reloj,clock ayudar,help ropa,clothes calles,streets beso,kiss tarjeta,card mark,mark francia,france fracción,fraction hará,will geometría,geometry debajo,below trampa,trap perdone,forgive puta,bitch chispa,spark viviendo,living jefe,boss bajar,down intimidad,intimacy esposa,wife jabón,soap casas,houses ironía,irony propósito,purpose personas,people muelle,dock bote,boat pero,but esta,this matar,kill abuela,grandmother niebla,fog camión,truck sale,leaves plato,plate oyes,hear inocente,innocent dan,give pide,asks única,only referir,refer hizo,did revólver,revolver atención,attention injusto,unfair ésa,that gustan,like equivalente,equivalent mi,my van,go aburrido,boring perro,dog alcalde,mayor entiende,understands busco,search bueno,good dormido,asleep nunca,never precioso,precious éxito,success blanco,white cuanto,many encima,above delicioso,delicious tantas,many álgebra,algebra whisky,whiskey perdonar,forgive oh,oh otro,other foto,photo escuche,heard pájaro,bird negros,black robar,steal trabaja,working fortuna,fortune al,to relación,relationship fuerza,force llanta,wheel embargo,embargo abierto,open palabra,word serán,be problemas,problems thomas,thomas con,with grueso,thick bill,bill caliente,hot bañador,swimsuit dejes,let aburrida,boring alemania,germany su,his garantía,guarantee unidad,unity atrás,behind temo,fear inglaterra,england salido,protruding m,m escucha,listen disparar,shoot además,besides molécula,molecule obra,work ninguna,any segundo,second mía,mine agradable,nice listos,ready claro,clear vemos,see palabras,words sube,up último,latest noticia,news cielos,heavens felices,happy dijeron,said situación,situation toca,plays preocupado,worried tensión,strain todas,all dave,dave puertas,doors volvió,returned tocar,play ayude,help vieja,old honesto,honest parecen,seem j,j elaborar,elaborate vuelo,flight vacío,vacuum entre,between parecía,seemed noticias,news cartas,letters amante,lover esperando,waiting entonces,then cheque,check aduana,customs vayamos,go espina,spine ducha,shower acusación,accusation sigue,follow mientras,while retirada,retreat orar,pray absoluto,absolute llevas,take delincuente,offender danza,dance acabo,finished tren,train vendedor,seller física,physics masa,dough pon,put bautismo,baptism dijo,said bajo,low divertido,fun protestante,protestant mataron,killed s,s nuestra,our luchar,fight nariz,nose arcilla,clay saca,removes york,york serás,be conducir,drive tranquilo,quiet turno,turn sano,healthy gusta,like minuto,minute fea,ugly era,was dedo,finger excepto,except siquiera,even amable,friendly bravo,bravo ayúdame,help boda,wedding oferta,sale hija,daughter adónde,where dueño,owner misión,mission doctor,doctor seguramente,surely saben,know paz,peace repentino,sudden cualquiera,anyone epidemia,epidemic tarifa,rate equivocado,wrong murió,died serio,serious veré,see www,www estimación,estimate salga,out dentro,inside aqui,here mamá,mom destino,destination cuello,neck nuestras,our puente,bridge suficiente,enough debe,should experiencia,experience embarazada,pregnant chofer,driver tienda,store pantalones,pants americano,american paseo,walk pone,places honestamente,honestly pata,duck cambiado,changed parque,park partido,match biología,biology quedó,stayed sangre,blood baño,bathroom hechos,acts lado,side primero,first levántate,raise hey,hey escuchen,listen diferentes,different velcro,velcro genial,great quedarse,stay china,china está,this arma,weapon mis,my verdad,true filosófico,philosophical patata,potato templo,temple novio,boyfriend hospital,hospital abuelo,grandfather ocurre,occurs vivir,live oír,hear suéter,sweater deber,must vete,go sentía,felt podemos,can diciendo,saying ventana,window sentido,sense librería,bookstore general,general quién,who vos,you verlo,see escaleras,stairs cuestión,question tendremos,have complicado,complicated trauma,trauma hermano,brother semana,week veremos,see culo,ass presuntamente,allegedly millones,millions antiguo,old fe,faith consejo,advice molesta,bothers turismo,tourism has,have intervalo,interval edificio,building gustaba,liked oído,ear decirme,tell alex,alex alguna,any toalla,towel dame,give espalda,back cerda,pig cenar,dine arrodillarse,kneel di,gave camboya,cambodia mapa,map venir,come monasterio,monastery vigésimo,twentieth rueda,wheel más,more hablé,talked diferencia,difference nuevos,new presente,present alboroto,riot enferma,sick hablas,speak saldrá,will vd,you corre,run ante,before imbécil,fool darle,give voy,go echar,throw enderezar,straighten corte,cut tengo,have comer,eat rana,frog ataque,attack años,years contar,tell vine,came droga,drug yo,i peligroso,dangerous necesitaba,needed un,a brillante,brilliant última,last ligero,light por,by primer,first matrimonio,marriage dormir,sleep hablar,talk soldados,soldiers barrio,neighborhood director,director terminó,finished pila,sink vosotros,you vista,view quisiera,want correr,run diría,say queda,remains primo,cousin luna,moon broma,joke nosotras,we ok,okay rápido,fast jim,jim hermoso,beautiful pedir,ask esa,that james,james patada,kick bienvenida,welcome viaje,travel sabemos,know hombro,shoulder gente,people unidos,united londres,london pido,ask triste,sad obispo,bishop vuestro,your tenías,had quien,who constitución,constitution parece,seems matado,killed preguntas,questions cargador,charger demasiado,too dije,said correcto,right irte,leave digamos,say público,public están,are acelerar,accelerate saber,know armas,weapons linda,pretty pelear,fight estúpida,stupid encanto,charm estaremos,be tendrás,have sepa,know conocido,known si,if cae,falls dejo,left muñeca,wrist montón,heap fundir,melt venido,come abajo,down energía,energy esto,this tendrá,have perdón,sorry ahi,there hiciera,do correa,belt pantalla,screen agua,water pequeños,little ruego,beg ocurrido,happened henry,henry tendrán,will estación,station bastante,quite termina,ends cola,tail muerte,death que,what ayer,yesterday panaderia,shop boca,mouth hacía,toward b,b haciendo,doing caballos,horses modo,mode secreto,secret verte,see gato,cat fábrica,factory piensan,think sabe,knows mensaje,message dime,tell cierre,zipper tampoco,neither to,to estado,state llamada,call d,d muchas,many ojo,eye lapicero,pen tanto,much pierna,leg acabar,finish ojos,eyes puto,fucking cresta,ridge comprendo,comprehend grave,serious debería,should centro,center mismo,same viudo,widower órdenes,orders monstruo,monster deberías,should visto,viewed piernas,legs nada,nothing señor,mister correos,office teníamos,had borracho,drunk estadio,stadium encuentras,find pueblo,town clases,lessons natural,natural dices,say proclamar,proclaim fuese,was olvides,forget defensa,defending estarán,be supe,knew carne,meat antes,before llave,key manta,blanket llaman,call coge,grabs través,through izquierda,left asuntos,issues algunas,some enfermero,nurse quiénes,who probar,try cristianismo,christianity leal,loyal detalles,details jugando,playing sam,sam cierto,true placer,pleasure pollo,chicken pase,pass mundo,world miedo,fear dos,two aunque,although hermana,sister patrón,patron puñetazo,punch jamás,never tony,tony trago,drink falda,skirt explícito,explicit televisión,television sino,but hay,are finalmente,finally decía,said salida,exit adentro,in caja,box hígado,liver despierto,awake escapar,escape rica,rich juntos,together nervioso,nervous papi,daddy cerrar,close dibujar,draw negro,black suya,his todavía,still anterior,underwear seas,are estuviera,was incluso,even mañana,morning informe,report tolerancia,tolerance gloria,glory contigo,with teatro,theater naríz,nose hablando,speaking américa,america tiro,threw pareja,couple me,me daño,hurt cuidar,care copa,cup oso,bear juro,swear cantar,sing arriba,above libras,pounds simple,simple lugares,places pudo,could tendría,have revisión,review veamos,see trajo,brought volver,return ellos,they problema,problem alemanes,german son,are diré,say decirte,tell ama,love aire,air opción,option ministro,minister veía,looked vio,saw naranja,orange walter,walter huevos,eggs encontramos,find amiga,friend muevas,move día,day soldado,soldier cabeza,head lapiz,pencil haga,make habitación,room fútbol,football denso,dense mantener,keep perforar,drill luces,lights charlie,charlie qué,what tomó,took campo,field matemáticas,math lleva,carries bienvenido,welcome cita,appointment patrocinador,sponsor queja,complaint carta,letter caer,fall siete,seven empujón,poke viejos,old estudiar,study mil,thousand orgulloso,proud llamar,call océano,ocean ido,gone poco,little dientes,teeth justicia,justice dejado,left viejo,old lleno,full salvo,except posible,possible lejos,far dígame,tell allí,there cerdo,pig rojo,red intenta,try quedarte,stay carretera,highway polvo,dust del,of parar,stop nave,ship juego,game ciclomotor,moped parís,paris hubiese,had las,the p,p causa,cause conoce,known alegar,allege él,he feo,ugly haya,beech vuestra,your líquido,liquid tonto,stupid siguiente,following sentado,seated vestíbulo,hallway pelea,fight profesora,professor menos,less querría,want cerveza,beer bromeando,joking respecto,respect inmediato,now mando,send sólo,only seré,be economía,economics lleve,carried verla,see esos,those roma,rome asesinato,murder colegio,college charca,pond debo,must pelo,hair quizá,maybe sábado,saturday recortar,trim leer,read inmediatamente,immediately capaz,able aprender,learn españa,spain llamaré,call viendo,seeing olvidado,forgotten mesa,table officina,office enemigos,enemies mirando,looking madera,timber acción,action aquel,that acerca,about tener,have gustaría,like actuar,act ballena,whale cena,dinner solía,accustomed deja,let total,total bus,bus ave,bird viento,wind joder,fuck mentira,lie umbral,threshold cayó,fell compañía,company operación,operation tapa,lid casarse,marry amor,love bomba,bomb conozco,know anda,walks invención,invention cuatro,four sur,south sabías,know extraña,strange llevará,carry compromiso,compromise sheriff,sheriff espere,waited volar,fly tanta,much contabilidad,accounting rutinariamente,routinely libertad,freedom abre,opens silla,chair haremos,will tomando,taking sobre,on precio,price cinta,ribbon para,for aspirina,aspirin motivo,reason perdió,lost totalmente,totally digas,say sus,their señores,sirs falta,lack muere,die zapatos,shoes hiciste,did recuperar,recover permiso,permission malditos,damn io,io electrónica,electronics seco,dry puntos,points crees,believe capa,coat sigo,follow guardia,guard ágil,agile ahora,now nuevas,news cerca,close llevo,wear pensé,thought peligro,danger en,in brazo,arm sombrero,hat preocupe,worry rato,while responsable,responsable michael,michael inevitablemente,inevitably podremos,can cierra,closes almacén,warehouse extraño,strange nombre,name rosa,pink déjeme,let éste,east hable,talked dejar,leave río,river color,color oeste,west alta,high juventud,youth contribuyente,contributor estudio,study raro,rare lucha,fight pesar,weigh pueden,may nick,nick pasado,past aspecto,appearance joe,joe sucedió,happened traer,bring pijama,pajamas you,you escupir,spit puesto,position eras,were vestido,dress ángel,angel adiós,goodbye demás,other hayas,have sueños,dreams cuchillo,knife demócrata,democrat sirve,serves da,gives aquellos,those tiempo,time cruel,cruel valiente,brave derecho,right permite,allows codo,elbow equipaje,luggage abrir,open cabello,hair papa,dad graduación,graduation leche,milk periódico,newspaper lago,lake estufa,stove salir,leave puse,put forma,shape acto,act roto,broken luz,light orden,order conoces,know cada,each veterano,veteran varias,several mucho,much tránsito,transit vale,okay plan,plan también,also jesús,jesus sargento,sergeant auto,car chica,girl prensa,press continúa,continue duro,hard dado,dice haz,make durmiendo,sleeping coger,take inteligente,intelligent preparado,prepared pies,feet estaba,was tornillo,bolt ellas,they uh,uh ley,law diccionário,dictionary verdadera,real cálculo,calculus vive,lives según,according viva,live paja,straw dé,from asesino,murderer mire,look espíritu,spirit una,a coronel,colonel jacob,jacob cabo,cape mira,look tí,you va,goes servicio,service carajo,fuck tengan,have entrada,entry espera,wait reservado,reserved vuelva,return cálmate,calm respuesta,answer bañera,bathtub pedí,asked steve,steve recibido,received fué,was espejo,mirror maldición,curse nacional,national quiere,wants habla,speaks the,the culpa,guilt lindo,pretty valle,valley sonido,sound oficial,official niñas,girls cómo,how esas,those ayuda,help lástima,pity momento,moment farmácia,pharmacy campesino,peasant tejido,fabric george,george llaves,keys opinión,opinion richard,richard muchacha,girl suyo,yours mírame,look error,error dejarlo,leave llamado,called vendrá,come dejé,leave infeliz,unhappy ladrón,thief enfermedad,disease mes,month silencio,silence vengan,come banco,bank enfermo,sick infierno,hell lión,lion microonda,microwave subir,up pequeño,small igual,same normal,normal llámame,call apartamento,apartment tumba,grave rádio,radio acaso,perhaps pena,pain tipos,types enseguida,immediately ahogar,drown malo,bad papeles,papers sala,room lugar,place selva,jungle alguno,any sentimientos,feelings puedes,can gracioso,funny simplemente,simply dejaré,leave frío,cold profeta,prophet pasó,passed habías,had autonomía,autonomy sacar,take alabanza,praise padres,parents cuenta,account muévete,move siga,follow nueve,nine colina,hill sin,without pecho,chest líder,leader así,yes riesgo,risk rodilla,knee apología,apology u,or ayudarte,help ni,neither propia,own llegado,arrived tus,your marido,husband dieron,gave acuerdo,agreement este,east puso,put pago,payment toques,touches golpe,knock suelo,floor hambre,hungry ridículo,ridiculous tom,tom desea,wish necesitamos,need interesa,interested tres,three preocupa,worries ocupado,occupied santa,saint transmitir,transmit tomas,shots paga,pay niños,children cree,believes aún,yet supone,supposed hasta,until cuchara,spoon pareció,seemed arte,art cintura,waist cien,hundred dicho,saying hablemos,talk adorar,worship santo,holy dr,dr experto,skilled puede,can genio,genius mar,sea hagamos,do he,have juez,judge ella,she sueño,dream refiero,refer seis,six vi,saw testigo,witness señoría,lordship misma,same hablo,speak impuesto,tax verme,see hielo,ice tenían,had máquina,machine vaca,cow necesita,needs realidad,reality mundial,world déjalo,leave geografía,geography inútil,useless pan,bread escribir,write larry,larry muchos,many chris,chris fuego,fire hotel,hotel existe,exists maldito,damned lavaplatos,dishwasher sabía,knew despacio,slowly famoso,famous mármol,marble inglés,english larga,long acabó,finished llame,called aceptar,accept decidido,decided escrito,written cerrado,closed acabado,finish botella,bottle yendo,going automovíl,car salvar,save recuerdo,memory allá,there increíble,amazing fue,was solo,alone o,or veces,times terriblemente,terribly volverá,return coco,coconut vienen,come humana,human perdí,lost partir,from siguen,follow encontrar,find déjame,let basura,trash oreja,ear zoológico,zoo meses,months escuché,heard estrellas,stars araña,spider duerme,sleeps judaismo,judaism estáis,are pude,could t,t modos,modes pueda,can justo,fair y,and estábamos,were arreglar,fix han,have acelerado,accelerated cuándo,when dicen,say contemplar,contemplate pregunta,question jimmy,jimmy tierra,earth segura,safe teniente,lieutenant ello,it paul,paul águila,eagle no,no conocí,met sabes,know arroz,rice les,them washington,washington varios,various valor,value tonta,dumb llena,full miel,honey necesitan,need sexualidad,sexuality princesa,princess tantos,many horas,hours gallina,chicken central,central menudo,often halcón,hawk costar,cost deprisa,quickly probablemente,probably planes,plans blanca,white biografía,biography evitar,avoid ibas,were tienen,have voluntario,voluntary esposo,husband número,number encuentra,find conversación,conversation cárcel,jail te,tea caballeros,gentlemen veo,see primera,first irá,go negra,black n,n subterráneo,subway ei,ei podrá,can terrible,terrible platillo,saucer grupo,group tía,aunt estilo,style recordar,remember norte,north coche,car descanso,rest principal,principal demonio,demon dile,tell municipal,municipal se,oneself armario,closet deberíamos,should estos,these sitio,site entero,whole metido,involved oveja,sheep barato,cheap peso,weight llevaba,took manera,way cualquier,any árboles,trees creer,believe época,time espero,hope equipo,team buen,good trae,brings mío,mine soleado,sunny jane,jane llamó,called próximo,next fuerte,strong resumen,summary reglas,rules necesito,need soy,am hermosa,beautiful bebe,baby felicidad,happiness fragmento,fragment intentando,trying globo,balloon vayas,go derecha,right vuelvo,return sucede,happens palo,stick estaré,be uva,grape estás,are abrumar,overwhelm puedas,can área,area contra,against vuelta,return lágrimas,tears estuve,was frank,frank historia,history algún,some europa,europe esté,be llamo,call hicieron,made niña,girl donación,donation mismos,same quizás,maybe radio,radio algunos,some mató,killed planeta,planet duele,hurts ven,come señal,signal unir,merge único,only ================================================ FILE: examples/02_lazy_loading.py ================================================ """ Example of lazy vs normal loading Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 02 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf ######################################## ## NORMAL LOADING ## ## print out a graph with 1 Add node ## ######################################## x = tf.Variable(10, name='x') y = tf.Variable(20, name='y') z = tf.add(x, y) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('graphs/normal_loading', sess.graph) for _ in range(10): sess.run(z) print(tf.get_default_graph().as_graph_def()) writer.close() ######################################## ## LAZY LOADING ## ## print out a graph with 10 Add nodes## ######################################## x = tf.Variable(10, name='x') y = tf.Variable(20, name='y') with tf.Session() as sess: sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('graphs/lazy_loading', sess.graph) for _ in range(10): sess.run(tf.add(x, y)) print(tf.get_default_graph().as_graph_def()) writer.close() ================================================ FILE: examples/02_placeholder.py ================================================ """ Placeholder and feed_dict example Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 02 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf # Example 1: feed_dict with placeholder # a is a placeholderfor a vector of 3 elements, type tf.float32 a = tf.placeholder(tf.float32, shape=[3]) b = tf.constant([5, 5, 5], tf.float32) # use the placeholder as you would a constant c = a + b # short for tf.add(a, b) writer = tf.summary.FileWriter('graphs/placeholders', tf.get_default_graph()) with tf.Session() as sess: # compute the value of c given the value of a is [1, 2, 3] print(sess.run(c, {a: [1, 2, 3]})) # [6. 7. 8.] writer.close() # Example 2: feed_dict with variables a = tf.add(2, 5) b = tf.multiply(a, 3) with tf.Session() as sess: print(sess.run(b)) # >> 21 # compute the value of b given the value of a is 15 print(sess.run(b, feed_dict={a: 15})) # >> 45 ================================================ FILE: examples/02_simple_tf.py ================================================ """ Simple TensorFlow's ops Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf # Example 1: Simple ways to create log file writer a = tf.constant(2, name='a') b = tf.constant(3, name='b') x = tf.add(a, b, name='add') writer = tf.summary.FileWriter('./graphs/simple', tf.get_default_graph()) with tf.Session() as sess: # writer = tf.summary.FileWriter('./graphs', sess.graph) print(sess.run(x)) writer.close() # close the writer when you’re done using it # Example 2: The wonderful wizard of div a = tf.constant([2, 2], name='a') b = tf.constant([[0, 1], [2, 3]], name='b') with tf.Session() as sess: print(sess.run(tf.div(b, a))) print(sess.run(tf.divide(b, a))) print(sess.run(tf.truediv(b, a))) print(sess.run(tf.floordiv(b, a))) # print(sess.run(tf.realdiv(b, a))) print(sess.run(tf.truncatediv(b, a))) print(sess.run(tf.floor_div(b, a))) # Example 3: multiplying tensors a = tf.constant([10, 20], name='a') b = tf.constant([2, 3], name='b') with tf.Session() as sess: print(sess.run(tf.multiply(a, b))) print(sess.run(tf.tensordot(a, b, 1))) # Example 4: Python native type t_0 = 19 x = tf.zeros_like(t_0) # ==> 0 y = tf.ones_like(t_0) # ==> 1 t_1 = ['apple', 'peach', 'banana'] x = tf.zeros_like(t_1) # ==> ['' '' ''] # y = tf.ones_like(t_1) # ==> TypeError: Expected string, got 1 of type 'int' instead. t_2 = [[True, False, False], [False, False, True], [False, True, False]] x = tf.zeros_like(t_2) # ==> 3x3 tensor, all elements are False y = tf.ones_like(t_2) # ==> 3x3 tensor, all elements are True print(tf.int32.as_numpy_dtype()) # Example 5: printing your graph's definition my_const = tf.constant([1.0, 2.0], name='my_const') print(tf.get_default_graph().as_graph_def()) ================================================ FILE: examples/02_variables.py ================================================ """ Variable exmaples Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 02 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf # Example 1: creating variables s = tf.Variable(2, name='scalar') m = tf.Variable([[0, 1], [2, 3]], name='matrix') W = tf.Variable(tf.zeros([784,10]), name='big_matrix') V = tf.Variable(tf.truncated_normal([784, 10]), name='normal_matrix') s = tf.get_variable('scalar', initializer=tf.constant(2)) m = tf.get_variable('matrix', initializer=tf.constant([[0, 1], [2, 3]])) W = tf.get_variable('big_matrix', shape=(784, 10), initializer=tf.zeros_initializer()) V = tf.get_variable('normal_matrix', shape=(784, 10), initializer=tf.truncated_normal_initializer()) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print(V.eval()) # Example 2: assigning values to variables W = tf.Variable(10) W.assign(100) with tf.Session() as sess: sess.run(W.initializer) print(sess.run(W)) # >> 10 W = tf.Variable(10) assign_op = W.assign(100) with tf.Session() as sess: sess.run(assign_op) print(W.eval()) # >> 100 # create a variable whose original value is 2 a = tf.get_variable('scalar', initializer=tf.constant(2)) a_times_two = a.assign(a * 2) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) sess.run(a_times_two) # >> 4 sess.run(a_times_two) # >> 8 sess.run(a_times_two) # >> 16 W = tf.Variable(10) with tf.Session() as sess: sess.run(W.initializer) print(sess.run(W.assign_add(10))) # >> 20 print(sess.run(W.assign_sub(2))) # >> 18 # Example 3: Each session has its own copy of variable W = tf.Variable(10) sess1 = tf.Session() sess2 = tf.Session() sess1.run(W.initializer) sess2.run(W.initializer) print(sess1.run(W.assign_add(10))) # >> 20 print(sess2.run(W.assign_sub(2))) # >> 8 print(sess1.run(W.assign_add(100))) # >> 120 print(sess2.run(W.assign_sub(50))) # >> -42 sess1.close() sess2.close() # Example 4: create a variable with the initial value depending on another variable W = tf.Variable(tf.truncated_normal([700, 10])) U = tf.Variable(W * 2) ================================================ FILE: examples/03_linreg_dataset.py ================================================ """ Solution for simple linear regression example using tf.data Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 03 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import utils DATA_FILE = 'data/birth_life_2010.txt' # Step 1: read in the data data, n_samples = utils.read_birth_life_data(DATA_FILE) # Step 2: create Dataset and iterator dataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1])) iterator = dataset.make_initializable_iterator() X, Y = iterator.get_next() # Step 3: create weight and bias, initialized to 0 w = tf.get_variable('weights', initializer=tf.constant(0.0)) b = tf.get_variable('bias', initializer=tf.constant(0.0)) # Step 4: build model to predict Y Y_predicted = X * w + b # Step 5: use the square error as the loss function loss = tf.square(Y - Y_predicted, name='loss') # loss = utils.huber_loss(Y, Y_predicted) # Step 6: using gradient descent with learning rate of 0.001 to minimize loss optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) start = time.time() with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter('./graphs/linear_reg', sess.graph) # Step 8: train the model for 100 epochs for i in range(100): sess.run(iterator.initializer) # initialize the iterator total_loss = 0 try: while True: _, l = sess.run([optimizer, loss]) total_loss += l except tf.errors.OutOfRangeError: pass print('Epoch {0}: {1}'.format(i, total_loss/n_samples)) # close the writer when you're done using it writer.close() # Step 9: output the values of w and b w_out, b_out = sess.run([w, b]) print('w: %f, b: %f' %(w_out, b_out)) print('Took: %f seconds' %(time.time() - start)) # plot the results plt.plot(data[:,0], data[:,1], 'bo', label='Real data') plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data with squared error') # plt.plot(data[:,0], data[:,0] * (-5.883589) + 85.124306, 'g', label='Predicted data with Huber loss') plt.legend() plt.show() ================================================ FILE: examples/03_linreg_placeholder.py ================================================ """ Solution for simple linear regression example using placeholders Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 03 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import utils DATA_FILE = 'data/birth_life_2010.txt' # Step 1: read in data from the .txt file data, n_samples = utils.read_birth_life_data(DATA_FILE) # Step 2: create placeholders for X (birth rate) and Y (life expectancy) X = tf.placeholder(tf.float32, name='X') Y = tf.placeholder(tf.float32, name='Y') # Step 3: create weight and bias, initialized to 0 w = tf.get_variable('weights', initializer=tf.constant(0.0)) b = tf.get_variable('bias', initializer=tf.constant(0.0)) # Step 4: build model to predict Y Y_predicted = w * X + b # Step 5: use the squared error as the loss function # you can use either mean squared error or Huber loss loss = tf.square(Y - Y_predicted, name='loss') # loss = utils.huber_loss(Y, Y_predicted) # Step 6: using gradient descent with learning rate of 0.001 to minimize loss optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) start = time.time() writer = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph()) with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b sess.run(tf.global_variables_initializer()) # Step 8: train the model for 100 epochs for i in range(100): total_loss = 0 for x, y in data: # Session execute optimizer and fetch values of loss _, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) total_loss += l print('Epoch {0}: {1}'.format(i, total_loss/n_samples)) # close the writer when you're done using it writer.close() # Step 9: output the values of w and b w_out, b_out = sess.run([w, b]) print('Took: %f seconds' %(time.time() - start)) # plot the results plt.plot(data[:,0], data[:,1], 'bo', label='Real data') plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data') plt.legend() plt.show() ================================================ FILE: examples/03_linreg_starter.py ================================================ """ Starter code for simple linear regression example using placeholders Created by Chip Huyen (huyenn@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 03 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import utils DATA_FILE = 'data/birth_life_2010.txt' # Step 1: read in data from the .txt file data, n_samples = utils.read_birth_life_data(DATA_FILE) # Step 2: create placeholders for X (birth rate) and Y (life expectancy) # Remember both X and Y are scalars with type float X, Y = None, None ############################# ########## TO DO ############ ############################# # Step 3: create weight and bias, initialized to 0.0 # Make sure to use tf.get_variable w, b = None, None ############################# ########## TO DO ############ ############################# # Step 4: build model to predict Y # e.g. how would you derive at Y_predicted given X, w, and b Y_predicted = None ############################# ########## TO DO ############ ############################# # Step 5: use the square error as the loss function loss = None ############################# ########## TO DO ############ ############################# # Step 6: using gradient descent with learning rate of 0.001 to minimize loss optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) start = time.time() # Create a filewriter to write the model's graph to TensorBoard ############################# ########## TO DO ############ ############################# with tf.Session() as sess: # Step 7: initialize the necessary variables, in this case, w and b ############################# ########## TO DO ############ ############################# # Step 8: train the model for 100 epochs for i in range(100): total_loss = 0 for x, y in data: # Execute train_op and get the value of loss. # Don't forget to feed in data for placeholders _, loss = ########## TO DO ############ total_loss += loss print('Epoch {0}: {1}'.format(i, total_loss/n_samples)) # close the writer when you're done using it ############################# ########## TO DO ############ ############################# writer.close() # Step 9: output the values of w and b w_out, b_out = None, None ############################# ########## TO DO ############ ############################# print('Took: %f seconds' %(time.time() - start)) # uncomment the following lines to see the plot # plt.plot(data[:,0], data[:,1], 'bo', label='Real data') # plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data') # plt.legend() # plt.show() ================================================ FILE: examples/03_logreg.py ================================================ """ Solution for simple logistic regression model for MNIST with tf.data module MNIST dataset: yann.lecun.com/exdb/mnist/ Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 03 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf import time import utils # Define paramaters for the model learning_rate = 0.01 batch_size = 128 n_epochs = 30 n_train = 60000 n_test = 10000 # Step 1: Read in data mnist_folder = 'data/mnist' utils.download_mnist(mnist_folder) train, val, test = utils.read_mnist(mnist_folder, flatten=True) # Step 2: Create datasets and iterator train_data = tf.data.Dataset.from_tensor_slices(train) train_data = train_data.shuffle(10000) # if you want to shuffle your data train_data = train_data.batch(batch_size) test_data = tf.data.Dataset.from_tensor_slices(test) test_data = test_data.batch(batch_size) iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes) img, label = iterator.get_next() train_init = iterator.make_initializer(train_data) # initializer for train_data test_init = iterator.make_initializer(test_data) # initializer for train_data # Step 3: create weights and bias # w is initialized to random variables with mean of 0, stddev of 0.01 # b is initialized to 0 # shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w) # shape of b depends on Y w = tf.get_variable(name='weights', shape=(784, 10), initializer=tf.random_normal_initializer(0, 0.01)) b = tf.get_variable(name='bias', shape=(1, 10), initializer=tf.zeros_initializer()) # Step 4: build model # the model that returns the logits. # this logits will be later passed through softmax layer logits = tf.matmul(img, w) + b # Step 5: define loss function # use cross entropy of softmax of logits as the loss function entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=label, name='entropy') loss = tf.reduce_mean(entropy, name='loss') # computes the mean over all the examples in the batch # Step 6: define training op # using gradient descent with learning rate of 0.01 to minimize loss optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss) # Step 7: calculate accuracy with test set preds = tf.nn.softmax(logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) writer = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph()) with tf.Session() as sess: start_time = time.time() sess.run(tf.global_variables_initializer()) # train the model n_epochs times for i in range(n_epochs): sess.run(train_init) # drawing samples from train_data total_loss = 0 n_batches = 0 try: while True: _, l = sess.run([optimizer, loss]) total_loss += l n_batches += 1 except tf.errors.OutOfRangeError: pass print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches)) print('Total time: {0} seconds'.format(time.time() - start_time)) # test the model sess.run(test_init) # drawing samples from test_data total_correct_preds = 0 try: while True: accuracy_batch = sess.run(accuracy) total_correct_preds += accuracy_batch except tf.errors.OutOfRangeError: pass print('Accuracy {0}'.format(total_correct_preds/n_test)) writer.close() ================================================ FILE: examples/03_logreg_placeholder.py ================================================ """ Solution for simple logistic regression model for MNIST with placeholder MNIST dataset: yann.lecun.com/exdb/mnist/ Created by Chip Huyen (huyenn@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 03 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import time import utils # Define paramaters for the model learning_rate = 0.01 batch_size = 128 n_epochs = 30 # Step 1: Read in data # using TF Learn's built in function to load MNIST data to the folder data/mnist mnist = input_data.read_data_sets('data/mnist', one_hot=True) X_batch, Y_batch = mnist.train.next_batch(batch_size) # Step 2: create placeholders for features and labels # each image in the MNIST data is of shape 28*28 = 784 # therefore, each image is represented with a 1x784 tensor # there are 10 classes for each image, corresponding to digits 0 - 9. # each lable is one hot vector. X = tf.placeholder(tf.float32, [batch_size, 784], name='image') Y = tf.placeholder(tf.int32, [batch_size, 10], name='label') # Step 3: create weights and bias # w is initialized to random variables with mean of 0, stddev of 0.01 # b is initialized to 0 # shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w) # shape of b depends on Y w = tf.get_variable(name='weights', shape=(784, 10), initializer=tf.random_normal_initializer()) b = tf.get_variable(name='bias', shape=(1, 10), initializer=tf.zeros_initializer()) # Step 4: build model # the model that returns the logits. # this logits will be later passed through softmax layer logits = tf.matmul(X, w) + b # Step 5: define loss function # use cross entropy of softmax of logits as the loss function entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss') loss = tf.reduce_mean(entropy) # computes the mean over all the examples in the batch # loss = tf.reduce_mean(-tf.reduce_sum(tf.nn.softmax(logits) * tf.log(Y), reduction_indices=[1])) # Step 6: define training op # using gradient descent with learning rate of 0.01 to minimize loss optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss) # Step 7: calculate accuracy with test set preds = tf.nn.softmax(logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) writer = tf.summary.FileWriter('./graphs/logreg_placeholder', tf.get_default_graph()) with tf.Session() as sess: start_time = time.time() sess.run(tf.global_variables_initializer()) n_batches = int(mnist.train.num_examples/batch_size) # train the model n_epochs times for i in range(n_epochs): total_loss = 0 for j in range(n_batches): X_batch, Y_batch = mnist.train.next_batch(batch_size) _, loss_batch = sess.run([optimizer, loss], {X: X_batch, Y:Y_batch}) total_loss += loss_batch print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches)) print('Total time: {0} seconds'.format(time.time() - start_time)) # test the model n_batches = int(mnist.test.num_examples/batch_size) total_correct_preds = 0 for i in range(n_batches): X_batch, Y_batch = mnist.test.next_batch(batch_size) accuracy_batch = sess.run(accuracy, {X: X_batch, Y:Y_batch}) total_correct_preds += accuracy_batch print('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples)) writer.close() ================================================ FILE: examples/03_logreg_starter.py ================================================ """ Starter code for simple logistic regression model for MNIST with tf.data module MNIST dataset: yann.lecun.com/exdb/mnist/ Created by Chip Huyen (chiphuyen@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 03 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf import time import utils # Define paramaters for the model learning_rate = 0.01 batch_size = 128 n_epochs = 30 n_train = 60000 n_test = 10000 # Step 1: Read in data mnist_folder = 'data/mnist' utils.download_mnist(mnist_folder) train, val, test = utils.read_mnist(mnist_folder, flatten=True) # Step 2: Create datasets and iterator # create training Dataset and batch it train_data = tf.data.Dataset.from_tensor_slices(train) train_data = train_data.shuffle(10000) # if you want to shuffle your data train_data = train_data.batch(batch_size) # create testing Dataset and batch it test_data = None ############################# ########## TO DO ############ ############################# # create one iterator and initialize it with different datasets iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes) img, label = iterator.get_next() train_init = iterator.make_initializer(train_data) # initializer for train_data test_init = iterator.make_initializer(test_data) # initializer for train_data # Step 3: create weights and bias # w is initialized to random variables with mean of 0, stddev of 0.01 # b is initialized to 0 # shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w) # shape of b depends on Y w, b = None, None ############################# ########## TO DO ############ ############################# # Step 4: build model # the model that returns the logits. # this logits will be later passed through softmax layer logits = None ############################# ########## TO DO ############ ############################# # Step 5: define loss function # use cross entropy of softmax of logits as the loss function loss = None ############################# ########## TO DO ############ ############################# # Step 6: define optimizer # using Adamn Optimizer with pre-defined learning rate to minimize loss optimizer = None ############################# ########## TO DO ############ ############################# # Step 7: calculate accuracy with test set preds = tf.nn.softmax(logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1)) accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) writer = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph()) with tf.Session() as sess: start_time = time.time() sess.run(tf.global_variables_initializer()) # train the model n_epochs times for i in range(n_epochs): sess.run(train_init) # drawing samples from train_data total_loss = 0 n_batches = 0 try: while True: _, l = sess.run([optimizer, loss]) total_loss += l n_batches += 1 except tf.errors.OutOfRangeError: pass print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches)) print('Total time: {0} seconds'.format(time.time() - start_time)) # test the model sess.run(test_init) # drawing samples from test_data total_correct_preds = 0 try: while True: accuracy_batch = sess.run(accuracy) total_correct_preds += accuracy_batch except tf.errors.OutOfRangeError: pass print('Accuracy {0}'.format(total_correct_preds/n_test)) writer.close() ================================================ FILE: examples/04_linreg_eager.py ================================================ """ Starter code for a simple regression example using eager execution. Created by Akshay Agrawal (akshayka@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 04 """ import time import tensorflow as tf import tensorflow.contrib.eager as tfe import matplotlib.pyplot as plt import utils DATA_FILE = 'data/birth_life_2010.txt' # In order to use eager execution, `tfe.enable_eager_execution()` must be # called at the very beginning of a TensorFlow program. tfe.enable_eager_execution() # Read the data into a dataset. data, n_samples = utils.read_birth_life_data(DATA_FILE) dataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1])) # Create variables. w = tfe.Variable(0.0) b = tfe.Variable(0.0) # Define the linear predictor. def prediction(x): return x * w + b # Define loss functions of the form: L(y, y_predicted) def squared_loss(y, y_predicted): return (y - y_predicted) ** 2 def huber_loss(y, y_predicted, m=1.0): """Huber loss.""" t = y - y_predicted # Note that enabling eager execution lets you use Python control flow and # specificy dynamic TensorFlow computations. Contrast this implementation # to the graph-construction one found in `utils`, which uses `tf.cond`. return t ** 2 if tf.abs(t) <= m else m * (2 * tf.abs(t) - m) def train(loss_fn): """Train a regression model evaluated using `loss_fn`.""" print('Training; loss function: ' + loss_fn.__name__) optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01) # Define the function through which to differentiate. def loss_for_example(x, y): return loss_fn(y, prediction(x)) # `grad_fn(x_i, y_i)` returns (1) the value of `loss_for_example` # evaluated at `x_i`, `y_i` and (2) the gradients of any variables used in # calculating it. grad_fn = tfe.implicit_value_and_gradients(loss_for_example) start = time.time() for epoch in range(100): total_loss = 0.0 for x_i, y_i in tfe.Iterator(dataset): loss, gradients = grad_fn(x_i, y_i) # Take an optimization step and update variables. optimizer.apply_gradients(gradients) total_loss += loss if epoch % 10 == 0: print('Epoch {0}: {1}'.format(epoch, total_loss / n_samples)) print('Took: %f seconds' % (time.time() - start)) print('Eager execution exhibits significant overhead per operation. ' 'As you increase your batch size, the impact of the overhead will ' 'become less noticeable. Eager execution is under active development: ' 'expect performance to increase substantially in the near future!') train(huber_loss) plt.plot(data[:,0], data[:,1], 'bo') # The `.numpy()` method of a tensor retrieves the NumPy array backing it. # In future versions of eager, you won't need to call `.numpy()` and will # instead be able to, in most cases, pass Tensors wherever NumPy arrays are # expected. plt.plot(data[:,0], data[:,0] * w.numpy() + b.numpy(), 'r', label="huber regression") plt.legend() plt.show() ================================================ FILE: examples/04_linreg_eager_starter.py ================================================ """ Starter code for a simple regression example using eager execution. Created by Akshay Agrawal (akshayka@cs.stanford.edu) CS20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Lecture 04 """ import time import tensorflow as tf import tensorflow.contrib.eager as tfe import matplotlib.pyplot as plt import utils DATA_FILE = 'data/birth_life_2010.txt' # In order to use eager execution, `tfe.enable_eager_execution()` must be # called at the very beginning of a TensorFlow program. ############################# ########## TO DO ############ ############################# # Read the data into a dataset. data, n_samples = utils.read_birth_life_data(DATA_FILE) dataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1])) # Create weight and bias variables, initialized to 0.0. ############################# ########## TO DO ############ ############################# w = None b = None # Define the linear predictor. def prediction(x): ############################# ########## TO DO ############ ############################# pass # Define loss functions of the form: L(y, y_predicted) def squared_loss(y, y_predicted): ############################# ########## TO DO ############ ############################# pass def huber_loss(y, y_predicted): """Huber loss with `m` set to `1.0`.""" ############################# ########## TO DO ############ ############################# pass def train(loss_fn): """Train a regression model evaluated using `loss_fn`.""" print('Training; loss function: ' + loss_fn.__name__) optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01) # Define the function through which to differentiate. ############################# ########## TO DO ############ ############################# def loss_for_example(x, y): pass # Obtain a gradients function using `tfe.implicit_value_and_gradients`. ############################# ########## TO DO ############ ############################# grad_fn = None start = time.time() for epoch in range(100): total_loss = 0.0 for x_i, y_i in tfe.Iterator(dataset): # Compute the loss and gradient, and take an optimization step. ############################# ########## TO DO ############ ############################# optimizer.apply_gradients(gradients) total_loss += loss if epoch % 10 == 0: print('Epoch {0}: {1}'.format(epoch, total_loss / n_samples)) print('Took: %f seconds' % (time.time() - start)) print('Eager execution exhibits significant overhead per operation. ' 'As you increase your batch size, the impact of the overhead will ' 'become less noticeable. Eager execution is under active development: ' 'expect performance to increase substantially in the near future!') train(huber_loss) plt.plot(data[:,0], data[:,1], 'bo') # The `.numpy()` method of a tensor retrieves the NumPy array backing it. # In future versions of eager, you won't need to call `.numpy()` and will # instead be able to, in most cases, pass Tensors wherever NumPy arrays are # expected. plt.plot(data[:,0], data[:,0] * w.numpy() + b.numpy(), 'r', label="huber regression") plt.legend() plt.show() ================================================ FILE: examples/04_word2vec.py ================================================ """ starter code for word2vec skip-gram model with NCE loss CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 04 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np from tensorflow.contrib.tensorboard.plugins import projector import tensorflow as tf import utils import word2vec_utils # Model hyperparameters VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # number of negative examples to sample LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 100000 VISUAL_FLD = 'visualization' SKIP_STEP = 5000 # Parameters for downloading data DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip' EXPECTED_BYTES = 31344016 NUM_VISUALIZE = 3000 # number of tokens to visualize def word2vec(dataset): """ Build the graph for word2vec model and train it """ # Step 1: get input, output from the dataset with tf.name_scope('data'): iterator = dataset.make_initializable_iterator() center_words, target_words = iterator.get_next() """ Step 2 + 3: define weights and embedding lookup. In word2vec, it's actually the weights that we care about """ with tf.name_scope('embed'): embed_matrix = tf.get_variable('embed_matrix', shape=[VOCAB_SIZE, EMBED_SIZE], initializer=tf.random_uniform_initializer()) embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embedding') # Step 4: construct variables for NCE loss and define loss function with tf.name_scope('loss'): nce_weight = tf.get_variable('nce_weight', shape=[VOCAB_SIZE, EMBED_SIZE], initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5))) nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE])) # define loss function to be NCE loss function loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, biases=nce_bias, labels=target_words, inputs=embed, num_sampled=NUM_SAMPLED, num_classes=VOCAB_SIZE), name='loss') # Step 5: define optimizer with tf.name_scope('optimizer'): optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss) utils.safe_mkdir('checkpoints') with tf.Session() as sess: sess.run(iterator.initializer) sess.run(tf.global_variables_initializer()) total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps writer = tf.summary.FileWriter('graphs/word2vec_simple', sess.graph) for index in range(NUM_TRAIN_STEPS): try: loss_batch, _ = sess.run([loss, optimizer]) total_loss += loss_batch if (index + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP)) total_loss = 0.0 except tf.errors.OutOfRangeError: sess.run(iterator.initializer) writer.close() def gen(): yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD) def main(): dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32), (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1]))) word2vec(dataset) if __name__ == '__main__': main() ================================================ FILE: examples/04_word2vec_eager.py ================================================ """ starter code for word2vec skip-gram model with NCE loss Eager execution CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) & Akshay Agrawal (akshayka@cs.stanford.edu) Lecture 04 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf import tensorflow.contrib.eager as tfe import utils import word2vec_utils tfe.enable_eager_execution() # Model hyperparameters VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # number of negative examples to sample LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 100000 VISUAL_FLD = 'visualization' SKIP_STEP = 5000 # Parameters for downloading data DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip' EXPECTED_BYTES = 31344016 class Word2Vec(object): def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED): self.vocab_size = vocab_size self.num_sampled = num_sampled self.embed_matrix = tfe.Variable(tf.random_uniform( [vocab_size, embed_size])) self.nce_weight = tfe.Variable(tf.truncated_normal( [vocab_size, embed_size], stddev=1.0 / (embed_size ** 0.5))) self.nce_bias = tfe.Variable(tf.zeros([vocab_size])) def compute_loss(self, center_words, target_words): """Computes the forward pass of word2vec with the NCE loss.""" embed = tf.nn.embedding_lookup(self.embed_matrix, center_words) loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.nce_weight, biases=self.nce_bias, labels=target_words, inputs=embed, num_sampled=self.num_sampled, num_classes=self.vocab_size)) return loss def gen(): yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD) def main(): dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32), (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1]))) optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE) model = Word2Vec(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE) grad_fn = tfe.implicit_value_and_gradients(model.compute_loss) total_loss = 0.0 # for average loss in the last SKIP_STEP steps num_train_steps = 0 while num_train_steps < NUM_TRAIN_STEPS: for center_words, target_words in tfe.Iterator(dataset): if num_train_steps >= NUM_TRAIN_STEPS: break loss_batch, grads = grad_fn(center_words, target_words) total_loss += loss_batch optimizer.apply_gradients(grads) if (num_train_steps + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format( num_train_steps, total_loss / SKIP_STEP)) total_loss = 0.0 num_train_steps += 1 if __name__ == '__main__': main() ================================================ FILE: examples/04_word2vec_eager_starter.py ================================================ """ starter code for word2vec skip-gram model with NCE loss Eager execution CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) & Akshay Agrawal (akshayka@cs.stanford.edu) Lecture 04 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np import tensorflow as tf import tensorflow.contrib.eager as tfe import utils import word2vec_utils # Enable eager execution! ############################# ########## TO DO ############ ############################# # Model hyperparameters VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # number of negative examples to sample LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 100000 VISUAL_FLD = 'visualization' SKIP_STEP = 5000 # Parameters for downloading data DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip' EXPECTED_BYTES = 31344016 class Word2Vec(object): def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED): self.vocab_size = vocab_size self.num_sampled = num_sampled # Create the variables: an embedding matrix, nce_weight, and nce_bias ############################# ########## TO DO ############ ############################# self.embed_matrix = None self.nce_weight = None self.nce_bias = None def compute_loss(self, center_words, target_words): """Computes the forward pass of word2vec with the NCE loss.""" # Look up the embeddings for the center words ############################# ########## TO DO ############ ############################# embed = None # Compute the loss, using tf.reduce_mean and tf.nn.nce_loss ############################# ########## TO DO ############ ############################# loss = None return loss def gen(): yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD) def main(): dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32), (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1]))) optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE) # Create the model ############################# ########## TO DO ############ ############################# model = None # Create the gradients function, using `tfe.implicit_value_and_gradients` ############################# ########## TO DO ############ ############################# grad_fn = None total_loss = 0.0 # for average loss in the last SKIP_STEP steps num_train_steps = 0 while num_train_steps < NUM_TRAIN_STEPS: for center_words, target_words in tfe.Iterator(dataset): if num_train_steps >= NUM_TRAIN_STEPS: break # Compute the loss and gradients, and take an optimization step. ############################# ########## TO DO ############ ############################# if (num_train_steps + 1) % SKIP_STEP == 0: print('Average loss at step {}: {:5.1f}'.format( num_train_steps, total_loss / SKIP_STEP)) total_loss = 0.0 num_train_steps += 1 if __name__ == '__main__': main() ================================================ FILE: examples/04_word2vec_visualize.py ================================================ """ word2vec skip-gram model with NCE loss and code to visualize the embeddings on TensorBoard CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 04 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import numpy as np from tensorflow.contrib.tensorboard.plugins import projector import tensorflow as tf import utils import word2vec_utils # Model hyperparameters VOCAB_SIZE = 50000 BATCH_SIZE = 128 EMBED_SIZE = 128 # dimension of the word embedding vectors SKIP_WINDOW = 1 # the context window NUM_SAMPLED = 64 # number of negative examples to sample LEARNING_RATE = 1.0 NUM_TRAIN_STEPS = 100000 VISUAL_FLD = 'visualization' SKIP_STEP = 5000 # Parameters for downloading data DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip' EXPECTED_BYTES = 31344016 NUM_VISUALIZE = 3000 # number of tokens to visualize class SkipGramModel: """ Build the graph for word2vec model """ def __init__(self, dataset, vocab_size, embed_size, batch_size, num_sampled, learning_rate): self.vocab_size = vocab_size self.embed_size = embed_size self.batch_size = batch_size self.num_sampled = num_sampled self.lr = learning_rate self.global_step = tf.get_variable('global_step', initializer=tf.constant(0), trainable=False) self.skip_step = SKIP_STEP self.dataset = dataset def _import_data(self): """ Step 1: import data """ with tf.name_scope('data'): self.iterator = self.dataset.make_initializable_iterator() self.center_words, self.target_words = self.iterator.get_next() def _create_embedding(self): """ Step 2 + 3: define weights and embedding lookup. In word2vec, it's actually the weights that we care about """ with tf.name_scope('embed'): self.embed_matrix = tf.get_variable('embed_matrix', shape=[self.vocab_size, self.embed_size], initializer=tf.random_uniform_initializer()) self.embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embedding') def _create_loss(self): """ Step 4: define the loss function """ with tf.name_scope('loss'): # construct variables for NCE loss nce_weight = tf.get_variable('nce_weight', shape=[self.vocab_size, self.embed_size], initializer=tf.truncated_normal_initializer(stddev=1.0 / (self.embed_size ** 0.5))) nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE])) # define loss function to be NCE loss function self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, biases=nce_bias, labels=self.target_words, inputs=self.embed, num_sampled=self.num_sampled, num_classes=self.vocab_size), name='loss') def _create_optimizer(self): """ Step 5: define optimizer """ self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, global_step=self.global_step) def _create_summaries(self): with tf.name_scope('summaries'): tf.summary.scalar('loss', self.loss) tf.summary.histogram('histogram loss', self.loss) # because you have several summaries, we should merge them all # into one op to make it easier to manage self.summary_op = tf.summary.merge_all() def build_graph(self): """ Build the graph for our model """ self._import_data() self._create_embedding() self._create_loss() self._create_optimizer() self._create_summaries() def train(self, num_train_steps): saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias initial_step = 0 utils.safe_mkdir('checkpoints') with tf.Session() as sess: sess.run(self.iterator.initializer) sess.run(tf.global_variables_initializer()) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint')) # if that checkpoint exists, restore from checkpoint if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps writer = tf.summary.FileWriter('graphs/word2vec/lr' + str(self.lr), sess.graph) initial_step = self.global_step.eval() for index in range(initial_step, initial_step + num_train_steps): try: loss_batch, _, summary = sess.run([self.loss, self.optimizer, self.summary_op]) writer.add_summary(summary, global_step=index) total_loss += loss_batch if (index + 1) % self.skip_step == 0: print('Average loss at step {}: {:5.1f}'.format(index, total_loss / self.skip_step)) total_loss = 0.0 saver.save(sess, 'checkpoints/skip-gram', index) except tf.errors.OutOfRangeError: sess.run(self.iterator.initializer) writer.close() def visualize(self, visual_fld, num_visualize): """ run "'tensorboard --logdir='visualization'" to see the embeddings """ # create the list of num_variable most common words to visualize word2vec_utils.most_common_words(visual_fld, num_visualize) saver = tf.train.Saver() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint')) # if that checkpoint exists, restore from checkpoint if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) final_embed_matrix = sess.run(self.embed_matrix) # you have to store embeddings in a new variable embedding_var = tf.Variable(final_embed_matrix[:num_visualize], name='embedding') sess.run(embedding_var.initializer) config = projector.ProjectorConfig() summary_writer = tf.summary.FileWriter(visual_fld) # add embedding to the config file embedding = config.embeddings.add() embedding.tensor_name = embedding_var.name # link this tensor to its metadata file, in this case the first NUM_VISUALIZE words of vocab embedding.metadata_path = 'vocab_' + str(num_visualize) + '.tsv' # saves a configuration file that TensorBoard will read during startup. projector.visualize_embeddings(summary_writer, config) saver_embed = tf.train.Saver([embedding_var]) saver_embed.save(sess, os.path.join(visual_fld, 'model.ckpt'), 1) def gen(): yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD) def main(): dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32), (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1]))) model = SkipGramModel(dataset, VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE) model.build_graph() model.train(NUM_TRAIN_STEPS) model.visualize(VISUAL_FLD, NUM_VISUALIZE) if __name__ == '__main__': main() ================================================ FILE: examples/05_randomization.py ================================================ """ Examples to demonstrate ops level randomization CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 05 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf # Example 1: session keeps track of the random state c = tf.random_uniform([], -10, 10, seed=2) with tf.Session() as sess: print(sess.run(c)) # >> 3.574932 print(sess.run(c)) # >> -5.9731865 # Example 2: each new session will start the random state all over again. c = tf.random_uniform([], -10, 10, seed=2) with tf.Session() as sess: print(sess.run(c)) # >> 3.574932 with tf.Session() as sess: print(sess.run(c)) # >> 3.574932 # Example 3: with operation level random seed, each op keeps its own seed. c = tf.random_uniform([], -10, 10, seed=2) d = tf.random_uniform([], -10, 10, seed=2) with tf.Session() as sess: print(sess.run(c)) # >> 3.574932 print(sess.run(d)) # >> 3.574932 # Example 4: graph level random seed tf.set_random_seed(2) c = tf.random_uniform([], -10, 10) d = tf.random_uniform([], -10, 10) with tf.Session() as sess: print(sess.run(c)) # >> 9.123926 print(sess.run(d)) # >> -4.5340395 ================================================ FILE: examples/05_variable_sharing.py ================================================ """ Examples to demonstrate variable sharing CS 20: 'TensorFlow for Deep Learning Research' cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 05 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import tensorflow as tf x1 = tf.truncated_normal([200, 100], name='x1') x2 = tf.truncated_normal([200, 100], name='x2') def two_hidden_layers(x): assert x.shape.as_list() == [200, 100] w1 = tf.Variable(tf.random_normal([100, 50]), name='h1_weights') b1 = tf.Variable(tf.zeros([50]), name='h1_biases') h1 = tf.matmul(x, w1) + b1 assert h1.shape.as_list() == [200, 50] w2 = tf.Variable(tf.random_normal([50, 10]), name='h2_weights') b2 = tf.Variable(tf.zeros([10]), name='2_biases') logits = tf.matmul(h1, w2) + b2 return logits def two_hidden_layers_2(x): assert x.shape.as_list() == [200, 100] w1 = tf.get_variable('h1_weights', [100, 50], initializer=tf.random_normal_initializer()) b1 = tf.get_variable('h1_biases', [50], initializer=tf.constant_initializer(0.0)) h1 = tf.matmul(x, w1) + b1 assert h1.shape.as_list() == [200, 50] w2 = tf.get_variable('h2_weights', [50, 10], initializer=tf.random_normal_initializer()) b2 = tf.get_variable('h2_biases', [10], initializer=tf.constant_initializer(0.0)) logits = tf.matmul(h1, w2) + b2 return logits # logits1 = two_hidden_layers(x1) # logits2 = two_hidden_layers(x2) # logits1 = two_hidden_layers_2(x1) # logits2 = two_hidden_layers_2(x2) # with tf.variable_scope('two_layers') as scope: # logits1 = two_hidden_layers_2(x1) # scope.reuse_variables() # logits2 = two_hidden_layers_2(x2) # with tf.variable_scope('two_layers') as scope: # logits1 = two_hidden_layers_2(x1) # scope.reuse_variables() # logits2 = two_hidden_layers_2(x2) def fully_connected(x, output_dim, scope): with tf.variable_scope(scope, reuse=tf.AUTO_REUSE) as scope: w = tf.get_variable('weights', [x.shape[1], output_dim], initializer=tf.random_normal_initializer()) b = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0)) return tf.matmul(x, w) + b def two_hidden_layers(x): h1 = fully_connected(x, 50, 'h1') h2 = fully_connected(h1, 10, 'h2') with tf.variable_scope('two_layers') as scope: logits1 = two_hidden_layers(x1) # scope.reuse_variables() logits2 = two_hidden_layers(x2) writer = tf.summary.FileWriter('./graphs/cool_variables', tf.get_default_graph()) writer.close() ================================================ FILE: examples/07_convnet_layers.py ================================================ """ Using convolutional net on MNIST dataset of handwritten digits MNIST dataset: http://yann.lecun.com/exdb/mnist/ CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 07 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import tensorflow as tf import utils class ConvNet(object): def __init__(self): self.lr = 0.001 self.batch_size = 128 self.keep_prob = tf.constant(0.75) self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') self.n_classes = 10 self.skip_step = 20 self.n_test = 10000 self.training=False def get_data(self): with tf.name_scope('data'): train_data, test_data = utils.get_mnist_dataset(self.batch_size) iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes) img, self.label = iterator.get_next() self.img = tf.reshape(img, shape=[-1, 28, 28, 1]) # reshape the image to make it work with tf.nn.conv2d self.train_init = iterator.make_initializer(train_data) # initializer for train_data self.test_init = iterator.make_initializer(test_data) # initializer for train_data def inference(self): conv1 = tf.layers.conv2d(inputs=self.img, filters=32, kernel_size=[5, 5], padding='SAME', activation=tf.nn.relu, name='conv1') pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2, name='pool1') conv2 = tf.layers.conv2d(inputs=pool1, filters=64, kernel_size=[5, 5], padding='SAME', activation=tf.nn.relu, name='conv2') pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2, name='pool2') feature_dim = pool2.shape[1] * pool2.shape[2] * pool2.shape[3] pool2 = tf.reshape(pool2, [-1, feature_dim]) fc = tf.layers.dense(pool2, 1024, activation=tf.nn.relu, name='fc') dropout = tf.layers.dropout(fc, self.keep_prob, training=self.training, name='dropout') self.logits = tf.layers.dense(dropout, self.n_classes, name='logits') def loss(self): ''' define loss function use softmax cross entropy with logits as the loss function compute mean cross entropy, softmax is applied internally ''' # with tf.name_scope('loss'): entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.label, logits=self.logits) self.loss = tf.reduce_mean(entropy, name='loss') def optimize(self): ''' Define training op using Adam Gradient Descent to minimize cost ''' self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=self.gstep) def summary(self): ''' Create summaries to write on TensorBoard ''' with tf.name_scope('summaries'): tf.summary.scalar('loss', self.loss) tf.summary.scalar('accuracy', self.accuracy) tf.summary.histogram('histogram loss', self.loss) self.summary_op = tf.summary.merge_all() def eval(self): ''' Count the number of right predictions in a batch ''' with tf.name_scope('predict'): preds = tf.nn.softmax(self.logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1)) self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) def build(self): ''' Build the computation graph ''' self.get_data() self.inference() self.loss() self.optimize() self.eval() self.summary() def train_one_epoch(self, sess, saver, init, writer, epoch, step): start_time = time.time() sess.run(init) self.training = True total_loss = 0 n_batches = 0 try: while True: _, l, summaries = sess.run([self.opt, self.loss, self.summary_op]) writer.add_summary(summaries, global_step=step) if (step + 1) % self.skip_step == 0: print('Loss at step {0}: {1}'.format(step, l)) step += 1 total_loss += l n_batches += 1 except tf.errors.OutOfRangeError: pass saver.save(sess, 'checkpoints/convnet_layers/mnist-convnet', step) print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches)) print('Took: {0} seconds'.format(time.time() - start_time)) return step def eval_once(self, sess, init, writer, epoch, step): start_time = time.time() sess.run(init) self.training = False total_correct_preds = 0 try: while True: accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op]) writer.add_summary(summaries, global_step=step) total_correct_preds += accuracy_batch except tf.errors.OutOfRangeError: pass print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test)) print('Took: {0} seconds'.format(time.time() - start_time)) def train(self, n_epochs): ''' The train function alternates between training one epoch and evaluating ''' utils.safe_mkdir('checkpoints') utils.safe_mkdir('checkpoints/convnet_layers') writer = tf.summary.FileWriter('./graphs/convnet_layers', tf.get_default_graph()) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_layers/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) step = self.gstep.eval() for epoch in range(n_epochs): step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step) self.eval_once(sess, self.test_init, writer, epoch, step) writer.close() if __name__ == '__main__': model = ConvNet() model.build() model.train(n_epochs=15) ================================================ FILE: examples/07_convnet_mnist.py ================================================ """ Using convolutional net on MNIST dataset of handwritten digits MNIST dataset: http://yann.lecun.com/exdb/mnist/ CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 07 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import tensorflow as tf import utils def conv_relu(inputs, filters, k_size, stride, padding, scope_name): ''' A method that does convolution + relu on inputs ''' with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope: in_channels = inputs.shape[-1] kernel = tf.get_variable('kernel', [k_size, k_size, in_channels, filters], initializer=tf.truncated_normal_initializer()) biases = tf.get_variable('biases', [filters], initializer=tf.random_normal_initializer()) conv = tf.nn.conv2d(inputs, kernel, strides=[1, stride, stride, 1], padding=padding) return tf.nn.relu(conv + biases, name=scope.name) def maxpool(inputs, ksize, stride, padding='VALID', scope_name='pool'): '''A method that does max pooling on inputs''' with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope: pool = tf.nn.max_pool(inputs, ksize=[1, ksize, ksize, 1], strides=[1, stride, stride, 1], padding=padding) return pool def fully_connected(inputs, out_dim, scope_name='fc'): ''' A fully connected linear layer on inputs ''' with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope: in_dim = inputs.shape[-1] w = tf.get_variable('weights', [in_dim, out_dim], initializer=tf.truncated_normal_initializer()) b = tf.get_variable('biases', [out_dim], initializer=tf.constant_initializer(0.0)) out = tf.matmul(inputs, w) + b return out class ConvNet(object): def __init__(self): self.lr = 0.001 self.batch_size = 128 self.keep_prob = tf.constant(0.75) self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') self.n_classes = 10 self.skip_step = 20 self.n_test = 10000 self.training = True def get_data(self): with tf.name_scope('data'): train_data, test_data = utils.get_mnist_dataset(self.batch_size) iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes) img, self.label = iterator.get_next() self.img = tf.reshape(img, shape=[-1, 28, 28, 1]) # reshape the image to make it work with tf.nn.conv2d self.train_init = iterator.make_initializer(train_data) # initializer for train_data self.test_init = iterator.make_initializer(test_data) # initializer for train_data def inference(self): conv1 = conv_relu(inputs=self.img, filters=32, k_size=5, stride=1, padding='SAME', scope_name='conv1') pool1 = maxpool(conv1, 2, 2, 'VALID', 'pool1') conv2 = conv_relu(inputs=pool1, filters=64, k_size=5, stride=1, padding='SAME', scope_name='conv2') pool2 = maxpool(conv2, 2, 2, 'VALID', 'pool2') feature_dim = pool2.shape[1] * pool2.shape[2] * pool2.shape[3] pool2 = tf.reshape(pool2, [-1, feature_dim]) fc = fully_connected(pool2, 1024, 'fc') dropout = tf.nn.dropout(tf.nn.relu(fc), self.keep_prob, name='relu_dropout') self.logits = fully_connected(dropout, self.n_classes, 'logits') def loss(self): ''' define loss function use softmax cross entropy with logits as the loss function compute mean cross entropy, softmax is applied internally ''' # with tf.name_scope('loss'): entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.label, logits=self.logits) self.loss = tf.reduce_mean(entropy, name='loss') def optimize(self): ''' Define training op using Adam Gradient Descent to minimize cost ''' self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=self.gstep) def summary(self): ''' Create summaries to write on TensorBoard ''' with tf.name_scope('summaries'): tf.summary.scalar('loss', self.loss) tf.summary.scalar('accuracy', self.accuracy) tf.summary.histogram('histogram loss', self.loss) self.summary_op = tf.summary.merge_all() def eval(self): ''' Count the number of right predictions in a batch ''' with tf.name_scope('predict'): preds = tf.nn.softmax(self.logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1)) self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) def build(self): ''' Build the computation graph ''' self.get_data() self.inference() self.loss() self.optimize() self.eval() self.summary() def train_one_epoch(self, sess, saver, init, writer, epoch, step): start_time = time.time() sess.run(init) self.training = True total_loss = 0 n_batches = 0 try: while True: _, l, summaries = sess.run([self.opt, self.loss, self.summary_op]) writer.add_summary(summaries, global_step=step) if (step + 1) % self.skip_step == 0: print('Loss at step {0}: {1}'.format(step, l)) step += 1 total_loss += l n_batches += 1 except tf.errors.OutOfRangeError: pass saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', step) print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches)) print('Took: {0} seconds'.format(time.time() - start_time)) return step def eval_once(self, sess, init, writer, epoch, step): start_time = time.time() sess.run(init) self.training = False total_correct_preds = 0 try: while True: accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op]) writer.add_summary(summaries, global_step=step) total_correct_preds += accuracy_batch except tf.errors.OutOfRangeError: pass print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test)) print('Took: {0} seconds'.format(time.time() - start_time)) def train(self, n_epochs): ''' The train function alternates between training one epoch and evaluating ''' utils.safe_mkdir('checkpoints') utils.safe_mkdir('checkpoints/convnet_mnist') writer = tf.summary.FileWriter('./graphs/convnet', tf.get_default_graph()) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) step = self.gstep.eval() for epoch in range(n_epochs): step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step) self.eval_once(sess, self.test_init, writer, epoch, step) writer.close() if __name__ == '__main__': model = ConvNet() model.build() model.train(n_epochs=30) ================================================ FILE: examples/07_convnet_mnist_starter.py ================================================ """ Using convolutional net on MNIST dataset of handwritten digits MNIST dataset: http://yann.lecun.com/exdb/mnist/ CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 07 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import time import tensorflow as tf import utils def conv_relu(inputs, filters, k_size, stride, padding, scope_name): ''' A method that does convolution + relu on inputs ''' ############################# ########## TO DO ############ ############################# return None def maxpool(inputs, ksize, stride, padding='VALID', scope_name='pool'): '''A method that does max pooling on inputs''' ############################# ########## TO DO ############ ############################# return None def fully_connected(inputs, out_dim, scope_name='fc'): ''' A fully connected linear layer on inputs ''' ############################# ########## TO DO ############ ############################# return None class ConvNet(object): def __init__(self): self.lr = 0.001 self.batch_size = 128 self.keep_prob = tf.constant(0.75) self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') self.n_classes = 10 self.skip_step = 20 self.n_test = 10000 def get_data(self): with tf.name_scope('data'): train_data, test_data = utils.get_mnist_dataset(self.batch_size) iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes) img, self.label = iterator.get_next() self.img = tf.reshape(img, shape=[-1, 28, 28, 1]) # reshape the image to make it work with tf.nn.conv2d self.train_init = iterator.make_initializer(train_data) # initializer for train_data self.test_init = iterator.make_initializer(test_data) # initializer for train_data def inference(self): ''' Build the model according to the description we've shown in class ''' ############################# ########## TO DO ############ ############################# self.logits = None def loss(self): ''' define loss function use softmax cross entropy with logits as the loss function tf.nn.softmax_cross_entropy_with_logits softmax is applied internally don't forget to compute mean cross all sample in a batch ''' ############################# ########## TO DO ############ ############################# self.loss = None def optimize(self): ''' Define training op using Adam Gradient Descent to minimize cost Don't forget to use global step ''' ############################# ########## TO DO ############ ############################# self.opt = None def summary(self): ''' Create summaries to write on TensorBoard Remember to track both training loss and test accuracy ''' ############################# ########## TO DO ############ ############################# self.summary_op = None def eval(self): ''' Count the number of right predictions in a batch ''' with tf.name_scope('predict'): preds = tf.nn.softmax(self.logits) correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1)) self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) def build(self): ''' Build the computation graph ''' self.get_data() self.inference() self.loss() self.optimize() self.eval() self.summary() def train_one_epoch(self, sess, saver, init, writer, epoch, step): start_time = time.time() sess.run(init) total_loss = 0 n_batches = 0 try: while True: _, l, summaries = sess.run([self.opt, self.loss, self.summary_op]) writer.add_summary(summaries, global_step=step) if (step + 1) % self.skip_step == 0: print('Loss at step {0}: {1}'.format(step, l)) step += 1 total_loss += l n_batches += 1 except tf.errors.OutOfRangeError: pass saver.save(sess, 'checkpoints/convnet_starter/mnist-convnet', step) print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches)) print('Took: {0} seconds'.format(time.time() - start_time)) return step def eval_once(self, sess, init, writer, epoch, step): start_time = time.time() sess.run(init) total_correct_preds = 0 try: while True: accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op]) writer.add_summary(summaries, global_step=step) total_correct_preds += accuracy_batch except tf.errors.OutOfRangeError: pass print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test)) print('Took: {0} seconds'.format(time.time() - start_time)) def train(self, n_epochs): ''' The train function alternates between training one epoch and evaluating ''' utils.safe_mkdir('checkpoints') utils.safe_mkdir('checkpoints/convnet_starter') writer = tf.summary.FileWriter('./graphs/convnet_starter', tf.get_default_graph()) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_starter/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) step = self.gstep.eval() for epoch in range(n_epochs): step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step) self.eval_once(sess, self.test_init, writer, epoch, step) writer.close() if __name__ == '__main__': model = ConvNet() model.build() model.train(n_epochs=15) ================================================ FILE: examples/07_run_kernels.py ================================================ """ Simple examples of convolution to do some basic filters Also demonstrates the use of TensorFlow data readers. We will use some popular filters for our image. It seems to be working with grayscale images, but not with rgb images. It's probably because I didn't choose the right kernels for rgb images. kernels for rgb images have dimensions 3 x 3 x 3 x 3 kernels for grayscale images have dimensions 3 x 3 x 1 x 1 CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 07 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import sys sys.path.append('..') from matplotlib import gridspec as gridspec from matplotlib import pyplot as plt import tensorflow as tf import kernels def read_one_image(filename): ''' This method is to show how to read image from a file into a tensor. The output is a tensor object. ''' image_string = tf.read_file(filename) image_decoded = tf.image.decode_image(image_string) image = tf.cast(image_decoded, tf.float32) / 256.0 return image def convolve(image, kernels, rgb=True, strides=[1, 3, 3, 1], padding='SAME'): images = [image[0]] for i, kernel in enumerate(kernels): filtered_image = tf.nn.conv2d(image, kernel, strides=strides, padding=padding)[0] if i == 2: filtered_image = tf.minimum(tf.nn.relu(filtered_image), 255) images.append(filtered_image) return images def show_images(images, rgb=True): gs = gridspec.GridSpec(1, len(images)) for i, image in enumerate(images): plt.subplot(gs[0, i]) if rgb: plt.imshow(image) else: image = image.reshape(image.shape[0], image.shape[1]) plt.imshow(image, cmap='gray') plt.axis('off') plt.show() def main(): rgb = False if rgb: kernels_list = [kernels.BLUR_FILTER_RGB, kernels.SHARPEN_FILTER_RGB, kernels.EDGE_FILTER_RGB, kernels.TOP_SOBEL_RGB, kernels.EMBOSS_FILTER_RGB] else: kernels_list = [kernels.BLUR_FILTER, kernels.SHARPEN_FILTER, kernels.EDGE_FILTER, kernels.TOP_SOBEL, kernels.EMBOSS_FILTER] kernels_list = kernels_list[1:] image = read_one_image('data/friday.jpg') if not rgb: image = tf.image.rgb_to_grayscale(image) image = tf.expand_dims(image, 0) # make it into a batch of 1 element images = convolve(image, kernels_list, rgb) with tf.Session() as sess: images = sess.run(images) # convert images from tensors to float values show_images(images, rgb) if __name__ == '__main__': main() ================================================ FILE: examples/11_char_rnn.py ================================================ """ A clean, no_frills character-level generative language model. CS 20: "TensorFlow for Deep Learning Research" cs20.stanford.edu Danijar Hafner (mail@danijar.com) & Chip Huyen (chiphuyen@cs.stanford.edu) Lecture 11 """ import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' import random import sys sys.path.append('..') import time import tensorflow as tf import utils def vocab_encode(text, vocab): return [vocab.index(x) + 1 for x in text if x in vocab] def vocab_decode(array, vocab): return ''.join([vocab[x - 1] for x in array]) def read_data(filename, vocab, window, overlap): lines = [line.strip() for line in open(filename, 'r').readlines()] while True: random.shuffle(lines) for text in lines: text = vocab_encode(text, vocab) for start in range(0, len(text) - window, overlap): chunk = text[start: start + window] chunk += [0] * (window - len(chunk)) yield chunk def read_batch(stream, batch_size): batch = [] for element in stream: batch.append(element) if len(batch) == batch_size: yield batch batch = [] yield batch class CharRNN(object): def __init__(self, model): self.model = model self.path = 'data/' + model + '.txt' if 'trump' in model: self.vocab = ("$%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ" " '\"_abcdefghijklmnopqrstuvwxyz{|}@#➡📈") else: self.vocab = (" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ" "\\^_abcdefghijklmnopqrstuvwxyz{|}") self.seq = tf.placeholder(tf.int32, [None, None]) self.temp = tf.constant(1.5) self.hidden_sizes = [128, 256] self.batch_size = 64 self.lr = 0.0003 self.skip_step = 1 self.num_steps = 50 # for RNN unrolled self.len_generated = 200 self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step') def create_rnn(self, seq): layers = [tf.nn.rnn_cell.GRUCell(size) for size in self.hidden_sizes] cells = tf.nn.rnn_cell.MultiRNNCell(layers) batch = tf.shape(seq)[0] zero_states = cells.zero_state(batch, dtype=tf.float32) self.in_state = tuple([tf.placeholder_with_default(state, [None, state.shape[1]]) for state in zero_states]) # this line to calculate the real length of seq # all seq are padded to be of the same length, which is num_steps length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1) self.output, self.out_state = tf.nn.dynamic_rnn(cells, seq, length, self.in_state) def create_model(self): seq = tf.one_hot(self.seq, len(self.vocab)) self.create_rnn(seq) self.logits = tf.layers.dense(self.output, len(self.vocab), None) loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits[:, :-1], labels=seq[:, 1:]) self.loss = tf.reduce_sum(loss) # sample the next character from Maxwell-Boltzmann Distribution # with temperature temp. It works equally well without tf.exp self.sample = tf.multinomial(tf.exp(self.logits[:, -1] / self.temp), 1)[:, 0] self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=self.gstep) def train(self): saver = tf.train.Saver() start = time.time() min_loss = None with tf.Session() as sess: writer = tf.summary.FileWriter('graphs/gist', sess.graph) sess.run(tf.global_variables_initializer()) ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/' + self.model + '/checkpoint')) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) iteration = self.gstep.eval() stream = read_data(self.path, self.vocab, self.num_steps, overlap=self.num_steps//2) data = read_batch(stream, self.batch_size) while True: batch = next(data) # for batch in read_batch(read_data(DATA_PATH, vocab)): batch_loss, _ = sess.run([self.loss, self.opt], {self.seq: batch}) if (iteration + 1) % self.skip_step == 0: print('Iter {}. \n Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start)) self.online_infer(sess) start = time.time() checkpoint_name = 'checkpoints/' + self.model + '/char-rnn' if min_loss is None: saver.save(sess, checkpoint_name, iteration) elif batch_loss < min_loss: saver.save(sess, checkpoint_name, iteration) min_loss = batch_loss iteration += 1 def online_infer(self, sess): """ Generate sequence one character at a time, based on the previous character """ for seed in ['Hillary', 'I', 'R', 'T', '@', 'N', 'M', '.', 'G', 'A', 'W']: sentence = seed state = None for _ in range(self.len_generated): batch = [vocab_encode(sentence[-1], self.vocab)] feed = {self.seq: batch} if state is not None: # for the first decoder step, the state is None for i in range(len(state)): feed.update({self.in_state[i]: state[i]}) index, state = sess.run([self.sample, self.out_state], feed) sentence += vocab_decode(index, self.vocab) print('\t' + sentence) def main(): model = 'trump_tweets' utils.safe_mkdir('checkpoints') utils.safe_mkdir('checkpoints/' + model) lm = CharRNN(model) lm.create_model() lm.train() if __name__ == '__main__': main() ================================================ FILE: examples/kernels.py ================================================ import numpy as np import tensorflow as tf a = np.zeros([3, 3, 3, 3]) a[1, 1, :, :] = 0.25 a[0, 1, :, :] = 0.125 a[1, 0, :, :] = 0.125 a[2, 1, :, :] = 0.125 a[1, 2, :, :] = 0.125 a[0, 0, :, :] = 0.0625 a[0, 2, :, :] = 0.0625 a[2, 0, :, :] = 0.0625 a[2, 2, :, :] = 0.0625 BLUR_FILTER_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) # a[1, 1, :, :] = 0.25 # a[0, 1, :, :] = 0.125 # a[1, 0, :, :] = 0.125 # a[2, 1, :, :] = 0.125 # a[1, 2, :, :] = 0.125 # a[0, 0, :, :] = 0.0625 # a[0, 2, :, :] = 0.0625 # a[2, 0, :, :] = 0.0625 # a[2, 2, :, :] = 0.0625 a[1, 1, :, :] = 1.0 a[0, 1, :, :] = 1.0 a[1, 0, :, :] = 1.0 a[2, 1, :, :] = 1.0 a[1, 2, :, :] = 1.0 a[0, 0, :, :] = 1.0 a[0, 2, :, :] = 1.0 a[2, 0, :, :] = 1.0 a[2, 2, :, :] = 1.0 BLUR_FILTER = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 3, 3]) a[1, 1, :, :] = 5 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[2, 1, :, :] = -1 a[1, 2, :, :] = -1 SHARPEN_FILTER_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) a[1, 1, :, :] = 5 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[2, 1, :, :] = -1 a[1, 2, :, :] = -1 SHARPEN_FILTER = tf.constant(a, dtype=tf.float32) # a = np.zeros([3, 3, 3, 3]) # a[:, :, :, :] = -1 # a[1, 1, :, :] = 8 # EDGE_FILTER_RGB = tf.constant(a, dtype=tf.float32) EDGE_FILTER_RGB = tf.constant([ [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]], [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]], [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]], [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]] ]) a = np.zeros([3, 3, 1, 1]) # a[:, :, :, :] = -1 # a[1, 1, :, :] = 8 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[1, 2, :, :] = -1 a[2, 1, :, :] = -1 a[1, 1, :, :] = 4 EDGE_FILTER = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 3, 3]) a[0, :, :, :] = 1 a[0, 1, :, :] = 2 # originally 2 a[2, :, :, :] = -1 a[2, 1, :, :] = -2 TOP_SOBEL_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) a[0, :, :, :] = 1 a[0, 1, :, :] = 2 # originally 2 a[2, :, :, :] = -1 a[2, 1, :, :] = -2 TOP_SOBEL = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 3, 3]) a[0, 0, :, :] = -2 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[1, 1, :, :] = 1 a[1, 2, :, :] = 1 a[2, 1, :, :] = 1 a[2, 2, :, :] = 2 EMBOSS_FILTER_RGB = tf.constant(a, dtype=tf.float32) a = np.zeros([3, 3, 1, 1]) a[0, 0, :, :] = -2 a[0, 1, :, :] = -1 a[1, 0, :, :] = -1 a[1, 1, :, :] = 1 a[1, 2, :, :] = 1 a[2, 1, :, :] = 1 a[2, 2, :, :] = 2 EMBOSS_FILTER = tf.constant(a, dtype=tf.float32) ================================================ FILE: examples/utils.py ================================================ import os import gzip import shutil import struct import urllib os.environ['TF_CPP_MIN_LOG_LEVEL']='2' from matplotlib import pyplot as plt import numpy as np import tensorflow as tf def huber_loss(labels, predictions, delta=14.0): residual = tf.abs(labels - predictions) def f1(): return 0.5 * tf.square(residual) def f2(): return delta * residual - 0.5 * tf.square(delta) return tf.cond(residual < delta, f1, f2) def safe_mkdir(path): """ Create a directory if there isn't one already. """ try: os.mkdir(path) except OSError: pass def read_birth_life_data(filename): """ Read in birth_life_2010.txt and return: data in the form of NumPy array n_samples: number of samples """ text = open(filename, 'r').readlines()[1:] data = [line[:-1].split('\t') for line in text] births = [float(line[1]) for line in data] lifes = [float(line[2]) for line in data] data = list(zip(births, lifes)) n_samples = len(data) data = np.asarray(data, dtype=np.float32) return data, n_samples def download_one_file(download_url, local_dest, expected_byte=None, unzip_and_remove=False): """ Download the file from download_url into local_dest if the file doesn't already exists. If expected_byte is provided, check if the downloaded file has the same number of bytes. If unzip_and_remove is True, unzip the file and remove the zip file """ if os.path.exists(local_dest) or os.path.exists(local_dest[:-3]): print('%s already exists' %local_dest) else: print('Downloading %s' %download_url) local_file, _ = urllib.request.urlretrieve(download_url, local_dest) file_stat = os.stat(local_dest) if expected_byte: if file_stat.st_size == expected_byte: print('Successfully downloaded %s' %local_dest) if unzip_and_remove: with gzip.open(local_dest, 'rb') as f_in, open(local_dest[:-3],'wb') as f_out: shutil.copyfileobj(f_in, f_out) os.remove(local_dest) else: print('The downloaded file has unexpected number of bytes') def download_mnist(path): """ Download and unzip the dataset mnist if it's not already downloaded Download from http://yann.lecun.com/exdb/mnist """ safe_mkdir(path) url = 'http://yann.lecun.com/exdb/mnist' filenames = ['train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz'] expected_bytes = [9912422, 28881, 1648877, 4542] for filename, byte in zip(filenames, expected_bytes): download_url = os.path.join(url, filename) local_dest = os.path.join(path, filename) download_one_file(download_url, local_dest, byte, True) def parse_data(path, dataset, flatten): if dataset != 'train' and dataset != 't10k': raise NameError('dataset must be train or t10k') label_file = os.path.join(path, dataset + '-labels-idx1-ubyte') with open(label_file, 'rb') as file: _, num = struct.unpack(">II", file.read(8)) labels = np.fromfile(file, dtype=np.int8) #int8 new_labels = np.zeros((num, 10)) new_labels[np.arange(num), labels] = 1 img_file = os.path.join(path, dataset + '-images-idx3-ubyte') with open(img_file, 'rb') as file: _, num, rows, cols = struct.unpack(">IIII", file.read(16)) imgs = np.fromfile(file, dtype=np.uint8).reshape(num, rows, cols) #uint8 imgs = imgs.astype(np.float32) / 255.0 if flatten: imgs = imgs.reshape([num, -1]) return imgs, new_labels def read_mnist(path, flatten=True, num_train=55000): """ Read in the mnist dataset, given that the data is stored in path Return two tuples of numpy arrays ((train_imgs, train_labels), (test_imgs, test_labels)) """ imgs, labels = parse_data(path, 'train', flatten) indices = np.random.permutation(labels.shape[0]) train_idx, val_idx = indices[:num_train], indices[num_train:] train_img, train_labels = imgs[train_idx, :], labels[train_idx, :] val_img, val_labels = imgs[val_idx, :], labels[val_idx, :] test = parse_data(path, 't10k', flatten) return (train_img, train_labels), (val_img, val_labels), test def get_mnist_dataset(batch_size): # Step 1: Read in data mnist_folder = 'data/mnist' download_mnist(mnist_folder) train, val, test = read_mnist(mnist_folder, flatten=False) # Step 2: Create datasets and iterator train_data = tf.data.Dataset.from_tensor_slices(train) train_data = train_data.shuffle(10000) # if you want to shuffle your data train_data = train_data.batch(batch_size) test_data = tf.data.Dataset.from_tensor_slices(test) test_data = test_data.batch(batch_size) return train_data, test_data def show(image): """ Render a given numpy.uint8 2D array of pixel data. """ plt.imshow(image, cmap='gray') plt.show() ================================================ FILE: examples/word2vec_utils.py ================================================ from collections import Counter import random import os import sys sys.path.append('..') import zipfile import numpy as np from six.moves import urllib import tensorflow as tf import utils def read_data(file_path): """ Read data into a list of tokens There should be 17,005,207 tokens """ with zipfile.ZipFile(file_path) as f: words = tf.compat.as_str(f.read(f.namelist()[0])).split() return words def build_vocab(words, vocab_size, visual_fld): """ Build vocabulary of VOCAB_SIZE most frequent words and write it to visualization/vocab.tsv """ utils.safe_mkdir(visual_fld) file = open(os.path.join(visual_fld, 'vocab.tsv'), 'w') dictionary = dict() count = [('UNK', -1)] index = 0 count.extend(Counter(words).most_common(vocab_size - 1)) for word, _ in count: dictionary[word] = index index += 1 file.write(word + '\n') index_dictionary = dict(zip(dictionary.values(), dictionary.keys())) file.close() return dictionary, index_dictionary def convert_words_to_index(words, dictionary): """ Replace each word in the dataset with its index in the dictionary """ return [dictionary[word] if word in dictionary else 0 for word in words] def generate_sample(index_words, context_window_size): """ Form training pairs according to the skip-gram model. """ for index, center in enumerate(index_words): context = random.randint(1, context_window_size) # get a random target before the center word for target in index_words[max(0, index - context): index]: yield center, target # get a random target after the center wrod for target in index_words[index + 1: index + context + 1]: yield center, target def most_common_words(visual_fld, num_visualize): """ create a list of num_visualize most frequent words to visualize on TensorBoard. saved to visualization/vocab_[num_visualize].tsv """ words = open(os.path.join(visual_fld, 'vocab.tsv'), 'r').readlines()[:num_visualize] words = [word for word in words] file = open(os.path.join(visual_fld, 'vocab_' + str(num_visualize) + '.tsv'), 'w') for word in words: file.write(word) file.close() def batch_gen(download_url, expected_byte, vocab_size, batch_size, skip_window, visual_fld): local_dest = 'data/text8.zip' utils.download_one_file(download_url, local_dest, expected_byte) words = read_data(local_dest) dictionary, _ = build_vocab(words, vocab_size, visual_fld) index_words = convert_words_to_index(words, dictionary) del words # to save memory single_gen = generate_sample(index_words, skip_window) while True: center_batch = np.zeros(batch_size, dtype=np.int32) target_batch = np.zeros([batch_size, 1]) for index in range(batch_size): center_batch[index], target_batch[index] = next(single_gen) yield center_batch, target_batch ================================================ FILE: setup/requirements.txt ================================================ tensorflow==1.4.1 scipy==1.0.0 scikit-learn==0.19.1 matplotlib==2.1.1 xlrd==1.1.0 ipdb==0.10.3 Pillow==5.0.0 lxml==4.1.1 ================================================ FILE: setup/setup_instruction.md ================================================ Please follow the official instruction to install TensorFlow [here](https://www.tensorflow.org/install/). For this course, I will use Python 3.6 and TensorFlow 1.4. You’re welcome to use either Python 2 or Python 3 for the assignments. The starter code, though, will be in Python 3.6. You don't need GPU for most code examples in this course, though having GPU won't hurt. If you install TensorFlow on your local machine, my ecommendation is always set up Tensorflow using virtualenv. For the list of dependencies, please consult the file requirements.txt. This list will be updated as the course progresses. There are a few things to note: - As of version 1.2, TensorFlow no longer provides GPU support on macOS. - On macOS, Python 3.6 might gives warning but still works. - TensorFlow with GPU support will only work with CUDA® Toolkit 8.0 and cuDNN v6.0, not the newest CUDA and cnDNN version. Make sure that you install the correct CUDA and cuDNN versions to avoid frustrating issues. - On Windows, TensorFlow supports only 64-bit Python 3.5 anx Python 3.6. - If you see the warning: ```bash Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA ``` it's because you didn't install TensorFlow from sources to take advantage of all these settings. You can choose to install TensorFlow from sources -- the process might take up to 30 minutes. To silence the warning, add this before importing TensorFlow:
```bash import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' ``` - If you want to install TensorFlow from sources, keep in mind that TensorFlow doesn't officially support building TensorFlow on Windows. On Windows, you may try using the highly experimental Bazel on Windows or TensorFlow CMake build. Below is a simpler instruction on how to install TensorFlow on macOS. If you have any problem installing Tensorflow, feel free to post it on [Piazza](piazza.com/stanford/winter2018/cs20) If you get “permission denied” error in any command, use “sudo” in front of that command. You will need pip3 (or pip if you use Python2), and virtualenv. Step 1: install python3 and pip3. Skip this step if you already have both. You can find the official instruction [here](http://docs.python-guide.org/en/latest/starting/install3/osx/) Step 2: upgrade six ```bash $ sudo easy_install --upgrade six ``` Step 3: install virtualenv. Skip this step if you already have virtualenv ```bash $ pip3 install virtualenv ``` Step 4: set up a project directory. You will do all work for this class in this directory ```bash $ mkdir cs20 ``` Step 5: set up virtual environment with python3 ```bash $ cd cs20 $ python3 -m venv .env ``` These commands create a venv subdirectory in your project where everything is installed. Step 6: activate the virtual environment ```bash $ source .env/bin/activate ``` If you type: ```bash $ pip3 freeze ``` You will see that nothing is shown, which means no package is installed in your virtual environment. So you have to install all packages that you need. For the list of packages you need for this class, you can see/download the list of requirements in [the setup folder of this repository](https://github.com/chiphuyen/stanford-tensorflow-tutorials/blob/master/setup/requirements.txt). Step 7: Install Tensorflow and other dependencies ```bash $ pip3 install -r requirements.txt ``` Step n: To exit the virtual environment, use: ```bash $ deactivate ``` ### Other options #### Floydhub Floydhub has a clean, GitHub-like interface that allows you to create and run TensorFlow projects. # Possible set up problems ## Matplotlib If you have problem with using Matplotlib in virtual environment, here are two simple ways to fix.
1. If you installed matplotlib using pip, there is a directory in you root called ~/.matplotlib. Go there and create a file ~/.matplotlib/matplotlibrc there and add the following code: ```backend: TkAgg``` 2. After importing matplotlib, simply add: ```matplotlib.use("TkAgg")``` If you run into more problems, feel free to post your questions on [Piazza](https://piazza.com/stanford/winter2018/cs20) or email us cs20-win1718-staff@lists.stanford.edu.